Description |
In the current business world, data collection for business analysis is not difficult any more. The major concern faced by business managers is whether they can use data to build predictive models so as to provide accurate information for decision-making. Knowledge Discovery from Databases (KDD) provides us a guideline for collecting data through identifying knowledge inside data. As one of the KDD steps, the data mining method provides a systematic and intelligent approach to learning a large amount of data and is critical to the success of KDD. In the past several decades, many different data mining algorithms have been developed and can be categorized as classification, association rule, and clustering. These data mining algorithms have been demonstrated to be very effective in solving different business questions. Among these data mining types, classification is the most popular group and is widely used in all kinds of business areas. However, the exiting classification algorithm is designed to maximize the prediction accuracy given by the assumption of equal class distribution and equal error costs. This assumption seldom holds in the real world. Thus, it is necessary to extend the current classification so that it can deal with the data with the imbalanced distribution and unequal costs. In this dissertation, I propose an Iterative Cost-sensitive Naïve Bayes (ICSNB) method aimed at reducing overall misclassification cost regardless of class distribution. During each iteration, K nearest neighbors are identified and form a new training set, which is used to learn unsolved instances. Using the characteristics of the nearest neighbor method, I also develop a new under-sampling method to solve the imbalance problem in the second study. In the second study, I design a general method to deal with the imbalance problem and identify noisy instances from the data set to create a balanced data set for learning. Both of these two methods are validated using multiple real world data sets. The empirical results show the superior performance of my methods compared to some existing and popular methods. |