Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗
- Author: Mehmed Kantardzic
Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic
2. to substitute computationally simple distance measures for the error measures—this approximation will reduce the computation time yet give satisfactory results for comparison of subset alternatives;
3. to select features based only on subsets of large amounts of data, but the subsequent steps of data mining will be applied on the whole set.
The application of feature selection and reduction of data dimensionality may be used in all phases of the data-mining process for successful knowledge discovery. It has to be started in the preprocessing phase, but, on many occasions, feature selection and reduction is a part of the data-mining algorithm, even if it is applied in postprocessing for better evaluation and consolidation of obtained results.
Let us return to the promising subsets of features. One possible technique for feature selection is based on comparison of means and variances. To summarize the key characteristics of the distribution of values for a given feature, it is necessary to compute the mean value and the corresponding variance. The main weakness in this approach is that the distribution for the feature is not known. If it is assumed to be a normal curve, the statistics can work out very well, but this may in fact be a poor assumption. Without knowing the shape of the distribution curve, the means and variances are viewed as heuristics for feature selection, not exact, mathematical modeling tools.
In general, if one feature describes different classes of entities, samples of two different classes can be examined. The means of feature values are normalized by their variances and then compared. If the means are far apart, interest in a feature increases; it has potential, in terms of its use in distinguishing between two classes. If the means are indistinguishable, interest wanes in that feature. It is a heuristic, nonoptimal approach to feature selection, but it is consistent with practical experience in many data-mining applications in the triage of features. Next, equations formalize the test, where A and B are sets of feature values measured for two different classes, and n1 and n2 are the corresponding number of samples:
The mean of a feature is compared in both classes without taking into consideration relationship to other features. In this approach to feature selection, we assumed a priori that the given feature is independent of the others. A comparison of means is a natural fit to classification problems. For the purposes of feature selection, a regression problem can be considered as a pseudo-classification problem. For k classes, k pairwise comparisons can be made, comparing each class with its complement. A feature is retained if it is significant for any of the pairwise comparisons.
We can analyze this approach in feature selection through one example. A simple data set is given in Table 3.1 with two input features X and Y, and an additional output feature C that classifies samples into two classes, A and B. It is necessary to decide whether the features X and Y are candidates for reduction or not. Suppose that the threshold value of the applied test is 0.5.
TABLE 3.1. Dataset with Three FeaturesXYC0.30.7A0.20.9B0.60.6A0.50.5A0.70.7B0.40.9B
First, we need to compute a mean value and a variance for both classes and both features X and Y. The analyzed subsets of the feature’s values are
and the results of applied tests are
This analysis shows that X is a candidate for reduction because its mean values are close and, therefore, the final test is below the threshold value. On the other hand, the test for feature Y is significantly above the threshold value; this feature is not a candidate for reduction because it has the potential to be a distinguishing feature between two classes.
A similar idea for feature ranking is shown in the algorithm that is based on correlation criteria. Let us consider first the prediction of a continuous outcome y. The Pearson correlation coefficient is defined as:
where cov designates the covariance and var the variance. The estimate of R(i) for the given data set with samples’ inputs xk,,j and outputs yk is defined by:
where the bar notation stands for an average over the index k (set of all samples). Using R(i)2 as a variable-ranking criterion enforces a ranking according to goodness of linear fit of individual variables. Correlation criteria such as R(i)2 can only detect linear dependencies between input features and target or output feature (variable). One common criticism of variable ranking is that it leads to the selection of a redundant subset. The same performance could possibly be achieved with a smaller subset of complementary variables. Still, one may wonder whether deleting presumably redundant variables can result in a performance gain.
Practical experiments show that noise reduction and consequently better model estimation may be obtained by adding features that are presumably redundant. Therefore, we have to be very careful in the preprocessing analysis. Yes, perfectly correlated variables are truly redundant in the sense that no additional information is gained by adding them. But, even variables with relatively high correlation (or anti-correlation) do not guarantee absence of variables’ complementarity. We can find cases where a feature looks completely useless by itself, and it is ranked very low, but it can provide significant information to the model and performance improvement when taken with others. These features by themselves may have little correlation with the output, target concept, but when combined with some other features, they can be strongly correlated with the target feature. Unintentional removal of these features can result in poor mining performance.
The previous simple methods test features separately. Several features may be useful when considered separately, but they may be redundant in their predictive ability. If the features are examined collectively, instead of independently, additional information can be obtained about their characteristics and mutual relations. Assuming normal distributions of values, it is possible to describe an efficient technique for selecting subsets of features. Two descriptors characterize a multivariate normal distribution:
1. M, a
Comments (0)