Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗
- Author: Mehmed Kantardzic
Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic
1. Start with the initial full set of feature F.
2. For each feature f∈F, remove one feature f from F and obtain a subset Ff. Find the difference between entropy for F and entropy for all Ff. In the example in Figure 3.2, we have to compare the differences (EF − E F-F1), (EF − E F-F2), and (EF − E F-F3).
3. Let fk be a feature such that the difference between entropy for F and entropy for Ffk is minimum.
4. Update the set of feature F = F − {fk}, where “–” is a difference operation on sets. In our example, if the difference (EF − E F-F1) is minimum, then the reduced set of features is {F2, F3}. F1 becomes the bottom of the ranked list.
5. Repeat steps 2–4 until there is only one feature in F.
A ranking process may be stopped in any iteration, and may be transformed into a process of selecting features, using the additional criterion mentioned in step 4. This criterion is that the difference between entropy for F and entropy for Ff should be less than the approved threshold value to reduce feature fk from set F. A computational complexity is the basic disadvantage of this algorithm, and its parallel implementation could overcome the problems of working with large data sets and large number of features sequentially.
3.5 PCA
The most popular statistical method for dimensionality reduction of a large data set is the Karhunen-Loeve (K-L) method, also called PCA. In various fields, it is also known as the singular value decomposition (SVD), the Hotelling transform, and the empirical orthogonal function (EOF) method. PCA is a method of transforming the initial data set represented by vector samples into a new set of vector samples with derived dimensions. The goal of this transformation is to concentrate the information about the differences between samples into a small number of dimensions. Practical applications confirmed that PCA is the best linear dimension-reduction technique in the mean-square error sense. Being based on the covariance matrix of the features, it is a second-order method. In essence, PCA seeks to reduce the dimension of the data by finding a few orthogonal linear combinations of the original features with the largest variance. Since the variance depends on the scale of the variables, it is customary to first standardize each variable to have mean 0 and standard deviation 1. After the standardization, the original variables with possibly different units of measurement are all in comparable units.
More formally, the basic idea can be described as follows: A set of n-dimensional vector samples X = {x1, x2, x3, … , xm} should be transformed into another set Y = {y1, y2, … , ym} of the same dimensionality, but y-s have the property that most of their information content is stored in the first few dimensions. This will allow us to reduce the data set to a smaller number of dimensions with low information loss.
The transformation is based on the assumption that high information corresponds to high variance. So, if we want to reduce a set of input dimensions X to a single dimension Y, we should transform X into Y as a matrix computation
choosing A such that Y has the largest variance possible for a given data set. The single dimension Y obtained in this transformation is called the first principal component. The first principal component is an axis in the direction of maximum variance. It minimizes the distance of the sum of squares between data points and their projections on the component axis, as shown in Figure 3.4 where a two-dimensional space is transformed into a one-dimensional space in which the data set has the highest variance.
Figure 3.4. The first principal component is an axis in the direction of maximum variance.
In practice, it is not possible to determine matrix A directly, and therefore we compute the covariance matrix S as a first step in feature transformation. Matrix S is defined as
The eigenvalues of the covariance matrix S for the given data should be calculated in the next step. Finally, the m eigenvectors corresponding to the m largest eigenvalues of S define a linear transformation from the n-dimensional space to an m-dimensional space in which the features are uncorrelated. To specify the principal components we need the following additional explanations about the notation in matrix S:
1. The eigenvalues of S n×n are λ1, λ2, … , λn, where
2. The eigenvectors e1, e2, … , en correspond to eigenvalues λ1, λ2, … , λn, and they are called the principal axes.
Principal axes are new, transformed axes of an n-dimensional space, where the new variables are uncorrelated, and variance for the ith component is equal to the ith eigenvalue. Because λi’s are sorted, most of the information about the data set is concentrated in a few first-principal components. The fundamental question is how many of the principal components are needed to get a good representation of the data? In other words, what is the effective dimensionality of the data set? The easiest way to answer the question is to analyze the proportion of variance. Dividing the sum of the first m eigenvalues by the sum of all the variances (all eigenvalues), we will get the measure for the quality of representation based on the first m principal components. The result is expressed as a percentage, and if, for example, the projection accounts for over 90% of the total variance, it is considered to be good. More formally, we can express this ratio in the following way. The criterion for feature selection is based on the ratio of the sum of the m largest eigenvalues of S to the trace of S. That is a fraction of the variance retained in the m-dimensional space. If the eigenvalues are labeled so that λ1 ≥ λ2 ≥ … ≥ λn, then the ratio can be written as
When
Comments (0)