Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗
- Author: Mehmed Kantardzic
Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic
6.6 CART ALGORITHM & GINI INDEX
CART, as indicated earlier, is an acronym for Classification and Regression Trees. The basic methodology of divide and conquer described in C4.5 is also used in CART. The main differences are in the tree structure, the splitting criteria, the pruning method, and the way missing values are handled.
CART constructs trees that have only binary splits. This restriction simplifies the splitting criterion because there need not be a penalty for multi-way splits. Furthermore, if the label is binary, the binary split restriction allows CART to optimally partition categorical attributes (minimizing any concave splitting criteria) to two subsets of values in the number of attribute values. The restriction has its disadvantages, however, because the tree may be less interpretable with multiple splits occurring on the same attribute at the adjacent levels.
CART uses the Gini diversity index as a splitting criterion instead of information-based criteria for C4.5. The CART authors favor the Gini criterion over information gain because the Gini can be extended to include symmetrized costs, and it is computed more rapidly than information gain. The Gini index is used to select the feature at each internal node of the decision tree. We define the Gini index for a data set S as follows:
where
c is the number of predefined classes,
Ci are classes for i = 1, … , c − 1,
si is the number of samples belonging to class Ci, and
pi= si/S is a relative frequency of class Ci in the set.
This metric indicates the partition purity of the data set S. For branch prediction where we have two classes the Gini index lies within [0, 0.5]. If all the data in S belong to the same class, Gini S equals the minimum value 0, which means that S is pure. If Gini S equals 0.5, all observations in S are equally distributed among two classes. This decreases as a split favoring one class: for instance a 70/30 distribution produces Gini index of 0.42. If we have more than two classes, the maximum possible value for index increases; for instance, the worst possible diversity for three classes is a 33% split, and it produces a Gini value of 0.67. The Gini coefficient, which ranges from 0 to 1 (for extremely large number of classes), is multiplied by 100 to range between 0 and 100 in some commercial tools.
The quality of a split on a feature into k subsets Si is then computed as the weighted sum of the Gini indices of the resulting subsets:
where
ni is the number of samples in subset Si after splitting, and
n is the total number of samples in the given node.
Thus, Ginisplit is calculated for all possible features, and the feature with minimum Ginisplit is selected as split point. The procedure is repetitive as in C4.5. We may compare the results of CART and C4.5 attribute selection by applying entropy index (C4.5) and Gini index (CART) for splitting Attribute1 in Table 6.1. While we already have results for C4.5 (gain ratio for Attribute 1), Ginisplit index for the same attribute may be calculated as:
Interpretation of the absolute value for Ginisplit index is not important. It is important that relatively this value is lower for AttributeA1 than for the other two attributes in Table 6.1. The reader may easily check this claim. Therefore, based on Gini index Attribute1 is selected for splitting. It is the same result obtained in C4.5 algorithm using an entropy criterion. Although the results are the same in this example, for many other data sets there could be (usually small) differences in results between these two approaches.
CART also supports the twoing splitting criterion, which can be used for multi-class problems. At each node, the classes are separated into two superclasses containing disjoint and mutually exhaustive classes. A splitting criterion for a two-class problem is used to find the attribute, and the two superclasses that optimize the two-class criterion. The approach gives “strategic” splits in the sense that several classes that are similar are grouped together. Although twoing splitting rule allows us to build more balanced trees, this algorithm works slower than Gini rule. For example, if the total number of classes is equal to K, then we will have 2K − 1 possible grouping into two classes. It can be seen that there is a small difference between trees constructed using Gini and trees constructed via twoing rule. The difference can be seen mainly at the bottom of the tree where the variables are less significant in comparison with the top of the tree.
Pruning technique used in CART is called minimal cost complexity pruning, while C4.5 uses binomial confidence limits. The proposed approach assumes that the bias in the re-substitution error of a tree increases linearly with the number of leaf nodes. The cost assigned to a subtree is the sum of two terms: the re-substitution error and the number of leaves times a complexity parameter α. It can be shown that, for every α value, there exists a unique smallest tree minimizing cost of the tree. Note that, although α runs through a continuum of values, there are at most a finite number of possible subtrees for analysis and pruning.
Unlike C4.5, CART does not penalize the splitting criterion during the tree construction if examples have unknown values for the attribute used in the split. The criterion uses only those instances for which the value is known. CART finds several surrogate splits that can be used instead of the original split. During classification, the first surrogate split based on a known attribute value is used. The surrogates cannot be chosen based on the original splitting criterion because the subtree at each node is constructed based on the original split selected. The surrogate splits are therefore chosen to maximize a measure of predictive association with the original split. This procedure works well if there
Comments (0)