Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗
- Author: Mehmed Kantardzic
Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic
To illustrate this threshold-finding process, we could analyze, for our example of database T, the possibilities of Attribute2 splitting. After a sorting process, the set of values for Attribute2 is {65, 70, 75, 78, 80, 85, 90, 95, 96} and the set of potential threshold values Z is {65, 70, 75, 78, 80, 85, 90, 95}. Out of these eight values the optimal Z (with the highest information gain) should be selected. For our example, the optimal Z value is Z = 80 and the corresponding process of information-gain computation for the test x3 (Attribute2 ≤ 80 or Attribute2 > 80) is the following:
Now, if we compare the information gain for the three attributes in our example, we can see that Attribute1 still gives the highest gain of 0.246 bits and therefore this attribute will be selected for the first splitting in the construction of a decision tree. The root node will have the test for the values of Attribute1, and three branches will be created, one for each of the attribute values. This initial tree with the corresponding subsets of samples in the children nodes is represented in Figure 6.4.
Figure 6.4. Initial decision tree and subset cases for a database in Table 6.1.
After initial splitting, every child node has several samples from the database, and the entire process of test selection and optimization will be repeated for every child node. Because the child node for test x1, Attribute1 = B, has four cases and all of them are in CLASS1, this node will be the leaf node, and no additional tests are necessary for this branch of the tree.
For the remaining child node where we have five cases in subset T1, tests on the remaining attributes can be performed; an optimal test (with maximum information gain) will be test x4 with two alternatives: Attribute2 ≤ 70 or Attribute2 > 70.
Using Attribute2 to divide T1 into two subsets (test x4 represents the selection of one of two intervals), the resulting information is given by:
The information gained by this test is maximal:
and two branches will create the final leaf nodes because the subsets of cases in each of the branches belong to the same class.
A similar computation will be carried out for the third child of the root node. For the subset T3 of the database T, the selected optimal test x5 is the test on Attribute3 values. Branches of the tree, Attribute3 = True and Attribute3 = False, will create uniform subsets of cases that belong to the same class. The final decision tree for database T is represented in Figure 6.5.
Figure 6.5. A final decision tree for database T given in Table 6.1.
Alternatively, a decision tree can be presented in the form of an executable code (or pseudo-code) with if-then constructions for branching into a tree structure. The transformation of a decision tree from one representation to the other is very simple and straightforward. The final decision tree for our example is given in pseudocode in Figure 6.6.
Figure 6.6. A decision tree in the form of pseudocode for the database T given in Table 6.1.
While the gain criterion has had some good results in the construction of compact decision trees, it also has one serious deficiency: a strong bias in favor of tests with many outcomes. A solution was found in some kinds of normalization. By analogy with the definition of Info(S), an additional parameter was specified:
This represented the potential information generated by dividing set T into n subsets Ti. Now, a new gain measure could be defined:
This new gain measure expresses the proportion of information generated by the split that is useful, that is, that appears helpful in classification. The gain-ratio criterion also selects a test that maximizes the ratio given earlier. This criterion is robust and typically gives a consistently better choice of a test than the previous gain criterion. A computation of the gain-ratio test can be illustrated for our example. To find the gain-ratio measure for the test x1, an additional parameter Split-info(x1) is calculated:
A similar procedure should be performed for other tests in the decision tree. Instead of gain measure, the maximal gain ratio will be the criterion for attribute selection, along with a test to split samples into subsets. The final decision tree created using this new criterion for splitting a set of samples will be the most compact.
6.3 UNKNOWN ATTRIBUTE VALUES
The previous version of the C4.5 algorithm is based on the assumption that all values for all attributes are determined. But in a data set, often some attribute values for some samples can be missing—such incompleteness is typical in real-world
Comments (0)