Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗
- Author: Mehmed Kantardzic
Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic
To find the subset of rules having d3, CMAR traverses nodes having the attribute value d3 and looks “upward” the FP tree to collect d3-projected samples. In our example, there are three samples represented in the FP tree, and they are (a1, b2, c1, d3):C, (a1, b2, d3):C, and (d3):A. The problem of finding all FPs in the training set can be reduced to mining FPs in the d3-projected database. In our example, in the d3-projected database, since the pattern (a1, b2, d3) occurs twice its support is equal to the required threshold value 2. Also, the rule based on this FP, (a1, b2, d3) → C has a confidence of 100% (above the threshold value), and that is the only rule generated in the given projection of the database.
After a search for rules having d3 value, all the nodes of d3 and their corresponding class labels are merged into their parent nodes at the FP tree. The FP tree is shrunk as shown in Figure 10.7b. The remaining set of rules can be mined similarly by repeating the previous procedures for a c1-projected database, then for the b2-projected database, and finally for the a1-projected database. In this analysis, (a1, c1) is an FP with support 3, but all rules are with confidence less than the threshold value. The same conclusions can be drawn for pattern (a1, b2) and for (a1). Therefore, the only association rule generated through the training process with the database T is (a1, b2, d3) → C with support equal to 2 and 100% confidence.
When a set of rules is selected for classification, CMAR is ready to classify new samples. For the new sample, CMAR collects the subset of rules matching the sample from the total set of rules. Trivially, if all the rules have the same class, CMAR simply assigns that label to the new sample. If the rules are not consistent in the class label, CMAR divides the rules into groups according to the class label and yields the label of the “strongest” group. To compare the strength of groups, it is necessary to measure the “combined effect” of each group. Intuitively, if the rules in a group are highly positively correlated and have good support, the group should have a strong effect. CMAR uses the strongest rule in the group as its representative, that is, the rule with highest χ2 test value (adopted for this algorithm for a simplified computation). Preliminary experiments have shown that CMAR outperforms the C4.5 algorithm in terms of average accuracy, efficiency, and scalability.
10.7 MULTIDIMENSIONAL ASSOCIATION–RULES MINING
A multidimensional transactional database DB has the schema
where ID is a unique identification of each transaction, Ai are structured attributes in the database, and items are sets of items connected with the given transaction. The information in each tuple t = (id, a1, a2, … , an, items-t) can be partitioned into two: dimensional part (a1, a2, … , an) and itemset part (items-t). It is commonsense to divide the mining process into two steps: first mine patterns about dimensional information and then find frequent itemsets from the projected sub-database, or vice versa. Without any preferences in the methodology we will illustrate the first approach using the multidimensional database DB in Table 10.4.
TABLE 10.4. Multidimensional-Transactional Database DB
One can first find the frequent multidimensional-value combinations and then find the corresponding frequent itemsets of a database. Suppose that the threshold value for our database DB in Table 10.4 is set to 2. Then, the combination of attribute values that occurs two or more than two times is frequent, and it is called a multidimensional pattern or MD pattern. For mining MD patterns, a modified Bottom Up Computation (BUC) algorithm can be used (it is an efficient “iceberg cube” computing algorithm). The basic steps of the BUC algorithm are as follows:
1. First, sort all tuples in the database in alphabetical order of values in the first dimension (A1), because the values for A1 are categorical. The only MD pattern found for this dimension is (a, *, *) because only the value a occurs two times; the other values b and c occur only once and they are not part of the MD patterns. Value * for the other two dimensions shows that they are not relevant in this first step, and they could have any combination of allowed values.
Select tuples in a database with found M pattern (or patterns). In our database, these are the samples with ID values 01 and 03. Sort the reduced database again with respect to the second dimension (A2), where the values are 1 and 2. Since no pattern occurs twice, there are no MD patterns for exact A1 and A2 values. Therefore, one can ignore the second dimension A2 (this dimension does not reduce the database further). All selected tuples are used in the next phase.
Selected tuples in the database are sorted in alphabetical order of values for the third dimension (in our example A3 with categorical values). A subgroup (a, *, m) is contained in two tuples and it is an MD pattern. Since there are no more dimensions in our example, the search continues with the second step.
2. Repeat the processes in step 1; only start not with the first but with the second dimension (first dimension is not analyzed at all in this iteration). In the following iterations, reduce the search process further for one additional dimension at the beginning. Continue with other dimensions.
In our example in the second iteration, starting with attribute A2, MD pattern (*, 2, *) will be found. Including dimension A3, there are no
Comments (0)