bookssland.com » Other » Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗

Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic



1 ... 72 73 74 75 76 77 78 79 80 ... 193
Go to page:
Medical diagnostic decisions are a typical example of this kind of classification. If four out of 11 symptoms support diagnosis of a given disease, then the corresponding classifier will generate 330 regions in an 11-dimensional space for positive diagnosis only. That corresponds to 330 decision rules. Therefore, a data-mining analyst has to be very careful in applying the orthogonal-classification methodology of decision rules for this type of nonlinear problems.

Finally, introducing new attributes rather than removing old ones can avoid the sometimes intensive fragmentation of the n-dimensional space by additional rules. Let us analyze a simple example. A classification problem is described by nine binary inputs {A1, A2, … , A9}, and the output class C is specified by the logical relation

The above expression can be rewritten in a conjunctive form:

and it will have 27 factors with only ∧ operations. Every one of these factors is a region in a 9-D space and corresponds to one rule. Taking into account regions for negative examples, there exist about 50 leaves in the decision tree (and the same number of rules) describing class C. If new attributes are introduced:

the description of class C will be simplified into the logical rule

It is possible to specify the correct classification using a decision tree with only four leaves. In a new three-dimensional space (B1, B2, B3) there will be only four decision regions. This kind of simplification via constructive induction (development of new attributes in the preprocessing phase) can be applied also in a case n-of-m attributes’ decision. If none of the previous transformations are found appropriate, the only way to deal with the increased fragmentation of an n-dimensional space is to bring more data to bear on the problem.

6.8 REVIEW QUESTIONS AND PROBLEMS

1. Explain the differences between the statistical and logical approaches in the construction of a classification model.

2. What are the new features of C4.5 algorithm comparing with original Quinlan’s ID3 algorithm for decision tree generation?

3. Given a data set X with 3-D categorical samples:

Construct a decision tree using the computation steps given in the C4.5 algorithm.

4. Given a training data set Y:

Find the best threshold (for the maximal gain) for AttributeA.

(a) Find the best threshold (for the maximal gain) for AttributeB.

(b) Find a decision tree for data set Y.

(c) If the testing set is:

What is the percentage of correct classifications using the decision tree developed in (c).

(d) Derive decision rules from the decision tree.

5. Use the C4.5 algorithm to build a decision tree for classifying the following objects:

6. Given a training data set Y* with missing values:

(a) Apply a modified C4.5 algorithm to construct a decision tree with the (Ti/E) parameters explained in Section 7.3.

(b) Analyze the possibility of pruning the decision tree obtained in (a).

(c) Generate decision rules for the solution in (a). Is it necessary to generate a default rule for this rule-based model?

7. Why is postpruning in C4.5 defined as pessimistic pruning?

8. Suppose that two decision rules are generated with C4.5:Rule1:(X > 3) ∧ (Y ≥ 2) → Class1 (9.6/0.4)Rule2:(X > 3) ∧ (Y < 2) → Class2 (2.4/2.0).

Analyze if it is possible to generalize these rules into one using confidence limit U25% for the binomial distribution.

9. Discuss the complexity of the algorithm for optimal splitting of numeric attributes into more than two intervals.

10. In real-world data-mining applications, a final model consists of extremely large number of decision rules. Discuss the potential actions and analyses you should perform to reduce the complexity of the model.

11. Search the Web to find the basic characteristics of publicly available or commercial software tools for generating decision rules and decision trees. Document the results of your search.

12. Consider a binary classification problem (output attribute value = {Low, High}) with the following set of input attributes and attribute values:

Air Conditioner = {Working, Broken}

Engine = {Good, Bad}

Mileage = {High, Medium, Low}

Rust = {Yes, No}

Suppose a rule-based classifier produces the following rule set:

Mileage = High −→ Value = Low

Mileage = Low −→ Value = High

Air Conditioner = Working and Engine = Good −→ Value = High

Air Conditioner = Working and Engine = Bad −→ Value = Low

Air Conditioner = Broken −→ Value = Low

(a) Are the rules mutually exclusive? Explain your answer.

(b) Is the rule set exhaustive (covering each possible case)? Explain your answer.

(c) Is ordering needed for this set of rules? Explain your answer.

(d) Do you need a default class for the rule set? Explain your answer.

13. Of the following algorithms:

C4.5

K-Nearest Neighbor

Naïve Bayes

Linear Regression

(a) Which are fast in training but slow in classification?

(b) Which one produces classification rules?

(c) Which one requires discretization of continuous attributes before application?

(d) Which model is the most complex?

14.

(a) How much information is involved in choosing one of eight items, assuming that they have an equal frequency?

(b) One of 16 items?

15. The following data set will be used to learn a decision tree for predicting whether a mushroom is edible or not based on its shape, color, and odor.

(a) What is entropy H(Edible|Odor = 1 or Odor = 3)?

(b) Which attribute would the C4.5 algorithm choose to use for the root of the tree?

(c) Draw the full decision tree that would be learned for this data (no pruning).

(d) Suppose we have a validation set as follows. What will be the training set error and validation set error of the tree? Express your answer as the number of examples that would be misclassified.

6.9 REFERENCES FOR FURTHER STUDY

Dzeroski, S., N. Lavrac, eds., Relational Data Mining, Springer-Verlag, Berlin, Germany, 2001.

Relational data mining has its roots in inductive logic programming, an area in the intersection of machine learning and programming languages. The book provides a thorough overview of different techniques and strategies used in knowledge discovery from multi-relational data. The chapters describe a broad selection of practical, inductive-logic programming approaches to relational data mining, and give a good overview of several interesting applications.

Kralj Novak, P., N. Lavrac, G. L.

1 ... 72 73 74 75 76 77 78 79 80 ... 193
Go to page:

Free e-book «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗» - read online now

Comments (0)

There are no comments yet. You can be the first!
Add a comment