bookssland.com » Other » Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗

Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic



1 ... 32 33 34 35 36 37 38 39 40 ... 193
Go to page:
arising from a simple random sample. Inverse sampling is used when a feature in a data set occurs with rare frequency, and even a large subset of samples may not give enough information to estimate a feature value. In that case, sampling is dynamic; it starts with the small subset and it continues until some conditions about the required number of feature values are satisfied.

For some specialized types of problems, alternative techniques can be helpful in reducing the number of cases. For example, for time-dependent data the number of samples is determined by the frequency of sampling. The sampling period is specified based on knowledge of the application. If the sampling period is too short, most samples are repetitive and few changes occur from case to case. For some applications, increasing the sampling period causes no harm and can even be beneficial in obtaining a good data-mining solution. Therefore, for time-series data the windows for sampling and measuring features should be optimized, and that requires additional preparation and experimentation with available data.

3.9 REVIEW QUESTIONS AND PROBLEMS

1. Explain what we gain and what we lose with dimensionality reduction in large data sets in the preprocessing phase of data mining.

2. Use one typical application of data mining in a retail industry to explain monotonicity and interruptability of data-reduction algorithms.

3. Given the data set X with three input features and one output feature representing the classification of samples, X:

(a) Rank the features using a comparison of means and variances.

(b) Rank the features using Relief algorithm. Use all samples for the algorithm (m = 7).

4. Given four-dimensional samples where the first two dimensions are numeric and the last two are categorical

(a) Apply a method for unsupervised feature selection based on entropy measure to reduce one dimension from the given data set.

(b) Apply Relief algorithm under the assumption that X4 is output (classification) feature.

5.

(a) Perform bin-based value reduction with the best cutoffs for the following:

(i) the feature I3 in problem 3 using mean values as representatives for two bins.

(ii) the feature X2 in problem 4 using the closest boundaries for two bin representatives.

(b) Discuss the possibility of applying approximation by rounding to reduce the values of numeric attributes in problems 3 and 4.

6. Apply the ChiMerge technique to reduce the number of values for numeric attributes in problem 3.

Reduce the number of numeric values for feature I1 and find the final, reduced number of intervals.

Reduce the number of numeric values for feature I2 and find the final, reduced number of intervals.

Reduce the number of numeric values for feature I3 and find the final, reduced number of intervals.

Discuss the results and benefits of dimensionality reduction obtained in (a), (b), and (c).

7. Explain the differences between averaged and voted combined solutions when random samples are used to reduce dimensionality of a large data set.

8. How can the incremental-sample approach and the average-sample approach be combined to reduce cases in large data sets.

9. Develop a software tool for feature ranking based on means and variances. Input data set is represented in the form of flat file with several features.

10. Develop a software tool for ranking features using entropy measure. The input data set is represented in the form of a flat file with several features.

11. Implement the ChiMerge algorithm for automated discretization of selected features in a flat input file.

12. Given the data set F = {4, 2, 1, 6, 4, 3, 1, 7, 2, 2}, apply two iterations of bin method for values reduction with best cutoffs. Initial number of bins is 3. What are the final medians of bins, and what is the total minimized error?

13. Assume you have 100 values that are all different, and use equal width discretization with 10 bins.

(a) What is the largest number of records that could appear in one bin?

(b) What is the smallest number of records that could appear in one bin?

(c) If you use equal height discretization with 10 bins, what is largest number of records that can appear in one bin?

(d) If you use equal height discretization with 10 bins, what is smallest number of records that can appear in one bin?

(e) Now assume that the maximum value frequency is 20. What is the largest number of records that could appear in one bin with equal width discretization (10 bins)?

(f) What about with equal height discretization (10 bins)?

3.10 REFERENCES FOR FURTHER STUDY

Fodor, I. K., A Survey of Dimension Reduction Techniques, LLNL Technical Report, June 2002.

The author reviews PCA and FA, respectively, the two most widely used linear dimension-reduction methods based on second-order statistics. However, many data sets of interest are not realizations from Gaussian distributions. For those cases, higher order dimension-reduction methods, using information not contained in the covariance matrix, are more appropriate. It includes ICA and method of random projections.

Liu, H., H. Motoda, eds., Instance Selection and Construction for Data Mining, Kluwer Academic Publishers, Boston, MA, 2001.

Many different approaches have been used to address the data-explosion issue, such as algorithm scale-up and data reduction. Instance, sample, or tuple selection pertains to methods that select or search for a representative portion of data that can fulfill a data-mining task as if the whole data were used. This book brings researchers and practitioners together to report new developments and applications in instance-selection techniques, to share hard-learned experiences in order to avoid similar pitfalls, and to shed light on future development.

Liu, H., H. Motoda, Feature Selection for Knowledge Discovery and Data Mining, (Second Printing), Kluwer Academic Publishers, Boston, MA, 2000.

The book offers an overview of feature-selection methods and provides a general framework in order to examine these methods and categorize them. The book uses simple examples to show the essence of methods and suggests guidelines for using different methods under various circumstances.

Liu, H., H. Motoda, Computational Methods of Feature Selection, CRC Press, Boston, MA, 2007.

The book represents an excellent surveys, practical guidance, and comprehensive tutorials from leading experts. It paints a picture of the state-of-the-art techniques that can boost the capabilities of many existing data-mining

1 ... 32 33 34 35 36 37 38 39 40 ... 193
Go to page:

Free e-book «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗» - read online now

Comments (0)

There are no comments yet. You can be the first!
Add a comment