bookssland.com » Other » Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗

Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic



1 ... 89 90 91 92 93 94 95 96 97 ... 193
Go to page:
the most robust and accurate learning algorithms of the past decade. A multitude of heuristics has been developed for randomizing the ensemble parameters to generate diverse models. It is arguable that this line of investigation is rather oversubscribed nowadays, and the more interesting research is now in methods for nonstandard data.

Kuncheva, L. I., Combining Pattern Classifiers: Methods and Algorithms, Wiley Press, Hoboken, NJ, 2004.

Covering pattern classification methods, Combining Classifiers: Methods and Algorithms focuses on the important and widely studied issue of combining several classifiers together in order to achieve an improved recognition performance. It is one of the first books to provide unified, coherent, and expansive coverage of the topic and as such will be welcomed by those involved in the area. With case studies that bring the text alive and demonstrate “real-world” applications it is destined to become an essential reading.

Dietterich, T. G., Ensemble Methods in Machine Learning, in Lecture Notes in Computer Science on Multiple Classifier Systems, Vol. 1857, Springer, Berlin, 2000.

Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions. The original ensemble method is Bayesian averaging, but more recent algorithms include error-correcting output coding, bagging, and boosting. This paper reviews these methods and explains why ensembles can often perform better than any single classifier. Some previous studies comparing ensemble methods are reviewed, and some new experiments are presented to uncover the reasons that Adaboost does not overfit rapidly.

9

CLUSTER ANALYSIS

Chapter Objectives

Distinguish between different representations of clusters and different measures of similarities.

Compare the basic characteristics of agglomerative- and partitional-clustering algorithms.

Implement agglomerative algorithms using single-link or complete-link measures of similarity.

Derive the K-means method for partitional clustering and analysis of its complexity.

Explain the implementation of incremental-clustering algorithms and its advantages and disadvantages.

Introduce concepts of density clustering, and algorithms Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Balanced and Iterative Reducing and Clustering Using Hierarchies (BIRCH).

Discuss why validation of clustering results is a difficult problem.

Cluster analysis is a set of methodologies for automatic classification of samples into a number of groups using a measure of association so that the samples in one group are similar and samples belonging to different groups are not similar. The input for a system of cluster analysis is a set of samples and a measure of similarity (or dissimilarity) between two samples. The output from cluster analysis is a number of groups (clusters) that form a partition, or a structure of partitions, of the data set. One additional result of cluster analysis is a generalized description of every cluster, and this is especially important for a deeper analysis of the data set’s characteristics.

9.1 CLUSTERING CONCEPTS

Organizing data into sensible groupings is one of the most fundamental approaches of understanding and learning. Cluster analysis is the formal study of methods and algorithms for natural grouping, or clustering, of objects according to measured or perceived intrinsic characteristics or similarities. Samples for clustering are represented as a vector of measurements, or more formally, as a point in a multidimensional space. Samples within a valid cluster are more similar to each other than they are to a sample belonging to a different cluster. Clustering methodology is particularly appropriate for the exploration of interrelationships among samples to make a preliminary assessment of the sample structure. Human performances are competitive with automatic-clustering procedures in one, two, or three dimensions, but most real problems involve clustering in higher dimensions. It is very difficult for humans to intuitively interpret data embedded in a high-dimensional space.

Table 9.1 shows a simple example of clustering information for nine customers, distributed across three clusters. Two features describe customers: The first feature is the number of items the customers bought, and the second feature shows the price they paid for each.

TABLE 9.1. Sample Set of Clusters Consisting of Similar Objects Number of ItemsPriceCluster 1217003200042300Cluster 2101800122100112500Cluster 3210032003350

Customers in Cluster 1 purchase a few high-priced items; customers in Cluster 2 purchase many high-priced items; and customers in Cluster 3 purchase few low-priced items. Even this simple example and interpretation of a cluster’s characteristics shows that clustering analysis (in some references also called unsupervised classification) refers to situations in which the objective is to construct decision boundaries (classification surfaces) based on unlabeled training data set. The samples in these data sets have only input dimensions, and the learning process is classified as unsupervised.

Clustering is a very difficult problem because data can reveal clusters with different shapes and sizes in an n-dimensional data space. To compound the problem further, the number of clusters in the data often depends on the resolution (fine vs. coarse) with which we view the data. The next example illustrates these problems through the process of clustering points in the Euclidean two-dimensional (2-D) space. Figure 9.1a shows a set of points (samples in a 2-D space) scattered on a 2-D plane. Let us analyze the problem of dividing the points into a number of groups. The number of groups N is not given beforehand. Figure 9.1b shows the natural clusters bordered by broken curves. Since the number of clusters is not given, we have another partition of four clusters in Figure 9.1c that is as natural as the groups in Figure 9.1b. This kind of arbitrariness for the number of clusters is a major problem in clustering.

Figure 9.1. Cluster analysis of points in a 2D-space. (a) Initial data; (b) three clusters of data; (c) four clusters of data.

Note that the above clusters can be recognized by sight. For a set of points in a higher dimensional Euclidean space, we cannot recognize clusters visually. Accordingly, we need an objective criterion for clustering. To describe this criterion, we have to introduce a more formalized approach in describing the basic concepts and the clustering process.

An input to a cluster analysis can be described as an ordered pair (X, s), or (X, d), where X is a set of object descriptions represented with samples, and s and d are measures for similarity or dissimilarity (distance) between

1 ... 89 90 91 92 93 94 95 96 97 ... 193
Go to page:

Free e-book «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗» - read online now

Comments (0)

There are no comments yet. You can be the first!
Add a comment