bookssland.com » Other » Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗

Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic



1 ... 95 96 97 98 99 100 101 102 103 ... 193
Go to page:
frustration in using iterative partitional-clustering programs is the lack of guidelines available for choosing K-number of clusters apart from the ambiguity about the best direction for initial partition, updating the partition, adjusting the number of clusters, and the stopping criterion. The K-means algorithm is very sensitive to noise and outlier data points, because a small number of such data can substantially influence the mean value. Unlike the K-means, the K-mediods method, instead of taking the mean value of the samples, uses the most centrally located object (mediods) in a cluster to be the cluster representative. Because of this, the K-mediods method is less sensitive to noise and outliers. Fuzzy c-means, proposed by Dunn and later improved, is an extension of K-means algorithm where each data point can be a member of multiple clusters with a membership value expressed through fuzzy sets. Despite its drawbacks, k-means remains the most widely used partitional clustering algorithm in practice. The algorithm is simple, easily understandable and reasonably scalable, and can be easily modified to deal with streaming data.

9.5 INCREMENTAL CLUSTERING

There are more and more applications where it is necessary to cluster a large collection of data. The definition of “large” has varied with changes in technology. In the 1960s, “large” meant several-thousand samples for clustering. Now, there are applications where millions of samples of high dimensionality have to be clustered. The algorithms discussed above work on large data sets, where it is possible to accommodate the entire data set in the main memory. However, there are applications where the entire data set cannot be stored in the main memory because of its size. There are currently three possible approaches to solve this problem:

1. The data set can be stored in a secondary memory and subsets of this data are clustered independently, followed by a merging step to yield a clustering of the entire set. We call this approach the divide-and-conquer approach.

2. An incremental-clustering algorithm can be employed. Here, data are stored in the secondary memory and data items are transferred to the main memory one at a time for clustering. Only the cluster representations are stored permanently in the main memory to alleviate space limitations.

3. A parallel implementation of a clustering algorithm may be used where the advantages of parallel computers increase the efficiency of the divide-and-conquer approach.

An incremental-clustering approach is most popular, and we will explain its basic principles. The following are the global steps of the incremental-clustering algorithm.

1. Assign the first data item to the first cluster.

2. Consider the next data item. Either assign this item to one of the existing clusters or assign it to a new cluster. This assignment is done based on some criterion, for example, the distance between the new item and the existing cluster centroids. In that case, after every addition of a new item to an existing cluster, recompute a new value for the centroid.

3. Repeat step 2 until all the data samples are clustered.

The space requirements of the incremental algorithm are very small, necessary only for the centroids of the clusters. Typically, these algorithms are non-iterative and therefore their time requirements are also small. But, even if we introduce iterations into the incremental-clustering algorithm, computational complexity and corresponding time requirements do not increase significantly. On the other hand, there is one obvious weakness of incremental algorithms that we have to be aware of. Most incremental algorithms do not satisfy one of the most important characteristics of a clustering process: order-independence. An algorithm is order-independent if it generates the same partition for any order in which the data set is presented. Incremental algorithms are very sensitive to the order of samples, and for different orders they generate totally different partitions.

Let us analyze the incremental-clustering algorithm with the sample set given in Figure 9.6. Suppose that the order of samples is x1, x2, x3, x4, x5 and the threshold level of similarity between clusters is δ = 3.

1. The first sample x1 will become the first cluster C1 = {x1}. The coordinates of x1 will be the coordinates of the centroid M1 = {0, 2}.

2. Start analysis of the other samples.

(a) Second sample x2 is compared with M1, and the distance d is determined

Therefore, x2 belongs to the cluster C1. The new centroid will be

(b) The third sample x3 is compared with the centroid M1 (still the only centroid):

(c) The fourth sample x4 is compared with the centroid M1:

Because the distance of the sample from the given centroid M1 is larger than the threshold value δ, this sample will create its own cluster C2 = {x4} with the corresponding centroid M2 = {5, 0}.

(d) The fifth sample x5 is compared with both cluster centroids:

The sample is closer to the centroid M2, and its distance is less than the threshold value δ. Therefore, sample x5 is added to the second cluster C2:

3. All samples are analyzed and a final clustering solution of two clusters is obtained:

The reader may check that the result of the incremental-clustering process will not be the same if the order of the samples is different. Usually, this algorithm is not iterative (although it could be) and the clusters generated after all the samples have been analyzed in one iteration are the final clusters. If the iterative approach is used, the centroids of the clusters computed in the previous iteration are used as a basis for the partitioning of samples in the next iteration.

For most partitional-clustering algorithms, including the iterative approach, a summarized representation of the cluster is given through its clustering feature (CF) vector. This vector of parameters is given for every cluster as a triple, consisting of the number of points (samples) of the cluster, the centroid of the cluster, and the radius of the cluster. The cluster’s radius is defined as the square-root of the average mean-squared distance from the centroid to the points in the cluster (averaged within-cluster variation). When

1 ... 95 96 97 98 99 100 101 102 103 ... 193
Go to page:

Free e-book «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗» - read online now

Comments (0)

There are no comments yet. You can be the first!
Add a comment