Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗
- Author: Mehmed Kantardzic
Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic
9.4 PARTITIONAL CLUSTERING
Every partitional-clustering algorithm obtains a single partition of the data instead of the clustering structure, such as a dendrogram, produced by a hierarchical technique. Partitional methods have the advantage in applications involving large data sets for which the construction of a dendrogram is computationally very complex. The partitional techniques usually produce clusters by optimizing a criterion function defined either locally (on a subset of samples) or globally (defined over all of the samples). Thus, we say that a clustering criterion can be either global or local. A global criterion, such as the Euclidean square-error measure, represents each cluster by a prototype or centroid and assigns the samples to clusters according to the most similar prototypes. A local criterion, such as the minimal MND, forms clusters by utilizing the local structure or context in the data. Therefore, identifying high-density regions in the data space is a basic criterion for forming clusters.
The most commonly used partitional-clustering strategy is based on the square-error criterion. The general objective is to obtain the partition that, for a fixed number of clusters, minimizes the total square-error. Suppose that the given set of N samples in an n-dimensional space has somehow been partitioned into K clusters {C1, C2, … , Ck}. Each Ck has nk samples and each sample is in exactly one cluster, so that Σnk = N, where k = 1, … , K. The mean vector Mk of cluster Ck is defined as the centroid of the cluster or
where xik is the ith sample belonging to cluster Ck. The square-error for cluster Ck is the sum of the squared Euclidean distances between each sample in Ck and its centroid. This error is also called the within-cluster variation:
The square-error for the entire clustering space containing K clusters is the sum of the within-cluster variations:
The objective of a square-error clustering method is to find a partition containing K clusters that minimize for a given K.
The K-means partitional-clustering algorithm is the simplest and most commonly used algorithm employing a square-error criterion. It starts with a random, initial partition and keeps reassigning the samples to clusters, based on the similarity between samples and clusters, until a convergence criterion is met. Typically, this criterion is met when there is no reassignment of any sample from one cluster to another that will cause a decrease of the total squared error. K-means algorithm is popular because it is easy to implement, and its time and space complexity is relatively small. A major problem with this algorithm is that it is sensitive to the selection of the initial partition and may converge to a local minimum of the criterion function if the initial partition is not properly chosen.
The simple K-means partitional-clustering algorithm is computationally efficient and gives surprisingly good results if the clusters are compact, hyperspherical in shape, and well separated in the feature space. The basic steps of the K-means algorithm are
1. select an initial partition with K clusters containing randomly chosen samples, and compute the centroids of the clusters;
2. generate a new partition by assigning each sample to the closest cluster center;
3. compute new cluster centers as the centroids of the clusters; and
4. repeat steps 2 and 3 until an optimum value of the criterion function is found (or until the cluster membership stabilizes).
Let us analyze the steps of the K-means algorithm on the simple data set given in Figure 9.6. Suppose that the required number of clusters is two, and initially, clusters are formed from a random distribution of samples: C1 = {x1, x2, x4} and C2 = {x3, x5}. The centriods for these two clusters are
Within-cluster variations, after initial random distribution of samples, are
And the total square-error is
When we reassign all samples, depending on a minimum distance from centroids M1 and M2, the new redistribution of samples inside clusters will be
New clusters C1 = {x1, x2, x3} and C2 = {x4, x5} have new centroids
The corresponding within-cluster variations and the total square-error are
We can see that after the first iteration, the total square-error is significantly reduced (from the value 27.48 to 6.17). In this simple example, the first iteration was at the same time the final one because if we analyze the distances between the new centroids and the samples, the latter will all be assigned to the same clusters. There is no reassignment and therefore the algorithm halts.
In summary, the K-means algorithm and its equivalent in an artificial neural networks domain—the Kohonen net—have been applied for clustering on large data sets. The reasons behind the popularity of the K-means algorithm are as follows:
1. Its time complexity is O(n × k × l), where n is the number of samples, k is the number of clusters, and l is the number of iterations taken by the algorithm to converge. Typically, k and l are fixed in advance and so the algorithm has linear time complexity in the size of the data set.
2. Its space complexity is O(k + n), and if it is possible to store all the data in the primary memory, access time to all elements is very fast and the algorithm is very efficient.
3. It is an order-independent algorithm. For a given initial distribution of clusters, it generates the same partition of the data at the end of the partitioning process irrespective of the order in which the samples are presented to the algorithm.
A big
Comments (0)