bookssland.com » Other » Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗

Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic



1 ... 93 94 95 96 97 98 99 100 101 ... 193
Go to page:
Figure 9.5.

Figure 9.5. Distances for a single-link and a complete-link clustering algorithm. (a) Single-link distance; (b) complete-link distance.

In either case, two clusters are merged to form a larger cluster based on minimum-distance criteria. Although the single-link algorithm is computationally simpler, from a practical viewpoint it has been observed that the complete-link algorithm produces more useful hierarchies in most applications.

As explained earlier, the only difference between the single-link and complete-link approaches is in the distance computation. For both, the basic steps of the agglomerative clustering algorithm are the same. These steps are as follows:

1. Place each sample in its own cluster. Construct the list of intercluster distances for all distinct unordered pairs of samples, and sort this list in ascending order.

2. Step through the sorted list of distances, forming for each distinct threshold value dk a graph of the samples where pairs of samples closer than dk are connected into a new cluster by a graph edge. If all the samples are members of a connected graph, stop. Otherwise, repeat this step.

3. The output of the algorithm is a nested hierarchy of graphs, which can be cut at the desired dissimilarity level forming a partition (clusters) identified by simple connected components in the corresponding subgraph.

Let us consider five points {x1, x2, x3, x4, x5} with the following coordinates as a 2-D sample for clustering:

For this example, we selected 2-D points because it is easier to graphically represent these points and to trace all the steps in the clustering algorithm. The points are represented graphically in Figure 9.6.

Figure 9.6. Five two-dimensional samples for clustering.

The distances between these points using the Euclidian measure are

The distances between points as clusters in the first iteration are the same for both single-link and complete-link clustering. Further computation for these two algorithms is different. Using agglomerative single-link clustering, the following steps are performed to create a cluster and to represent the cluster structure as a dendrogram.

First x2 and x3 samples are merged and a cluster {x2, x3} is generated with a minimum distance equal to 1.5. Second, x4 and x5 are merged into a new cluster {x4, x5} with a higher merging level of 2.0. At the same time, the minimum single-link distance between clusters {x2, x3} and {x1} is also 2.0. So, these two clusters merge at the same level of similarity as x4 and x5. Finally, the two clusters {x1, x2, x3} and {x4, x5} are merged at the highest level with a minimum single-link distance of 3.5. The resulting dendrogram is shown in Figure 9.7.

Figure 9.7. Dendrogram by single-link method for the data set in Figure 9.6.

The cluster hierarchy created by using an agglomerative complete-link clustering algorithm is different compared with the single-link solution. First, x2 and x3 are merged and a cluster {x2, x3} is generated with the minimum distance equal to 1.5. Also, in the second step, x4 and x5 are merged into a new cluster {x4, x5} with a higher merging level of 2.0. Minimal single-link distance is between clusters {x2, x3}, and {x1} is now 2.5, so these two clusters merge after the previous two steps. Finally, the two clusters {x1, x2, x3} and {x4, x5} are merged at the highest level with a minimal complete-link distance of 5.4. The resulting dendrogram is shown in Figure 9.8.

Figure 9.8. Dendrogram by complete-link method for the data set in Figure 9.6.

Selecting, for example, a threshold measure of similarity s = 2.2, we can recognize from the dendograms in Figures 9.7 and 9.8 that the final clusters for single-link and complete-link algorithms are not the same. A single-link algorithm creates only two clusters: {x1, x2, x3} and {x4, x5}, while a complete-link algorithm creates three clusters: {x1}, {x2, x3}, and {x4, x5}.

Unlike traditional agglomerative methods, Chameleon is a clustering algorithm that tries to improve the clustering quality by using a more elaborate criterion when merging two clusters. Two clusters will be merged if the interconnectivity and closeness of the merged clusters is very similar to the interconnectivity and closeness of the two individual clusters before merging.

To form the initial subclusters, Chameleon first creates a graph G = (V, E), where each node v ∈ V represents a data sample, and a weighted edge e(vi, vj) exists between two nodes vi and vj if vj is one of the k-nearest neighbors of vi. The weight of each edge in G represents the closeness between two samples, that is, an edge will weigh more if the two data samples are closer to each other. Chameleon then uses a graph-partition algorithm to recursively partition G into many small, unconnected subgraphs by doing a min-cut on G at each level of recursion. Here, a min-cut on a graph G refers to a partitioning of G into two parts of close, equal size such that the total weight of the edges being cut is minimized. Each subgraph is then treated as an initial subcluster, and the algorithm is repeated until a certain criterion is reached.

In the second phase, the algorithm goes bottom-up. Chameleon determines the similarity between each pair of elementary clusters Ci and Cj according to their relative interconnectivity RI(Ci, Cj) and their relative closeness RC(Ci, Cj). Given that the interconnectivity of a cluster is defined as the total weight of edges that are removed when a min-cut is performed, the relative interconnectivity RI(Ci, Cj) is defined as the ratio between the interconnectivity of the merged cluster Ci and Cj to the average interconnectivity of Ci and Cj. Similarly, the relative closeness RC(Ci, Cj) is defined as the ratio between the closeness of the merged cluster of Ci and Cj to the average internal closeness of Ci and Cj. Here the closeness of a cluster refers to the average weight of the edges that are removed when a min-cut is performed on the cluster.

The similarity function is then computed as a product: RC(Ci, Cj) * RI(Ci, Cj)α where α

1 ... 93 94 95 96 97 98 99 100 101 ... 193
Go to page:

Free e-book «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗» - read online now

Comments (0)

There are no comments yet. You can be the first!
Add a comment