bookssland.com » Other » Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗

Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic



1 ... 92 93 94 95 96 97 98 99 100 ... 193
Go to page:
0, that is, the similarity is 1 if the values match and 0 if the values mismatch. Additionally, measures may use the following information in computing a similarity value (all the measures in this paper use only this information):

1. f(a), f(b), f(c), and f(d), the frequencies of the values in the data set;

2. N, the size of the data set; and

3. n, the number of values taken by the attribute (4 in the case above).

We will present, as an illustrative example, only one additional measure of similarity for categorical data, Goodall3, because it shows good performances on average for a variety of experiments with different data sets. That does not mean that some other measures such as Eskin, Lin, Smirnov, or Burnaby will not be more appropriate for a specific data set. The Goodall3 measure, given in Table 9.4, assigns a high similarity if the matching values are infrequent regardless of the frequencies of the other values.The range of Sk(Xk; Yk) for matches in the Goodall3 measure is , with the minimum value being attained if Xk is the only value for attribute Ak and maximum value is attained if Xk occurs only twice.

TABLE 9.4. Goodall3 Similarity Measure for Categorical Attributes

There are some advanced distance measures applicable to categorical data, and also to numerical data, that take into account the effect of the surrounding or neighboring points in the n-dimensional spaces of samples. These surrounding points are called contexts. The similarity between two points, xi and xj, with the given context, is measured using the mutual neighbor distance (MND), which is defined as

where NN(xi, xj) is the neighbor number of xj with respect to xi. If xi is the closest point to xj, then NN(xi, xj) is equal to 1, if it is the second closest point, NN(xi, xj) is equal to 2, and so on. Figures 9.3 and 9.4 give an example of the computation and basic characteristics of the MND measure.

Figure 9.3. A and B are more similar than B and C using the MND measure.

Figure 9.4. After changes in the context, B and C are more similar than A and B using the MND measure.

Points in Figures 9.3 and 9.4, denoted by A, B, C, D, E, and F, are 2-D samples with features x1 and x2. In Figure 9.3, the nearest neighbor of A is B using Euclidian distance, and B’s nearest neighbor is A. So,

If we compute the distance between points B and C, the results will be

Figure 9.4 was obtained from Figure 9.3 by adding three new points D, E, and F (samples in the data set). Now, because the context has changed, the distances between the same points A, B, and C have also changed:

The MND between A and B has increased by introducing additional points close to A, even though A and B have not moved. B and C points become more similar than points A and B. The MND measure is not a metric because it does not satisfy the triangle inequality. Despite this, MND has been successfully applied in several real-world clustering tasks.

In general, based on a distance measure between samples, it is possible to define a distance measure between clusters (set of samples). These measures are an essential part in estimating the quality of a clustering process, and therefore they are part of clustering algorithms. The widely used measures for distance between clusters Ci and Cj are

1. Dmin (Ci, Cj) = min |pi − pj|, where pi ∈ Ci and pj ∈ Cj;

2. Dmean (Ci, Cj) = |mi − mj |, where mi and mj are centriods of Ci and Cj;

3. Davg (Ci, Cj) = 1/(ni nj) ∑ ∑ |pi − pj|, where pi ∈ Ci and pj ∈ Cj, and ni and nj are the numbers of samples in clusters Ci and Cj; and

4. Dmax (Ci, Cj) = max |pi − pj|, where pi ∈ Ci and pj ∈ Cj.

9.3 AGGLOMERATIVE HIERARCHICAL CLUSTERING

In hierarchical-cluster analysis, we do not specify the number of clusters as a part of the input. Namely, the input to a system is (X, s), where X is a set of samples, and s is a measure of similarity. An output from a system is a hierarchy of clusters. Most procedures for hierarchical clustering are not based on the concept of optimization, and the goal is to find some approximate, suboptimal solutions, using iterations for improvement of partitions until convergence. Algorithms of hierarchical cluster analysis are divided into the two categories, divisible algorithms and agglomerative algorithms. A divisible algorithm starts from the entire set of samples X and divides it into a partition of subsets, then divides each subset into smaller sets, and so on. Thus, a divisible algorithm generates a sequence of partitions that is ordered from a coarser one to a finer one. An agglomerative algorithm first regards each object as an initial cluster. The clusters are merged into a coarser partition, and the merging process proceeds until the trivial partition is obtained: All objects are in one large cluster. This process of clustering is a bottom-up process, where partitions are from a finer one to a coarser one. In general, agglomerative algorithms are more frequently used in real-world applications than divisible methods, and therefore we will explain the agglomerative approach in greater detail.

Most agglomerative hierarchical clustering algorithms are variants of the single-link or complete-link algorithms. These two basic algorithms differ only in the way they characterize the similarity between a pair of clusters. In the single-link method, the distance between two clusters is the minimum of the distances between all pairs of samples drawn from the two clusters (one element from the first cluster, the other from the second). In the complete-link algorithm, the distance between two clusters is the maximum of all distances between all pairs drawn from the two clusters. A graphical illustration of these two distance measures is given in

1 ... 92 93 94 95 96 97 98 99 100 ... 193
Go to page:

Free e-book «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗» - read online now

Comments (0)

There are no comments yet. You can be the first!
Add a comment