Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗
- Author: Mehmed Kantardzic
Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic
This global-distributed framework may be more precisely specified when we implement a specific clustering algorithm. The density-based clustering algorithm DBSCAN is a good candidate, because it is robust to outliers, easy to implement, supports clusters of different shapes, and allows incremental, online implementation. The main steps of the algorithm are explained in Chapter 9, and the same process is applied locally. To find local clusters, DBSCAN starts with an arbitrary core object p, which is not yet clustered and retrieves all objects density reachable from p. The retrieval of density-reachable objects is performed in iterations until all local samples are analyzed. After having clustered the data locally, we need a small number of representatives that will describe the local clustering result accurately. For determining suitable representatives of the clusters, the concept of specific core points is introduced.
Let C be a local cluster with respect to the given DBSCAN parameters ε and MinPts. Furthermore, let CorC ⊆ C be the set of core points belonging to this cluster. Then ScorC ⊆ C is called a complete set of specific core points of C iff the following conditions are true:
ScorC ⊆ CorC
∀si,sj ⊆ ScorC: si ∉ Neighborhoodε (sj)
∀c ∈ CorC , ∃s ∈ ScorC: c ∈ Neighborhoodε (s)
The ScorC set of points consists of a very small number of specific core points that describe the cluster C. For example, in Figure 12.32a, sites 2 and 3 have only one specific core point, while site 1, because of the cluster shape, has two specific core points. To further simplify the representation of local clusters, the number of specific core points, |ScorC| = K, is used as an input parameter for a further local “clustering step” with an adapted version of K-means. For each cluster C found by DBSCAN, k-means use ScorC points as starting points. The result is K = |ScorC| subclusters and centroids within C.
Figure 12.32. Distributed DBSCAN clustering
(Januzaj et al., 2003).
(a) Local clusters; (b) local representatives; (c) global model with εglobal = 2εlocal.
Each local model LocalModelk consists of a set of mk pairs: a representative r (complete specific core point), and an ε radius value. The number m of pairs transmitted from each site k is determined by the number n of clusters Ci found on site k. Each of these pairs (r, εr) represents a subset of samples that are all located in a corresponding local cluster. Obviously, we have to check whether it is possible to merge two or more of these clusters, found on different sites, together. That is the main task of a global modeling part. To find such a global model, the algorithm continues with the density-based clustering algorithm DBSCAN again but only for collected representatives from local models. Because of characteristics of these representative points, the parameter MinPtsglobal is set to 2, and radius εglobal value should be set generally close to 2εlocal.
In Figure 12.32, an example of distributed DBSCAN for εglobal = 2εlocal is depicted. In Figure 12.32a the independently detected clusters on site 1, 2, and 3 are represented. The cluster on site 1 is represented using K-means by two representatives, R1 and R2, whereas the clusters on site 2 and site 3 are only represented by one representative as shown in Figure 12.32b. Figure 12.32c illustrates that all four local clusters from the different sites are merged together in one large cluster. This integration is obtained by using an εglobal parameter equal to 2εlocal. Figure 12.32c also makes clear that an εglobal = εlocal is insufficient to detect this global cluster. When the final global model is obtained, the model is distributed to local sites. This model makes corrections comparing previously found local models. For example, in the local clustering some points may be left as outliers, but with the global model they may be integrated into modified clusters.
12.5 CORRELATION DOES NOT IMPLY CAUSALITY
An associational concept is any relationship that can be defined in terms of a frequency-based joint distribution of observed variables, while a causal concept is any relationship that cannot be defined from the distribution alone. Even simple examples show that the associational criterion is neither necessary nor sufficient for causality confirmation. For example, data mining might determine that males with income between $50,000 and $65,000 who subscribe to certain magazines are likely purchasers of a product you want to sell. While you can take advantage of this pattern, say by aiming your marketing at people who fit the pattern, you should not assume that any of these factors (income, type of magazine) cause them to buy your product. The predictive relationships found via data mining are not necessarily causes of an action or behavior.
The research questions that motivate many studies in the health, social, and behavioral sciences are not statistical but causal in nature. For example, what is the efficacy of a given drug in a given population, or what fraction of past crimes could have been avoided by a given policy? The central target of such studies is to determine cause–effect relationships among variables of interests, for example, treatments–diseases or policies–crime, as precondition–outcome relationships. In order to express causal assumptions mathematically, certain extensions are required in the standard mathematical language of statistics, and these extensions are not generally emphasized in the mainstream literature and education.
The aim of standard statistical analysis, typified by regression and other estimation techniques, is to infer parameters of a distribution from samples drawn from that distribution. With the help of such
Comments (0)