bookssland.com » Other » Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗

Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic



1 ... 132 133 134 135 136 137 138 139 140 ... 193
Go to page:
as σ ai bi, and (a • a) is a special case of the inner product representing square of the magnitude of the vector a. The reader can easily check the previous relation. If, for example, the vectors a and b are a = (1,2,3) and b = (2,1,2), then the Euclidean distance may be calculated as d2 = 14 + 9 − 2×10 = 3. While products (a • a) and (b • b) can be computed locally, and each result is a single value, the core challenge is to develop an algorithm for distributed inner product computation (a • b). A simple, communication-efficient randomized technique for computing this inner product between two vectors observed at two different sites may consist of the following steps:

1. Vectors a and b are given on two sites, A and B, respectively. Site A sends to the site B a random number generator seed. (This is only one passed message.)

2. Both sites A and B cooperatively generate a random matrix R with dimensions k × m, where k m. Each entry in matrix R is generated independently and identically from some fixed distribution with mean 0 and a finite variance.

3. Based on matrix R, sites A and B compute their own local matrix products: ∧a = R a and ∧b = R b.

Dimensions of new local vectors ∧a and ∧b are k, and that means significantly lower than initial lengths of m.

4. Site A sends the resulting vector ∧a to the site B. (This represents k passed messages.)

5. Site B computes approximate inner product (a • b) = (∧aT • ∧b)/k

So, instead of sending an m-dimensional vector to the other site, the algorithm sends only a (k + 1)-dimensional vector where k m (k is a user-defined parameter). The inner product of vectors can still be estimated accurately with lower communication load.

In the DDM literature, one of two assumptions is commonly adopted as to how data are distributed across sites: (1) homogeneously or horizontally partitioned, or (2) heterogeneously or vertically partitioned. Both viewpoints assume that the data tables at each distributed site are partitions of a single global table. It is important to stress that the global table viewpoint is strictly conceptual. It is not necessarily assumed that such a table was physically realized and partitioned to form the tables at each site. In the homogeneous case, the global table is horizontally partitioned. The tables at each site are subsets of the global table; they have exactly the same attributes. Figure 12.30a illustrates the homogeneously distributed case using an example from weather data where both tables use the same three attributes. In the heterogeneous case, the table is vertically partitioned where each site contains a subset of columns. That means sites do not have the same attributes. However, samples at each site are assumed to contain a unique identifier to facilitate matching, and Figure 12.30b illustrates this case. The tables at distributed sites have different attributes, and samples are linked through a unique identifier, Patient ID.

Figure 12.30. Horizontally versus vertically partitioned data. (a) Horizontally partitioned data; (b) vertically partitioned data.

DDM technology supports different data-mining tasks including classification, prediction, clustering, market-basket analysis, and outliers’ detection. A solution for each of these tasks may be implemented with a variety of DDM algorithms. For example, distributed Apriori has several versions for frequent itemset generation in distributed transactional database. They usually require multiple synchronizations and communication steps. Most of these implementations assume that platforms are homogeneous and therefore the data sets are partitioned evenly among the sites. However, in practice, both the data sets and the processing platforms are more likely to be heterogeneous, running multiple and different systems and tools. This leads to unbalanced data-set distributions and workloads causing additional problems in implementation.

One recent trend is online-mining technology used for monitoring in distributed sensor networks. This is because deployments of large-scale distributed sensor networks are now possible owing to hardware advances and increasing software support. Online data mining, also called data-stream mining, is concerned with extracting patterns, detecting outliers, or developing dynamic models of a system’s behavior from continuous data streams such as those generated by sensor networks. Because of the massive amount of data and the speed of which the data are generated, many data-mining applications in sensor networks require in-network processing such as aggregation to reduce sample size and communication overhead. Online data mining in sensor networks offers many additional challenges, including:

limited communication bandwidth,

constraints on local computing resources,

limited power supply,

need for fault tolerance, and

asynchronous nature of the network.

Obviously, data-mining systems have evolved in a short period of time from stand-alone programs characterized by single algorithms with little support for the entire knowledge-discovery process to integrated systems incorporating several mining algorithms, multiple users, communications, and various and heterogeneous data formats and distributed data sources. Although many DDM algorithms are developed and deployed in a variety of applications, the trend will be illustrated in this book with only one example of a distributed clustering algorithm.

12.4.1 Distributed DBSCAN Clustering

Distributed clustering assumes that samples to be clustered reside on different sites. Instead of transmitting all samples to a central site where we can apply one of the standard clustering algorithms to analyze the data locally, the data are clustered independently on the distributed local sites. Then, in a subsequent step, the central site tries to establish a global clustering model based on the downloaded local models, that is, summarized representatives of local data. Distributed clustering is carried out on two different levels, that is, the local level and the global level (Fig. 12.31). On the local level, all sites carry out clustering independently from each other. Communication with the central site and determining a global model should reflect an optimum trade-off between complexity and accuracy of the algorithm.

Figure 12.31. System architecture for distributed clustering.

Local models consist of a set of representatives for each locally found cluster. A representative is a good approximation for samples residing on the

1 ... 132 133 134 135 136 137 138 139 140 ... 193
Go to page:

Free e-book «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗» - read online now

Comments (0)

There are no comments yet. You can be the first!
Add a comment