bookssland.com » Other » Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗

Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic



1 ... 161 162 163 164 165 166 167 168 169 ... 193
Go to page:
focuses primarily on physical data such as the human body, the earth, and molecules. Scientific visualization also deals with multidimensional data, but most of the data sets used in this field use the spatial attributes of the data for visualization purposes, for example, computer-aided tomography(CAT) and CAD. Also, many of the Geographical Information Systems (GIS) use either the Cartesian coordinate system or some modified geographical coordinates to achieve a reasonable visualization of the data.

Information visualization focuses on abstract, nonphysical data such as text, hierarchies, and statistical data. Data-mining techniques are primarily oriented toward information visualization. The challenge for nonphysical data is in designing a visual representation of multidimensional samples (where the number of dimensions is greater than three). Multidimensional-information visualizations present data that are not primarily plenary or spatial. One-, two-, and three-dimensional, but also temporal information–visualization schemes can be viewed as a subset of multidimensional information visualization. One approach is to map the nonphysical data to a virtual object such as a cone tree, which can be manipulated as if it were a physical object. Another approach is to map the nonphysical data to the graphical properties of points, lines, and areas.

Using historical developments as criteria, we can divide IVT into two broad categories: traditional IVT and novel IVT. Traditional methods of 2-D and 3-D graphics offer an opportunity for information visualization, even though these techniques are more often used for presentation of physical data in scientific visualization. Traditional visual metaphors are used for a single or a small number of dimensions, and they include:

1. bar charts that show aggregations and frequencies;

2. histograms that show the distribution of variable values;

3. line charts for understanding trends in order;

4. pie charts for visualizing fractions of a total;

5. scatter plots for bivariate analysis.

Color-coding is one of the most common traditional IVT methods for displaying a 1-D set of values where each value is represented by a different color. This representation becomes a continuous tonal variation of color when real numbers are the values of a dimension. Normally, a color spectrum from blue to red is chosen, representing a natural variation from “cool” to “hot,” in other words, from the smallest to the highest values.

With the development of large data warehouses, data cubes became very popular IVT. A data cube, the raw-data structure in a multidimensional database, organizes information along a sequence of categories. The categorizing variables are called dimensions. The data, called measures, are stored in cells along given dimensions. The cube dimensions are organized into hierarchies and usually include a dimension representing time. The hierarchical levels for the dimension time may be year, quarter, month, day, and hour. Similar hierarchies could be defined for other dimensions given in a data warehouse. Multidimensional databases in modern data warehouses automatically aggregate measures across hierarchical dimensions; they support hierarchical navigation, expand and collapse dimensions, enable drill down, drill up, or drill across, and facilitate comparisons through time. In a transaction information in the database, the cube dimensions might be product, store, department, customer number, region, month, year. The dimensions are predefined indices in a cube cell and the measures in a cell are roll-ups or aggregations over the transactions. They are usually sums but may include functions such as average, standard deviation, and percentage.

For example, the values for the dimensions in a database may be

1. region: north, south, east, west;

2. product: shoes, shirts;

3. month: anuary, February, March, … , December.

Then, the cell corresponding to (north, shirt, February) is the total sales of shirts for the northern region for the month of February.

Novel IVT can simultaneously represent large data sets with many dimensions on one screen. The widely accepted classifications of these new techniques are

1. geometric-projection techniques,

2. icon-based techniques,

3. pixel-oriented techniques, and

4. hierarchical techniques.

Geometric-projection techniques aim to find interesting projections of multidimensional data sets. We will present some illustrative examples of these techniques.

The Scatter-Plot Matrix Technique is an approach that is very often available in new data-mining software tools. A grid of 2-D scatter plots is the standard means of extending a standard 2-D scatter plot to higher dimensions. If you have 10-D data, a 10 × 10 array of scatter plots is used to provide a visualization of each dimension versus every other dimension. This is useful for looking at all possible two-way interactions or correlations between dimensions. Positive and negative correlations, but only between two dimensions, can be seen easily. The standard display quickly becomes inadequate for extremely large numbers of dimensions, and user interactions of zooming and panning are needed to interpret the scatter plots effectively.

The Survey Plot is a simple technique of extending an n-dimensional point (sample) in a line graph. Each dimension of the sample is represented on a separate axis in which the dimension’s value is a proportional line from the center of the axis. The principles of representation are given in Figure 15.1.

Figure 15.1. A 4-dimensional survey plot.

This visualization of n-dimensional data allows you to see correlations between any two variables, especially when the data are sorted according to a particular dimension. When color is used for different classes of samples, you can sometimes use a sort to see which dimensions are best at classifying data samples. This technique was evaluated with different machine-learning data sets and it showed the ability to present exact IF-THEN rules in a set of samples.

The Andrews’s curves technique plots each n-dimensional sample as a curved line. This is an approach similar to a Fourier transformation of a data point. This technique uses the function f(t) in the time domain t to transform the n-dimensional point X = (x1, x2, x3, … , xn) into a continuous plot. The function is usually plotted in the interval −π ≤ t ≤ π. An example of the transforming function f(t) is

One advantage of this visualization is that it can represent many dimensions; the disadvantage, however, is the computational time required to display each n-dimensional point for large data sets.

The class of geometric-projection techniques also includes techniques of exploratory statistics such as

1 ... 161 162 163 164 165 166 167 168 169 ... 193
Go to page:

Free e-book «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗» - read online now

Comments (0)

There are no comments yet. You can be the first!
Add a comment