Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗
- Author: Mehmed Kantardzic
Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic
it is possible to calculate this force for a given point. For example, force F1 in Figure 15.8 is a product of a spring constant K1 and a distance vector between points P(x, y) and D1(1,0):
The same analysis will give expressions for F2, F3, and F4. Using the basic relation between forces
we will obtain
Both the i and j components of the previous vector have to be equal to 0, and therefore:
or
These are the basic relations for representing a 4-D point P*(K1,K2,K3,K4) in a 2-D space P(x, y) using the radial-visualization technique. Similar procedures may be performed to get transformations for other n-dimensional spaces.
We can analyze the behavior of n-dimensional points after transformation and representation with two dimensions. For example, if all n coordinates have the same value, the data point will lie exactly in the center of the circle. In our 4-D space, if the initial point is P1*(0.6, 0.6, 0.6, 0.6), then using relations for x and y its presentation will be P1(0, 0). If the n-dimensional point is a unit vector for one dimension, then the projected point will lie exactly at the fixed point on the edge of the circle (where the spring for that dimension is fixed). Point P2*(0, 0, 1, 0) will be represented as P2(−1, 0). Radial visualization represents a nonlinear transformation of the data, which preserves certain symmetries. This technique emphasizes the relations between dimensional values, not between separate, absolute values. Some additional features of radial visualization include:
1. Points with approximately equal coordinate values will lie close to the center of the representational circle. For example, P3*(0.5, 0.6, 0.4, 0.5) will have 2-D coordinates P3(0.05, −0.05).
2. Points that have one or two coordinate values greater than the others lie closer to the origins of those dimensions. For example, P4*(0.1, 0.8, 0.6, −0.1) will have a 2-D representation P4(−0.36, −0.64). The point is in a third quadrant closer to D2 and D3, points where the spring is fixed for the second and third dimensions.
3. An n-dimensional line will map to the line or in a special case to the point. For example, points P5*(0.3, 0.3, 0.3, 0.3), P6*(0.6, 0.6, 0.6, 0.6), and P7*(0.9, 0.9, 0.9, 0.9) are on a line in a 4-D space, and all three of them will be transformed into the same 2-D point P567(0, 0).
4. A sphere will map to an ellipse.
5. An n-dimensional plane maps to a bounded polygon.
The Gradviz method is a simple extension of a radial visualization that places the dimensional anchors on a rectangular grid instead of the perimeter of a circle. The spring forces work the same way. Dimensional labeling for Gradviz is difficult, but the number of dimensions that can be displayed increases significantly in comparison to the Radviz technique. For example, in a typical Radviz display 50 seems to be a reasonable limit to the points around a circle. However, in a grid layout supported by the Gradviz technique you can easily fit 50 × 50 grid points or dimensions into the same area.
15.5 VISUALIZATION USING SELF-ORGANIZING MAPS (SOMs)
SOM is often seen as a promising technique for exploratory analyses through visualization of high-dimensional data. It visualizes a data structure of a high-dimensional data space usually as a 2-D or 3-D geometrical picture. SOMs are, in effect, a nonlinear form of PCA, and share similar goals to multidimensional scaling. PCA is much faster to compute, but it has the disadvantage, compared with SOMs, of not retaining the topology of the higher dimensional space.
The topology of the data set in its n-dimensional space is captured by the SOM and reflected in the ordering of its output nodes. This is an important feature of the SOM that allows the data to be projected onto a lower dimension space while roughly preserving the order of the data in its original space. Resultant SOMs are then visualized using graphical representations. SOM algorithm may use different data-visualization techniques including a cell or U-matrix visualization (a distance matrix visualization), projections (mesh visualization), visualization of component planes (in a multiple-linked view), and 2-D and 3-D surface plot of distance matrices. These representations use visual variables (size, value, texture, color, shape, orientation) added to the position property of the map elements. This allows exploration of relationships between samples. A coordinate system enables to determine distance and direction, from which other relationships (size, shape, density, arrangement, etc.) may be derived. Multiple levels of detail allow exploration at various scales, creating the potential for hierarchical grouping of items, regionalization, and other types of generalizations. Graphical representations in SOMs are used to represent uncovered structure and patterns that may be hidden in the data set and to support understanding and knowledge construction. An illustrative example is given in Figure 15.9 where linear or nonlinear relationships are detected by the SOM.
Figure 15.9. Output maps generated by the SOM detect relationships in data. (a) 1-D image map; (b) 2-D image map; (c) nonlinear relationship; (d) linear relationship.
For years there has been visualization of primary numeric data using pie charts, colored graphs, graphs over time, multidimensional analysis, Pareto charts, and so forth. The counterpart to numeric data is unstructured, textual data. Textual data are found in many places, but nowhere more prominently than on the Web. Unstructured electronic data include emails, email attachments, PDF files, spread sheets, PowerPoint files, text files, and document files. In this new environment, the end user faces massive amounts, often millions, of unstructured documents. The end user cannot read them all, and especially, there is no way he/she could manually organize or summarize them. Unstructured data run the less formal part of the organization, while structured data run the formal part of the organization. It is a good assumption, confirmed in many real-world applications, that as many business decisions are made in the unstructured environment as in the
Comments (0)