Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗
- Author: Mehmed Kantardzic
Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic
Figure 12.15. Traditional statistical analyses of time series. (a) Trend; (b) cycle; (c) seasonal; (d) outliers.
12.2.1 Temporal Data Representation
The representation of temporal data is especially important when dealing with large volumes of data since direct manipulation of continuous, high-dimensional data in an efficient way is extremely difficult. There are a few ways this problem can be addressed.
Original Data or with Minimal Preprocessing
Use data as they are without or with minimal preprocessing. We preserve the characteristics of each data point when a model is built. The main disadvantage of this process is that it is extremely inefficient to build data-mining models with millions of records of temporal data, all of them with different values.
Windowing and Piecewise Approximations
There is well-known psychological evidence that the human eye segments smooth curves into piecewise straight lines. Based on this theory, there are a number of algorithms for segmenting a curve that represents a time series. Figure 12.16 shows a simple example where it replaces the original nonlinear function with several piecewise linear functions. As shown in the figure, the initial real-valued elements (time series) are partitioned into several segments. Finding the required number of segments that best represent the original sequence is not trivial. An easy approach is to predefine the number of segments. A more realistic approach may be to define them when a change point is detected in the original sequence. Another technique based on the same idea segments a sequence by iteratively merging two similar segments. Segments to be merged are selected based on the squared-error minimization criteria. Even though these methods have the advantage of ability to reduce the impact of noise in the original sequence, when it comes to real-world applications (e.g., sequence matching) differences in amplitudes (scaling) and time-axis distortions are not addressed easily.
Figure 12.16. Simplified representation of temporal data. (a) Piecewise linear approximation; (b) piecewise aggregate approximation.
To overcome these drawbacks, Piecewise Aggregate Approximation (PAA) technique was introduced. It approximates the original data by segmenting the sequences into same-length sections and recording the mean value of these sections. A time-series C of length n is represented as C = {c1, c2, … , cn}. The task is to represent C as C′ in a w-dimensional space (w < n) by mean values of cis in w equal-sized segments. The ith element of C′ is calculated as a mean of all values in the segment:
For example, if the original sequence is C = {−2, −4, −3, −1, 0, 1, 2, 1, 1, 0}, where n = |C| = 10, and we decide to represent C in two sections of the same length, then
Usually, PAA is visualized as a linear combination of box bases functions as illustrated in Figure 12.16b, where a continuous function is replaced with 10 discrete averages for each interval.
A modified PAA algorithm, which is called Symbolic Aggregate Approximation (SAX), is proposed with the assumption that the original normalized time series, C, has a Gaussian distribution of PAA values. SAX defines “break points” in the Gaussian curve that will produce equal-sized areas below the curve. Formally, break points are a sorted list of numbers such that the areas under the Gaussian curve from to are equal to 1/α, and they are constant. α is a parameter of the methodology representing the number of intervals. These break points can be determined in a statistical table. For example, Figure 12.17 gives the break points for α values from 3 to 10.
Figure 12.17. SAX look-up table. (a) Break points in the Gaussian curve (b = 0,c2 = 0.2) when α = 3; (b) SAX look-up table.
Once the break points are defined along with the corresponding coding symbols for each interval, the sequence is discretized as follows:
1. Obtain the PAA averages of the time series.
2. All PAA averages in the given interval (, ) are coded with a specific symbol for this interval. For example, if α = 3, all PAA averages less than the smallest break point (−0.43) are mapped to “a.” All PAA coefficients less than the second break point but greater than first breakpoint (−0.43, 0.43) are mapped to “b,” and all average values larger than the second break point (0.43) are mapped to “c.” This process is illustrated in Figure 12.18.
Figure 12.18. Transformation of PAA to SAX.
It is assumed that the symbols “a,” “b,” and “c” are approximately equiprobable symbols in the representation of a time series. The original sequence is then represented as a concatenation of these symbols, which is known as a “word.” For example, the mapping from PAA (C′) to a word C″ is represented as C″ = (bcccccbaaaaabbccccbb). The main advantage of the SAX method is that 100 different discrete numerical values in an initial discrete time series C is first reduced to 20 different (average) values using PAA, and then they are transformed into only three different categorical values using SAX.
The proposed approach is intuitive and simple, yet a powerful methodology in a simplified representation of a large number of different values in time series. The method is fast to compute and supports different distance measures. Therefore, it is applicable as a data-reduction step for different data-mining techniques.
Transformation-Based Representations
The main idea behind transformation-based techniques is to map the original data into a point of a more manageable domain. One of the widely used methods is the Discrete Fourier Transformation (DFT). It transforms a sequence in the time domain to a point in the frequency domain. This is done by selecting the top-K frequencies and representing each sequence as a point in the K-dimensional space. One important property that is worth noting is that Fourier coefficients do not change under the shift operation. One problem with DFT is that it misses the important feature of time localization. To avoid this problem Piecewise Fourier Transform was proposed, but the size
Comments (0)