Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗
- Author: Mehmed Kantardzic
Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic
Figure 12.22. Markov Model versus hidden Markov Model. (a) One-coin model; (b) two-coin model.
Another possibility for explaining the observed sequence is shown in Figure 12.22b. There are again two states in this model, and each state corresponds to a separate biased coin being tossed. Each state has its own probability distribution of Heads and Tails, and therefore the model is represented as an HMM. Obviously, in this model we have several “paths” to determine the probability of the sequence. In other words, we can start with tossing one or another coin, and continue with this selection. In all these cases, composite probability will be different. In this situation, we may search for the maximum probability of the sequence O in the HMM. HMM may be formalized as a directed graph with V vertices and A arcs. Set V = {v1, v2, … , vn} represents states, and matrix A = {aij} represents transition-probability distribution, where aij is the transitional probability from state i to state j. Given a set of possible observations O = {o1, o2, … , om} for each state vi, the probability of seeing each observation in the sequence is given by Bi = {oi1, oi2, … , oim}. The initial-state distribution is represented as σ, which determines the starting state at time t = 0.
12.2.4 Mining Sequences
Temporal data-mining tasks include prediction, classification, clustering, search and retrieval, and pattern discovery. The first four have been investigated extensively in traditional time-series analysis, pattern recognition, and information retrieval. We will concentrate in this text on illustrative examples of algorithms for pattern discovery in large databases, which are of more recent origin and showing wide applicability. The problem of pattern discovery is to find and evaluate all “interesting” patterns in the data. There are many ways of defining what constitutes a pattern in the data and we shall discuss some generic approaches. There is no universal notion for interestingness of a pattern either. However, one concept that is found very useful in data mining is that of frequent patterns. A frequent pattern is one that occurs many times in the data. Much of the data-mining literature is concerned with formulating useful pattern structures and developing efficient algorithms for discovering all patterns that occur frequently in the data.
A pattern is a local structure in the database. In the sequential-pattern framework, we are given a collection of sequences, and the task is to discover sequences of items called sequential patterns that occur in sufficiently many of those sequences. In the frequent episodes analysis, the data set may be given in a single long sequence or in a large set of shorter sequences. An event sequence is denoted by {(E1, t1), (E2, t2), …, (En,, tn)}, where Ei takes values from a finite set of event types E, and ti is an integer denoting the time stamp of the ith event. The sequence is ordered with respect to the time stamps so that ti ≤ ti + 1 for all i = 1, 2, … , n. The following is an example event sequence S with 10 events in it:
An episode is a partially ordered set of events. When the order among the events of an episode is total, it is called a serial episode, and when there is no order at all, the episode is called a parallel episode. For example, (A → B → C) is a three-node serial episode. The arrows in our notation serve to emphasize the total order. In contrast, parallel episodes are somewhat similar to itemsets, and so, we can denote a three-node parallel episode with event types A, B, and C, as (ABC).
An episode is said to occur in an event sequence if there exist events in the sequence occurring with exactly the same order as that prescribed in the episode. For instance, in the example the events (A, 2), (B, 3) and (C, 8) constitute an occurrence of the serial episode (A → B → C) while the events (A, 7), (B, 3), and (C, 8) do not, because for this serial episode to occur, A must occur before B and C. Both these sets of events, however, are valid occurrences of the parallel episode (ABC), since there are no restrictions with regard to the order in which the events must occur for parallel episodes. Let α and β be two episodes. β is said to be a sub-episode of α if all the event types in β appear in α as well, and if the partial order among the event types of β is the same as that for the corresponding event types in α. For example, (A → C) is a two-node sub-episode of the serial episode (A → B → C) while (B → A) is not. In case of parallel episodes, this order constraint is not required.
The sequential pattern-mining framework may extend the frequent itemsets idea described in the chapter on association rules with temporal order. The database D of itemsets is considered no longer just some unordered collection of transactions. Now, each transaction in D carries a time stamp as well as a customer ID. Each transaction, as earlier, is simply a collection of items. The transactions associated with a single customer can be regarded as a sequence of itemsets ordered by time, and D would have one such transaction sequence corresponding to each customer. Consider an example database with five customers whose corresponding transaction sequences are as follows:Customer IDTransaction Sequence1({A,B}{A,C,D}{B,E})2({D,G} {A,B,E,H})3({A}{B,D}{A,B,E,F}{G,H})4({A}{F})5({A,D} {B,E,G,H} {F})
Each customer’s transaction sequence is enclosed in angular braces, while the items bought in a single transaction are enclosed in round braces. For example, customer 3 made four visits to the supermarket. In his/her first visit he/she bought only item A, in the second visit items B and D, and so on.
The temporal patterns of interest are sequences of itemsets. A sequence S of itemsets is denoted by {s1 s2 · · ·
Comments (0)