Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗
- Author: Mehmed Kantardzic
Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic
One of the possible implementations of the previous ideas is the GIL system, which moves the GA closer to the symbolic level—mainly by defining specialized operators that manipulate binary strings. Previous symbolic classifiers are translated into binary strings, where for each attribute a binary string of fixed length is generated. The length is equal to the number of possible values for the given attribute. In the string, the required value is set to 1 and all others are set to 0. For example, if attribute A1 has the value z, it is represented with the binary string 001 (0’s are for values x, y). If the value of some attribute is *, that means that all values are possible, so it is represented with value 1 in all positions of the binary string.
For our previous example, with six attributes and a total number of 17 different values for all attributes, the classifier symbolically represented as
can be transformed into the binary representation
where bars separate bit sets for each attribute. The operators of the GIL system are modeled on inductive reasoning, which includes various inductive operators such as RuleExchange, RuleCopy, RuleGeneralization, RuleSpecialization, RuleSplit, SelectorDrop, ReferenceChange, and ReferenceExtension. We discuss some of them in turn.
13.6.1 RuleExchange
The RuleExchange operator is similar to a crossover of the classical GA as it exchanges selected complex between two parent chromosomes. For example, two parents (two rules)
may produce the following offspring (new rules)
13.6.2 RuleGeneralization
This unary operator generalizes a random subset of complexes. For example, for a parent
and the second and third complexes selected for generalization, the bits are ORed and the following offspring is produced:
13.6.3 RuleSpecialization
This unary operator specializes a random subset of complexes. For example, for a parent
and the second and third complexes selected for specialization, the bits are ANDed, and the following offspring is produced:
13.6.4 RuleSplit
This operator acts on a single complex, splitting it into a number of complexes. For example, a parent
may produce the following offspring (the operator splits the second selector):
The GIL system is a complex, inductive-learning system based on GA principles. It requires a number of parameters, such as the probabilities of applying each operator. The process is iterative. At each iteration, all chromosomes are evaluated with respect to their completeness, consistency, and fitness criteria, and a new population is formed with those chromosomes that are better and more likely to appear. The operators are applied to the new population, and the cycle is repeated.
13.7 GAS FOR CLUSTERING
Much effort has been undertaken toward applying GAs to provide better solutions than those found by traditional clustering algorithms. The emphasis was on appropriate encoding schemes, specific genetic operators, and corresponding fitness functions. Several encoding schemes have been proposed specifically for data clustering, and the main three types are binary, integer, and real encoding.
The binary encoding solution is usually represented as a binary string of length N, where N is the number of data set samples. Each position of the binary string corresponds to a particular sample. The value of the ith gene is 1 if the ith sample is a prototype of a cluster, and 0 otherwise. For example, the data set s in Table 13.5 can be encoded by means of the string [0100001010], in which samples 2, 7, and 9 are prototypes for clusters C1, C2, and C3. The total number of 1s in the string is equal to an a priori defined number of clusters. Clearly, such an encoding scheme leads to a medoid-based representation in which the cluster prototypes coincide with representative samples from the data set. There is an alternative way of representing a data partition using a binary encoding. The matrix of k × N dimensions is used in which the rows represent clusters, and the columns represent samples. In this case, if the jth sample belongs to the ith cluster then 1 is assigned to (i,j) genotype, whereas the other elements of the same column receive 0. For example, using this representation, the data set in Table 13.5 would be encoded as 3 × 10 matrix in Table 13.6.
TABLE 13.5. Three Clusters Are Defined for a Given Data Set s
TABLE 13.6. Binary Encoded Data Set s Given in Table 13.5
Integer encoding uses a vector of N integer positions, where N is the number of data set samples.
Each position corresponds to a particular sample; that is, the ith position (gene) represents the ith data sample. Assuming that there are k clusters, each gene has a value over the alphabet {1, 2, 3, … , k}. These values define the cluster labels. For example, the integer vector [1111222233] represents the clusters depicted in Table 13.5. Another way of representing a partition by means of an integer-encoding scheme involves using an array of only k elements to provide a medoid-based representation of the data set. In this case, each array element represents the index of the sample xi, i = 1, 2, … , N (with respect to the order the samples appear in the data set) corresponding to the prototype of a given cluster. As an example, the array [1 6 10] may represent a partition in which 1, 6, and 10 are indices of the cluster prototypes (medoids) for the data set in Table 13.5. Integer encoding is usually more computationally efficient than the binary-encoding schemes.
Real encoding is the third encoding scheme where the genotypes are made up of real numbers that represent the coordinates of the cluster centroids. In an n-dimensional space, the first n positions represent the n coordinates of the first centroid; the next n positions represent the coordinates of the second centroid, and so
Comments (0)