Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗
- Author: Mehmed Kantardzic
Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic
4.7 MODEL SELECTION VERSUS GENERALIZATION
We assume that the empirical data are given according to an unknown probability distribution. The question arises as to whether a finite set of empirical data includes sufficient information such that the underlying regularities can be learned and represented with a corresponding model. A positive answer to this question is a necessary condition for the success of any learning algorithm. A negative answer would yield the consequence that a learning system may remember the empirical data perfectly, and the system may have an unpredictable behavior with unseen data.
We may start discussion about appropriate model selection with an easy problem of learning a Boolean function from examples. Assume that both inputs and outputs are binary. For d inputs, there are 2d different samples, and there are 22d possible Boolean functions as outputs. Each function is a potential hypothesis hi. In the case of two inputs x1 and x2, there are 24 = 16 different hypotheses as shown in Table 4.1.
TABLE 4.1. Boolean Functions as Hypotheses
Each training (learning) sample with two inputs and one output value (x1, x2, o) removes half the hypotheses (hi). For example, sample (0, 0, 0) removes h9 to h16 because these hypotheses have output value 1 for the input pair (0, 0). This is one way of interpreting learning: We start with all possible hypotheses, and as we see more training samples we remove non-consistent hypotheses. After seeing N samples, there remain 22d-N possible hypotheses, that is, Boolean functions as a model for a given data set. In reality, where all inputs are usually not binary but with k different values (k), and also data are high-dimensional (d), then kd N. The number of samples for real-world data is significantly lower than the number of hypotheses (or the number of potential models). Therefore, data set by itself is not sufficient to find a unique solution—model. There is still huge number of hypotheses. We have to make some extra assumptions to reach a unique solution with the given data (N samples). These assumptions we call inductive bias (principle) of the learning algorithm. It influences a model selection. The main question is: How well does a model trained on training data set predict the right output for new samples (not available in training data). This represents an essential requirement for model generalization. For best generalization, we should match the complexity of the hypothesis class with the complexity of the function underlying training data. We made in the learning process a trade-off between the complexity of the hypothesis, the amount of data, and the generalization error of new samples (Fig. 4.31). Therefore, building a data-mining model is not a straightforward procedure, but a very sensitive process requiring, in many cases, feedback information for multiple mining iterations.
Figure 4.31. Trade-off between model complexity and the amount of data. (a) Too simple model; (b) too complex model; (c) appropriate model.
In the final phase of the data-mining process, when the model is obtained using one or more inductive-learning techniques, one important question still exists. How does one verify and validate the model? At the outset, let us differentiate between validation and verification.
Model validation is substantiating that the model, within its domain of applicability, behaves with satisfactory accuracy consistent with the objectives defined by the users. In other words, in model validation, we substantiate that the data have transformed into the model and that they have sufficient accuracy in representing the observed system. Model validation deals with building the right model, the model that corresponds to the system. Model verification is substantiating that the model is transformed from the data as intended into new representations with sufficient accuracy. Model verification deals with building the model right, the model that corresponds correctly to the data.
Model validity is a necessary but insufficient condition for the credibility and acceptability of data-mining results. If, for example, the initial objectives are incorrectly identified or the data set is improperly specified, the data-mining results expressed through the model will not be useful; however, we may still find the model valid. We can claim that we conducted an “excellent” data-mining process, but the decision makers will not accept our results and we cannot do anything about it. Therefore, we always have to keep in mind, as it has been said, that a problem correctly formulated is a problem half-solved. Albert Einstein once indicated that the correct formulation and preparation of a problem was even more crucial than its solution. The ultimate goal of a data-mining process should not be just to produce a model for a problem at hand, but to provide one that is sufficiently credible and accepted and implemented by the decision makers.
The data-mining results are validated and verified by the testing process. Model testing is demonstrating that inaccuracies exist or revealing the existence of errors in the model. We subject the model to test data or test cases to see if it functions properly. “Test failed” implies the failure of the model, not of the test. Some tests are devised to evaluate the behavioral accuracy of the model (i.e., validity), and some tests are intended to judge the accuracy of data transformation into the model (i.e., verification).
The objective of a model obtained through the data-mining process is to classify/predict new instances correctly. The commonly used measure of a model’s quality is predictive accuracy. Since new instances are not supposed to be seen by the model in its learning phase, we need to estimate its predictive accuracy using the true error rate. The true error rate is statistically defined as the error rate of the model on an asymptotically large number of new cases that converge to the actual population distribution. In practice, the true error rate of a data-mining model must be estimated from all the available samples, which are usually split into training and testing sets. The model is first designed using
Comments (0)