On the Connection between In-sample Testing and Generalization Error
David H. Wolpert
Current address: The Santa Fe Institute,
1660 Old Pecos Trail, Suite A,
Santa Fe, NM, 87501.
Theoretical Division and Center for Nonlinear Studies,
MS B213, Los Alamos National Laboratory, Los Alamos, NM, 87545, USA
Abstract
This paper proves that it is impossible to justify a correlation between reproduction of a training set and generalization error off of the training set using only a priori reasoning. As a result, the use in the real world of any generalizer that fits a hypothesis function to a training set (e.g., the use of back-propagation) is implicitly predicated on an assumption about the physical universe. This paper shows how this assumption can be expressed in terms of a non-Euclidean inner product between two vectors, one representing the physical universe and one representing the generalizer. In deriving this result, a novel formalism for addressing machine learning is developed. This new formalism can be viewed as an extension of the conventional "Bayesian'' formalism, to (among other things) allow one to address the case in which one's assumed "priors'' are not exactly correct. The most important feature of this new formalism is that it uses an extremely low-level event space, consisting of triples of {target function, hypothesis function, training set}. Partly as a result of this feature, most other formalisms that have been constructed to address machine learning (e.g., PAC, the Bayesian formalism, and the "statistical mechanics'' formalism) are special cases of the formalism presented in this paper. Consequently such formalisms are capable of addressing only a subset of the issues addressed in this paper. In fact, the formalism of this paper can be used to address all generalization issues of which the author is aware: over-training, the need to restrict the number of free parameters in the hypothesis function, the problems associated with a "non-representative'' training set, whether and when cross-validation works, whether and when stacked generalization works, whether and when a particular regularizer will work, and so forth. A summary of some of the more important results of this paper concerning these and related topics can be found in the conclusion.