The use of Mercer's theorem for interpreting kernels as inner products in a feature space was introduced into machine learning in 1964 by the work of Aizermann, Bravermann and Rozoener on the method of potential functions, but its possibilities were not fully understood until it was first used in the article by Boser, Guyon and Vapnik that introduced the Support Vector method.
The theory of kernels is, however, older: Mercer's theorem dates back to 1909, and the study of reproducing kernel Hilbert spaces was developed by Aronszajn in the 1940s. This theory was used in approximation and regularisation theory, see for example the book of Wahba and her 1999 survey. The first use of polynomial kernels was by Poggio in 1975. Reproducing kernels were extensively used in machine learning and neural networks by Poggio and Girosi from the early 1990s.
The theory of positive definite functions was also developed in the context of covariance and correlation functions, so that work in Gaussian processes is closely related . In fact that literature builds on older results, from the theory of reproducing kernel hilbert spaces.
Techniques for `making kernels' can be found in many papers, for example by Micchelli, MacKay, Evgeniou et al., Schoelkopf et al., Haussler , and Watkins. The discussion about RKHSs draws on the paper of Haussler, while Example 3.1.5 is based on Watkins' paper. The one dimensional shift invariant kernels of Example 3.9 is taken from Girosi's paper. The differential geometric description of the feature space has been provided by Burges, along with some necessary conditions for a kernel to satisfy Mercer's theorem.
Building on an observation of. Scholekopf, Watkins and Haussler have greatly extended the use of kernels, showing that they can in fact be defined on general sets, which do not need to be Euclidean spaces, paving the way for their use in a swathe of new real-world applications, on input spaces as diverse as biological sequences, text, and images. These kernels generalise the idea of recursive ANOVA kernels described in Vapnik's book.
Doachims and Dumais et al used sparse vectors to encode text features. Jaakkola and Haussler proposed to use a hidden Markov model in order to evaluate a kernel between biosequences, where the feature vector is the Fisher score of the distribution. This is described in more detail in Chapter 7. Watkins proposed to use probabilistic context free grammars to build kernels between sequences. Also Haussler proposed the use of special kernels for biosequences.
An interesting direction of research is to learn the kernel directly from the data. The papers of Jaakkola etal and Cristianini et al. deal with this issue. An interesting paper by Amari and Wu describes a method for directly acting on the kernel in order to affect the geometry in the input space, so that the separability of the data is improved.
The use of kernels as a general technique for using linear machines in a non-linear fashion can be exported to other learning systems, such as nearest neighbour (using the techniques of Section 3.4 or less trivially to PCA as demonstrated by Schoelkopf, Smola and Mueller. A technique that uses a specific kernel in order to deal with noisy and non-separable data was introduced by Shawe-Taylor and Cristianini.