Support Vector Machines are a very specific class of algorithms, characterised by the use of kernels, the absence of local minima, the sparseness of the solution and the capacity control obtained by acting on the margin, or on other `dimension independent' quantities such as the number of support vectors. They were invented by Vladimir Vapnik and his co-workers, and first introduced with a paper at the COLT 1992 conference. All of these features, however, were already present and had been used in machine learning since the 1960s: large margin hyperplanes in the input space were discussed for example by Duda and Hart, Cover, Vapnik et al, and several statistical mechanics papers (for example Anlauf and Biehl); the use of kernels was proposed by Aronszajn, Wahba, Poggio, and others, but it was the paper by Aizermann et al. in 1964 that introduced the geometrical interpretation of the kernels as inner products in a feature space. Similar optimisation techniques were used in pattern recognition by Mangasarian, and the sparseness had also already been discussed. See also Hassoun for related early work. The use of slack variables to overcome the problem of noise and non-separability was also introduced in the 1960s by Smith and improved by Bennett and Mangasarian. However, it was not until 1992 that all of these features were put together to form the maximal margin classifier, the basic Support Vector Machine, and not until 1995 that the soft margin version was introduced by Cortes and Vapnik: it is surprising how naturally and elegantly all the pieces fit together and complement each other. The papers by Shawe-Taylor et al. and Bartlett gave the first rigorous statistical bound on the generalisation of hard margin SVMs, while the paper Shawe-Taylor and Cristianini gives similar bounds for the soft margin algorithms and for the regression case.
After their introduction, an increasing number of researchers have worked on both the algorithmic and theoretical analysis of these systems, creating in just a few years what is effectively a new research direction in its own right, merging concepts from disciplines as distant as statistics, functional analysis, optimisation, as well as machine learning. The soft margin classifier was introduced a few years later by Cortes and Vapnik, and in 1995 the algorithm was extended to the regression case.
The two recent books written by Vapnik provide a very extensive theoretical background of the field and develop the concept of a Support Vector Machine.
Since most recent advances in kernels, learning theory, and implementation are discussed at the ends of the relevant chapters, we only give a brief overview of some of the improvements recently obtained from the point of view of the overall algorithm. Work has been done on generalisations of the method, and extensions to the multiclass case (Weston and Watkins; Platt, Cristianini and Shawe-Taylor; Vapnik). More general regression scenarios have also been studied: Smola, Schoelkopf and Mueller discussed a very general class of cost functions., and the use of ridge regression in feature space was considered by Saunders et al. and by Smola and Scholekopf.
Ridge regression is a special case of regularisation networks. The concept of regularisation was introduced by Tikhonov, and applied to learning in the form of regularisation networks by Girosi, Jones, Poggio. The relation between regularisation networks and Support Vector Machines has been explored by a number of authors (Girosi, Wahba, Smola, Schoelkopf, Mueller,, see also Evgeniou, Pontil and Poggio).
The "\nu" Support Vector algorithm for classification and regression described in Remark 6.13 was introduced and developed by Smola, Schoelkopf, Williamson and Bartlett. Other adaptations of the basic approach have been used for density estimation, transduction, Bayes Point Estimation, ordinal regression. Rifkyn et al. show that for some degenerate training sets the soft margin gives a trivial solution.
Theoretical advances that do not fit within the framework of Chapter 4 include the analyses of generalisation given in the article of Dietrich, Opper and Sompolinsky, which provides a statistical mechanical analysis of SVMs, the papers by Jaakkola and Haussler, Vapnik and Chapelle, Weston and Herbrich, Wahba, Lin, Zhang, Opper and Winther, which provide a cross-validation analysis of the expected error, and the book Vapnik, which gives an expected error bound in terms of the margin and radius of the smallest ball containing the essential support vectors.
Extensions of the SVM concept have been made by several authors, for example Mangasarian's generalised SVMs. Particularly interesting is the development of a system, called the Bayes Point Machine, that enforces another inductive principle, given by Bayesian generalisation theory. Albeit losing the feature of sparseness, this system exhibits excellent performance, and illustrates the important point that this class of algorithms is not limited to the use of margin bounds. Another example of a system similar to SVMs that does not enforce a margin bound is given by Gaussian processes.
More recently, a number of practical applications of SVMs have been reported, in fields as diverse as bioinformatics, computational linguistics and computer vision (some of which are reported in Chapter 8). Many of these recent advances are reported in the collections Schoelkopf et al. and Smola et al. , by Burges, by Smola and Schoelkopf and by Evgeniou, Pontil and Poggio.
Gaussian processes are surveyed in this paper by Williams, and in the dissertations of Rasmussen and Gibbs. Extensions of Gaussian processes to the classification case have also been undertaken but fall beyond the scope of this book.