The theory of linear discriminants dates back to the 1930s, when Fisher proposed a procedure for classification. In the field of artificial intelligence attention was drawn to this problem by the work of Frank Rosenblatt, who starting from 1956 introduced the perceptron learning rule. Minsky and Papert's famous book Perceptrons analysed the computational limitations of linear learning machines. The famous book by Duda and Hart provides a complete survey of the state of the art up to 1973. For more recent results, see also Bishop's book on neural networks which includes a description of a class of generalised learning machine.
The extension of Novikoff to the non-separable case is due to Freund and Schapire; the pocket algorithm was proposed by Gallant; a simple description of Fisher's discriminant is in Duda and Hart. For discussions of the computational complexity of learning linear separations in the non-separable case, see Hoeffgen et al. and Arora et al. The idea of a maximal margin hyperplane has been rediscovered several times. It is discussed by Vapnik and Lerner, by Duda and Hart, and an iterative algorithm known as Adatron for learning maximal margin hyperplanes was investigated in the statistical mechanics literature by Anlauf and Biehl, and will be further discussed in Chapter 7.
The problem of linear regression is much older than the classification one. Least squares linear interpolation was first used by Gauss in the 18th century for astronomical problems. The ridge regression algorithm was published by Hoerl and Kennard, and subsequently discovered to be a special case of the regularisation theory of Tikhonov for the solution of ill-posed problems. The dual form of ridge regression including the derivations of Subsection 2.2.2 was studied by Saunders et al. and Smola et al. An equivalent heuristic was widely used in the neural networks literature under the name of weight decay. The Widrow-Hoff algorithm is described here.
Finally note that the representation of linear machines in the dual form, using the training data, is intimately related to the optimisation technique of Lagrange multipliers, and will be further discussed in Chapter 5. Guyon and Stork compare the primal and dual representation of linear learning machines in an analogous way to that adopted in this chapter.