{"title": "Regularisation in Sequential Learning Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 458, "page_last": 464, "abstract": null, "full_text": "Regularisation in Sequential Learning \n\nAlgorithms \n\nJ oao FG de Freitas \nCambridge University \n\nEngineering Department \n\nCambridge CB2 IPZ England \n\njfgf@eng.cam.ac.uk \n\n[Corresponding author] \n\nMahesan Niranjan \nCambridge University \n\nEngineering Department \n\nCambridge CB2 IPZ England \n\nniranjan@eng.cam.ac.uk \n\nAndrew H Gee \n\nCambridge University \n\nEngineering Department \n\nCambridge CB2 IPZ England \n\nahg@eng.cam.ac.uk \n\nAbstract \n\nIn this paper, we discuss regularisation in online/sequential learn(cid:173)\ning algorithms. In environments where data arrives sequentially, \ntechniques such as cross-validation to achieve regularisation or \nmodel selection are not possible. Further, bootstrapping to de(cid:173)\ntermine a confidence level is not practical. To surmount these \nproblems, a minimum variance estimation approach that makes use \nof the extended Kalman algorithm for training multi-layer percep(cid:173)\ntrons is employed. The novel contribution of this paper is to show \nthe theoretical links between extended Kalman filtering, Sutton's \nvariable learning rate algorithms and Mackay's Bayesian estima(cid:173)\ntion framework. In doing so, we propose algorithms to overcome \nthe need for heuristic choices of the initial conditions and noise \ncovariance matrices in the Kalman approach. \n\n1 \n\nINTRODUCTION \n\nModel estimation involves building mathematical representations of physical pro(cid:173)\ncesses using measured data. This problem is often referred to as system identifi(cid:173)\ncation, time-series modelling or machine learning. In many occasions, the system \nbeing modelled varies with time. Under this circumstance, the estimator needs to be \n\n\fRegularisation in Sequential Learning Algorithms \n\n459 \n\nupdated sequentially. Online or sequential learning has many applications in track(cid:173)\ning and surveillance, control systems, fault detection, communications, econometric \nsystems, operations research, navigation and other areas where data sequences are \noften non-stationary and difficult to obtain before the actual estimation process. \nTo achieve acceptable generalisation, the complexity of the estimator needs to be \njudiciously controlled. Although there are various reliable schemes for controlling \nmodel complexity when training en bloc (batch processing), the same cannot be \nsaid about sequential learning. Conventional regularisation techniques cannot be \napplied simply because there is no data to cross-validate. Consequently, there is \nample scope for the design of sequential methods of controlling model complexity. \n\n2 NONLINEAR ESTIMATION \n\nA dynamical system may be described by the following discrete, stochastic state \nspace representation: \n\nWk +dk \ng(Wk, tk) + Vk \n\n(1) \n(2) \nwhere it has been assumed that the model parameters (Wk E R.:~a' where \nthe Ai correspond to the eigenvalues of the Hessian of the error function without \nthe regularisation term. \nInstead of adopting Mackay's evidence framework, it is possible to maximise the \nposterior density function by performing integrations over the hyper-parameters \nanalytically (Buntine and Weigend 1991, Mackay 1994). The latter approach is \nknown as the MAP framework for 0 and {3. The hyper-parameters computed by \nthe MAP framework differ from the ones computed by the evidence framework in \nthat the former makes use of the total number of parameters and not only the \neffective number of parameters. That is, 0 and {3 are updated according to: \n\nq \nOk+l = \",q \n\n2 \nL.,..i=l Wi \n\nand \n\n{3k+1 = \n\nn \n\n2 \nl:k=l (Yk - /n,q(Wk ,