A major problem in machine learning is that of inductive bias: how to choose a learner's hypothesis space so that it is large enough to contain a solution to the problem being learnt, yet small enough to ensure reliable generalization from reasonably-sized training sets. Typically such bias is supplied by hand through the skill and insights of experts. In this paper a model for automatically learning bias is investigated. The central assumption of the model is that the learner is embedded within an environment of related learning tasks. Within such an environment the learner can sample from multiple tasks, and hence it can search for a hypothesis space that contains good solutions to many of the problems in the environment. Under certain restrictions on the set of all hypothesis spaces available to the learner, we show that a hypothesis space that performs well on a sufficiently large number of training tasks will also perform well when learning novel tasks in the same environment. Explicit bounds are also derived demonstrating that learning multiple tasks within an environment of related tasks can potentially give much better generalization than learning a single task.

Journal of Arti002cial IntelligenceResearch 12 {2000} 149226198 Submitted11/99; published 3/00 A Model of Inductive Bias Learning JonathanBaxter JONATHAN.BAXTER@ANU.EDU.AU ResearchSchool of Information Sciences and Engineering Australian National University, Canberra 0200, Australia Abstract A major problem in machinelearning is that of inductive bias:how to choose a learner's hy- pothesis space so that it is large enough to contain a solution to the problem being learnt, yetsmall enough to ensure reliable generalizationfrom reasonably-sized training sets. Typically such bias is supplied by handthrough the skill and insights of experts.In this paper a model for automatically learning bias is investigated.The central assumption of the model is that thelearner is embedded within anenvironment of related learning tasks.Within such an environment the learner cansample from multiple tasks, and hence it cansearch for a hypothesis space that contains goodsolutions to many of the problems in the environment. Under certain restrictions on theset of all hypothesis spaces availableto the learner, we show that a hypothesis space that performs well on a suf002ciently large number of training tasks will alsoperform well when learning novel tasks in thesame en- vironment. Explicit bounds arealso derived demonstrating that learning multipletasks within an environment of relatedtasks can potentially give much bettergeneralization than learning a single task. 1. Introduction Oftenthe hardest problem in any machinelearning task is the initial choice of hypothesisspace; it has to be large enough tocontain a solution to the problem at hand, yetsmall enough to ensure good generalization froma small number of examples {Mitchell, 1991}.Once a suitable bias has been found, theactual learning task is often straightforward.Existing methods of bias generally requirethe input of a human expert in the form of heuristics and domain knowledge {for example, through the selection of an appropriate setof features}. Despite their successes, suchmethods are clearly limited by the accuracyand reliability of the expert'sknowledge and also by the extent to whichthat knowledge can be transferred to the learner. Thus it is natural to search formethods for automatically learningthe bias. In this paper we introduce andanalyze a formal model of bias learningthat builds upon the PACmodel of machine learning and its variants {Vapnik, 1982; Valiant, 1984;Blumer, Ehrenfeucht, Haussler, &Warmuth, 1989; Haussler, 1992}.These models typically take the following general form: the learner is supplied with a hypothesis space H and training data z =f{x 1 ; y 1 }; : : : ; {x m ; y m }g drawn independently according tosomeunderlying distribution P onX 002 Y . Based on theinformation contained in z, the learner's goal is to select a hypothesis h : X ! Yfrom H minimizing some measureer P {h}of expected loss with respect to P{for ex- ample, in the case ofsquared loss er P {h} := E {x;y}030P {h{x} 000 y} 2 }. In such models thelearner's bias is represented by thechoice of H; if H does not contain agood solution to the problem, then, regardless of how much data the learner receives, it cannot learn. Of course, the bestway to bias the learner is to supply it withan H containing just a single op- timal hypothesis. But 002nding such ahypothesis is precisely the original learning problem,so in the c fl2000AI Access Foundation and Morgan KaufmannPublishers. Allrights reserved.BAXTER PAC model there is no distinctionbetween bias learning and ordinary learning.Or put differently, thePAC model does not model the process ofinductive bias, it simply takes the hypothesisspace H as given and proceeds fromthere. To overcome this problem, inthis paper we assume that instead of being faced with just a single learning task,the learner is embedded within an environmentof related learning tasks. Thelearner is supplied with a family of hypothesisspaces H = fHg, and its goal is to 002nd a bias {i.e.hypothesis space H 2 H } that isappropriate for the entire environment. A simple example is the problem of handwrittencharacter recognition. A preprocessingstage that identi002es and removes any{small} rotations, dilations and translations ofanimage of a character will be advantageousfor recognizing allcharacters. If the set ofall individual character recognition problems is viewed as an environment oflearning problems {that is, the set of all problems of the form 223distinguish `A ' from allother characters224, 223distinguish `B' from allother characters224, and so on}, thispreprocessor represents a bias that is appropriate forall problems in the environment. It is likely that there are many othercurrently unknown biases that are also appropriateforthis environment. Wewould like to be able to learn theseautomatically. There are many other examples of learning problems that can be viewed as belonging to envi- ronments of relatedproblems. For example, each individual face recognition problem belongs to an {essentially in002nite} set of related learningproblems {all the other individual facerecognition prob- lems}; the set of all individual spoken word recognition problems formsanother large environment, asdoes the set of all 002ngerprint recognitionproblems, printed Chinese and Japanese character recog- nition problems, stock price prediction problemsand so on. Even medical diagnostic andprognostic problems, where a multitude ofdiseases are predicted from the same pathology tests,constitute an environment ofrelated learningproblems. In many cases these 223environments224 arenot normally modeled as such;instead they are treated as single, multiplecategory learning problems. For example,recognizing a group of faces would normally be viewed as a single learningproblem with multiple class labels {one for each face in the group}, not as multiple individual learning problems. However,if a reliable classi002er for each individual face in the group can beconstructed then they can easily be combined toproduce a classi002er for the whole group.Furthermore, byviewing the faces as an environment of related learning problems, theresults presented here show that bias can belearnt that will be good for learningnovel faces, a claim that cannot be madefor the traditional approach. This point goesto the heart of our model: we are not notconcerned with adjusting a learner's bias so it performs better on some002xed set of learning problems. Sucha process is in fact just ordinarylearning but with a richer hypothesis space inwhich some components labelled 223bias224 are also able to be varied. Instead,we suppose the learner is faced with a{potentially in002nite} stream of tasks, and that by adjusting its bias on somesubset of the tasks it improves its learningperformance on future, as yet unseen tasks. Bias that is appropriate forall problemsin an environment must be learnt by sampling from many tasks. If only a single task is learnt then the bias extracted is likely to be speci002c to that task. In the rest ofthis paper, a general theory of bias learning isdeveloped based upon the idea of learning multiple related tasks. Looselyspeaking {formal results are stated in Section 2},there are twomain conclusions ofthetheory presented here: * Learningmultiple related tasks reduces the sampling burdenrequired for good generalization, at least ona number-of-examples-required-per-taskbasis. 150A MODEL OFINDUCTIVE BIASLEARNING * Bias that is learnt on suf002ciently manytraining tasks is likely to be good for learningnovel tasks drawn from the same environment. The second point shows that aform of meta-generalization ispossible in bias learning. Or- dinarily, wesay a learner generalizes wellif, after seeing suf002ciently many training examples, it produces a hypothesis that withhigh probability will perform well on future examples of the same task. However,a bias learner generalizes well if, after seeing suf002ciently many training tasks it pro- duces a hypothesis space that with highprobability contains good solutions to novel tasks.Another term that has been used forthis process is Learning to Learn {Thrun &Pratt, 1997}. Our main theorems are statedin an agnostic setting {that is, Hdoes not necessarily contain a hypothesisspace with solutions to all the problemsin the environment}, butwe also giveimproved bounds in the realizable case.The sample complexity bounds appearing intheseresults are stated interms of combinatorialparameters related to the complexity of the set of all hypothesis spaces H availableto the bias learner. For Boolean learning problems {pattern classi002cation} these parameters are the bias learning analogue of theVapnik-Chervonenkis dimension{Vapnik, 1982; Blumer et al., 1989}. As an application of the general theory, the problem of learning an appropriate set ofneural- network features for an environmentof related tasks is formulated as a bias learningproblem. In the case of continuousneural-network features we are able to proveupper bounds on the number of training tasksand number of examples of each training taskrequired to ensure a set of features that works well for the training tasks will,with high probability, work well on novel tasks drawn from the same environment.The upper bound on the number of tasks scalesas O{b} where bis a measure of the complexity of thepossible feature sets available to the learner,whilethe upper bound on the number of examples of each task scales as O{a+ b=n} where O{a} is the number ofexamples required to learn a task if the223true224 set of features {that is, the correctbias} is already known, and nis the number of tasks. Thus, in this casewe see that as the number of related tasks learnt increases, the number of examplesrequired of each task for good generalization decaysto the minimum possible. ForBoolean neural-network feature maps we are able toshow a matching lowerbound on thenumber of examples required per task of the sameform. 1.1 Related Work There is a large body of previousalgorithmic and experimental work in the machinelearning and statistics literature addressingthe problems of inductive bias learning and improving generalization through multiple tasklearning. Some of these approaches can be seenas special cases of, or at least closely aligned with, the model described here,while others are more orthogonal. Withoutbeing completely exhaustive, in thissection we present an overview of the maincontributions. See Thrun and Pratt{1997, chapter 1} for a more comprehensivetreatment. * Hierarchical Bayes.The earliest approaches to bias learning comefrom Hierarchical Bayesian methods in statistics{Berger, 1985; Good,1980; Gelman, Carlin, Stern,& Rubim, 1995}. In contrast to the Bayesianmethodology, the present paper takes anessentially empirical process approach tomodeling the problem of bias learning.However, a model using a mixture of hierarchical Bayesian and information-theoretic ideaswas presented in Baxter {1997a}, with similar conclusions tothose found here.An empirical study showing the utility of the hierarchical Bayes approach in a domaincontaining alarge number of related tasks wasgiven in Heskes {1998}. 151BAXTER * Early machine learning work.In Rendell, Seshu, and Tcheng {1987}223VBMS224 or Variable Bias Management System was introduced as amechanism for selecting amongst different learning algorithms when tackling a new learningproblem. 223STABB224 or Shift To a Better Bias {Ut- goff, 1986} was another early scheme for adjusting bias, butunlike VBMS, STABB was not primarily focussed on searching forbias applicableto large problem domains. Our use of an 223environment ofrelated tasks224 in thispaper may also be interpreted as an 223environmentof analogous tasks224 in the sense thatconclusions about one task can be arrived at by analogy with {suf002ciently many of} the other tasks. For an early discussion ofanalogy in this con- text, see Russell{1989, S4.3}, in particular the observation thatfor analogous problems the samplingburden per task can be reduced. * Metric-based approaches.The metric used in nearest-neighbourclassi002cation, andin vector quantizationtodetermine the nearest code-book vector,represents aform of inductive bias. Using the model of the present paper, andunder some extra assumptions on the tasks in the environment {speci002cally, that their marginal input-space distributionsare identical and they only differ inthe conditional probabilities theyassign to classlabels}, it can be shown that there isan optimal metric or distance measure to usefor vector quantization andone- nearest-neighbourclassi002cation {Baxter, 1995a,1997b; Baxter & Bartlett, 1998}. Thismetric can be learnt by sampling from asubset of tasks from the environment,andthen used as a distance measure whenlearning novel tasks drawn from the same environment. Bounds on the number oftasks and examples of each task required to ensure good performance on novel taskswere given in Baxter and Bartlett {1998},along with an experiment in which a metric was successfully trained on examples of asubset of 400 Japanese characters and then used as a 002xed distance measure when learning 2600as yet unseen characters. A similar approachis described in Thrun and Mitchell {1995}, Thrun{1996}, in which a neuralnetwork's output was trained to match labels on a novel task, while simultaneously beingforced to match its gradient to derivativeinformation generated from a distance metric trained on previous, related tasks.Performance on the novel tasks improvedsubstantially with the use of the derivative information. Notethat there are many other adaptive metric techniques used inmachine learning, but these all focus exclusively on adjusting the metric for a 002xed set of problems rather than learning a metric suitable for learning novel, relatedtasks {bias learning}. * Featurelearning or learning internal representations.As with adaptive metric techniques, there are many approaches to feature learning that focus on adapting features for a 002xedtask rather than learning features to beused in novel tasks. One of the fewcases where features have been learnt ona subset of tasks with the explicit aim of using them on novel tasks was Intrator andEdelman {1996} in which a low-dimensionalrepresentation waslearnt for a set ofmultiple related image-recognition tasksandthen used to successfully learn novel tasks of the same kind. The experiments reportedin Baxter {1995a, chapter 4} and Baxter {1995b}, Baxter and Bartlett {1998} are also of thisnature. * Bias learning in Inductive Logic Programming {ILP}. Predicate invention refers to the pro- cessin ILP whereby new predicates thought to beuseful for the classi002cation task at hand are added to the learner'sdomain knowledge. By using the newpredicates as background do- main knowledgewhen learning novel tasks, predicate inventionmay be viewed as a form of 152A MODEL OFINDUCTIVE BIASLEARNING inductivebias learning. Preliminary results with thisapproach on a chess domain are reported in Khan, Muggleton, and Parson {1998}. * Improving performance on a 002xed reference task. 223Multi-task learning224{Caruana, 1997} trains extra neural network outputs to match related tasks in order toimprove generalization performance on a 002xed reference task. Although this approach does notexplicitly identify the extrabias generated by the related tasks in a waythat can be used to learn novel tasks, it is anexample of exploiting the bias provided by a set of related tasks to improvegeneralization performance. Othersimilar approaches include Suddarth and Kergosien{1990}, Suddarth and Holden {1991}, Abu-Mostafa {1993}. * Bias ascomputational complexity. In this paper weconsider inductive bias from a sample- complexity perspective: howdoes the learnt bias decrease the number of examples required of novel tasks for goodgeneralization? A natural alternative line ofenquiry is how the running- timeor computational complexity of a learning algorithmmay be improved by training on related tasks. Some early algorithms for neural networks in this vein are contained in Sharkey and Sharkey {1993}, Pratt {1992}. * Reinforcement Learning.Many control tasks can appropriately be viewed as elements of sets of related tasks,suchas learning to navigate to different goalstates, orlearning a set of complex motorcontrol tasks. A number of papers in thereinforcement learning literature haveproposed algorithms for both sharing the informationin related tasks to improve average generalization performance across those tasksSingh {1992}, Ring {1995}, orlearning bias from a set of tasks to improveperformance on future tasks Sutton {1992}, Thrun andSchwartz {1995}. 1.2Overview of the Paper In Section 2 the bias learning model is formally de002ned, and the main sample complexity results aregiven showing the utility of learningmultiple related tasks and the feasibility ofbiaslearning. These results show that thesample complexity is controlled by the size ofcertain covering numbers associated with the set of all hypothesis spaces available to thebias learner, in much the same way as the sample complexity in learning Booleanfunctions is controlled bythe Vapnik-Chervonenkis dimension {Vapnik, 1982; Blumer et al., 1989}.The results of Section 2 are upper bounds on the sample complexity required for goodgeneralization when learning multiple tasks and learning inductive bias. The general resultsof Section 2 are specialized tothe case of feature learning with neural net- works in Section3, where an algorithm for training features bygradient descent is also presented. For this special case we are able to showmatching lower bounds for the sample complexityof multiple task learning. InSection 4 we present some concluding remarks anddirections for future research. Manyof the proofs are quite lengthy and have beenmoved to the appendices so as not to interrupt the 003ow of the main text. The following tables contain a glossary of the mathematical symbols used in the paper. 153BAXTERSymbolDescriptionFirstReferencedXInput Space155YOutput Space155PDistribution on X 002Y {learning task}155lLoss function155HHypothesis Space155hHypothesis155er P {h}Error of hypothesis hon distribution P156zTraining set156ALearning Algorithm156^er z {h}Empirical error of h on training setz156PSet of all learningtasks P157QDistribution over learning tasks157HFamily of hypothesis spaces157er Q {H}Lossof hypothesis space H on environment Q158z{n; m}-sample158^er z {H}Empirical loss of H onz158ABias learning algorithm159h lFunction inducedby h and l159H lSet of h l159{h 1 ; : : : ; h n } lAverage of h 1;l ; : : : ; h n;l159h lSame as{h 1 ; : : : ; h n } l159H n lSetof {h 1 ;: : : ; h n } l159H n lSet of H n l159H 003Function on probability distributions160H 003Set of H 003160d PPseudo-metric on H n l160d QPseudo-metric on H 003160N {"; H 003 ; d Q }Coveringnumber of H 003160C {"; H 003 }Capacity of H 003160N {";H n l ;d P }Covering number of H n l160C {";H n l }Capacity of H n l160hSequence of n hypotheses {h 1 ; : : : ; h n }163PSequence of n distributions{P 1 ; : : : ; P n }163er P {h}Average loss of h onP164^er z {h}Averageloss of h on z164FSet of feature maps166GOutput class composed with feature mapsf166G ffi fHypothesis space associated withf166G lLoss function class associated with G166N {"; G l ; d P }Covering number of G l166C {"; G l }Capacityof G l166d [P;G l ] {f ; f 0 }Pseudo-metricon feature maps f ; f 0166N {"; F ;d [P;G l ] }Coveringnumber of F166154A MODEL OFINDUCTIVE BIASLEARNINGSymbolDescriptionFirst ReferencedN {"; F ; d [P;G l ] }Covering number ofF166C G l {"; F }Capacity of F166H wNeural network hypothesis space167H jxH restricted to vector x172005 H {m}Growth function of H172VCdim{H}Vapnik-Chervonenkis dimensionof H172H jxH restricted to matrix x173H jxH restricted to matrix x173005 H {n;m}Growth function ofH173d H {n}Dimension function ofH173d{ H}Upper dimension functionof H173d{H }Lowerdimension function of H173opt P { H n }Optimal performance ofH n onP175d 027Metric onR +179h 1 010001 001 001 010 h nAverageof h 1 , :: : , h n179H 1 010 001 001 001010 H nSet ofh 1 010 001 001 001010 h n180000 {2m;n}Permutations on integer pairs182z 033Permutedz182d z {h; h 0 }Empirical l 1 metric on functions h182^er P {H}Optimal average error ofH on P1852. TheBias Learning Model In this section thebias learning model is formally introduced.To motivate the de002nitions, we 002rst describe the main features of ordinary{single-task} supervised learning models. 2.1 Single-Task Learning Computational learning theory models ofsupervised learning usually include the followingingre- dients: * Aninput space X and an output spaceY , * a probabilitydistribution P on X 002Y , * a loss functionl : Y 002 Y! R, and * ahypothesis space H which is a set ofhypotheses or functions h :X ! Y . As an example,if the problem is to learn to recognize images ofMary's face using a neural network, then X would be the set of allimages {typically represented as a subset ofR d where each component is a pixel intensity}, Y wouldbe the set f0; 1g, and the distribution P wouldbe peaked over images of differentfaces and the correct class labels.The learner's hypothesis spaceH would be a class of neural networks mapping the input space R d to f0; 1g. The loss in this case would bediscrete loss: l{y;y 0 } := 032 1 if y 6= y 0 0 ify = y 0 {1} 155BAXTER Using the loss function allows us topresent a uni002ed treatment of both patternrecognition {Y = f0; 1g, l as above}, andreal-valued function learning {e.g.regression} in which Y =R and usually l{y; y 0 } = {y 000 y 0 } 2 . The goal of the learneris to select a hypothesis h 2 Hwith minimum expected loss: er P {h} := Z X 002Y l{h{x}; y} dP {x; y}: {2} Of course,thelearner does not know P andso it cannot search through H for anh minimizing er P {h}. In practice, the learner samples repeatedly from X 002 Yaccording to the distribution Pto generate a training set z := f{x 1 ; y 1 }; : : : ; {x m ; y m }g: {3} Basedon the information contained in z the learnerproduces a hypothesis h 2 H. Hence,in general a learner is simply a mapA from the set of all training samples to thehypothesis space H: A: [ m>0 {X 002 Y } m ! H {stochasticlearner's can be treated by assuming a distribution-valued A.} Manyalgorithms seek to minimize the empiricalloss of h on z, where this isde002ned by: ^er z {h} := 1m m X i=1 l{h{x i }; y i }: {4} Ofcourse, there are more intelligent things to do with the data than simply minimizing empirical error227for example one can add regularisation terms to avoid over-002tting. However the learner chooses its hypothesish, if we have a uniformbound {over all h 2 H} on the probability oflarge deviation between^er z {h} and er P {h}, then we can bound thelearner's gen- eralization errorer P {h}as a function of its empirical loss on thetraining set ^er z {h}. Whether such a bound holds depends upon the 223richness224of H. The conditions ensuring convergence between ^er z {h} and er P {h} are by nowwell understood; for Boolean function learning {Y = f0; 1g, discrete loss}, convergence iscontrolled by the VC-dimension 1 of H: Theorem 1.Let P be any probability distribution onX 002 f0;1g and suppose z = f{x 1 ; y 1 }; : : : ;{x m ; y m }g is generated bysampling m times from X 002f0; 1g according toP . Let d := VCdim{H}. Then with probability at least1000 ffi {over the choice ofthe training set z}, all h2 H will satisfy er P {h}024 ^er z {h} + 024 32m 022 d log 2emd + log 4ffi 025 1=2 {5} Proofsof this result may be found in Vapnik {1982}, Blumer et al. {1989},and will not be reproduced here.1. The VC dimension of aclass of Boolean functions H is the largestinteger d such that there exists a subsetS := fx 1 ; : : : ; x d g 032 X such that the restrictionof H to S contains all 2 d Boolean functions onS. 156A MODEL OFINDUCTIVE BIASLEARNING Theorem 1 onlyprovides conditions under which the deviationbetween er P {h} and ^er z {h} is likely to besmall, it does not guarantee that the true errorer P {h}will actually be small. This is governed by the choice of H. If H contains a solution with smallerror and the learner minimizes erroron thetraining set, then with high probabilityer P {h}will be small. However, a bad choiceof H will mean there is no hope ofachieving small error. Thus, thebias of the learner in this model 2 is represented by the choice ofhypothesis space H. 2.2The Bias Learning Model The main extra assumption of the bias learning model introducedhere is that the learner is embed- ded in an environment of related tasks,and can sample from the environment to generatemultiple training sets belonging to multipledifferent tasks. In the above model ofordinary {single-task} learning, alearning task is represented by a distributionP on X 002 Y .So in the bias learning model,an environment oflearning problems is representedby a pair {P ; Q}where P is the set of all probability distributions onX 002 Y {i.e., Pis the set of all possible learning problems},and Q is a distributiononP . Q controls which learningproblems the learner is likely to see 3 . For example, if the learner is in a face recognition environment, Qwill be highly peaked overface-recognition-type problems,whereas if the learner is in a character recognition environment Q will be peaked over character-recognition-type problems{here,as in the introduction, we view these environments as sets of individualclassi002cation problems, rather than single, multipleclass classi002cation problems}. Recall from the last paragraph of the previous section that thelearner's bias is represented byits choice of hypothesis space H.So to enable the learner to learn the bias, wesupply it with a family or set ofhypothesis spaces H := fHg. Putting all this together, formally alearning to learn or bias learningproblem consists of: * aninput space X and an output spaceY {both of which are separable metricspaces}, * a loss functionl : Y 002 Y! R, * anenvironment {P ; Q} where P is the set of allprobability distributions on X 002Y and Q is a distributionon P , * ahypothesis space family H =fHg where each H 2 His a set of functions h :X ! Y . From now on wewill assume the loss function l has range[0; 1], or equivalently, with rescaling, we assume thatl is bounded.2. The bias is also governed by howthe learner uses the hypothesis space.For example, under some circumstances the learner may choose not to use the full power of H {a neural network example isearly-stopping}. For simplicity in this paper we abstract away from suchfeatures of the algorithm A and assume that ituses the entire hypothesis space H. 3. Q'sdomain is a033-algebra of subsets of P. A suitable one for our purposes is theBorel 033-algebra B{P} generated by the topology of weakconvergence on P . Ifwe assume that X and Yare separable metric spaces, then Pis also a separable metric space in the Prohorov metric {which metrizes the topology ofweak convergence} {Parthasarathy, 1967}, so there is no problem withthe existence of measures on B{P }. See Appendix D for furtherdiscussion, particularly the proof of part 5in Lemma 32. 157BAXTER We de002ne the goal of a bias learner to be to 002nd a hypothesis space H2 H minimizing the followingloss: er Q {H} := Z P inf h2H er P {h} dQ{P } {6} = Z P inf h2H Z X 002Y l{h{x}; y} dP {x; y} dQ{P }: The only way er Q {H} can be small is if, withhigh Q-probability, Hcontains agood solution h to any problemP drawn at random according toQ. In this sense er Q {H} measures howappropriate the bias embodied byH is for the environment {P; Q}. In general thelearner will not know Q, so it will notbe able to 002nd an H minimizinger Q {H} directly. However,the learner can sample from the environment inthe following way: * Sample n times from P accordingto Q to yield: P 1 ; : : : ; P n . * Samplem times from X 002 Yaccording to each P i to yield: z i = f{x i1 ; y i1 } : : : ; {x im ; y im }g. * The resultingntraining sets227henceforth called an{n; m}-sample if they aregenerated by the above process227aresupplied to the learner. In the sequel, an{n; m}-sample will be denoted by z and written as a matrix: {x 11 ;y 11 } 001001 001 {x 1m ; y 1m } = z 1 z := . . . . . . . . . . . . {x n1 ; y n1 } 001 001 001{x nm ;y nm } = z n {7} An {n; m}-sample is simplyn training sets z 1 ; : : : ; z n sampled from n different learningtasks P 1 ;: : : ; P n ,where each task is selected according to the environmental probability distribution Q. The size of each training set is kept the same primarily to facilitate the analysis. Based on the information contained inz, the learner must choose a hypothesis spaceH 2 H . One way to dothis would be for the learner to 002nd anH minimizing the empirical loss onz, where this is de002ned by: ^er z {H} := 1n n X i=1 inf h2H ^er z i {h} {8} Note that ^er z {H} is simply the averageof the best possible empirical error achievableon each training set z i , using a function fromH. It is a biased estimate ofer Q {H}. An unbiased esti- mateof er Q {H} would require choosing anHwith minimal average error over then distributions P 1 ; : : : ; P n , where this is de002ned by 1n P n i=1 inf h2H er P i {h}. As with ordinary learning, it islikely there are more intelligent things to dowith the training data z thanminimizing {8}. Denoting the set of all{n; m}-samples by{X 002 Y } {n;m} , a general 223bias learner224is a map A that takes{n; m}-samples as input andproduces hypothesis spaces H 2 Has output: A : [ n>0 m>0 {X 002 Y} {n;m} ! H : {9} 158A MODEL OFINDUCTIVE BIASLEARNING {as stated,Ais a deterministic bias learner, however it is trivial to extend our results tostochastic learners}. Note that in thispaper we are concerned only with the sample complexity properties of a bias learnerA; we do not discuss issues of thecomputability of A. Since Ais searching for entire hypothesis spacesH within a family of such hypothesis spaces H , there is an extra representationalquestion in our model of bias learning that is notpresent in ordinary learning, and that is how the family H is represented andsearched by A. We defer this discussion until Section 2.5, after the mainsample complexity results for this model of biaslearning have been introduced.For the speci002c case of learning a set offeatures suitable for an environment of related learning problems, see Section 3. Regardless of how the learner chooses itshypothesis space H, if we have a uniform bound {over all H 2 H} on the probability oflarge deviationbetween ^er z {H} and er Q {H}, and we cancompute an upper bound on ^er z {H}, then we can bound the bias learner's 223generalization error224 er Q {H}. With this view, the question ofgeneralization withinour bias learning model becomes:how many tasks {n} and how many examples of each task {m} are required to ensure that^er z {H} and er Q {H} areclose with highprobability, uniformly over allH 2 H ? Or, informally, howmany tasks and how many examples of each task are requiredto ensure that a hypothesis space with good solutions to all the training tasks will containgood solutions to novel tasks drawn from thesame environment? It turns out that thiskind of uniform convergence forbias learningis controlled by the 223size224 of certainfunction classes derived from the hypothesis spacefamily H , in much the same way as the VC-dimension of a hypothesis spaceH controls uniform convergence inthe case of Boolean function learning{Theorem 1}. These 223size224 measures andother auxiliary de002nitions needed to state the main theorem are introduced inthefollowing subsection. 2.3 CoveringNumbers De002nition 1. For any hypothesis h : X! Y , de002ne h l : X002 Y! [0; 1] by h l {x;y} := l{h{x}; y} {10} For any hypothesis space Hin the hypothesis space family H ,de002ne H l := fh l :h2 Hg: {11} For any sequence of n hypotheses{h 1 ; : : : ;h n }, de002ne{h 1 ; : : : ; h n } l : {X 002 Y } n ! [0; 1]by {h 1 ; : : : ; h n } l {x 1 ; y 1 ; : : : ; x n ; y n }:= 1n n X i=1 l{h i {x i };y i }: {12} We will also use h l to denote {h 1 ; : : : ; h n } l . For any H in the hypothesisspace family H , de002ne H n l :=f{h 1 ;: : : ; h n } l : h 1 ; : : : ; h n 2 Hg: {13} De002ne H n l := [ H2 H H n l : {14} 159BAXTER In the 002rst part of the de002nition above, hypotheses h : X! Y are turned into functionsh l mapping X002 Y ! [0; 1]by composition withthe loss function.H l is then just thecollection ofall such functions where theoriginal hypotheses comefrom H.H l is often called aloss-function class. In our case weare interested in the average loss acrossn tasks, where each of the n hypotheses is chosen from a 002xed hypothesis spaceH. This motivates the de002nition ofh l and H n l . Finally, H n l isthe collection of all {h 1 ; : : : ; h n } l ,with the restriction thatall h 1 ; : : : ; h n belong to a single hypothesis space H 2 H . De002nition 2. Foreach H 2 H , de002ne H 003 : P ! [0; 1] by H 003 {P } :=inf h2H er P {h}:{15} For the hypothesis spacefamily H , de002ne H 003 := fH 003 : H2 H g: {16} It is the 223size224 ofH n l andH 003 that controls howlarge the {n; m}-samplez must be to ensure ^er z {H}and er Q {H} are close uniformly over allH 2 H . Their size will bede002ned in terms of certain coveringnumbers, and for this we need to de002ne how to measure the distance between elementsof H n l andalsobetween elements of H 003 . De002nition 3. LetP = {P 1 ; : : : ; P n } be any sequence of n probabilitydistributions on X 002 Y. For any h l ; h 0 l 2 H n l , de002ne d P {h l ; h 0 l } := Z {X 002Y } n jh l {x 1 ; y 1 ; : : : ; x n ; y n } 000 h 0 l {x 1 ; y 1 ; : : : ;x n ; y n }j dP 1 {x 1 ; y 1 }: : : dP n {x n ; y n } {17} Similarly, for any distributionQ on P and any H 003 1 ;H 003 2 2 H 003 ,de002ne d Q {H 003 1 ; H 003 2 } := Z P jH 003 1 {P }000 H 003 2 {P }j dQ{P } {18} It is easily veri002ed that d P and d Q are pseudo-metrics 4 on H n l and H 003 respectively. De002nition4. An "-cover of {H 003 ; d Q } is a setfH 003 1 ; : : : ; H 003 N g such that for allH 003 2 H 003 , d Q {H 003 ; H 003 i } 024 " for somei = 1 : : : N .Note that we do not require theH 003 i to be contained in H 003 , just that they be measurable functions on P . LetN {"; H 003 ; d Q }denote the size of the smallest such cover. De002ne thecapacity of H 003 by C {"; H 003 } := sup Q N {"; H 003 ; d Q } {19} wherethe supremum is over all probability measures on P . N {";H n l ;d P } is de002ned in a similar way, using d P in place of d Q . De002nethe capacityof H n l by: C {"; H n l } := sup P N {";H n l ;d P } {20} where now the supremum is over allsequences of n probability measures onX 002 Y .4. A pseudo-metric d is a metricwithout the condition that d{x;y} = 0 } x = y. 160A MODEL OFINDUCTIVE BIASLEARNING 2.4Uniform Convergence for Bias Learners Now we have enough machinery tostate the main theorem. In the theorem thehypothesis space familyis required to bepermissible. Permissibility is discussed indetail in Appendix D, but note that it is a weak measure-theoretic conditionsatis002ed by almost all 223real-world224 hypothesisspace families. All logarithms are to basee. Theorem 2. SupposeX and Y are separable metricspaces and let Q be any probability distri- butionon P , the set of alldistributions on X 002 Y. Suppose z is an {n; m}-sample generated by sampling n times from Paccordingto Q to give P 1 ; : : : ; P n , and then samplingm times from each P i to generate z i = f{x i1 ; y i1 }; : : : ;{x im ; y im }g, i= 1; : : : ; n. LetH = fHgbe any permissible hypothesis space family. Ifthe number of tasks n satis002es n 025 max { 256" 2 log 8C 000 "32 ;H 003 001ffi ; 64" 2 } ; {21} and the number of examples m of each task satis002es m 025 max 032 256n" 2 log 8C{ "32 ; H n l }ffi ; 64" 2 033 ; {22} thenwith probability at least 1 000ffi {over the {n; m}-sample z}, all H 2H will satisfy er Q {H} 024^er z {H} + " {23} Proof. See Appendix A.There are several important points tonote about Theorem 2: 1. Providedthe capacities C {"; H 003 } and C{"; H n l } are 002nite, the theorem shows that any bias learner that selectshypothesis spaces from H can bound itsgeneralisation error er Q {H} in termsof^er z {H} for suf002ciently large {n; m}-samples z.Most bias learner's will not 002nd the exact value of ^er z {H} because it involves 002nding the smallest error of anyhypothesis h 2 H on each of then training sets in z. Butany upper bound on ^er z {H} {found,for example bygradient descent on someerror function} willstill give an upper boundon er Q {H}. See Section 3.3.1 for a briefdiscussion on how this can be achieved in afeature learning setting. 2. Inorder to learn bias {in the sense thater Q {H}and ^er z {H} are close uniformly overall H 2 H }, both thenumber of tasks n and the number of examples of each task m must be suf002cientlylarge. This is intuitively reasonablebecause the bias learner must see both suf002ciently many tasks to be con002dentof the nature of the environment, andsuf002ciently many examples of eachtask to be con002dent of the nature of each task. 3. Once the learner has found anH 2 H with a small value of^er z {H}, it can then use Hto learn novel tasks Pdrawn according to Q. One then hasthe following theorem bounding the sample complexity required for good generalisationwhenlearning with H {the proof is very similar to the proof of the bound onm in Theorem 2}. 161BAXTER Theorem 3. Let z= f{x 1 ; y 1 }; : : : ; {x m ; y m }gbe a training set generated by sampling from X 002 Y accordingto some distribution P . LetH be a permissible hypothesis space.For all "; ffiwith 0 < "; ffi < 1, if the number of training examplesm satis002es m 025 max { 64" 2 log 4C 000 "16 ; H l 001ffi ; 16" 2 } {24} then with probability atleast 1 000 ffi, all h2 H will satisfy er P {h} 024^er z {h} + ": The capacityC {"; H} appearing inequation {24} is de002ned in an analogous fashion to the capacities in De002nition 4 {wejust use the pseudo-metric d P {h l ; h 0 l } := R X002Y jh l {x; y} 000 h 0 l {x; y}j dP {x; y}}. The important thing to note about Theorem 3 is that the number of ex- amples required for good generalisation whenlearning novel tasks is proportional tothe log- arithm of the capacity of thelearnt hypothesis space H. Incontrast, if the learner does not do any bias learning, it will have noreason to select one hypothesis space H2 H over any other andconsequently it would have to view as acandidate solution any hypothesis in any of the hypothesis spaces H 2H . Thus, its sample complexity willbe proportional to the capacityof [ H2 H fH l g = H 1 l , which in generalwill be considerably larger than the capacity of any individual H 2 H. So by learning H the learner haslearnt to learn in the environment {P ; Q} in the sensethat it needs far smaller training sets to learnnovel tasks. 4. Havinglearnt a hypothesis space H with a small value of ^er z {H}, Theorem 2 tells us that with probability atleast 1000 ffi, the expected value ofinf h2H er P {h} on a novel task P will be less than^er z {H} + ". Of course, this does notrule out really bad performance onsome tasks P . However, the probability of generating such 223bad224 tasks can be bounded.In particular, note that er Q {H} is just the expected value of the function H 003 over P ,and so by Markov's inequality ,for fl > 0, Pr 032 P : inf h2H er P {h} 025 fl 033 = Pr fP: H 003 {P } 025 fl g 024 E Q H 003fl = er Q {H}fl 024 ^er z {H} + "fl {with probability 1000 ffi}. 5. Keepingthe accuracy and con002dence parameters"; ffi 002xed, note that the numberof examples required of each task for goodgeneralisation obeys m =O 022 1n log C{"; H n l } :{25} So provided logC {"; H n l } increases sublinearly withn, the upper bound on the number of examples required of each task willdecrease as the number of tasks increases.This shows that for suitably constructed hypothesis space families it is possible toshare information between tasks.This is discussed further after Theorem 4 below. 162A MODEL OFINDUCTIVE BIASLEARNING 2.5Choosing the Hypothesis Space FamilyH . Theorem 2 only providesconditions under which ^er z {H} ander Q {H}are close, it does not guaran- tee thater Q {H}is actually small. This is governed bythe choice of H . IfH containsa hypothesis space H with a small value of er Q {H} and the learner is able to 002nd anH 2 H minimizing error on the {n; m} samplez {i.e., minimizing ^er z {H}},then, for suf002ciently large nand m, Theorem 2 en- sures that with high probability er Q {H}will be small. However, a bad choice of H willmean there is no hope of 002nding anH with small error. In this sense thechoice of H represents the hyper-bias of the learner. Notethat from a sample complexity point of view, the optimal hypothesis space family tochoose is one containing a single, minimalhypothesis space H that contains good solutionsto all of the problems in the environment{or at least a set of problems with highQ-probability}, and no more. For then there is no bias learning to do{because there is no choice to be made betweenhypothesis spaces}, the output of the biaslearning algorithm is guaranteed tobe a goodhypothesis space for the environment,andsince the hypothesis space is minimal, learningany problem within the en- vironment usingH will require the smallest possible number of examples. However, this scenario is analagous tothe trivial scenario inordinary learning in which the learning algorithmcontains a single, optimal hypothesis for the problem being learnt. In that case there is no learning to be done, just as there is nobias learning to be done if the correct hypothesisspace is already known. At the other extreme, if H contains a single hypothesisspace H consisting ofall possible func- tions from X ! Ythen bias learning is impossible because thebias learner cannot produce a restrictedhypothesis space as output, and hence cannot producea hypothesis space with improved sample complexity requirements on as yet unseentasks. Focussing on these two extremeshighlights the minimal requirements on Hfor successful bias learningto occur: the hypothesis spaces H2 H must be strictly smaller than thespace of all functions X! Y , but not so small or so223skewed224 that none of them contain goodsolutions to a large majority of theproblems in the environment. It may seemthat we have simply replaced the problem ofselecting the right bias {i.e., selecting theright hypothesis space H} with theequally dif002cult problem of selecting the righthyper-bias {i.e., the right hypothesisspace family H }. However,in many cases selecting the right hyper-bias isfar easier than selecting the right bias.For example, in Section 3 we will see how the feature selection problem may be viewed as a bias selection problem. Selectingthe right features can be extremely dif002cult if one knows little about theenvironment, withintelligent trial-and-error typicallythe best one can do. However,in a bias learning scenario, one only has tospecify that a set of features should exist, 002nd a loosely parameterised setoffeatures {for example neural networks}, and thenlearn the features by sampling from multiplerelated tasks. 2.6 LearningMultiple Tasks It may be that thelearner is not interested in learning to learn, but just wants to learn a 002xed set of n tasks from the environment{P ; Q}. Asin the previous section, we assume the learnerstarts outwith a hypothesis space familyH , and also that it receives an{n; m}-sample zgenerated from the n distributionsP 1 ; : : : ; P n . This time, however, the learner is simply looking for nhypotheses {h 1 ; : : : ; h n }, all contained in the samehypothesis space H, such that the averagegeneralization error of the nhypotheses is minimal. Denoting {h 1 ; : : : ; h n } by h and writingP = {P 1 ; : : : ; P n }, 163BAXTER this error is given by: er P {h} := 1n n X i=1 er P i {h i }{26} = 1n n X i=1 Z X002Y l{h i {x};y} dP i {x; y}; and the empirical loss of h on z is ^er z {h}:= 1n n X i=1 ^er z i {h i }{27} = 1n n X i=1 1m m X j=1 l{h i {x ij }; y ij }: As before, regardless of how the learner chooses {h 1 ; : : : ; h n }, if we can prove auniform bound on the probability oflargedeviation between ^er z {h} ander P {h}then any {h 1 ; : : : ; h n } that perform wellon thetraining sets z will with high probabilityperform well on future examples of the same tasks. Theorem 4. LetP = {P 1 ; : : : ; P n } be n probability distributionsonX 002 Y and letz be an {n; m}- sample generated by samplingm times from X 002 Yaccording to each P i . Let H =fHg be any permissible hypothesisspace family. If the number of examplesmof each task satis002es m 025 max 032 64n" 2 log 4C { "16 ; H n l }ffi ; 16" 2 033 {28} then with probability at least1 000 ffi {over the choice ofz}, any h 2 H n will satisfy er P {h} 024^er z {h} + " {29} {recall De002nition 4 for the meaning ofC {"; H n l }}. Proof.Omitted {follow the proof of the bound onm in Theorem 2}.Thebound on m in Theorem 4 is virtually identical to the bound on m in Theorem 2, and note againthat it depends inverselyon the number of tasks n {assuming that the002rst part of the 223max224 expression is the dominate one}. Whether this helps dependson the rate of growth of C { "16 ; H n l } as a function ofn. The following Lemma shows that this growth is always small enough to ensure that we never do worse by learning multipletasks {at least in terms of the upper bound onthe number of examples required per task}. Lemma 5. For any hypothesisspace family H , C 000 "; H 1 l 001 024C {"; H n l } 024 C 000 ";H 1 l 001 n :{30} 164A MODEL OFINDUCTIVE BIASLEARNING Proof.Let K denotethe set of all functions{h 1 ; : : : ; h n } l where each h i can be a member of any hypothesis space H 2 H {recallDe002nition 1}. Then H n l 022 K andso C {"; H n l } 024 C {"; K }. By Lemma29 in Appendix B, C {"; K} 024 C 000 "; H 1 l 001 n and so the righthand inequality follows. Forthe 002rst inequality, let Pbe any probability measure on X002 Y and let P be the mea- sureon {X 002 Y} n obtained by usingP on the 002rst copy of X002 Y in the product, and ignoring all other elements of the product.Let N be an "-cover for{ H n l ; d P }.Pick any h l 2 H 1 l and let {g 1 ; : : : ; g n } l 2 Nbe such that d P {(h; h; : : : ; h} l ; {g 1 ; : : : ; g n } l } 024". But by construction, d P {(h;h; : : : ;h} l ; {g 1 ; : : : ; g n } l } = d P {h; {g 1 } l }, which establishes the002rst inequality.By Lemma 5 log C 000 ";H 1 l 001 024 log C {";H n l }024 n log C 000 "; H 1 l 001 : {31} So keeping the accuracy parameters" and ffi 002xed, and plugging {31}into {28}, we see that the upper bound on the number of examples required of each task never increases with the number of tasks, and at best decreases asO{1=n}. Althoughonly an upper bound, this provides a strong hintthat learning multiple related tasks shouldbe advantageous on a 223number of examplesrequired per task224 basis. InSection 3 it will be shown that for featurelearning all types of behavior are possible, from no advantage at all to O{1=n} decrease. 2.7 Dependenceon " In Theorems 2, 3 and 4 thebounds on sample complexity all scale as1=" 2 . Thisbehavior can be improved to1=" if the empirical loss is alwaysguaranteed tobe zero {i.e., we are in therealizable case}. The same behaviorresults if we are interested in relative deviation between empirical and trueloss, rather than absolute deviation.Formal theorems along these lines are stated inAppendix A.3. 3. FeatureLearning The use of restricted featuresets is nearly ubiquitous as a method of encodingbias in many areas of machine learning andstatistics, including classi002cation, regression anddensity estimation. In this section we showhow the problem of choosing aset of featuresfor an environment of related tasks can berecast as a bias learning problem. Explicitbounds on C { H 003 ; "} and C {H n l ;"} are calculated for generalfeature classes in Section 3.2. Thesebounds are applied to the problem of learning a neural network feature set inSection 3.3. 3.1 The Feature Learning Model Consider the following quotefrom Vapnik {1996}: Theclassical approach to estimating multidimensionalfunctional dependencies is based on the following belief: Real-lifeproblems are such thatthere exists a small number of 223strong features,224 simple functions of which {saylinear combinations} approximate well the unknownfunction. Therefore, it is necessary tocarefully choose a low-dimensional feature space andthen to use regular statistical techniques to construct an approximation. 165BAXTER In general a set of 223strong features224may be viewed as a function f: X ! V mapping the input space X into some {typically lower} dimensional space V . LetF = ff g be a set ofsuch feature maps {each f maybe viewed as a set of features {f 1 ; : : : ; f k } if V =R k }. It is thef that must be 223carefullychosen224 in the above quote. Ingeneral, the 223simple functions of the features224may be represented as a class of functionsG mappingV to Y .If for each f 2 F wede002ne the hypothesis space G ffif := fg ffi f :g 2 G g, then we have thehypothesis space family H H:= fG ffi f : f2 F g: {32} Now the problem of 223carefully choosing224 theright features f is equivalent to thebias learning problem 223002nd the righthypothesis space H 2 H 224. Hence,provided the learner is embedded within an environment ofrelated tasks, and thecapacities C { H 003 ; "} and C {H n l ;"} are 002nite, Theorem 2 tells us that the feature set fcan be learnt rather than carefully chosen.This represents animportant simpli002cation,as choosing a set of features is often the mostdif002cult part of any machine learning problem. In Section 3.2 we give atheorem bounding C { H 003 ; "} andC { H n l ; "} for general feature classes. The theorem is specialized toneural network classes in Section 3.3. Note that we have forced the function class Gto be the same for all feature mapsf , although thisis not necessary. Indeed variants of the results to follow can be obtained if G is allowed to vary with f . 3.2Capacity Bounds for General Feature Classes Notationally it is easier to view thefeature maps f as mapping fromX 002 Y to V 002Y by {x; y}7! {f {x}; y}, and also to absorb the lossfunction l into the de002nition ofG by viewing each g 2 Gas a map from V 002Y into [0; 1]via {v; y} 7!l{g{v};y}. Previously this latter functionwould have been denoted g l but in what follows we willdrop the subscript l where this does not causeconfusion. The class to whichg l belongs will still bedenoted by G l . With the above de002nitions letG l ffi F:= fg ffi f :g 2 G l ;f 2 F g. De002ne the capacityof G l in the usual way, C {"; G l }:= sup P N {"; G l ; d P } wherethe supremum is over all probability measures on V 002 Y , and d P {g; g 0 } := R V 002Y jg{v; y} 000 g 0 {v; y}j dP {v; y}. To de002ne the capacity ofF we 002rst de002ne a pseudo-metricd [P;G l ] on F by 223pulling back224 the L 1 metric on R through G l as follows: d [P;G l ] {f ; f 0 } := Z X 002Y sup g2G l jg ffi f {x; y}000 g ffi f 0 {x; y}j dP{x; y}: {33} It is easily veri002ed that d [P;G l ] is a pseudo-metric. Notethat for d [P;G l ] to be well de002ned the supre- mumover G l in the integrand must bemeasurable. This is guaranteed if the hypothesisspace family H = fG l ffi f :f 2 F g is permissible {Lemma 32,part 4}. Now de002ne N {"; F ; d [P;G l ] }to be the smallest "-coverof the pseudo-metric space{F; d [P;G l ] } and the"-capacity ofF {with respect toG l } as C G l {"; F } := sup P N {"; F; d [P;G l ] } where the supremum is over all probabilitymeasures on X 002 Y .Now we can state the main theorem of this section. 166A MODEL OFINDUCTIVE BIASLEARNING Theorem 6.Let H be a hypothesis space familyas in equation {32}. Then for all"; " 1 ; " 2 > 0 with " = " 1 +" 2 , C{"; H n l } 024 C {" 1 ; G l } n C G l {" 2 ; F } {34} C {"; H 003 } 024 C G l {"; F }{35} Proof. SeeAppendix B.3.3 LearningNeural Network Features Ingeneral, a set of features may be viewed as amap from the {typically high-dimensional} input space R d to a muchsmaller dimensional space R k {k 034d}. Inthis section we consider approximat- ing such a feature map by a one-hidden-layerneuralnetwork with d input nodes andk output nodes {Figure 1}. We denote the set of all such feature maps byf010 w = {036 w;1 ;: : : ; 036 w;k } : w 2 Dg where D is a bounded subset ofR W {W isthe number of weights {parameters} inthe 002rsttwo layers}. This set is theF of the previous section. Each feature 036 w;i : R d ![0; 1], i = 1; : : : ; k is de002ned by 036 w;i {x} := 033 0 @ l X j=1 v ij h j {x}+v il+1 1 A {36} whereh j {x}is the output of the j th node in the002rst hidden layer, {v i1 ; : : : ; v il+1 } are the output node parameters for the ith feature and033 is a 223sigmoid224 squashing function033 : R ! [0;1]. Each 002rst layer hidden node h i : R d ! R, i= 1; : : : ; l, computes h i {x} := 033 0 @ d X j=1 u ij x j + u id+1 1 A {37} where {u i1 ; : : : ; u id+1 } are the hidden node's parameters. We assume 033is Lipschitz. 5 The weight vector for the entire feature map is thus w = {u 11 ; : : : ; u 1d+1 ; : : : ; u l1 ; : : : ; u ld+1 ; v 11 ; : : : ; v 1l+1 ; : : : ; v k1 ; : : : ; v kl+1 } and the total number of feature parametersW = l{d+ 1} +k{l + 1}. For argument's sake, assume the 223simplefunctions224 of the features {the classG of the previous section}are squashed af002ne maps using the same sigmoidfunction 033 above {in keeping with the 223neural network224 003avor of thefeatures}. Thus, each setting of the featureweights w generates a hypothesisspace: H w := { 033 k X i=1 ff i 036 w;i + ff k+1 ! : {ff 1 ; : : : ; ff k+1 } 2D 0 } ;{38} where D 0 is a bounded subset of R k+1 . The set of allsuch hypothesis spaces, H :=fH w : w2 Dg {39}5. 033 is Lipschitz if there exists a constant K such that j033{x} 000 033{x 0 }j 024K jx 000 x 0 j for all x; x 0 2 R. 167BAXTERrootknld FeatureMapInput Multiple Output ClassesFigure 1: Neuralnetwork for feature learning. Thefeature map is implemented by the 002rst two hidden layers. The n output nodescorrespond to the n different tasks in the{n; m}- samplez. Each node in the network computesa squashed linear function of the nodes in the previous layer. isa hypothesis space family. Therestrictions on the output layer weights{ff 1 ;: : : ; ff k+1 } and feature weightsw, and the restriction to a Lipschitzsquashing function are needed to obtain 002nite upper bounds on the covering numbers inTheorem 2. Finding a good set of featuresfor the environment {P ;Q} is equivalent to 002nding a goodhy- pothesisspace H w 2 H , which in turn means002nding a good set of feature map parametersw. As in Theorem 2, the correct setof features may be learnt by 002nding a hypothesisspace with small error on a suf002ciently large {n; m}-samplez. Specializing to squared loss, in thepresent framework the empirical loss ofH w on z {equation {8}} is given by ^er z {H w } = 1n n X i=1 inf {ff 0 ;ff 1 ;:::;ff k }2D 0 1m m X j=1 " 033 k X l=1 ff l 036 w;l {x ij } + ff 0 ! 000 y ij # 2 {40} Since our sigmoid function033 only has range [0; 1], we also restrict the outputs Yto this range. 3.3.1ALGORITHMS FORFINDING A GOODSET OF FEATURES Provided the squashing function033 is differentiable, gradientdescent {with a small variation on backpropagation to compute the derivatives} can be used to 002nd feature weightsw minimizing {40} {or at least alocal minimum of {40}}. The only extradif002culty over and above ordinarygradient descent is the appearance of 223inf 224 in the de002nition of^er z {H w }. Thesolution is to perform gradient descent over both the output parameters {ff 0 ; : : : ; ff k } for each node and thefeature weights w. For more detailssee Baxter {1995b} and Baxter {1995a, chapter 4},where empirical results supporting the theoretical results presented here are also given. 168A MODEL OFINDUCTIVE BIASLEARNING 3.3.2SAMPLE COMPLEXITY BOUNDS FOR NEURAL-NETWORKFEATURE LEARNING The size of z ensuring that theresulting features will be good for learning novel tasks from the same environmentisgiven by Theorem 2. All we haveto do is compute the logarithm of the covering numbers C {"; H n l } and C{"; H 003 }. Theorem 7. LetH = 010 H w : w 2 R W 011 be a hypothesisspace family where each H w is of the form H w := { 033 k X i=1 ff i 036 w;i {001} + ff 0 ! : {ff 1 ; : : : ; ff k } 2 R k } ; where 010 w = {036 w;1 ;: : : ; 036 w;k } is a neural network withW weights mapping from R d to R k . If the feature weightsw and the output weights ff 0 ; ff 1 ; : : : ; ff k are bounded, the squashing function033 is Lipschitz,l is squaredloss, and the output space Y =[0; 1] {any bounded subset ofR will do}, then there existconstants 024; 024 0 {independent of "; W andk} such that for all " >0, log C {"; H n l } 024 2 {{k +1}n + W } log 024" {41} log C {"; H 003 } 024 2Wlog 024 0" {42} {recall that we have specialized tosquaredloss here}. Proof. SeeAppendix B.Noting that our neuralnetwork hypothesis space family His permissible, plugging {41} and {42} into Theorem 2 gives the followingtheorem. Theorem 8. LetH = fH w g be a hypothesis space family where each hypothesis space H w is a set of squashed linear mapscomposed with a neural network feature map, asabove. Suppose the numberof features is k, and the total numberof feature weights is W. Assume all feature weights and output weights are bounded,and the squashing function 033 is Lipschitz.Let z be an {n; m}-sample generated from the environment {P ; Q}. If n025 O 022 1" 2 024 W log 1" + log 1ffi 025 ; {43} and m 025 O 022 1" 2 024022 k + 1 + Wn log 1" + 1n log 1ffi 025 {44} then withprobability at least 1 000 ffiany H w 2H will satisfy er Q {H w } 024 ^er z {H w } + ": {45} 169BAXTER 3.3.3 DISCUSSION 1. Keeping the accuracy andcon002dence parameters " and ffi002xed, the upper bound on the number of examples required of each task behaveslike O{k + W=n}. If the learner is simply learning n 002xed tasks {rather than learningto learn}, then the same upper bound also applies{recall Theorem 4}. 2.Note that if we do away with the featuremap altogether then W = 0 and theupper bound on m becomes O{k}, independent of n{apart from the less important ffiterm}. So in terms of the upper bound,learning n tasks becomes just as hard aslearning one task. At the other extreme, if we 002x the output weights then effectively k = 0 and the number of examples required of each task decreases asO{W=n}. Thusa range of behavior in the number of examplesrequired of each task is possible:from no improvement at all to anO{1=n} decrease as the number of tasks n increases {recall the discussionat the end of Section 2.6}. 3.Once the feature map is learnt {which can beachieved using the techniques outlined in Baxter, 1995b; Baxter & Bartlett, 1998; Baxter,1995a, chapter 4}, only the output weights haveto be estimated to learn a novel task.Again keeping the accuracy parameters 002xed, this requires no more that O{k} examples. Thus,as the number of tasks learnt increases, the upperbound on the number of examples required of each task decays to the minimum possible,O{k}. 4. Ifthe 223small number of strong features224 assumptionis correct, then k will be small.However, typically we will have very little idea of what the features are, soto be con002dent that the neural network is capable of implementing a goodfeature set it will need to be very large,implying W 035 k.O{k+W=n} decreases mostrapidly with increasing n when W035 k, so at least in terms of the upper bound on the number of examples required per task, learning small feature sets is an ideal application for biaslearning. However, the upper bound onthe number of tasks does not fare sowell as it scales as O{W}. 3.3.4 COMPARISON WITHTRADITIONAL MULTIPLE- CL AS SCLASSIFICATION A special case of this multi-task framework is one in which the marginal distributionon the input space P ijX is the same for each task i = 1; : : : ; n, and all that varies between tasks is theconditional distribution over the output space Y . An example would be amulti-class problem such as face recognition, in which Y =f1; : : : ; ngwhere n is the number of faces to berecognized and the marginal distribution onX is simply the 223natural224 distributionover images of those faces. Inthat case, if for every examplex ij we have227in addition to the sample y ij from the ith task's conditional distribution onY 227samplesfrom the remaining n 000 1conditional distributions onY, then we can view then training sets containing mexamples eachas one large training set for the multi-class problem with mn examples altogether. The bound on m in Theorem 8 statesthat mn should be O{nk +W }, or proportional tothe total number of parameters in the network, aresult we would expect from 6 {Haussler, 1992}. So when specialized to the traditionalmultiple-class, single task framework, Theorem 8 is con- sistent with the bounds already known.However, as we have already argued, problems such as face recognition arenot really single-task, multiple-class problems.They are more appropriately viewed6. If each example canbe classi002ed with a 223large margin224 thennaive parameter counting can be improved upon {Bartlett, 1998}. 170A MODEL OFINDUCTIVE BIASLEARNING as a{potentially in002nite} collection of distinct binaryclassi002cation problems. In that case, the goal of bias learning is not to 002nd asingle n-output network that can classifysome subset of n faces well. Itis to learn a set of features that can reliablybe used as a 002xed preprocessing fordistinguish- ing any single face from other faces. This is the new thing providedby Theorem 8: it tells us that provided we have trained ourn-output neural network on suf002ciently many examples of suf002ciently many tasks, we can be con002dent that thecommon feature map learnt for those ntasks will be good for learningany new, as yet unseen task, provided the new task is drawn from the same distribution that generated the training tasks.In addition, learning the new task only requires estimating the k output node parametersfor that task, a vastly easier problem thanestimating the parameters of the entire network, from both a sample and computational complexity perspective. Also, since we have high con002dence that the learnt features willbe good for learning novel tasks drawn fromthe same environment, thosefeatures arethemselves a candidate for further study to learnmore about the nature of the environment.The same claim could not be made if thefeatures had been learnt on too small a setof tasks to guarantee generalization tonoveltasks, for then it is likely that the features would implement idiosyncrasies speci002cto those tasks, rather than 223invariances224that apply across all tasks. When viewed from a bias {or feature} learningperspective, ratherthan a traditionaln-class classi002cation perspective,the bound m on the number of examplesrequired of each task takes on a somewhat different meaning. Ittells us that provided nis large {i.e.,we are collecting examples of a largenumber tasks}, then we really only need to collecta few more examples than we would otherwise have to collect if the featuremap was already known {k +W=n examples vs. k examples}. So it tells us that the burden imposedby feature learning can be made negligibly small,at least when viewed from the perspective of the sampling burden required of each task. 3.4 Learning Multiple Taskswith Boolean Feature Maps Ignoringthe accuracy and con002dence parameters" and ffi, Theorem 8 shows that thenumber of examples required of each taskwhen learning n tasks with a common neural-network feature map is bounded above byO{k +W=n},where k is the number of features andW is the number of adjustableparameters in the feature map. SinceO{k} examples are required tolearn a single task once the true featuresare known, this shows that the upper bound onthe number of examples requiredof each task decays {in order} to the minimumpossible as the number of tasks n increases. This suggests that learning multiple tasks isadvantageous, butto be truly convincing weneed to prove a lower bound of thesame form. Proving lower bounds in a real-valued setting {Y = R} is complicated by the fact that a singleexample can convey an in002nite amount ofinformation, so one typically has to make extra assumptions, suchas that the targetsy 2 Y are corrupted by a noise process. Rather than concern ourselves with such complications, in this section werestrict our attention to Boolean hypothesisspace families {meaning each hypothesish 2 H 1 maps to Y = f0061g and we measure error by discrete lossl{h{x};y} = 1 if h{x} 6= y and l{h{x}; y}= 0 otherwise}. Weshow that the sample complexity for learningntasks with a Boolean hypothesis space family H is controlled by a 223VCdimension224 type parameter d H {n} {that is, we give nearly matching upper and lower boundsinvolving d H {n}}. We then derivebounds on d H {n} for the hypothesis space family considered inthe previous sectionwith the Lipschitz sigmoid function 033replaced by a hard threshold {linearthreshold networks}. 171BAXTER As well as the bound on the number of examples required per task for good generalizationacross those tasks, Theorem 8 also showsthat features performing well on O{W } tasks will generalize well to novel tasks, where W isthe number of parameters in the feature map.Given that for many feature learning problems W is likely tobe quite large {recall Note 4 in Section 3.3.3}, it would be useful to know thatO{W } tasks are in factnecessary without further restrictions on the environmental distributions Qgeneratingthe tasks. Unfortunately, wehave notyet been able to show such a lower bound. There is some empirical evidencesuggesting that in practice the upper bound on thenumber of tasks may be very weak.For example, in Baxter and Bartlett {1998}we reported experiments in whicha set of neural network features learnt on asubset of only 400 Japanese characters turned out to be good enough for classifying some 2600unseen characters, even though the featurescontained several hundred thousand parameters. Similar results may be found in Intrator andEdelman {1996} and in the experimentsreported in Thrun {1996} and Thrun and Pratt{1997, chapter 8}. While thisgap between experiment and theory may be justanother example of the looseness inherent in general bounds, itmay also be that theanalysis can be tightened. In particular,thebound on the number of tasks isinsensitive to the size of the class of outputfunctions {the class G inSection 3.1}, which may be where the looseness hasarisen. 3.4.1 UPPERANDLOWER BOUNDSFOR LEARNING nTASKS WITH BOOLEAN HYPOTHESIS SPACE FAMILIES First we recall some concepts from thetheory of Boolean function learning. LetH be a class of Booleanfunctions on X and x = {x 1 ; : : : ; x m } 2 X m . H jx is the set of all binary vectorsobtainable by applying functions inH to x: H jx := f{h{x 1 };: : : ; h{x m }) : h 2 Hg: Clearly jH jx j 024 2 m . IfjH jx j = 2 m we sayH shatters x. The growth function of H is de002ned by 005 H {m} :=max x2X m fi fi H jx fi fi : The Vapnik-Chervonenkisdimension VCdim{H} is the sizeof the largest set shattered by H: VCdim{H} := maxfm : 005 H {m} = 2 m g: An important result in thetheory of learning Boolean functions is Sauer's Lemma {Sauer, 1972}, of which we will also make use. Lemma 9 {Sauer's Lemma}. For a Boolean function class H withVCdim{H} = d, 005 H {m} 024 d X i=0 022 m i 024 020 emd 021 d ; for all positive integersm. We now generalize theseconcepts to learning n tasks with a Booleanhypothesis space family. 172A MODEL OFINDUCTIVE BIASLEARNING De002nition 5.Let H be a Boolean hypothesis spacefamily. Denote the n 002m matrices over the input spaceX by X {n;m} . For eachx 2 X {n;m} and H 2 H , de002neH jx to be the set of{binary} matrices, H jx := 8 > < > : 2 6 4 h 1 {x 11 }001 001 001 h 1 {x 1m } . . . . . . . . . h n {x n1 } 001 001 001 h n {x nm } 3 7 5 : h 1 ;: : : ; h n 2H 9 > = > ; : De002ne H jx := [ H2 H H jx : Nowfor each n > 0; m >0, de002ne 005 H {n; m} by 005 H {n;m} := max x2X {n;m} fi fi H jx fi fi : Note that 005 H {n; m} 024 2 nm . If fi fi H jx fi fi =2 nm we say H shattersthe matrix x. For eachn > 0 let d H {n} := maxfm : 005 H {n; m} = 2 nm g: De002ned{ H }: = VCdim{ H 1 }and d{ H}: = max H2 H VCdim{H}: Lemma 10.d{H } 025 d{H } d H {n} 025 max 032026d{H }n 027 ; d{ H } 033 025 12 022026d{ H }n 027 + d{ H} Proof.The 002rst inequality is trivial from thede002nitions. To get the second term inthe maximum in the second inequality, choosean H 2 H withVCdim{H} = d{ H } and construct a matrix x 2 X {n;m} whose rows are of lengthd{ H } and areshattered by H. Then clearlyH shatters x. For the 002rst term in the maximum take asequence x = {x 1 ; : : : ; xd{ H } }shattered by H 1 {thehypothesis space consisting ofthe union over all hypothesis spaces from H}, and distribute its elements equally among the rows of x {throw awayany leftovers}. The set of matrices 8 > < > : 2 6 4 h{x 11 } 001 001 001 h{x 1m } . . . . . . . . . h{x n1 } 001 001 001h{x nm } 3 7 5 :h2 H 1 9 > = > ; : where m = bd{ H }=ncis a subset of H jx andhas size 2 nm .Lemma 11. 005 H {n;m} 024 024 emd H {n} 025 nd H {n} 173BAXTER Proof. Observe that for eachn, 005 H {n; m} = 005 H {nm} where H is thecollection of all Boolean functions on sequencesx 1 ; : : : ; x nm obtained by 002rst choosingnfunctions h 1 ; : : : ; h n from some H 2 H ,and then applying h 1 to the 002rst m examples, h 2 to the second m examplesand so on. By the de002nition ofd H {n}, VCdim{H} = nd H {n}, hence theresult follows from Lemma 9 applied to H.If one follows the proof of Theorem 4 {in particular the proof of Theorem18 in Appendix A} then it is clear thatfor all * > 0, C{ H n l ; "} may be replaced by005 H {n;2m} in the Boolean case. Making this replacement in Theorem 18,and using the choices of ff; 027from the discussion followingTheorem 26, we obtain the following bound on theprobability oflarge deviation between empirical and true performance in this Booleansetting. Theorem 12. LetP = {P 1 ; : : : ; P n } be n probability distributionson X 002 f0061gand let z be an {n;m}-sample generated by samplingm times from X 002 f0061g according to each P i . Let H =fHg be any permissible Booleanhypothesis space family. Forall 0 < * 024 1, Pr fz: 9h2 H n : er P {h} 025^er z {h} + "g 024 4005 H {n; 2m} exp{000 * 2 nm=64}: {46} Corollary 13. Under the conditions ofTheorem 12, ifthe number of examplesm of each task satis002es m 025 88" 2 024 2d H {n} log 22" + 1n log 4ffi 025 {47} thenwith probability at least 1 000ffi {over the choice of z}, any h 2 H n will satisfy er P {h} 024^er z {h} + " {48} Proof. Applying Theorem 12, we require 4005 H {n; 2m} exp{000 * 2 nm=64}024 ffi; which is satis002ed if m 025 64 * 2 024 d H {n} log 2emd H {n} + 1n log 4ffi 025 ; {49} where we have used Lemma 11.Now, for all a 025 1, if m = 022 1+ 1e a log 022 1 + 1e a; then m 025 alog m. So setting a= 64d H {n}=" 2 , {49} issatis002ed if m 025 88" 2 024 2d H {n} log 22" + 1n log 4ffi 025 :174A MODEL OFINDUCTIVE BIASLEARNING Corollary13 shows that any algorithm learningn tasks using the hypothesis space familyH requires no more than m = O 022 1" 2 024 d H {n} log 1" + 1n log 1ffi 025 {50} examplesof each task to ensure that with high probabilitytheaverage true error of any nhypotheses it selects from H n is within " of their average empirical error on the sample z. We now give a theorem showing that if the learning algorithmis required to produce n hypotheseswhose average true error is within" of the best possible error{achievable using H n } for an arbitrary sequence of distributions P 1 ; : : : ; P n , then within a log 1" factor thenumber of examples in equation {50} is also necessary. For any sequenceP = {P 1 ; : : : ; P n } of n probability distributions onX 002 f0061g, de002ne opt P { H n } by opt P { H n } := inf h2H n er P {h}: Theorem 14. Let H bea Boolean hypothesis space family such thatH 1 containsat least two functions. For each n= 1; 2; : : : ;let A n be any learningalgorithm taking as input {n; m}-samples z 2 {X002 f0061g} {n;m} and producing asoutput n hypotheses h = {h 1 ; : : : ; h n } 2 H n . For all 0 < " < 1=64and 0 < ffi <1=64, if m < 1" 2 024 d H {n}616 + {1 000 " 2 } 1n log 022 18ffi{1 000 2ffi} 025 then there exist distributions P = {P 1 ; : : : ; P n } such that with probability atleast ffi {over the randomchoice of z}, er P {A n {z}) > opt P { H n } + " Proof.See Appendix C3.4.2LINEAR THRESHOLD NETWORKS Theorems 13and 14 show that within constants and alog{1="} factor, the samplecomplexity of learning n tasks usingthe Boolean hypothesis space family His controlled by the complexity pa- rameter d H {n}. In this section we derivebounds on d H {n} for hypothesis space families constructed as thresholded linear combinations of Booleanfeature maps. Speci002cally, weassumeH is of the form given by{39}, {38}, {37} and {36}, where nowthe squashing function 033 is replaced with ahard threshold: 033{x} := { 1if x 025 0; 0001 otherwise; andwe don't restrict the range of the feature andoutput layer weights. Note that in this casethe proof of Theorem 8 does not carrythrough because the constants 024; 024 0 in Theorem 7 depend on the Lipschitz bound on 033. Theorem 15. Let H bea hypothesis space family of the form given in{39}, {38}, {37} and{36},with a hard thresholdsigmoid function 033. Recallthat the parameters d, l and k are theinput dimension, numberof hidden nodes in the feature map and number of features {outputnodes in the feature map} 175BAXTER respectively. Let W:= l{d + 1} +k{l + 1} {the number ofadjustable parameters in the feature map}. Then, d H {n}024 2 022 Wn + k +1 log 2 {2e{k +l + 1}): Proof. Recall that for eachw 2 R W ,010 w : R d ! R k denotes the feature map with parametersw. For each x 2X {n;m} , let010 wjx denote thematrix 2 6 4 010 w {x 11 } 001 001 001010 w {x 1m } . . . . . . . . . 010 w {x n1 } 001001 001 010 w {x nm } 3 7 5 : Note that H jx is the set of all binaryn 002 m matrices obtainable by composingthresholded linear functions withthe elementsof 010 wjx , with the restriction that the same functionmust be applied to eachelement in a row{but the functions may differ between rows}.With a slight abuse of notation, de002ne 005 010 {n; m} := max x2X {n;m} fi fi 010 010 wjx : w 2 R W 011 fi fi : Fix x2 X {n;m} . By Sauer's Lemma, each node in the002rst hidden layer of the feature map computes at most {emn={d+ 1}) d+1 functionson the nm input vectors in x. Thus, there can be at most {emn={d + 1}) l{d+1} distinctfunctions from the input to the output of the002rst hidden layer on the nmpoints in x. Fixing the 002rst hiddenlayer parameters, each node in the second layer ofthe feature map computes at most{emn={l + 1}) l+1 functions on the image ofx produced at the output of the 002rsthidden layer. Thus the second hidden layercomputes no more than {emn={l + 1}) k{l+1} functions on the output of the002rst hidden layer on the nm points inx. So, in total, 005 010 {n; m}024 022 emnd+ 1 l{d+1} 022 emnl + 1 k{l+1} : Now, for each possiblematrix 010 wjx , the number of functions computable on eachrow of 010 wjx by a thresholded linear combination ofthe output of the feature map is at most{em={k + 1}) k+1 . Hence, the number of binary sign assignments obtainableby applying linear threshold functions to all the rowsis at most {em={k +1}) n{k+1} . Thus, 005 H {n; m}024 022 emnd + 1 l{d+1} 022 emnl + 1 k{l+1} 022 emnn{k + 1} n{k+1} : f{x} := x log xis a convex function, hence for alla; b; c > 0, f 022 ka+ lb+ ck +l + 1 024 1k +l + 1 {kf {a} + lf {b}+ f {c}) } 022 k + l + 1ka + lb+ c ka+lb+c 025 022 1a ka 022 1b lb 022 1c c : Substituting a =l + 1, b = d + 1and c = n{k + 1}shows that 005 H {n; m} 024 022 emn{k +l + 1}W+ n{k + 1} W +n{k+1} : {51} 176A MODEL OFINDUCTIVE BIASLEARNING Hence, if m > 022 Wn + k + 1 log 2 022 emn{k +l + 1}W+ n{k +1} {52} then 005 H {n; m}< 2 nm and so byde002nition d H {n} 024 m. For alla > 1, observe that x > alog 2 x ifx = 2a log 2 2a. Setting x =emn{k +l + 1}={W + n{k + 1})and a = e{k +l + 1} shows that {52} issatis002ed if m = 2{W=n+ k + 1} log 2 {2e{k + l+ 1}).Theorem 16.Let H be as in Theorem 15 withthe following extra restrictions:d 025 3,l 025k and k 024 d. Then d H {n}025 12 022026 W2n 027 + k + 1 Proof. We boundd{ H }and d{ H }and then apply Lemma 10. In the presentsetting H 1 containsall three-layer linear-threshold networks withd input nodes, l hidden nodes in the002rst hidden layer, k hidden nodes in the second hidden layer and one output node.From Theorem 13 in Bartlett {1993}, we have VCdim{H 1 } 025dl + l{k 0001}2 + 1; which under the restrictions stated above is greater than W=2. Henced{ H }025 W=2. As k024 dand l 025 kwe can choose a feature weight assignment so thatthe feature map is the identity onk components of the input vector and insensitive to the setting of the reminaingd 000 k components.Hence we can generate k+ 1points in X whose image under thefeature map is shattered by the linearthreshold output node, and so d{ H }= k +1.Combining Theorem 15 withCorrolary 13 shows that m025 O 022 1" 2 024022 Wn + k + 1 log 1" + 1n log 1ffi 025 examples of each task suf002ce when learning n tasks using a linearthreshold hypothesis space family, while combining Theorem 16 with Theorem 14 shows that if m 024 012 022 1" 2 024022 Wn + k +1 + 1n log 1ffi 025 then any learning algorithm will fail on some set of n tasks. 4. Conclusion The problem of inductive bias is onethat has broad signi002cance in machine learning.In this paper we have introduced aformal model of inductive bias learning thatapplies when the learner is able tosample from multiple related tasks.We proved that provided certain covering numbers computed from the set of allhypothesis spaces available to the bias learnerare 002nite, any hypothesis space that contains good solutions to suf002ciently many training tasks is likely to contain goodsolutions to novel tasks drawn from the same environment. In the speci002c caseof learning a set of features, we showed thatthe number of examples m requiredof each task in an n-task training set obeys m = O{k +W=n}, where k is the number of 177BAXTER features and W is a measure ofthe complexity of the feature class.We showed that this bound is essentially tight for Boolean feature mapsconstructed fromlinear threshold networks.In addition, weproved that the numberof tasks required to ensure good performance from the features on novel tasks is no more thanO{W }. Wealso showed how a good set of features maybe found by gradient descent. Themodel of this paper represents a 002rst steptowards a formal model of hierarchical approaches to learning. By modelling a learner's uncertainty concerning its environment inprobabilistic terms, we have shown how learning can occur simultaneously atboth the baselevel227learn the tasks at hand227and atthe meta-level227learn bias that can betransferred to novel tasks. Froma technical perspective, itis the assumption that tasks are distributedprobabilstically that allows the perfor- mance guarantees to be proved.From a practical perspective, there are many problem domains that can be viewed asprobabilistically distributed sets of related tasks.For example, speech recognition may be decomposed along many different axes: words, speakers, accents, etc.Face recognition represents a potentiallyin002nite domain of related tasks. Medicaldiagnosis and prognosis problems using the same pathology tests are yet another example.All of these domains should bene002t from being tackled with a bias learning approach. Natural avenues for further enquiry include: * Alternative constructions forH . Although widely applicable, thespeci002c example on feature learning viagradient descent represents just one possible wayof generating and searching the hypothesis space family H . It would be interestingto investigate alternative methods, including decision tree approaches, approachesfrom Inductive Logic Programming {Khan et al., 1998}, and whether more generallearning techniques such as boosting can be applied in a bias learning setting. * Algorithms for automatically determining thehypothesis space family H . Inour model the structure of H is002xed apriori and represents thehyper-bias of the bias learner.It would be interesting to see towhat extent this structure can also be learnt. * Algorithms for automatically determiningtask relatedness. In ordinary learning thereis usu- ally little doubt whether an individual example belongs to the same learningtask or not. The analogous question in bias learning is whether an individual learning taskbelongs to a given set of related tasks, which in contrast to ordinary learning, does not always have such a clear-cut answer. For most of the examples we havediscussed here, such as speech and face recognition, thetask-relatedness isnot inquestion, but in other cases such as medical problems it is not so clear.Grouping too large a subset of tasks togetheras related tasks could clearly have adetrimental impact on bias-learning or multi-tasklearning, and there is empri- cal evidenceto support this {Caruana, 1997}. Thus,algorithms for automatically determining task-relatedness are a potentially useful avenue for further research. In this context,see Silver and Mercer {1996}, Thrun andO'Sullivan {1996}. Note that the questionof task relatedness is clearly onlymeaningful relative to a particular hypothesisspace family H {for example, all possible collections oftasks are related ifH contains every possible hypothesisspace}. * Extended hierarchies.For an extension of our two-levelapproach to arbitrarily deep hierarchies, see Langford {1999}. An interestingfurther question is to what extent the hierarchycan be inferred from data. Thisis somewhat related to the question of automaticinduction of structure in graphical models. 178A MODEL OFINDUCTIVE BIASLEARNING Acknowledgements This work was supported at varioustimes by an Australian Postgraduate Award,aShell Aus- tralia Postgraduate Fellowship, U.K Engineering and Physical Sciences ResearchCouncil grants K70366 and K70373, and anAustralian Postdoctoral Fellowship. Alongthe way, many people havecontributed helpful comments and suggestions for improvement including Martin Anthony, Peter Bartlett, Rich Caruana, John Langford,Stuart Russell, John Shawe-Taylor,Sebastian Thrun and several anonymousreferees. Appendix A. UniformConvergence Results Theorem2 provides a bound {uniform over allH 2 H } on the probability of large deviation between er Q {H} and ^er z {H}. To obtain a more general result, wefollow Haussler {1992} and introduce the following parameterized class of metrics onR + : d 027 [x; y]:= jx 000 yjx + y+ 027 ; where027 > 0. Our main theorem willbe a uniform bound on the probability of large values of d 027 [er Q {H}; er z {H}], rather than j er Q {H} 000^er z {H}j. Theorem2 will then followas a corollary, as will better boundsfor the realizable case ^er z {H} = 0{Appendix A.3}. Lemma 17.The following three properties ofd 027 are easily established: 1. For all r;s 025 0, 0 024 d 027 [r; s]024 1 2. Forall 0 024 r 024 s 024t, d 027 [r; s] 024 d 027 [r; t]and d 027 [s; t] 024 d 027 [r; t]. 3. For 0 024 r;s 024 1, jr000sj027+2 024 d 027 [r; s]024 jr000sj027 For easeof exposition we have up until now beendealing explicitly withhypothesis spacesH containing functions h: X ! Y , and then constructingloss functions h l mapping X 002 Y ![0; 1] by h l {x; y} :=l{h{x};y} for some loss function l: Y 002Y ! [0; 1]. However, in generalwe can view h l just as a function from an abstract setZ {X 002 Y }to [0; 1] and ignore its particular construction in terms of the loss functionl. So for the remainder of this section,unless otherwise stated, all hypothesisspaces H will be sets of functions mappingZ to [0; 1].It will also be considerably more convenient to transpose our notation for{n; m}-samples, writing then training sets as columns insteadof rows: z = z 11 : : : z 1n . . . . . . . . . z m1 : : : z mn where each z ij 2 Z . Recallingthe de002nition of {X 002Y } {n;m} {Equation 9 and prior discussion}, with this transposition z lives in{X 002 Y } {m;n} . Thefollowing de002nition now generalizes quantities like ^er z {H}, er P {H} and so on to thisnew setting. De002nition 6.Let H 1 ; : : : ;H n be n sets offunctions mapping Z into [0; 1]. For any h 1 2 H 1 ; : : : ; h n 2 H n , let h 1 010001001 001 010 h n or simply h denote the map h{~z } = 1=n n X i=1 h i {z i } 179BAXTER for all ~z = {z 1 ; : : : ; z n } 2 Z n . Let H 1 010 001 001 001010 H n denote the set ofall such functions. Given h2 H 1 010 001001 001 010 H n and m elements of {X 002Y } n , {~z 1 ; : : : ; ~z m } {or equivalently an element z of {X002 Y } {m;n} by writing the ~z i as rows}, de002ne ^er z {h} := 1m m X i=1 h{~z i } {recallequation {8}}. Similarly, for anyproduct probability measure P = P 1 002 001 001 001002 P n on {X 002 Y } n , de002ne er P {h} := Z Z n h{~z } dP{~z } {recallequation {26}}. For any h; h 0 : {X 002 Y } n ! [0; 1] {not necessarilyof the form h 1 010 001 001 001 010 h n }, de002ne d P {h; h 0 }:= Z Z n jh{~z }000 h 0 {~z }j dP{~z } {recall equation {17}}. For any class of functionsH mapping {X 002 Y} n to [0;1], de002ne C{"; H} := sup P N {"; H; d P } wherethe supremum is over all product probabilitymeasures on {X 002 Y} n and N{"; H; d P } is the size of the smallest"-cover of H under d P {recall De002nition 4}. The following theorem is the main result fromwhich the rest of the uniform convergenceresults in this paper are derived. Theorem 18. Let H 022 H 1 010 001 001 001 010 H n be a permissible class offunctions mapping {X 002 Y} n into [0; 1]. Let z2 {X 002 Y } {m;n} be generatedby m 025 2={ff 2 027 } independent trials from {X 002 Y } n according to some productprobability measure P = P 1 002 001 001 001002 P n . For all 027 > 0, 0< ff < 1, Pr 032 z 2 {X 002Y } {m;n} : sup H d 027 [ ^er z {h};er P {h}]> ff 033 0244C {ff027 =8;H} exp{000ff 2 027 nm=8}:{53} The following immediate corollarywill also be of use later. Corollary 19. Under the same conditionsasTheorem 18, if m 025max { 8ff 2 027 n log 4C 000 ff0278 ; H 001ffi ; 2ff 2 027 } ; {54} then Pr 032 z2 {X 002 Y } {m;n} : sup H d 027 [ ^er z {h}; er P {h}] > ff 033 024 ffi {55} A.1 Proof of Theorem 18 The proof is via a double symmetrization argument of the kind given in chapter 2 ofPollard {1984}. I have also borrowedsome ideas from the proof of Theorem 3 in Haussler{1992}. 180A MODEL OFINDUCTIVE BIASLEARNING A.1.1FIRST SYMMETRIZATION An extra piece ofnotation: for all z 2 {X 002 Y } {2m;n} , let z{1}be the top half of z and z{2} be the bottom half, viz: z{1} = z 11 : : : z 1n . . . . . . . . . z m1 : : : z mn z{2} = z m+1;1 : : :z m+1;n . . . . . . . . . z 2m;1 :: : z 2m;n The following lemma is the 002rst223symmetrization trick.224 Werelate the probability oflarge deviation between an empirical estimate of the loss andthe true loss to the probability oflarge deviation between two independent empiricalestimates of the loss. Lemma 20.Let H be a permissible set of functionsfrom {X 002 Y } n into [0;1] and let P be a probability measure on {X 002 Y } n . For all 027 >0; 0 < ff < 1and m 025 2ff 2 027 , Pr 032 z 2 {X 002Y } {m;n} : sup H d 027 [ ^er z {h};er P {h}]> ff 033 0242 Pr 032 z2 Z {2m;n} : sup H d 027 002 ^er z{1} {h}; ^er z{2} {h} 003 > ff2 033 : {56} Proof.Note 002rst that permissibility ofHguarantees themeasurability ofsuprema overH {Lemma32 part 5}. Bythe triangle inequality for d 027 , ifd 027 002 ^er z{1} {h}; er P {h} 003 >ff and d 027 002 ^er z{2} {h};er P {h} 003 < ff=2, thend 027 002 ^er z{1} {h}; ^er z{2} {h} 003 > ff=2. Thus, Pr n z 2{X 002 Y } {2m;n} : 9h2 H : d 027 002 ^er z{1} {h};^er z{2} {h} 003 > ff2 o 025 Pr n z 2 {X 002Y } {2m;n} : 9h2 H : d 027 002 ^er z{1} {h}; er P {h} 003 > ff and d 027 002 ^er z{2} {h}; er P {h} 003 < ff=2 o : {57} By Chebyshev's inequality, for any 002xedh, Pr n z 2 {X 002 Y } {m;n} : d 027 [ ^er z {h};er P {h}]< ff2 o 025 Pr 032 z 2 {X 002 Y} {m;n} : j ^er z {h} 000er P {h}j027 < ff2 033 025 1000 er P {h}{1 000 er P {h})m027 ff 2 =4 025 12 as m 0252={ff 2 027 } and er P {h} 024 1.Substituting this last expression into the righthand side of {57} gives the result.181BAXTER A.1.2 SECONDSYMMETRIZATION The second symmetrization trick bounds theprobability of large deviation between twoempirical estimates of the loss {i.e.the right hand side of {56}} by computingthe probability of large devi- ation when elements are randomly permuted between the002rst and second sample. The following de002nitionintroduces the appropriate permutationgroup for this purpose. De002nition 7.For all integers m; n025 1, let 000 {2m;n} denote the set of allpermutations 033 of the sequenceof pairs of integers f{1;1}; : : : ;{1; n}; : : : ; {2m; 1}; : : : ;{2m; n}gsuch that for all i, 1 024 i 024 m,either 033{i; j } = {m+ i; j } and 033{m + i; j } = {i; j } or 033{i; j } = {i; j} and 033{m+ i;j } = {m +i; j }. For anyz 2 {X 002 Y } {2m;n} and any033 2 000 {2m;n} , let z 033 := z 033{1;1} :: : z 033{1;n} . . . . . . . . . z 033{2m;1} : : :z 033{2m;n} : Lemma 21. LetH = H 1 010 001 001 001 010 H n be a permissible set offunctions mapping {X002 Y} n into [0; 1] {as in the statement ofTheorem 18}. Fix z 2{X002 Y } {2m;n} and let ^ H := ff 1 ; : : : ; f M g be anff027 =8-cover for {H; d z }, where d z {h; h 0 } := 12m P 2m i=1 jh{~z i } 000 h 0 {~z i }j where the ~z i are the rows of z. Then, Pr 032 033 2 000 {2m;n} : sup H d 027 002 ^er z 033 {1} {h}; ^er z 033 {2} {h} 003 > ff2 033 024 M X i=1 Pr n 0332000 {2m;n} : d 027 002 ^er z 033 {1} {f i }; ^er z 033 {2} {f i } 003 > ff4 o ; {58} whereeach 033 2 000 {2m;n} is chosen uniformly at random. Proof. Fix 0332 000 {2m;n} and let h 2 H be such thatd 027 002 ^er z 033 {1} {h};^er z 033 {2} {h} 003 > ff=2{if thereis nosuch h for any 033we are already done}. Choose f2 ^ H such thatd z {h;f } 024 ff027 =8. Without loss of generality wecan assume f is of the formf = f 1 010 001 001 001 010 f n . Now, 2027 d z {h;f } = P 2m i=1 fi fi fi P n j=1 h j {z ij } 000 f j {z ij } fi fi fi027 mn = P 2m i=1 fi fi fi P n j=1 h j {z 033{i;j} } 000 f j {z 033{i;j} } fi fi fi027 mn 025 fi fi fi P m i=1 P n j=1 002 h j {z 033{i;j} } 000 f j {z 033{i;j} } 003 fi fi fi027 mn + P m i=1 P n j=1 002 h j {z 033{i;j} } + f j {z 033{i;j} } 003 + fi fi fi P 2m i=m+1 P n j=1 002 h j {z 033{i;j} }000 f j {z 033{i;j} } 003 fi fi fi027mn + P 2m i=m+1 P n j=1 002 h j {z 033{i;j} } +f j {z 033{i;j} } 003 = d 027 002 ^er z 033 {1} {h}; ^er z 033 {1} {f } 003 + d 027 002 ^er z 033 {2} {h}; ^er z 033 {2} {f } 003 : 182A MODEL OFINDUCTIVE BIASLEARNING Hence, by thetriangle inequality for d 027 , 2027 d z {h; f }+d 027 002 ^er z 033 {1} {f }; ^er z 033 {2} {f } 003 025 d 027 002 ^er z 033 {1} {h};^er z 033 {1} {f } 003 + d 027 002 ^er z 033 {2} {h}; ^er z 033 {2} {f } 003 +d 027 002 ^er z 033 {1} {f };^er z 033 {2} {f } 003 025 d 027 002 ^er z 033 {1} {h}; ^er z 033 {2} {h} 003 : {59} But 2027 d z {h; f} 024 ff=4 by construction andd 027 002 ^er z 033 {1} {h};^er z 033 {2} {h} 003 > ff=2by assumption, so {59} impliesd 027 002 ^er z 033 {1} {f }; ^er z 033 {2} {f} 003 > ff=4. Thus, n 0332 000 {2m;n} : 9h 2 H :d 027 002 ^er z 033 {1} {h};^er z 033 {2} {h} 003 > ff2 o 022 n 033 2000 {2m;n} : 9f 2 ^ H : d 027 002 ^er z 033 {1} {f }; ^er z 033 {2} {f } 003 } > ff4 o ; which gives {58}.Now we bound the probability ofeach termin the right hand side of {58}. Lemma 22. Let f :{X 002 Y } n ! [0; 1]be any function that can be written in the form f = f 1 010 001 001 001 010f n . Forany z 2 {X 002 Y} {2m;n} , Pr n 0332000 {2m;n} : d 027 002 ^er z 033 {1} {f }; ^er z 033 {2} {f } 003 > ff4 o 024 2exp 022 000ff 2 027 mn8 ; {60} where each033 2 000 {2m;n} is chosen uniformly at random. Proof. For any 0332 000 {2m;n} , d 027 002 ^er z 033 {1} {f }; ^er z 033 {2} {f } 003 = fi fi fi P m i=1 P n j=1 002 f j {z 033{i;j} } 000f j {z 033{m+i;j} } 003 fi fi fi027 mn + P 2m i=1 P n j=1 f j {z ij } :{61} To simplify the notation denotef j {z ij } by fi ij . For each pairij , 1 024 i 024m, 1 024 j 024n, let Y ij be an independent random variable such thatY ij = fi ij 000 fi m+i;j with probability1=2 and Y ij = fi m+i;j 000 fi ij with probability 1=2. From {61}, Pr n 033 2 000 {2m;n} : d 027 002 ^er z 033 {1} {f };^er z 033 {2} {f } 003 > ff4 o = Pr 8 < : 033 2000 {2m;n} : fi fi fi fi fi fi m X i=1 n X j=1 002 f j {z 033{i;j} }000 f j {z 033{m+i;j} } 003 fi fi fi fi fi fi > ff4 0 @ 027 mn + 2m X i=1 n X j=1 fi ij 1 A 9 = ; = Pr 8 < : fi fi fi fi fi fi m X i=1 n X j=1 Y ij fi fi fi fi fi fi ff4 0 @ 027mn + 2m X i=1 n X j=1 fi ij 1 A 9 = ; For zero-mean independent randomvariables Y 1 ;: : : ; Y k withbounded ranges a i 024 Y i 024b i , Ho- effding's inequality {Devroye, Gy250or002, & Lugosi, 1996} is Pr { fi fi fi fi fi k X i=1 Y i fi fi fi fi fi 025 021 } 024 2 exp 000 2021 2P k i=1 {b i 000 a i } 2 ! : 183BAXTER Noting that the range of eachY ij is [000jfi ij 000 fi i+m;j }j; jfi ij 000 fi i+m;j }j], we have Pr 8 < : fi fi fi fi fi fi m X i=1 n X j=1 Y ij fi fi fi fi fi fi > ff4 0 @ 027 mn + 2m X i=1 n X j=1 fi ij 1 A 9 = ; 0242 exp 0 B @ 000 ff 2 h 027 mn + P 2m i=1 P n j=1 fi ij i 232 P m i=1 P n j=1 {fi ij 000fi i+m;j } 2 1 C A Let fl = P 2m i=1 P n j=1 fi ij . As 0 024fi ij 024 1, P m i=1 P n j=1 {fi ij 000fi m+ij } 2 024 fl .Hence, 2 exp 0 B @ 000 ff 2 h 027 mn+ P 2m i=1 P n j=1 fi ij i 232 P m i=1 P n j=1 {fi ij 000 fi i+m;j } 2 1 C A 024 2 exp 022 000 ff 2 {027 mn +fl } 232fl : {027 mn +fl } 2 =flis minimized by setting fl =027 mn giving a value of4027 mn. Hence Pr n 033 2 000 {2m;n} :d 027 002 ^er z 033 {1} {f }; ^er z 033 {2} {f} 003 > ff4 o 024 2 exp 022 000ff 2 027 mn8 ; as required.A.1.3PUTTING IT TOGETHER For 002xed z2 {X 002 Y } {2m;n} , Lemmas 21 and 22give: Pr 032 033 2 000 {2m;n} : sup H d 027 002 ^er z 033 {1} {h};^er z 033 {2} {h} 003 > ff2 033 024 2N {ff027=8; H; d z }) exp 022 000 ff 2 027 mn8 : Notethat d z is simplyd P where P= {P 1 ;: : : ; P n }and each P i is theempirical distribution that putspoint mass 1=m on each z ji ; j =1; : : : ; 2m {recallDe002nition 3}. Hence, Pr 032 033 2 000 {2m;n} ; z2 {X 002 Y } {2m;n} : sup H d 027 002 ^er z 033 {1} {h}; ^er z 033 {2} {h} 003 > ff2 033 0242C{ff027 =8;H} exp 022 000 ff 2 027 mn8 : Now, for a random choice of z, eachz ij in z isindependently {butnot identically} distributedand 033 only ever swapsz ij and z i+m;j {so that 033swaps a z ij drawn according to P j with another component drawnaccording to the same distribution}.Thus we can integrate out with respect to thechoice of 033 and write Pr 032 z 2{X 002 Y } {2m;n} : sup H d 027 002 ^er z{1} {h};^er z{2} {h} 003 > ff2 033 024 2C{ff027 =8; H} exp 022 000 ff 2 027mn8 : Applying Lemma 20 tothis expression gives Theorem 18.184A MODEL OFINDUCTIVE BIASLEARNING A.2Proof of Theorem 2 Another piece ofnotation is required for the proof. Forany hypothesis space H and any probability measures P = {P 1 ; : : : ; P n } on Z ,let ^er P {H} := 1n n X i=1 inf h2H er P i {h}: Notethat we have used ^er P {H} rather thaner P {H}to indicate that ^er P {H} is anotherempirical estimate of er Q {H}. With the {n; m}-samplingprocess, inaddition to the sample zthere is also generated a se- quenceof probability measures, P = {P 1 ; : : : ; P n } although these are notsupplied to the learner. Thisnotion is used in the following Lemma, wherePrf{z; P}2 {X 002 Y } {n;m} 002 P n : Ag means 223the probability of generating asequence ofmeasures P from the environment {P ; Q} and then an {n; m}-samplez according to P such that A holds224. Lemma 23. If Pr 032 {z; P} 2 {X 002 Y} {n;m} 002 P n : sup H d 027 [ ^er z {H}; ^er P {H}] > ff2 033 024 ffi2 ;{62} and Pr 032 P 2 P n : sup H d 027 [ ^er P {H}; er Q {H}] > ff2 033 024 ffi2 ; {63} then Pr 032 z 2 {X 002 Y} {n;m} : sup H d 027 [ ^er z {H};er Q {H}]> ff 033 024ffi: Proof. Followsdirectly from the triangle inequality ford 027 .We treat the two inequalities inLemma 23 separately. A.2.1INEQUALITY {62} In the following Lemma we replace thesupremum over H 2 H ininequality {62} with a supremum over h 2 H n . Lemma 24. Pr 032 {z; P} 2 {X 002 Y} {n;m} 002 P n : sup H d 027 [ ^er z {H}; ^er P {H}] > ff 033 024 Pr { {z; P}2 {X 002 Y } {n;m} 002 P n : sup H n l d 027 [ ^er z {h}; er P {h}] > ff } {64} 185BAXTER Proof. Suppose that {z; P} are such thatsup H d 027 [ ^er z {H}; ^er P {H}] > ff. Let H satisfy this in- equality. Suppose 002rst that^er z {H} 024 ^er P {H}. By the de002nition of^er z {H}, for all " > 0there exists h 2 H n := H 010 001001 001 010 H such that ^er z {h}< ^er z {H} + ". Henceby property {3} of the d 027 metric, for all " >0,there exists h 2 H n suchthat d 027 [ ^er z {h}; ^er z {H}]< ". Pick an arbitrary h satisfying this inequality. Byde002nition, ^er P {H} 024 er P {h},and so^er z {H} 024 ^er P {H} 024 er P {h}. As d 027 [^er z {H}; ^er P {H}] > ff{by assumption},by the compatibility of d 027 with the ordering on the reals,d 027 [ ^er z {H};er P {h}]> ff = ff + ffi, say. By the triangle inequality ford 027 , d 027 [ ^er z {h}; er P {h}] +d 027 [^er z {h};^er z {H}] 025 d 027 [ ^er z {H}; er P {h}] = ff+ ffi: Thus d 027 [ ^er z {h}; er P {h}] > ff+ ffi 000 " and for any" > 0 an h satisfying thisinequality can be found. Choosing" = ffi shows that there existsh 2 H n such thatd 027 [ ^er z {h}; er P {h}] > ff. If instead,^er P {H} < ^er z {H}, then an identical argument can be run with the role of zand P interchanged. Thus in bothcases, sup H d 027 [ ^er z {H};^er P {H}] > ff } 9h 2H n l :d 027 [ ^er z {h}; er P {h}] > ff; which completes theproof of the Lemma.By the nature ofthe {n; m} sampling process, Pr { {z; P} 2 {X 002Y } {n;m} 002 P n sup H n l : d 027 [^er z {h}; er P {h}] > ff } = Z P2P n Pr { z 2 {X 002 Y} {n;m} : sup H n l d 027 [ ^er z {h}; er P {h}] > ff } dQ n {P}: {65} Now H n l 022 K 010 001 001 001010K where K := fh l : h 2 H: H 2 H g andH n l ispermissible by the assumed permissibilityof H {Lemma 32, Appendix D}.Hence H n l satis002esthe conditions of Corollary 19 and so combining Lemma24, Equation {65} andsubstituting ff=2 for ffand ffi=2 for ffi in Corollary 19 gives the following Lemma on thesample size required to ensure {62} holds. Lemma 25. If m025 max 032 32ff 2 027 n log 8C {ff027 =16; H n l }ffi ; 8ff 2 027 033 then Pr 032 {z; P} 2 {X 002 Y } {n;m} 002 P n : sup H d 027 [ ^er z {H};^er P {H}] > ff2 033 024 ffi2 : A.2.2INEQUALITY {63} Note that er P {H} = 1n P n i=1 H 003 {P i }and er Q {H} = E P 030Q H 003 {P }, i.e the expectation ofH 003 {P} where P is distributedaccording toQ. So to bound theleft-hand-side of{63} we can apply Corollary 19 with n = 1, m replaced byn, H replaced by H 003 , ff and ffireplaced by ff=2 and ffi=2 respectively, Preplaced by Q and Z replacedbyP . Note that H 003 is permissible wheneverH is {Lemma 32}. Thus,if n 025 max 032 32ff 2 027 log 8C {ff027 =16; H 003 }ffi ; 8ff 2 027 033 {66} 186A MODEL OFINDUCTIVE BIASLEARNING then inequality{63} is satis002ed. Now,putting together Lemma 23, Lemma 25 and Equation 66, we have proved the following more general version of Theorem 2. Theorem 26. Let Hbe a permissible hypothesis space family and letz be an {n; m}-samplegen- erated from the environment{P ; Q}. For all 0 < ff; ffi <1 and 027 > 0, if n 025 max 032 32ff 2 027 log 8C {ff027 =16; H 003 }ffi ; 8ff 2 027 033 and m 025 max 032 32ff 2 027n log 8C {ff027 =16; H n l }ffi ; 8ff 2 027 033 ; then Pr 032 z 2 {X 002 Y} {n;m} : sup H d 027 [ ^er z {H};er Q {H}]> ff 033 024ffi To get Theorem 2, observethat er Q {H} > ^er z {H} + " }d 027 [ ^er z {H}; er Q {H}] > "={2 + 027}. Setting ff ="={2 + 027 } and maximizingff 2 027 gives027 = 2. Substitutingff = "=4 and 027= 2 into Theorem 26 givesTheorem 2. A.3 The Realizable Case In Theorem 2 the sample complexity forboth m and n scales as 1=" 2 . Thiscan be improved to 1="if instead of requiring er Q {H} 024^er z {H} + ", we require only thater Q {H}024 024 ^er z {H} + " for some 024 > 1. To see this, observe that er Q {H} >^er z {H}{1 + ff}={1000 ff} + ff027 ={1000 ff} } d 027 [ ^er z {H}; er Q {H}] > ff, so setting ff027 ={1000 ff} = "in Theorem 26 and treating ff as a constant gives: Corollary 27.Under the same conditions asTheorem 26, forall " > 0 and 0 < ff;ffi < 1, if n025 max 032 32ff{1 000ff}" log 8C {(1 000 ff}"=16; H 003 }ffi ; 8ff{1 000 ff}" 033 and m 025 max 032 32ff{1 000 ff}"n log 8C {(1000 ff}"=16;H n l }ffi ; 8ff{1 000 ff}" 033 ; then Pr 032 z 2{X 002 Y } {n;m} : sup H er Q {H} 025 1+ ff1000 ff ^er z {H} + " 033 024 ffi: These bounds are particularly useful if we know that ^er z {H} = 0, for then we can set ff = 1=2 {which maximizesff{1000 ff}}. Appendix B. Proof of Theorem 6 Recalling De002nition 6, forH of the form given in {32},H n l can be written H n l = fg 1 ffif 010 001 001 001 010g n ffi f :g 1 ; : : : ; g n 2 G l and f 2 F g : To write H n l asa composition of two functionclasses note that if for each f: X ! V we de002ne 026 f : {X002 Y } n ! {V 002 Y } n by 026 f {x 1 ; y 1 ; : : : ; x n ; y n } := {f {x 1 }; y 1 ; : : : ; f {x n }; y n } 187BAXTER then g 1 ffif010 001 001 001 010g n ffi f= g 1 010001 001 001 010 g n ffi 026 f. Thus, setting G n l := G l 010 001 001 001 010 G l andF:= f 026 f: f 2 F g, H n l =G n l ffiF : {67} Thefollowing two Lemmas will enable us to boundC{"; H n l }. Lemma 28.Let H : X 002Y ! [0; 1] be of theform H = G l ffi F where X 002Y F 000!V 002 Y G l 000! [0;1]. For all " 1 ; " 2 > 0, C {" 1 + " 2 ; H} 024 C G l {" 1 ; F }C {" 2 ;G l }: Proof. Fix a measureP on X 002 Yand let F be a minimum size" 1 -cover for{F ; d [P;G l ] }. By de002nition jFj 024 C G l {" 1 ;F }. For each f2 F let P f be the measure on V 002Y de002ned by P f {S } = P{f 0001 {S }) for any set Sin the 033-algebra on V002 Y {f is measurable sof 0001 {S} is measurable}. LetG f be a minimum size" 2 -cover for{G l ; d P f }.By de002nition again, jG f j 024 C {" 2 ; G l }. Let N :=fg ffi f : f2 F and g 2G f g. Notethat jN j 024 C G l {" 1 ; F }C{" 2 ; G l } so the Lemma will be proved if N can be shown tobe an " 1 +" 2 -coverfor {H; d P }. So, given anyg ffi f 2 H choose f 0 2 Fsuch that d [P;G l ] {f ; f 0 }024 " 1 andg 0 2 G f 0 such thatd P f 0 {g; g 0 } 024 " 2 . Now, d P {g ffi f; g 0 ffif 0 } 024d P {g ffif ; g ffif 0 } + d P {g ffif 0 ; g 0 ffif 0 } 024 d [P;G l ] {f; f 0 } +d P f 0 {g; g 0 } 024 " 1 + " 2 : where the 002rst line follows from the triangle inequality for d P and the second line followsfrom the facts: d P {g ffif 0 ; g 0 ffi f 0 } = d P f {g; g 0 } and d P {g ffif ; gffi f 0 }024 d [P;G l ] {f; f 0 }.Thus N is an " 1 + " 2 -cover for {H;d P } and so the result follows.Recalling the de002nition of H 1 010001001 001 010 H n {De002nition 6}, we have the followingLemma. Lemma 29. C{"; H 1 010 001 001 001 010 H n } 024 n Y i=1 C {"; H i } Proof. Fixa product probability measure P =P 1 002 001001 001 002 P n on {X 002 Y} n . LetN 1 ; : : : ; N n be "-coversof {H 1 ;d P 1 }: : : ; {H n ; d P n }. and let N =N 1 010 001001 001 010 N n . Given h = h 1 010 001 001 001010 h n 2 H 1 010 001 001 001010 H n , chooseg 1 010 001001 001 010 g n 2 N such that d P i {h i ; g i } 024 " for each i= 1; : : : ; n. Now, d P {h 1 010 001001 001 010 h n ; g 1 010001 001 001 010 g n } = 1n Z Z n fi fi fi fi fi n X i=1 h i {z i } 000 n X i=1 g i {z i } fi fi fi fi fi dP{z 1 ; : : : ; z n } 024 1n n X i=1 d P i {h i ; g i } 024 ": Thus N is an "-cover forH 1 010 001 001 001010 H n and asjN j = Q n i=1 jN i j the result follows.188A MODEL OFINDUCTIVE BIASLEARNING B.1Bounding C {"; H n l } From Lemma 28, C 000 " 1 + " 2 ; G l n ffiF 001 024 C {" 1 ; G l n } C G l n 000 " 2 ;F 001 {68} and from Lemma 29, C {" 1 ; G l n } 024 C{" 1 ; G l } n : {69} Using similar techniques tothose used to prove Lemmas 28 and 29, C G l n {";F } can be shown to satisfy C G l n {" 2 ;F} 024 C G l {" 2 ; F } : {70} Equations {67}, {68}, {69} and {70}together imply inequality {34}. B.2 Bounding C {"; H 003 } Wewish to prove that C {"; H 003 }024 C G l {"; F } whenH is a hypothesis space family of theform H = fG l ffif : f 2F g. Note that each H 003 2 H 003 corresponds to some G l ffi f , and that H 003 {P} = inf g2G l er P {g ffi f }: Any probability measure Qon P induces a probability measureQ X 002Y on X 002 Y , de002ned by Q X 002Y {S } = Z P P {S }dQ{P } for anyS in the 033-algebra onX 002 Y . Note also that ifh; h 0 are bounded,positive functions on an arbitrary setA, then fi fi fi fi inf a2A h{a} 000 inf a2A h 0 {a} fi fi fi fi 024 sup a2A fi fi h{a}000 h 0 {a} fi fi : {71} LetQ be any probability measure on the spaceP of probability measures on X002 Y . Let H 003 1 ; H 003 2 be twoelements of H 003 withcorresponding hypothesis spaces G l ffif 1 ; G l ffif 2 . Then, d Q {H 003 1 ; H 003 2 } = Z P fi fi fi fi inf g2G l er P {gffi f 1 } 000inf g2G l er P {gffi f 2 } fi fi fi fi dQ{P } 024 Z P sup g2G l jer P {g ffif 1 } 000 er P {g ffif 2 }j dQ{P }{by {71} above} 024 Z P Z X 002Y sup g2G l jgffi f 1 {x; y} 000 g ffif 2 {x; y}j dP {x; y}dQ{P } = d [Q X002Y ;G l ] {f 1 ;f 2 }: The measurability of sup G l gffi fis guaranteed by the permissibility ofH {Lemma 32 part 4, Ap- pendixD}. From d Q {H 003 1 ; H 003 2 } 024 d [Q X002Y ;G l ] {f 1 ;f 2 } we have, N {"; H 003 ; d Q } 024 N 000 "; F ; d [Q X002Y ;G l ] 001 ; {72} which givesinequality {35}. 189BAXTER B.3 Proof of Theorem 7 In order to prove the bounds inTheorem 7 we have to apply Theorem 6 to theneural network hypothesis space family ofequation {39}. In this case the structure is R d F 000! R k G 000! [0;1] where G = f{x 1 ; : : : ; x k } 7! 033 020 P k i=1 ff i x i + ff 0 021 : {ff 0 ; ff 1 ; : : : ; ff k } 2 U g for some bounded subset U of R k+1 and some Lipschitz squashingfunction 033. The feature classF : R d ! R k is the set ofall one hidden layer neural networks withd inputs, l hidden nodes, koutputs, 033 as the squashing functionand weights w 2 T whereT is a bounded subset of R W . The Lipschitz restriction on 033 and the boundedrestrictions on the weights ensure thatF and G areLipschitz classes. Hence there exists b< 1 such that for all f2 F and x; x 0 2 R d , kf {x}000 f {x 0 }k < bkx000 x 0 k and for allg 2 G andx; x 0 2 R k ,jg{x} 000 g{x 0 }j< bkx 000 x 0 k where k 001 kis the L 1 norm in each case. The loss function is squaredloss. Now, g l {x; y} =l{g{x};y} = {g{x}000 y} 2 , hence for all g; g 0 2 G and all probability measures P on R k 002 [0; 1] {recall that weassumed the output space Y was[0; 1]}, d P {g l ; g 0 l } = Z R k 002[0;1] fi fi {g{v} 000 y} 2 000 {g 0 {v} 000 y} 2 fi fi dP {v; y} 024 2 Z R k fi fi g{v}000 g 0 {v} fi fi dP R k {v}; {73} where P R k is the marginal distribution onR k derived fromP . Similarly, for allf ; f 0 2F and probability measures Pon R d 002[0; 1], d [P;G l ] {f ; f 0 } 024 2b Z R d kf {x} 000f 0 {x}k dP R d {x}: {74} De002ne C 000 "; G ; L 1 001 := sup P N 000 "; G ; L 1 {P } 001 ; where the supremum is overall probability measures on {the Borel subsets of}R k , and N 000 "; G; L 1 {P} 001 is the size of thesmallest "-cover of G under theL 1 {P }metric. Similarly set, C 000 "; F ;L 1 001 := sup P N 000 "; F ;L 1 {P } 001 ; wherenow the supremum is over all probabilitymeasures on R d . Equations{73} and {74} imply C {"; G l }024 C 020 "2 ;G ; L 1 021 {75} C G l {"; F }024 C 020 "2b ;F ; L 1 021 {76} Applying Theorem 11 fromHaussler {1992}, we 002nd C 020 "2 ; G l ; L 1 021 024 024 2eb" 025 2k+2 C 020 "2b ; F ; L 1 021 024 024 2eb 2" 025 2W : Substitutingthese two expressions into {75} and {76} and applying Theorem 6 yields Theorem 7.190A MODEL OFINDUCTIVE BIASLEARNING Appendix C.Proof of Theorem 14 Thisproof follows a similar argument to the onepresented in Anthony and Bartlett {1999} for ordinary Boolean function learning. First we need a technical Lemma. Lemma 30. Let ff be a randomvariable uniformly distributed onf1=2 + fi =2;1=2 000 fi =2g, with 0 < fi< 1. Let 030 1 ; : : : ; 030 m be i.i.d. f1;0001g-valued random variables with Pr{030 i = 1} =ff for all i. For any function f mappingf1; 0001g n ! f1=2 +fi =2; 1=2000 fi =2g, Pr f030 1 ; : : : ; 030 m : f {030 1 ; : : : ; 030 m } 6= ffg > 14 " 1 000 q1 000 e 000 mfi 21000fi 2 # : Proof.Let N {030}denote the number of occurences of+1in the random sequence 030 ={030 1 ; : : : ; 030 m }. Thefunction f can be viewed as a decisionrule, i.e. based on the observations030,f tries to guess whether the probability of +1 is1=2 + fi =2 or1=2 000 fi =2. The optimal decision rule is the Bayes estimator: f {030 1 ; : : : ; 030 m } = 1=2+ fi =2 if N {030} 025 m=2,andf {030 1 ; : : : ; 030 m } = 1=2 000fi =2 otherwise. Hence, Pr {f{030}6= ff} 025 12 Pr 020 N {030} 025 m2 fi fi fi fi ff = 12 000 fi2 + 12 Pr 020 N {030} < m2 fi fi fi fi ff = 12 + fi2 > 12 Pr 020 N {030}025 m2 fi fi fi fi ff = 12 000 fi2 which is half theprobability that a binomial {m;1=2 000 fi =2}random variable is at least m=2. By Slud's inequality {Slud,1977}, Pr {f {030} 6= ff}> 12 Pr Z 025 smfi 21000 fi 2 ! where Z is normal {0; 1}. Tate' s inequality {Tate, 1953} states that for all x025 0, Pr {Z025 x} 025 12 h 1 000 p1 000 e 000x 2 i : Combining the last two inequalitiescompletes the proof.Letx 2 X {n;m} beshattered by H , withm = d H {n}. For each rowi in x let P i be the set of all 2 d distributions Pon X 002 f0061gsuch that P {x; 1}= P {x; 0} = 0ifx is not contained in the ith row of x, and for eachj = 1; : : : ; d H {n},P {x ij ; 1} = {1 006 fi}={2d H {n}) and P{x ij ; 0001} = {1 007 fi}={2d H {n}). Let P:= P 1 002001 001 001 002 P n . Note that for P = {P 1 ; : : : ; P n } 2 P ,the optimal error opt P { H n }is achieved by any sequence h 003 = {h 003 1 ;: : : ; h 003 n } such that h 003 i {x ij } = 1 if and only ifP i {x ij ; 1} = {1+ fi }={2d H {n}), andH n alwayscontains such a sequence because Hshatters x. The optimal error is then opt P {H n } = er P {h 003 } = 1n n X i=1 P i fh 003 i {x} 6= yg= 1n n X i=1 d H {n} X j=1 1 000 fi2d H {n} = 1000 fi2 ; 191BAXTER and for any h = {h 1 ; : : : ; h n } 2 H n , er P {h} = opt P { H n }+ find H {n} jf{i; j } : h i {x ij } 6= h 003 i {x ij }gj : {77} For any {n; m}-sample z, let each elementm ij in the array m{z} := m 11 001 001 001m 1d H {n} . . . . . . . . . m n1 001 001 001m nd H {n} equal the number ofoccurrences of x ij in z. Now, if we selectP = {P 1 ; : : : ; P n } uniformly at random from P, and generate an {n; m}- sample z usingP, then for h = A n {z} {the output of the learning algorithm} we have: E { jf{i; j }: h i {x ij } 6= h 003 i {x ij }gj} = X m P {m}E { jf{i;j } : h i {x ij }6= h 003 i {x ij }gj jm} = X m P{m} n X i=1 d H {n} X j=1 P {h{x ij } 6=h 003 {x ij }jm ij } where P{m} is the probability ofgenerating acon002guration mof thex ij under the {n; m}-sampling processand the sum is over all possiblecon002gurations. From Lemma 30, P {h{x ij } 6= h 003 {x ij }jm ij } > 14 2 4 1 000 r1 000e 000 m ij fi 21000fi 2 3 5 ; hence E 024 1nd H {n} jf{i;j } : h i {x ij }6= h 003 i {x ij }gj 025 > 1nd H {n} X m P {m} n X i=1 d H {n} X j=1 14 2 4 1000 r1 000 e 000 m ij fi 21000fi 2 3 5 025 14 " 1 000 q1 000e 000 mfi 2d H {n}{1000fi 2 } # {78} by Jensen'sinequality. Since for any [0; 1]-valued random variableZ , Pr{Z > x}025 E Z 000 x, {78} implies: Pr 022 1nd H {n} jf{i; j } :h i {x ij } 6= h 003 i {x ij }gj > flff > {1000 fl }ff where ff := 14 " 1000 q1 000 e 000 mfi 2d H {n}{1000fi 2 } # {79} and fl 2 [0; 1]. Plugging this into {77} shows that Pr f{P; z} : er P {A n {z})> opt P {H n } + flfffi g > {1000 fl}ff: 192A MODEL OFINDUCTIVE BIASLEARNING Since theinequality holds over the random choice ofP, it must also hold for some speci002cchoice ofP. Hence for anylearning algorithm A n there is some sequence of distributionsP such that Pr fz : er P {A n {z})> opt P {H n } + flfffi g > {1000 fl}ff: Setting {1000 fl }ff 025ffi; and fl fffi025 "; {80} ensures Pr fz: er P {A n {z}) > opt P { H n } + "g > ffi:{81} Assuming equality in {80}, weget ff = ffi1 000 fl ; fi = "ffi 1 000flfl : Solving {79} for m, and substituting theabove expressions forff and fi shows that {81} issatis002ed provided m024 d H {n} " 022 ffi" 2 022 fl1 000 fl 2 0001 # log {1000 fl } 28ffi {1 000fl 000 2ffi} {82} Setting fl=1 000 affi for somea > 4 {a > 4since ff < 1=4and ff = ffi={1000 fl }}, and assuming "; ffi 024 1={ka} for some k >2, {82} becomes m 024 d H {n}a 2 024 1 000 2k 025 log a 28{a 000 2} : {83} Subject to the constraint a >4, the right hand side of {83} isapproximately maximized at a = 8:7966, at which point the value exceeds d H {n}{1000 2=k}={220" 2 }. Thus, for all k0251,if "; ffi 024 1=9k and m 024 d H {n} 000 1000 2k 001220" 2 ; {84} then Pr fz: er P {A n {z}) > opt P { H n }+ "g > ffi: To obtain the ffi-dependence inTheorem 14 observe that by assumptionH 1 containsat least two functions h 1 ; h 2 , hence there exists an x 2 X such thath 1 {x}6= h 2 {x}. Let P 006 be two distributions concentratedon {x; 1} and {x; 0001} such that P 006 {x; h 1 {x}) = {1006 "}=2 andP 006 {x;h 2 {x})= {1 007 "}=2. Let P + := P + 002001 001 001 002 P + and P 000 := P 000 002001 001 001 002 P 000 be the product distributions on {X 002 f0061g} n generated byP 006 , and h 1 := {h 1 ; : : : ; h 1 }; h 2 := {h 2 ; : : : ; h 2 }. Note that h 1 and h 2 are both in H n . If P is one of P 006 and the learning algorithmA n chooses the wronghypothesis h, then er P {h}000 opt P {H n } = ": 193BAXTER Now, if we choose Puniformly at random from fP + ; P 000 g and generate an {n;m}-sample z ac- cordingtoP, Lemma 30 shows that Prf{P; z}: er P {A n {z}) 025opt P { H n } + "g> 14 " 1 000 q1 000 e nm" 21000" 2 # ; which is at least ffiif m < 1000" 2" 2 1n log 18ffi{1000 2ffi} {85} provided 0 < ffi <1=4. Combining the two constraints onm: {84} {with k = 7} and {85}, and using maxfx 1 ; x 2 g 025 12 {x 1 + x 2 } 002nishes the proof.Appendix D. Measurability In order for Theorems 2 and 18 to hold infull generality wehad to impose a constraint called 223permissibility224 on the hypothesisspace family H . Permissibility wasintroduced by Pollard {1984} forordinary hypothesis classes H. Hisde002nition is very similar to Dudley's 223image admissible Suslin224{Dudley, 1984}. We will be extending this de002nition to cover hypothesisspace families. Throughout this section weassume all functions hmap from {the completeseparable metric space} Zinto [0; 1]. LetB{T } denote the Borel033-algebra of any topological spaceT . As in Section 2.2,we view P , the set of all probability measures on Z , as a topological space by equipping it with the topology of weak convergence. B{P }is then the 033-algebra generated by thistopology. The followingtwo de002nitions are taken {with minormodi002cations} from Pollard {1984}. De002nition 8. A set H of[0; 1]-valued functions onZ is indexed by the setT if there exists a function f : Z002 T! [0; 1] such that H = ff { 001; t} : t 2 Tg : De002nition 9.The set H is permissible if it canbe indexed by a set T suchthat 1. T is an analytic subsetof a Polish 7 spaceT , and 2. the function f : Z002 T ! [0; 1]indexing H by T ismeasurable with respect to the product 033-algebra B{Z} 012 B{T }. An analytic subset Tof a Polish spaceTis simply the continuous imageof a Borelsubset X of another Polish spaceX . The analytic subsetsof a Polish space include the Borel sets.They are important because projections ofanalytic sets are analytic, and can be measured ina complete measure space whereas projectionsof Borel sets are not necessarily Borel, and hencecannot be measured with a Borel measure.For more details see Dudley {1989}, section13.2. Lemma 31. H 1 010 001001 001 010 H n : {X 002Y } n ![0; 1] is permissible ifH 1 ; : : : ;H n are all permissible. Proof. Omitted.We now de002ne permissibility ofhypothesis space families.7. A topological space is calledPolish if it is metrizable such that itis a complete separable metric space. 194A MODEL OFINDUCTIVE BIASLEARNING De002nition 10.A hypothesis space family H= fHg is permissible if thereexist sets S andT that are analytic subsets of PolishspacesS andT respectively,and a function f : Z002T 002 S ! [0; 1], measurable with respectto S 012 B{T }012 B{S }, such that H = 010 ff { 001 ; t; s}: t 2 T g :s 2 S 011 : Let {X; 006; 026} be a measure space and Tbe an analytic subset of a Polish space.Let A{X } denotethe analytic subsets of X . Thefollowing three facts about analytic sets aretaken from Pollard {1984}, appendix C. {a} If {X; 006; 026} is complete then A{X } 022 006. {b} A{X 002T } contains the product033-algebra 006 012 B{T }. {c} Forany set Y in A{X 002 T }, the projection031 X Y ofY onto X is in A{X }. Recall De002nition 2for the de002nition of H 003 . In the following Lemma we assume that{Z; B{Z }) has been completed with respect to anyprobability measure P , and also that{P ; B{P })is complete with respect to the environmental measure Q. Lemma 32.For any permissible hypothesis space familyH , 1. H n l is permissible. 2. fh 2 H : H2 H gis permissible. 3. H is permissible for allH 2 H . 4. sup H and inf H are measurable for allH 2 H . 5. H 003 is measurable for allH 2 H . 6. H 003 is permissible. Proof. As we have absorbed theloss function into the hypotheses h,H n l issimply the set of all n-fold productsH 010 001 001 001 010 Hsuch that H 2 H . Thus{1} follows from Lemma 31. {2}and {3} are immediate from the de002nitions.As H is permissible for allH 2 H , {4} can be proved byan identical argument to that used in the223Measurable Suprema224 section of Pollard {1984},appendix C. For {5}, note thatfor any Borel-measurable h :Z ! [0; 1],the functionh :P ! [0; 1] de002ned byh{P} := R Z h{z} dP {z} is Borel measurable Kechris {1995, chapter 17}. Now, permissibility of H automatically implies permissibility ofH := fh : h 2 Hg, andH 003 = infH so H 003 is measurable by {4}. Now let Hbe indexed by f :Z 002 T 002 S! [0; 1] in the appropriate way. To prove {6}, de002ne g : P 002T 002 S ! [0; 1] by g{P;t; s} := R Z f {z; t; s} dP {z}. ByFubini's theorem g is a B{P } 012 B{T } 012 B{S}-measurable function. LetG : P 002 S ![0; 1] be de002ned byG{P; s} := inf t2T g{P; t; s}. Gindexes H 003 in the appropriate way forH 003 to be permissible,provided it can be shown thatG is B{P } 012 B{S }-measurable. Thisis where analyticity becomes important.Let g ff :=f{P; t; s} :g{P; t; s} > ffg. By property {b} of analytic sets,A {P 002 T 002S } contains g ff . The set G ff := f{P;s} : G{P; s} > ffg is the projection ofg ff onto P 002S , which by property {c} is also analytic. As {P; B{P }; Q} is assumed complete, G ff is measurable, by property {a}.Thus G is a measurable function andthe permissibility of H 003 follows.195BAXTER References Abu-Mostafa,Y. {1993}. A method for learning fromhints. In Hanson, S. J., Cowan, J. D., &Giles, C. L. {Eds.}, Advances in Neural Information Processing Systems 5, pp.7322680 San Mateo, CA. Morgan Kaufmann. Anthony, M., & Bartlett, P. L. {1999}. Neural Network Learning:Theoretical Foundations. Cam- bridge University Press, Cambridge, UK. Bartlett, P. L. {1993}.Lower bounds on the VC-dimension of multi-layerthreshold networks. In Proccedingsof the Sixth ACM Conference on ComputationalLearning Theory, pp. 44226150 New York. ACM Press. Summaryappeared in Neural Computation, 5, no. 3. Bartlett, P. L. {1998}.The sample complexity of pattern classi002cationwithneural networks: the sizeof the weights is more important than the size ofthe network. IEEE Transactionson Information Theory, 44{2}, 525226536. Baxter, J.{1995a}. Learning Internal Representations. Ph.D. thesis, Department of Mathe- matics and Statistics, The Flinders University of South Australia. Copy availablefrom http://wwwsyseng.anu.edu.au/030jon/papers/thesis.ps.gz. Baxter, J. {1995b}. Learninginternal representations. In Proceedings ofthe Eighth International Conference onComputational Learning Theory, pp. 311226320. ACM Press. Copy available fromhttp://wwwsyseng.anu.edu.au/030jon/papers/colt95.ps.gz. Baxter, J. {1997a}.A Bayesian/information theoretic model of learningto learn via multiple task sampling.Machine Learning, 28, 722640. Baxter, J. {1997b}. The canonicaldistortion measure for vector quantization andfunction approx- imation. InProceedings of the Fourteenth InternationalConference on Machine Learning, pp. 3922647. Morgan Kaufmann. Baxter, J., & Bartlett, P.L. {1998}. The canonical distortion measure infeature space and 1-NN classi002cation.In Advances in Neural Information ProcessingSystems 10, pp. 245226251. MIT Press. Berger, J. O. {1985}.Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, New York. Blumer, A., Ehrenfeucht,A.,Haussler, D.,& Warmuth, M. K.{1989}. Learnability and the vapnik- chervonenkis dimension. Journalof the ACM, 36, 929226965. Caruana, R. {1997}. Multitask learning.Machine Learning, 28, 4122670. Devroye, L., Gy 250or002,L., & Lugosi, G. {1996}. AProbabilistic Theory of PatternRecognition. Springer, New York. Dudley , R. M. {1984}.A Course on Empirical Processes, Vol. 1097 of Lecture Notes inMathemat- ics, pp. 2226142.Springer-Verlag. Dudley ,R. M. {1989}. Real Analysis and Probability. Wadsworth & Brooks/Cole,California. 196A MODEL OFINDUCTIVE BIASLEARNING Gelman,A., Carlin, J. B., Stern, H. S., & Rubim, D. B.{Eds.}. {1995}. Bayesian Data Analysis. Chapman and Hall. Good,I. J. {1980}. Some history of the hierarchical Bayesian methodology. In Bernardo, J.M., Groot, M. H. D., Lindley, D. V., & Smith, A. F. M. {Eds.},Bayesian Statistics II. University Press, Valencia. Haussler,D. {1992}. Decision theoretic generalizationsofthe pac model for neural net and other learning applications. Informationand Computation, 100, 78226150. Heskes, T. {1998}. Solvinga huge number of similar tasks: acombination ofmulti-task learning and a hierarchical Bayesian approach. InShavlik, J. {Ed.}, Proceedings of the 15thInternational Conference on Machine Learning{ICML '98}, pp. 233226241. Morgan Kaufmann. Intrator, N., & Edelman, S.{1996}.How to make a low-dimensional representationsuitable for diverse tasks.Connection Science, 8. Kechris, A. S. {1995}. ClassicalDescriptive Set Theory. Springer-Verlag, NewYork. Khan,K., Muggleton, S., & Parson, R. {1998}.Repeat learning using predicate invention.In Page, C. D. {Ed.}, Proceedingsof the 8th International Workshop on InductiveLogic Programming {ILP-98}, LNAI 1446, pp. 65226174. Springer-Verlag. Langford, J. C. {1999}.Staged learning. Tech. rep.,CMU, School of Computer Science. http://www.cs.cmu.edu/030jcl/research/ltol/stagedlatest.ps. Mitchell, T. M. {1991}.The need for biases in learning generalisations.In Dietterich, T. G., & Shavlik, J. {Eds.}, Readings in MachineLearning. Morgan Kaufmann. Parthasarathy, K.R. {1967}. ProbabiliityMeasures on Metric Spaces. AcademicPress, London. Pollard, D. {1984}.Convergence of Stochastic Processes. Springer-Verlag, NewYork. Pratt, L. Y. {1992}.Discriminability-based transferbetween neural networks. In Hanson, S.J., Cowan,J.D., & Giles, C. L. {Eds.}, Advances in Neural Information Processing Systems 5, pp. 204226211. Morgan Kaufmann. Rendell, L., Seshu, R., & Tcheng, D.{1987}.Layered concept learning and dynamically-variable bias management. In Proceedings ofthe Tenth International Joint Conference onArti002cial Intelligence {IJCAI'87}, pp. 308226314. IJCAI , Inc. Ring, M. B. {1995}. Continual Learning inReinforcement Environments. R.Oldenbourg Verlag. Russell,S. {1989}. The Use of Knowledge in Analogy and Induction. Morgan Kaufmann. Sauer, N. {1972}. On the densityof families of sets. Journalof Combinatorial Theory A, 13, 145226168. Sharkey,N. E., & Sharkey, A. J. C. {1993}.Adaptive generalisation andthe transfer of knowledge. Arti002cial Intelligence Review,7, 313226328. 197BAXTER Silver, D. L., & Mercer,R. E. {1996}. The parallel transfer of task knowledge using dynamic learningrates based on a measure of relatedness.Connection Science, 8, 277226294. Singh, S. {1992}. Transferof learning by composing solutions of elementalsequential tasks. Ma- chineLearning, 8, 323226339. Slud, E. {1977}. Distribution inequalities forthe binomial law. Annals of Probability, 4,404226412. Suddarth, S.C., & Holden, A. D. C. {1991}.Symolic-neural systems and the use of hints in devel- oping complex systems.International Journal of Man-Machine Studies, 35, 291226311. Suddarth,S. C., & Kergosien, Y.L. {1990}. Rule-injection hints as a means ofimproving net- work performance andlearning time. In Proceedings of the EURASIPWorkshop on Neural NetworksPortugal. EURASIP. Sutton,R. {1992}. Adapting bias by gradient descent:An incremental version of delta-bar-delta.In Proceedings of the TenthNational Conference on Arti002cial Intelligence, pp. 171226176. MIT Press. Tate, R.F. {1953}.On a double inequality of the normal distribution. Annals of Mathematical Statistics, 24, 132226134. Thrun, S. {1996}. Is learning the n-ththing any easier than learning the 002rst?.In Advances in Neural InformationProcessing Systems 8, pp. 640226646. MITPress. Thrun, S., & Mitchell, T. M. {1995}. Learning one more thing.In Proceedings of the International Joint Conference on Arti002cial Intelligence, pp. 12172261223. Morgan Kaufmann. Thrun, S., & O'Sullivan, J. {1996}.Discovering structure in multiple learning tasks:The TC al- gorithm. InSaitta, L.{Ed.}, Proceedings ofthe 13thInternational Conference on Machine Learning {ICML '96}, pp. 489226497. Morgen Kaufmann. Thrun, S., & Pratt, L. {Eds.}. {1997}. Learning to Learn.Kluwer Academic. Thrun, S.,& Schwartz, A.{1995}. Finding structure inreinforcement learning. In Tesauro,G., Touretzky , D., & Leen, T. {Eds.}, Advances in Neural Information Processing Systems, Vol. 7, pp. 385226392. MIT Press. Utgoff, P. E. {1986}. Shift of bias for inductive concept learning. In Machine Learning:An Arti002cial IntelligenceApproach, pp. 107226147. Morgan Kaufmann. Valiant, L. G. {1984}. Atheory of the learnable. Comm. ACM, 27, 11342261142. Vapnik, V. N. {1982}. Estimationof Dependences Based on Empirical Data. Springer-Verlag, New York. V apnik, V. N. {1996}. The Nature of StatisticalLearning Theory. Springer Verlag,New York. 198