A Model of Inductive Bias Learning

J. Baxter

Volume 12, 2000

Links to Full Text:

Abstract:
A major problem in machine learning is that of inductive bias: how to choose a learner's hypothesis space so that it is large enough to contain a solution to the problem being learnt, yet small enough to ensure reliable generalization from reasonably-sized training sets. Typically such bias is supplied by hand through the skill and insights of experts. In this paper a model for automatically learning bias is investigated. The central assumption of the model is that the learner is embedded within an environment of related learning tasks. Within such an environment the learner can sample from multiple tasks, and hence it can search for a hypothesis space that contains good solutions to many of the problems in the environment. Under certain restrictions on the set of all hypothesis spaces available to the learner, we show that a hypothesis space that performs well on a sufficiently large number of training tasks will also perform well when learning novel tasks in the same environment. Explicit bounds are also derived demonstrating that learning multiple tasks within an environment of related tasks can potentially give much better generalization than learning a single task.

Extracted Text

Journal of Arti002cial IntelligenceResearch 12 {2000} 149226198 Submitted11/99;
published 3/00
 A Model of Inductive Bias Learning
 JonathanBaxter JONATHAN.BAXTER@ANU.EDU.AU
 ResearchSchool of Information Sciences and Engineering Australian National
University, Canberra 0200, Australia
 Abstract
 A major problem in machinelearning is that of inductive bias:how to choose
a learner's hy- pothesis space so that it is large enough to contain a solution
to the problem being learnt, yetsmall
 enough to ensure reliable generalizationfrom reasonably-sized training sets.
Typically such bias is
 supplied by handthrough the skill and insights of experts.In this paper
a model for automatically learning bias is investigated.The central assumption
of the model is that thelearner is embedded
 within anenvironment of related learning tasks.Within such an environment
the learner cansample
 from multiple tasks, and hence it cansearch for a hypothesis space that
contains goodsolutions to
 many of the problems in the environment. Under certain restrictions on theset
of all hypothesis
 spaces availableto the learner, we show that a hypothesis space that performs
well on a suf002ciently large number of training tasks will alsoperform
well when learning novel tasks in thesame en-
 vironment. Explicit bounds arealso derived demonstrating that learning multipletasks
within an
 environment of relatedtasks can potentially give much bettergeneralization
than learning a single
 task.
 1. Introduction
 Oftenthe hardest problem in any machinelearning task is the initial choice
of hypothesisspace;
 it has to be large enough tocontain a solution to the problem at hand, yetsmall
enough to ensure
 good generalization froma small number of examples {Mitchell, 1991}.Once
a suitable bias has
 been found, theactual learning task is often straightforward.Existing methods
of bias generally
 requirethe input of a human expert in the form of heuristics and domain
knowledge {for example, through the selection of an appropriate setof features}.
Despite their successes, suchmethods are
 clearly limited by the accuracyand reliability of the expert'sknowledge
and also by the extent to whichthat knowledge can be transferred to the learner.
Thus it is natural to search formethods for
 automatically learningthe bias.
 In this paper we introduce andanalyze a formal model of bias learningthat
builds upon
 the PACmodel of machine learning and its variants {Vapnik, 1982; Valiant,
1984;Blumer,
 Ehrenfeucht, Haussler, &Warmuth, 1989; Haussler, 1992}.These models typically
take the
 following general form: the learner is supplied with a hypothesis space
H and training data z =f{x
 1 ; y
 1
 }; : : : ; {x
 m ; y
 m
 }g drawn independently according tosomeunderlying distribution P
 onX 002 Y . Based on theinformation contained in z, the learner's goal
is to select a hypothesis h : X ! Yfrom H minimizing some measureer
 P
 {h}of expected loss with respect to P{for ex-
 ample, in the case ofsquared loss er
 P
{h} := E
 {x;y}030P
 {h{x} 000 y} 2
 }. In such models thelearner's
 bias is represented by thechoice of H; if H does not contain agood solution
to the problem, then,
 regardless of how much data the learner receives, it cannot learn.
 Of course, the bestway to bias the learner is to supply it withan H containing
just a single op- timal hypothesis. But 002nding such ahypothesis is precisely
the original learning problem,so in the
 c
 fl2000AI Access Foundation and Morgan KaufmannPublishers. Allrights reserved.BAXTER
PAC model there is no distinctionbetween bias learning and ordinary learning.Or
put differently,
 thePAC model does not model the process ofinductive bias, it simply takes
the hypothesisspace H
 as given and proceeds fromthere. To overcome this problem, inthis paper
we assume that instead
 of being faced with just a single learning task,the learner is embedded
within an environmentof
 related learning tasks. Thelearner is supplied with a family of hypothesisspaces
H = fHg, and its goal is to 002nd a bias {i.e.hypothesis space H 2 H } that
isappropriate for the entire environment. A simple example is the problem
of handwrittencharacter recognition. A preprocessingstage that
 identi002es and removes any{small} rotations, dilations and translations
ofanimage of a character
 will be advantageousfor recognizing allcharacters. If the set ofall individual
character recognition
 problems is viewed as an environment oflearning problems {that is, the set
of all problems of the form 223distinguish `A ' from allother characters224,
223distinguish `B' from allother characters224, and
 so on}, thispreprocessor represents a bias that is appropriate forall problems
in the environment.
 It is likely that there are many othercurrently unknown biases that are
also appropriateforthis
 environment. Wewould like to be able to learn theseautomatically.
 There are many other examples of learning problems that can be viewed as
belonging to envi-
 ronments of relatedproblems. For example, each individual face recognition
problem belongs to an {essentially in002nite} set of related learningproblems
{all the other individual facerecognition prob-
 lems}; the set of all individual spoken word recognition problems formsanother
large environment,
 asdoes the set of all 002ngerprint recognitionproblems, printed Chinese
and Japanese character recog- nition problems, stock price prediction problemsand
so on. Even medical diagnostic andprognostic
 problems, where a multitude ofdiseases are predicted from the same pathology
tests,constitute an
 environment ofrelated learningproblems.
 In many cases these 223environments224 arenot normally modeled as such;instead
they are treated
 as single, multiplecategory learning problems. For example,recognizing a
group of faces would normally be viewed as a single learningproblem with
multiple class labels {one for each face in
 the group}, not as multiple individual learning problems. However,if a reliable
classi002er for each
 individual face in the group can beconstructed then they can easily be combined
toproduce a
 classi002er for the whole group.Furthermore, byviewing the faces as an
environment of related
 learning problems, theresults presented here show that bias can belearnt
that will be good for
 learningnovel faces, a claim that cannot be madefor the traditional approach.
 This point goesto the heart of our model: we are not notconcerned with adjusting
a learner's bias so it performs better on some002xed set of learning problems.
Sucha process is in fact just
 ordinarylearning but with a richer hypothesis space inwhich some components
labelled 223bias224 are also able to be varied. Instead,we suppose the
learner is faced with a{potentially in002nite} stream of
 tasks, and that by adjusting its bias on somesubset of the tasks it improves
its learningperformance
 on future, as yet unseen tasks. Bias that is appropriate forall problemsin
an environment must be learnt by sampling from many tasks. If only a single
task is learnt then the bias extracted is likely to be speci002c to that
 task. In the rest ofthis paper, a general theory of bias learning isdeveloped
based upon the idea of learning multiple related tasks. Looselyspeaking {formal
results are stated in Section 2},there are
 twomain conclusions ofthetheory presented here:
 
*  Learningmultiple related tasks reduces the sampling burdenrequired for
good generalization,
 at least ona number-of-examples-required-per-taskbasis.
 150A MODEL OFINDUCTIVE BIASLEARNING
 
* Bias that is learnt on suf002ciently manytraining tasks is likely to be
good for learningnovel
 tasks drawn from the same environment.
 The second point shows that aform of meta-generalization ispossible in bias
learning. Or-
 dinarily, wesay a learner generalizes wellif, after seeing suf002ciently
many training examples, it
 produces a hypothesis that withhigh probability will perform well on future
examples of the same
 task. However,a bias learner generalizes well if, after seeing suf002ciently
many training tasks it pro- duces a hypothesis space that with highprobability
contains good solutions to novel tasks.Another
 term that has been used forthis process is Learning to Learn {Thrun &Pratt,
1997}.
 Our main theorems are statedin an agnostic setting {that is, Hdoes not necessarily
contain a
 hypothesisspace with solutions to all the problemsin the environment}, butwe
also giveimproved
 bounds in the realizable case.The sample complexity bounds appearing intheseresults
are stated
 interms of combinatorialparameters related to the complexity of the set
of all hypothesis spaces H
 availableto the bias learner. For Boolean learning problems {pattern classi002cation}
these parameters are the bias learning analogue of theVapnik-Chervonenkis
dimension{Vapnik, 1982; Blumer et al.,
 1989}. As an application of the general theory, the problem of learning
an appropriate set ofneural-
 network features for an environmentof related tasks is formulated as a bias
learningproblem. In
 the case of continuousneural-network features we are able to proveupper
bounds on the number
 of training tasksand number of examples of each training taskrequired to
ensure a set of features that works well for the training tasks will,with
high probability, work well on novel tasks drawn
 from the same environment.The upper bound on the number of tasks scalesas
O{b} where bis
 a measure of the complexity of thepossible feature sets available to the
learner,whilethe upper
 bound on the number of examples of each task scales as O{a+ b=n} where O{a}
is the number
 ofexamples required to learn a task if the223true224 set of features {that
is, the correctbias} is already
 known, and nis the number of tasks. Thus, in this casewe see that as the
number of related tasks learnt increases, the number of examplesrequired
of each task for good generalization decaysto
 the minimum possible. ForBoolean neural-network feature maps we are able
toshow a matching
 lowerbound on thenumber of examples required per task of the sameform.
 1.1 Related Work There is a large body of previousalgorithmic and experimental
work in the machinelearning and
 statistics literature addressingthe problems of inductive bias learning
and improving generalization
 through multiple tasklearning. Some of these approaches can be seenas special
cases of, or at least
 closely aligned with, the model described here,while others are more orthogonal.
Withoutbeing
 completely exhaustive, in thissection we present an overview of the maincontributions.
See Thrun
 and Pratt{1997, chapter 1} for a more comprehensivetreatment.
 
* Hierarchical Bayes.The earliest approaches to bias learning comefrom Hierarchical
Bayesian
 methods in statistics{Berger, 1985; Good,1980; Gelman, Carlin, Stern,& Rubim,
1995}.
 In contrast to the Bayesianmethodology, the present paper takes anessentially
empirical
 process approach tomodeling the problem of bias learning.However, a model
using a mixture of hierarchical Bayesian and information-theoretic ideaswas
presented in Baxter {1997a}, with similar conclusions tothose found here.An
empirical study showing the utility of the hierarchical Bayes approach in
a domaincontaining alarge number of related tasks wasgiven
 in Heskes {1998}. 151BAXTER 
*  Early machine learning work.In Rendell, Seshu, and Tcheng {1987}223VBMS224
or Variable Bias Management System was introduced as amechanism for selecting
amongst different learning algorithms when tackling a new learningproblem.
223STABB224 or Shift To a Better Bias {Ut-
 goff, 1986} was another early scheme for adjusting bias, butunlike VBMS,
STABB was not primarily focussed on searching forbias applicableto large
problem domains. Our use of an 223environment ofrelated tasks224 in thispaper
may also be interpreted as an 223environmentof
 analogous tasks224 in the sense thatconclusions about one task can be arrived
at by analogy
 with {suf002ciently many of} the other tasks. For an early discussion ofanalogy
in this con-
 text, see Russell{1989, S4.3}, in particular the observation thatfor analogous
problems the
 samplingburden per task can be reduced. 
*  Metric-based approaches.The metric used in nearest-neighbourclassi002cation,
andin vector
 quantizationtodetermine the nearest code-book vector,represents aform of
inductive bias. Using the model of the present paper, andunder some extra
assumptions on the tasks in the environment {speci002cally, that their marginal
input-space distributionsare identical and
 they only differ inthe conditional probabilities theyassign to classlabels},
it can be shown
 that there isan optimal metric or distance measure to usefor vector quantization
andone-
 nearest-neighbourclassi002cation {Baxter, 1995a,1997b; Baxter & Bartlett,
1998}. Thismetric
 can be learnt by sampling from asubset of tasks from the environment,andthen
used as a
 distance measure whenlearning novel tasks drawn from the same environment.
Bounds on
 the number oftasks and examples of each task required to ensure good performance
on novel
 taskswere given in Baxter and Bartlett {1998},along with an experiment in
which a metric was successfully trained on examples of asubset of 400 Japanese
characters and then used as a 002xed distance measure when learning 2600as
yet unseen characters.
 A similar approachis described in Thrun and Mitchell {1995}, Thrun{1996},
in which a
 neuralnetwork's output was trained to match labels on a novel task, while
simultaneously
 beingforced to match its gradient to derivativeinformation generated from
a distance metric trained on previous, related tasks.Performance on the novel
tasks improvedsubstantially
 with the use of the derivative information.
 Notethat there are many other adaptive metric techniques used inmachine
learning, but these
 all focus exclusively on adjusting the metric for a 002xed set of problems
rather than learning a metric suitable for learning novel, relatedtasks {bias
learning}.
 
*  Featurelearning or learning internal representations.As with adaptive
metric techniques, there are many approaches to feature learning that focus
on adapting features for a 002xedtask
 rather than learning features to beused in novel tasks. One of the fewcases
where features
 have been learnt ona subset of tasks with the explicit aim of using them
on novel tasks was
 Intrator andEdelman {1996} in which a low-dimensionalrepresentation waslearnt
for a set ofmultiple related image-recognition tasksandthen used to successfully
learn novel tasks of the same kind. The experiments reportedin Baxter {1995a,
chapter 4} and Baxter {1995b}, Baxter and Bartlett {1998} are also of thisnature.
 
*  Bias learning in Inductive Logic Programming {ILP}. Predicate invention
refers to the pro-
 cessin ILP whereby new predicates thought to beuseful for the classi002cation
task at hand are added to the learner'sdomain knowledge. By using the newpredicates
as background do-
 main knowledgewhen learning novel tasks, predicate inventionmay be viewed
as a form of
 152A MODEL OFINDUCTIVE BIASLEARNING
 inductivebias learning. Preliminary results with thisapproach on a chess
domain are reported in Khan, Muggleton, and Parson {1998}. 
*  Improving performance on a 002xed reference task. 223Multi-task learning224{Caruana,
1997}
 trains extra neural network outputs to match related tasks in order toimprove
generalization
 performance on a 002xed reference task. Although this approach does notexplicitly
identify the
 extrabias generated by the related tasks in a waythat can be used to learn
novel tasks, it is anexample of exploiting the bias provided by a set of
related tasks to improvegeneralization
 performance. Othersimilar approaches include Suddarth and Kergosien{1990},
Suddarth and
 Holden {1991}, Abu-Mostafa {1993}.
 
*  Bias ascomputational complexity. In this paper weconsider inductive bias
from a sample- complexity perspective: howdoes the learnt bias decrease the
number of examples required of
 novel tasks for goodgeneralization? A natural alternative line ofenquiry
is how the running-
 timeor computational complexity of a learning algorithmmay be improved by
training on
 related tasks. Some early algorithms for neural networks in this vein are
contained in Sharkey
 and Sharkey {1993}, Pratt {1992}. 
*  Reinforcement Learning.Many control tasks can appropriately be viewed
as elements of sets
 of related tasks,suchas learning to navigate to different goalstates, orlearning
a set of
 complex motorcontrol tasks. A number of papers in thereinforcement learning
literature
 haveproposed algorithms for both sharing the informationin related tasks
to improve average generalization performance across those tasksSingh {1992},
Ring {1995}, orlearning bias from a set of tasks to improveperformance on
future tasks Sutton {1992}, Thrun andSchwartz
 {1995}.
 1.2Overview of the Paper
In Section 2 the bias learning model is formally de002ned, and the main
sample complexity results aregiven showing the utility of learningmultiple
related tasks and the feasibility ofbiaslearning.
 These results show that thesample complexity is controlled by the size ofcertain
covering numbers
 associated with the set of all hypothesis spaces available to thebias learner,
in much the same way as the sample complexity in learning Booleanfunctions
is controlled bythe Vapnik-Chervonenkis
 dimension {Vapnik, 1982; Blumer et al., 1989}.The results of Section 2 are
upper bounds on the sample complexity required for goodgeneralization when
learning multiple tasks and learning inductive bias.
 The general resultsof Section 2 are specialized tothe case of feature learning
with neural net-
 works in Section3, where an algorithm for training features bygradient descent
is also presented.
 For this special case we are able to showmatching lower bounds for the sample
complexityof
 multiple task learning. InSection 4 we present some concluding remarks anddirections
for future
 research. Manyof the proofs are quite lengthy and have beenmoved to the
appendices so as not to interrupt the 003ow of the main text. The following
tables contain a glossary of the mathematical symbols used in the paper.
 153BAXTERSymbolDescriptionFirstReferencedXInput Space155YOutput Space155PDistribution
on X 002Y {learning task}155lLoss function155HHypothesis Space155hHypothesis155er
 P
 {h}Error of hypothesis hon distribution P156zTraining set156ALearning Algorithm156^er
z
 {h}Empirical error of h on training setz156PSet of all learningtasks P157QDistribution
over learning tasks157HFamily of hypothesis spaces157er Q
 {H}Lossof hypothesis space H on environment Q158z{n; m}-sample158^er z
 {H}Empirical loss of H onz158ABias learning algorithm159h
 lFunction inducedby h and l159H
 lSet of h
l159{h
 1
 ; : : : ; h
 n
 }
 lAverage of h
 1;l ; : : : ; h
 n;l159h
 lSame as{h
 1 ; : : : ; h
 n }
 l159H n
 lSetof {h
 1
 ;: : : ; h
 n
 } l159H
 n
 lSet of H n
 l159H
 003Function on probability distributions160H
 003Set of H
 003160d
 PPseudo-metric on H
 n
 l160d
 QPseudo-metric on H
 003160N {"; H
 003
 ; d
 Q }Coveringnumber of H
 003160C {"; H
 003 }Capacity of H 003160N {";H
 n
 l
 ;d
 P
 }Covering number of H
 n
 l160C {";H
 n
 l
 }Capacity of H
 n
 l160hSequence of n hypotheses {h 1
 ; : : : ; h n
 }163PSequence of n distributions{P
 1
 ; : : : ; P n
 }163er P
 {h}Average loss of h onP164^er
 z
{h}Averageloss of h on z164FSet of feature maps166GOutput class composed
with feature mapsf166G ffi fHypothesis space associated withf166G
 lLoss function class associated with G166N {"; G
 l ; d
 P
 }Covering number of G l166C {"; G l
 }Capacityof G
 l166d [P;G
 l
 ]
 {f ; f
 0
 }Pseudo-metricon feature maps f ; f
0166N {"; F ;d
 [P;G
 l ]
 }Coveringnumber of F166154A MODEL OFINDUCTIVE BIASLEARNINGSymbolDescriptionFirst
ReferencedN {"; F ; d
 [P;G
 l
 ] }Covering number ofF166C
 G
 l {"; F }Capacity of F166H
 wNeural network hypothesis space167H
 jxH restricted to vector x172005
 H
 {m}Growth function of H172VCdim{H}Vapnik-Chervonenkis dimensionof H172H
 jxH restricted to matrix x173H
 jxH restricted to matrix x173005
 H
 {n;m}Growth function ofH173d
 H
 {n}Dimension function ofH173d{ H}Upper dimension functionof H173d{H }Lowerdimension
function of H173opt P
 { H
 n }Optimal performance ofH
 n
 onP175d
 027Metric onR
 +179h
 1 010001 001 001 010 h nAverageof h
 1
 , :: : , h
 n179H
 1
 010 001 001 001010 H
 nSet ofh
 1
 010 001 001 001010 h
 n180000 {2m;n}Permutations on integer pairs182z
 033Permutedz182d
 z
 {h; h
 0
 }Empirical l
 1 metric on functions h182^er
 P
 {H}Optimal average error ofH on P1852. TheBias Learning Model
 In this section thebias learning model is formally introduced.To motivate
the de002nitions, we 002rst describe the main features of ordinary{single-task}
supervised learning models. 2.1 Single-Task Learning Computational learning
theory models ofsupervised learning usually include the followingingre-
 dients:
 
*  Aninput space X and an output spaceY ,
 
*  a probabilitydistribution P on X 002Y ,
 
*  a loss functionl : Y 002 Y! R, and
 
*  ahypothesis space H which is a set ofhypotheses or functions h :X ! Y
.
 As an example,if the problem is to learn to recognize images ofMary's face
using a neural network, then X would be the set of allimages {typically represented
as a subset ofR
 d
 where each component is a pixel intensity}, Y wouldbe the set f0; 1g, and
the distribution P wouldbe peaked over images
 of differentfaces and the correct class labels.The learner's hypothesis
spaceH would be a class of
 neural networks mapping the input space R
 d
 to f0; 1g. The loss in this case would bediscrete loss:
 l{y;y
 0
 } :=
 032
 1 if y 6= y
 0
 0 ify = y
 0
 {1} 155BAXTER Using the loss function allows us topresent a uni002ed treatment
of both patternrecognition {Y =
 f0; 1g, l as above}, andreal-valued function learning {e.g.regression} in
which Y =R and usually
 l{y; y
 0
 } = {y 000 y
 0
 } 2
 .
 The goal of the learneris to select a hypothesis h 2 Hwith minimum expected
loss: er
 P
 {h} :=
 Z
 X 002Y
 l{h{x}; y} dP {x; y}: {2}
 Of course,thelearner does not know P andso it cannot search through H for
anh minimizing
 er
 P {h}. In practice, the learner samples repeatedly from X 002 Yaccording
to the distribution Pto
 generate a training set z := f{x
1
 ; y
 1
 }; : : : ; {x
 m
 ; y
 m
 }g: {3}
 Basedon the information contained in z the learnerproduces a hypothesis
h 2 H. Hence,in general
 a learner is simply a mapA from the set of all training samples to thehypothesis
space H:
 A:
 [
 m>0 {X 002 Y } m
 ! H
 {stochasticlearner's can be treated by assuming a distribution-valued A.}
 Manyalgorithms seek to minimize the empiricalloss of h on z, where this
isde002ned by:
 ^er
 z
 {h} := 1m
 m
 X
 i=1 l{h{x
i
 }; y
 i }: {4}
 Ofcourse, there are more intelligent things to do with the data than simply
minimizing empirical error227for example one can add regularisation terms
to avoid over-002tting. However the learner chooses its hypothesish, if
we have a uniformbound {over all h 2 H} on the probability oflarge deviation
between^er
 z
 {h} and er
 P {h}, then we can bound thelearner's gen-
 eralization errorer
 P
 {h}as a function of its empirical loss on thetraining set ^er
 z {h}. Whether such a bound holds depends upon the 223richness224of H.
The conditions ensuring convergence between
 ^er
z
 {h} and er P
 {h} are by nowwell understood; for Boolean function learning {Y = f0; 1g,
discrete
 loss}, convergence iscontrolled by the VC-dimension
1
 of H:
 Theorem 1.Let P be any probability distribution onX 002 f0;1g and suppose
z = f{x
 1
; y
 1
 }; : : : ;{x
 m
 ; y m
 }g is generated bysampling m times from X 002f0; 1g according toP . Let
 d := VCdim{H}. Then with probability at least1000 ffi {over the choice
ofthe training set z}, all h2 H will satisfy
er
 P
 {h}024 ^er
 z
 {h} +
 024 32m 022
 d log
 2emd
 + log
 4ffi
 025
 1=2
 {5}
 Proofsof this result may be found in Vapnik {1982}, Blumer et al. {1989},and
will not be
 reproduced here.1. The VC dimension of aclass of Boolean functions H is
the largestinteger d such that there exists a subsetS :=
 fx
 1 ; : : : ; x
 d g 032 X such that the restrictionof H to S contains all 2 d
 Boolean functions onS.
 156A MODEL OFINDUCTIVE BIASLEARNING
 Theorem 1 onlyprovides conditions under which the deviationbetween er
 P
 {h} and ^er
 z {h} is
 likely to besmall, it does not guarantee that the true errorer
 P
 {h}will actually be small. This is governed by the choice of H. If H contains
a solution with smallerror and the learner minimizes
 erroron thetraining set, then with high probabilityer
 P
 {h}will be small. However, a bad choiceof
 H will mean there is no hope ofachieving small error. Thus, thebias of the
learner in this model 2
 is represented by the choice ofhypothesis space H.
 2.2The Bias Learning Model
 The main extra assumption of the bias learning model introducedhere is that
the learner is embed-
 ded in an environment of related tasks,and can sample from the environment
to generatemultiple
 training sets belonging to multipledifferent tasks. In the above model ofordinary
{single-task}
 learning, alearning task is represented by a distributionP on X 002 Y .So
in the bias learning
 model,an environment oflearning problems is representedby a pair {P ; Q}where
P is the set of all probability distributions onX 002 Y {i.e., Pis the set
of all possible learning problems},and Q is a
 distributiononP . Q controls which learningproblems the learner is likely
to see 3
 . For example, if the learner is in a face recognition environment, Qwill
be highly peaked overface-recognition-type
 problems,whereas if the learner is in a character recognition environment
Q will be peaked over character-recognition-type problems{here,as in the
introduction, we view these environments as
 sets of individualclassi002cation problems, rather than single, multipleclass
classi002cation problems}.
 Recall from the last paragraph of the previous section that thelearner's
bias is represented byits choice of hypothesis space H.So to enable the learner
to learn the bias, wesupply it with a family
 or set ofhypothesis spaces H := fHg.
 Putting all this together, formally alearning to learn or bias learningproblem
consists of:
 
*  aninput space X and an output spaceY {both of which are separable metricspaces},
 
*  a loss functionl : Y 002 Y! R,
 
*  anenvironment {P ; Q} where P is the set of allprobability distributions
on X 002Y and Q is
 a distributionon P ,
 
*  ahypothesis space family H =fHg where each H 2 His a set of functions
h :X ! Y .
 From now on wewill assume the loss function l has range[0; 1], or equivalently,
with rescaling,
 we assume thatl is bounded.2. The bias is also governed by howthe learner
uses the hypothesis space.For example, under some circumstances the learner
may choose not to use the full power of H {a neural network example isearly-stopping}.
For simplicity in this paper we abstract away from suchfeatures of the algorithm
A and assume that ituses the entire hypothesis space
H.
 3. Q'sdomain is a033-algebra of subsets of P. A suitable one for our purposes
is theBorel 033-algebra B{P} generated
 by the topology of weakconvergence on P . Ifwe assume that X and Yare separable
metric spaces, then Pis also
 a separable metric space in the Prohorov metric {which metrizes the topology
ofweak convergence} {Parthasarathy,
 1967}, so there is no problem withthe existence of measures on B{P }. See
Appendix D for furtherdiscussion,
 particularly the proof of part 5in Lemma 32.
 157BAXTER We de002ne the goal of a bias learner to be to 002nd a hypothesis
space H2 H minimizing the
 followingloss:
 er
 Q
 {H} :=
 Z
 P inf
 h2H
 er P
 {h} dQ{P } {6}
 = Z
 P
 inf
 h2H
 Z
 X 002Y
 l{h{x}; y} dP {x; y} dQ{P }:
 The only way er
 Q {H} can be small is if, withhigh Q-probability, Hcontains agood solution
h to
 any problemP drawn at random according toQ. In this sense er
 Q
 {H} measures howappropriate
 the bias embodied byH is for the environment {P; Q}.
 In general thelearner will not know Q, so it will notbe able to 002nd an
H minimizinger
 Q
 {H} directly. However,the learner can sample from the environment inthe
following way:
 
* Sample n times from P accordingto Q to yield:
 P
 1
 ; : : : ; P
 n
 .
 
*  Samplem times from X 002 Yaccording to each P
 i to yield:
 z
 i = f{x
 i1 ; y
 i1
} : : : ; {x
 im ; y
 im
 }g.
 
*  The resultingntraining sets227henceforth called an{n; m}-sample if they
aregenerated by the
 above process227aresupplied to the learner. In the sequel, an{n; m}-sample
will be denoted by z and written as a matrix: {x
 11
 ;y
 11
 } 001001 001 {x
 1m ; y
 1m
 } = z
 1
z :=
 .
 .
 .
 .
 .
 .
 .
 .
 .
 .
 .
 .
 {x
 n1
 ; y
 n1
 } 001 001 001{x
 nm
 ;y
 nm
 } = z n
 {7}
 An {n; m}-sample is simplyn training sets z
 1 ; : : : ; z
 n sampled from n different learningtasks
 P
 1
 ;: : : ; P
 n
 ,where each task is selected according to the environmental probability
distribution Q.
 The size of each training set is kept the same primarily to facilitate the
analysis. Based on the information contained inz, the learner must choose
a hypothesis spaceH 2 H .
 One way to dothis would be for the learner to 002nd anH minimizing the
empirical loss onz, where
 this is de002ned by: ^er
 z
 {H} :=
 1n
 n
 X i=1
 inf
 h2H
 ^er
 z i
 {h} {8} Note that ^er
 z {H} is simply the averageof the best possible empirical error achievableon
each
 training set z
 i
 , using a function fromH. It is a biased estimate ofer
 Q
 {H}. An unbiased esti-
 mateof er
 Q
 {H} would require choosing anHwith minimal average error over then distributions
 P
1
 ; : : : ; P
 n
 , where this is de002ned by 1n P
 n
 i=1 inf
 h2H
 er P
 i
 {h}.
 As with ordinary learning, it islikely there are more intelligent things
to dowith the training data
 z thanminimizing {8}. Denoting the set of all{n; m}-samples by{X 002 Y
}
{n;m}
 , a general 223bias learner224is a map A that takes{n; m}-samples as input
andproduces hypothesis spaces H 2 Has
 output:
 A : [
 n>0
 m>0
 {X 002 Y}
 {n;m}
! H : {9}
 158A MODEL OFINDUCTIVE BIASLEARNING
 {as stated,Ais a deterministic bias learner, however it is trivial to extend
our results tostochastic
 learners}.
 Note that in thispaper we are concerned only with the sample complexity
properties of a bias
 learnerA; we do not discuss issues of thecomputability of A.
 Since Ais searching for entire hypothesis spacesH within a family of such
hypothesis spaces H , there is an extra representationalquestion in our model
of bias learning that is notpresent in
 ordinary learning, and that is how the family H is represented andsearched
by A. We defer this discussion until Section 2.5, after the mainsample complexity
results for this model of biaslearning
 have been introduced.For the speci002c case of learning a set offeatures
suitable for an environment of related learning problems, see Section 3.
Regardless of how the learner chooses itshypothesis space H, if we have a
uniform bound {over
 all H 2 H} on the probability oflarge deviationbetween ^er
 z
 {H} and er
 Q
 {H}, and we cancompute
 an upper bound on ^er
 z
 {H}, then we can bound the bias learner's 223generalization error224 er
Q
 {H}. With this view, the question ofgeneralization withinour bias learning
model becomes:how many
 tasks {n} and how many examples of each task {m} are required to ensure
that^er
 z
 {H} and er
 Q {H}
 areclose with highprobability, uniformly over allH 2 H ? Or, informally,
howmany tasks and how many examples of each task are requiredto ensure that
a hypothesis space with good solutions to
 all the training tasks will containgood solutions to novel tasks drawn from
thesame environment?
 It turns out that thiskind of uniform convergence forbias learningis controlled
by the 223size224
 of certainfunction classes derived from the hypothesis spacefamily H , in
much the same way as the VC-dimension of a hypothesis spaceH controls uniform
convergence inthe case of Boolean
 function learning{Theorem 1}. These 223size224 measures andother auxiliary
de002nitions needed to state the main theorem are introduced inthefollowing
subsection.
 2.3 CoveringNumbers
 De002nition 1. For any hypothesis h : X! Y , de002ne h
 l
 : X002 Y! [0; 1] by
 h
 l
 {x;y} := l{h{x}; y} {10}
 For any hypothesis space Hin the hypothesis space family H ,de002ne
 H
 l
 := fh
 l
 :h2 Hg: {11}
 For any sequence of n hypotheses{h
 1
 ; : : : ;h
 n
 }, de002ne{h
 1
 ; : : : ; h n
 }
 l : {X 002 Y } n
 ! [0; 1]by
 {h
 1 ; : : : ; h
 n }
 l
 {x 1
 ; y
 1 ; : : : ; x
 n ; y
 n
 }:=
 1n n
 X
 i=1 l{h
 i
 {x
 i
 };y
 i
 }: {12} We will also use h l
 to denote {h 1
 ; : : : ; h n
 }
 l
 . For any H in the hypothesisspace family H , de002ne
 H
 n
 l
 :=f{h
 1
 ;: : : ; h
 n
 } l
 : h
 1 ; : : : ; h
 n 2 Hg: {13}
 De002ne
 H
 n l
 :=
 [
 H2 H
 H
 n l
 : {14}
 159BAXTER In the 002rst part of the de002nition above, hypotheses h :
X! Y are turned into functionsh
 l
 mapping X002 Y ! [0; 1]by composition withthe loss function.H
 l
 is then just thecollection ofall
 such functions where theoriginal hypotheses comefrom H.H
 l
 is often called aloss-function class.
 In our case weare interested in the average loss acrossn tasks, where each
of the n hypotheses is chosen from a 002xed hypothesis spaceH. This motivates
the de002nition ofh
 l
 and H
 n
 l
 . Finally, H
 n
 l
 isthe collection of all {h
 1
 ; : : : ; h
 n
 }
 l
 ,with the restriction thatall h
 1
 ; : : : ; h
 n
 belong to a single hypothesis space H 2 H . De002nition 2. Foreach H 2
H , de002ne H 003
 : P ! [0; 1] by
 H
 003
 {P } :=inf
 h2H
 er P
 {h}:{15}
 For the hypothesis spacefamily H , de002ne
 H 003
 := fH
003
 : H2 H g: {16}
 It is the 223size224 ofH
 n
 l
 andH
 003
 that controls howlarge the {n; m}-samplez must be to ensure
 ^er
 z
 {H}and er
 Q
 {H} are close uniformly over allH 2 H . Their size will bede002ned in terms
of
 certain coveringnumbers, and for this we need to de002ne how to measure
the distance between
 elementsof H
 n
 l
 andalsobetween elements of H
 003 .
 De002nition 3. LetP = {P
 1
 ; : : : ; P
 n
 } be any sequence of n probabilitydistributions on X 002 Y. For
 any h
 l
 ; h
 0 l
 2 H
 n l
 , de002ne
 d P
 {h
 l ; h
 0
 l } :=
 Z
 {X 002Y }
 n jh
 l
 {x
 1
 ; y
 1
 ; : : : ; x
 n
 ; y
 n
 } 000 h
 0
 l
 {x
 1 ; y
 1
 ; : : : ;x
 n
 ; y
n
 }j
 dP 1
 {x
 1 ; y
 1
 }: : : dP
 n
 {x
 n
 ; y
 n
 }
 {17}
 Similarly, for any distributionQ on P and any H 003
 1
 ;H
 003
 2
2 H
 003
 ,de002ne
 d
 Q
 {H
 003
 1 ; H
 003
2
 } :=
 Z P
 jH
 003 1
 {P }000 H
 003
 2 {P }j dQ{P } {18}
 It is easily veri002ed that d
 P
and d
 Q
 are pseudo-metrics 4
 on H
 n l
 and H
 003 respectively.
 De002nition4. An "-cover of {H
 003
 ; d Q
 } is a setfH
 003
 1
 ; : : : ; H
 003 N
 g such that for allH
 003
 2 H 003
 ,
 d
 Q
 {H
 003 ; H
 003
 i } 024 " for somei = 1 : : : N .Note that we do not require theH
 003
 i
to be contained in
 H
 003
 , just that they be measurable functions on P . LetN {"; H
 003 ; d
 Q
 }denote the size of the smallest
 such cover. De002ne thecapacity of H
 003
 by
 C {"; H 003
 } := sup
 Q
 N {"; H 003
 ; d
 Q } {19}
 wherethe supremum is over all probability measures on P . N {";H
 n
 l
 ;d
 P
 } is de002ned in a similar
 way, using d P
 in place of d Q
 . De002nethe capacityof H
 n
 l
 by:
 C {"; H n
 l
 } := sup P
 N {";H
 n
 l
 ;d
 P
 } {20} where now the supremum is over allsequences of n probability measures
onX 002 Y .4. A pseudo-metric d is a metricwithout the condition that d{x;y}
= 0 } x = y.
 160A MODEL OFINDUCTIVE BIASLEARNING
 2.4Uniform Convergence for Bias Learners Now we have enough machinery tostate
the main theorem. In the theorem thehypothesis space
 familyis required to bepermissible. Permissibility is discussed indetail
in Appendix D, but note
 that it is a weak measure-theoretic conditionsatis002ed by almost all 223real-world224
hypothesisspace
 families. All logarithms are to basee.
 Theorem 2. SupposeX and Y are separable metricspaces and let Q be any probability
distri- butionon P , the set of alldistributions on X 002 Y. Suppose z is
an {n; m}-sample generated by sampling n times from Paccordingto Q to give
P 1
 ; : : : ; P n
 , and then samplingm times from each
 P i
 to generate z i
 = f{x i1
 ; y
 i1
 }; : : : ;{x
 im
 ; y im
 }g, i= 1; : : : ; n. LetH = fHgbe any permissible hypothesis space family.
Ifthe number of tasks n satis002es n 025 max
 { 256" 2
 log
 8C 000
 "32
 ;H
 003 001ffi ;
 64"
 2
 }
; {21}
 and the number of examples m of each task satis002es m 025 max
 032 256n" 2
 log
 8C{
 "32
 ; H
 n l
 }ffi
 ;
 64"
 2
033
 ; {22}
 thenwith probability at least 1 000ffi {over the {n; m}-sample z}, all
H 2H will satisfy
 er Q
 {H} 024^er
 z
 {H} + " {23}
 Proof. See Appendix A.There are several important points tonote about Theorem
2:
 1. Providedthe capacities C {"; H 003
 } and C{"; H
 n
 l
 } are 002nite, the theorem shows that any bias
 learner that selectshypothesis spaces from H can bound itsgeneralisation
error er
 Q {H} in
 termsof^er
 z
 {H} for suf002ciently large {n; m}-samples z.Most bias learner's will not
002nd the exact value of ^er z
 {H} because it involves 002nding the smallest error of anyhypothesis h
2 H
 on each of then training sets in z. Butany upper bound on ^er z
 {H} {found,for example
 bygradient descent on someerror function} willstill give an upper boundon
er
 Q
 {H}. See
 Section 3.3.1 for a briefdiscussion on how this can be achieved in afeature
learning setting.
 2. Inorder to learn bias {in the sense thater
 Q
 {H}and ^er
 z
 {H} are close uniformly overall
 H 2 H }, both thenumber of tasks n and the number of examples of each task
m must
 be suf002cientlylarge. This is intuitively reasonablebecause the bias learner
must see both suf002ciently many tasks to be con002dentof the nature of
the environment, andsuf002ciently
 many examples of eachtask to be con002dent of the nature of each task.
3. Once the learner has found anH 2 H with a small value of^er
 z
 {H}, it can then use Hto
 learn novel tasks Pdrawn according to Q. One then hasthe following theorem
bounding the
 sample complexity required for good generalisationwhenlearning with H {the
proof is very similar to the proof of the bound onm in Theorem 2}.
 161BAXTER Theorem 3. Let z= f{x
 1 ; y
 1
 }; : : : ; {x
 m ; y
 m
 }gbe a training set generated by sampling from
 X 002 Y accordingto some distribution P . LetH be a permissible hypothesis
space.For all
 "; ffiwith 0 < "; ffi < 1, if the number of training examplesm satis002es
 m 025 max {
 64"
 2
 log
 4C
 000
 "16
 ; H
 l
 001ffi
 ;
 16"
 2
 } {24}
 then with probability atleast 1 000 ffi, all h2 H will satisfy
 er
 P
 {h} 024^er
 z
 {h} + ":
 The capacityC {"; H} appearing inequation {24} is de002ned in an analogous
fashion to the
 capacities in De002nition 4 {wejust use the pseudo-metric d
P
 {h
 l
 ; h
 0
 l } :=
 R
 X002Y
 jh
l
 {x; y} 000 h
 0
 l
 {x; y}j dP {x; y}}. The important thing to note about Theorem 3 is that
the number of ex- amples required for good generalisation whenlearning novel
tasks is proportional tothe log-
 arithm of the capacity of thelearnt hypothesis space H. Incontrast, if the
learner does not do any bias learning, it will have noreason to select one
hypothesis space H2 H over any
 other andconsequently it would have to view as acandidate solution any hypothesis
in any of the hypothesis spaces H 2H . Thus, its sample complexity willbe
proportional to the
 capacityof [
 H2 H
 fH l
 g = H
 1
 l
 , which in generalwill be considerably larger than the capacity of any individual
H 2 H. So by learning H the learner haslearnt to learn in the environment
{P ; Q} in the sensethat it needs far smaller training sets to learnnovel
tasks.
 4. Havinglearnt a hypothesis space H with a small value of ^er
 z
 {H}, Theorem 2 tells us that with probability atleast 1000 ffi, the expected
value ofinf
 h2H
 er P
 {h} on a novel task P will be
 less than^er
 z
 {H} + ". Of course, this does notrule out really bad performance onsome
tasks P . However, the probability of generating such 223bad224 tasks can
be bounded.In particular,
 note that er Q
 {H} is just the expected value of the function H 003
 over P ,and so by Markov's
 inequality ,for fl > 0,
 Pr 032
 P : inf
 h2H
 er
 P {h} 025 fl 033
 = Pr fP: H
 003
 {P } 025 fl g 024
 E
 Q
 H
 003fl
 =
 er
Q
 {H}fl
 024
 ^er
 z
 {H} + "fl
 {with probability 1000 ffi}.
 5. Keepingthe accuracy and con002dence parameters"; ffi 002xed, note that
the numberof examples
 required of each task for goodgeneralisation obeys
 m =O
 022
 1n
 log C{"; H
 n
 l }
 
 :{25}
 So provided logC {"; H
 n l
 } increases sublinearly withn, the upper bound on the number of examples
required of each task willdecrease as the number of tasks increases.This
shows
 that for suitably constructed hypothesis space families it is possible toshare
information
 between tasks.This is discussed further after Theorem 4 below.
 162A MODEL OFINDUCTIVE BIASLEARNING
 2.5Choosing the Hypothesis Space FamilyH .
 Theorem 2 only providesconditions under which ^er
 z
 {H} ander
 Q
 {H}are close, it does not guaran-
 tee thater
 Q
 {H}is actually small. This is governed bythe choice of H . IfH containsa
hypothesis
 space H with a small value of er
 Q
 {H} and the learner is able to 002nd anH 2 H minimizing error on the {n;
m} samplez {i.e., minimizing ^er z
 {H}},then, for suf002ciently large nand m, Theorem 2 en-
 sures that with high probability er
 Q
 {H}will be small. However, a bad choice of H willmean there
 is no hope of 002nding anH with small error. In this sense thechoice of
H represents the hyper-bias of the learner.
 Notethat from a sample complexity point of view, the optimal hypothesis
space family tochoose
 is one containing a single, minimalhypothesis space H that contains good
solutionsto all of the
 problems in the environment{or at least a set of problems with highQ-probability},
and no more. For then there is no bias learning to do{because there is no
choice to be made betweenhypothesis
 spaces}, the output of the biaslearning algorithm is guaranteed tobe a goodhypothesis
space for
 the environment,andsince the hypothesis space is minimal, learningany problem
within the en-
 vironment usingH will require the smallest possible number of examples.
However, this scenario is analagous tothe trivial scenario inordinary learning
in which the learning algorithmcontains a
 single, optimal hypothesis for the problem being learnt. In that case there
is no learning to be done,
 just as there is nobias learning to be done if the correct hypothesisspace
is already known.
 At the other extreme, if H contains a single hypothesisspace H consisting
ofall possible func- tions from X ! Ythen bias learning is impossible because
thebias learner cannot produce a
 restrictedhypothesis space as output, and hence cannot producea hypothesis
space with improved
 sample complexity requirements on as yet unseentasks.
 Focussing on these two extremeshighlights the minimal requirements on Hfor
successful bias
 learningto occur: the hypothesis spaces H2 H must be strictly smaller than
thespace of all
 functions X! Y , but not so small or so223skewed224 that none of them
contain goodsolutions to a
 large majority of theproblems in the environment.
 It may seemthat we have simply replaced the problem ofselecting the right
bias {i.e., selecting theright hypothesis space H} with theequally dif002cult
problem of selecting the righthyper-bias {i.e.,
 the right hypothesisspace family H }. However,in many cases selecting the
right hyper-bias isfar
 easier than selecting the right bias.For example, in Section 3 we will see
how the feature selection
 problem may be viewed as a bias selection problem. Selectingthe right features
can be extremely dif002cult if one knows little about theenvironment, withintelligent
trial-and-error typicallythe best
 one can do. However,in a bias learning scenario, one only has tospecify
that a set of features should exist, 002nd a loosely parameterised setoffeatures
{for example neural networks}, and thenlearn the
 features by sampling from multiplerelated tasks.
 2.6 LearningMultiple Tasks
 It may be that thelearner is not interested in learning to learn, but just
wants to learn a 002xed set of n tasks from the environment{P ; Q}. Asin
the previous section, we assume the learnerstarts
 outwith a hypothesis space familyH , and also that it receives an{n; m}-sample
zgenerated from
 the n distributionsP
 1
 ; : : : ; P n
 . This time, however, the learner is simply looking for nhypotheses
 {h
 1 ; : : : ; h
 n }, all contained in the samehypothesis space H, such that the averagegeneralization
 error of the nhypotheses is minimal. Denoting {h
 1
 ; : : : ; h n
 } by h and writingP = {P
 1
; : : : ; P
 n
 },
 163BAXTER this error is given by: er
 P
 {h} :=
 1n
 n
 X
 i=1
 er
 P
 i {h
 i
 }{26}
 =
 1n
 n
 X i=1
 Z
 X002Y
 l{h i
 {x};y} dP
 i
 {x; y};
 and the empirical loss of h on z is
 ^er
 z
 {h}:=
 1n n
 X
 i=1 ^er
 z
 i {h
 i
 }{27}
 =
 1n
 n
 X i=1
 1m
 m
 X j=1
 l{h i
 {x
 ij }; y
 ij
 }:
 As before, regardless of how the learner chooses {h
 1
 ; : : : ; h
 n
 }, if we can prove auniform bound on
 the probability oflargedeviation between ^er
 z
 {h} ander
 P
 {h}then any {h
 1 ; : : : ; h
 n } that perform
 wellon thetraining sets z will with high probabilityperform well on future
examples of the same tasks.
 Theorem 4. LetP = {P
 1
 ; : : : ; P
 n
 } be n probability distributionsonX 002 Y and letz be an {n; m}- sample
generated by samplingm times from X 002 Yaccording to each P
 i
 . Let H =fHg be any
 permissible hypothesisspace family. If the number of examplesmof each task
satis002es
 m 025 max
 032
 64n"
 2
 log
 4C { "16 ; H
 n
 l }ffi ;
 16"
 2
 033
 {28}
 then with probability at least1 000 ffi {over the choice ofz}, any h 2
H
 n
 will satisfy
 er P
 {h} 024^er
 z
 {h} + " {29}
 {recall De002nition 4 for the meaning ofC {"; H
 n l
 }}.
 Proof.Omitted {follow the proof of the bound onm in Theorem 2}.Thebound
on m in Theorem 4 is virtually identical to the bound on m in Theorem 2,
and note againthat it depends inverselyon the number of tasks n {assuming
that the002rst part of the 223max224
 expression is the dominate one}. Whether this helps dependson the rate of
growth of C { "16 ; H
 n
 l } as
 a function ofn. The following Lemma shows that this growth is always small
enough to ensure that we never do worse by learning multipletasks {at least
in terms of the upper bound onthe number of
 examples required per task}. Lemma 5. For any hypothesisspace family H ,
 C
 000
 "; H
 1 l
 001
 024C {"; H
 n
 l
 } 024 C
 000 ";H
 1
 l 001
 n
 :{30}
 164A MODEL OFINDUCTIVE BIASLEARNING
 Proof.Let K denotethe set of all functions{h
 1
 ; : : : ; h n
 }
 l where each h
 i can be a member of any
 hypothesis space H 2 H {recallDe002nition 1}. Then H
n
 l
 022 K andso C {"; H
 n l
 } 024 C {"; K }. By
 Lemma29 in Appendix B, C {"; K} 024 C
 000
 "; H
 1
 l 001
 n
 and so the righthand inequality follows.
 Forthe 002rst inequality, let Pbe any probability measure on X002 Y and
let P be the mea- sureon {X 002 Y}
 n
 obtained by usingP on the 002rst copy of X002 Y in the product, and ignoring
all other elements of the product.Let N be an "-cover for{ H
 n
 l
 ; d
 P
 }.Pick any h
 l
2 H
 1
 l
 and
 let {g
 1 ; : : : ; g
 n }
 l
 2 Nbe such that d
 P
 {(h; h; : : : ; h} l
 ; {g
1
 ; : : : ; g
 n }
 l
 } 024". But by construction,
 d
 P
 {(h;h; : : : ;h}
 l ; {g
 1
; : : : ; g
 n
}
 l
 } = d P
 {h; {g 1
 }
 l
 }, which establishes the002rst inequality.By Lemma 5
log C
 000
 ";H
 1
 l
 001 024 log C {";H
 n
 l
 }024 n log C
 000 "; H
 1
 l 001
 : {31}
 So keeping the accuracy parameters" and ffi 002xed, and plugging {31}into
{28}, we see that the upper bound on the number of examples required of each
task never increases with the number of tasks,
 and at best decreases asO{1=n}. Althoughonly an upper bound, this provides
a strong hintthat
 learning multiple related tasks shouldbe advantageous on a 223number of
examplesrequired per task224
 basis. InSection 3 it will be shown that for featurelearning all types of
behavior are possible, from no advantage at all to O{1=n} decrease.
 2.7 Dependenceon "
 In Theorems 2, 3 and 4 thebounds on sample complexity all scale as1="
 2
 . Thisbehavior can be
 improved to1=" if the empirical loss is alwaysguaranteed tobe zero {i.e.,
we are in therealizable
 case}. The same behaviorresults if we are interested in relative deviation
between empirical and
 trueloss, rather than absolute deviation.Formal theorems along these lines
are stated inAppendix
 A.3.
 3. FeatureLearning
 The use of restricted featuresets is nearly ubiquitous as a method of encodingbias
in many areas of
 machine learning andstatistics, including classi002cation, regression anddensity
estimation.
 In this section we showhow the problem of choosing aset of featuresfor an
environment of
 related tasks can berecast as a bias learning problem. Explicitbounds on
C { H
 003 ; "} and C {H
 n
 l
 ;"}
 are calculated for generalfeature classes in Section 3.2. Thesebounds are
applied to the problem of learning a neural network feature set inSection
3.3.
 3.1 The Feature Learning Model
 Consider the following quotefrom Vapnik {1996}:
 Theclassical approach to estimating multidimensionalfunctional dependencies
is
 based on the following belief:
 Real-lifeproblems are such thatthere exists a small number of 223strong
features,224 simple
 functions of which {saylinear combinations} approximate well the unknownfunction.
 Therefore, it is necessary tocarefully choose a low-dimensional feature
space andthen
 to use regular statistical techniques to construct an approximation.
 165BAXTER In general a set of 223strong features224may be viewed as a
function f: X ! V mapping the input space X into some {typically lower} dimensional
space V . LetF = ff g be a set ofsuch feature
 maps {each f maybe viewed as a set of features {f
 1
 ; : : : ; f k
 } if V =R
 k
 }. It is thef that must be
 223carefullychosen224 in the above quote. Ingeneral, the 223simple functions
of the features224may be
 represented as a class of functionsG mappingV to Y .If for each f 2 F wede002ne
the hypothesis
 space G ffif := fg ffi f :g 2 G g, then we have thehypothesis space family
H
 H:= fG ffi f : f2 F g: {32}
 Now the problem of 223carefully choosing224 theright features f is equivalent
to thebias learning
 problem 223002nd the righthypothesis space H 2 H 224. Hence,provided
the learner is embedded within an environment ofrelated tasks, and thecapacities
C { H
 003 ; "} and C {H
 n
 l
 ;"} are 002nite, Theorem 2 tells us that the feature set fcan be learnt
rather than carefully chosen.This represents animportant
 simpli002cation,as choosing a set of features is often the mostdif002cult
part of any machine learning problem.
 In Section 3.2 we give atheorem bounding C { H
 003
 ; "} andC { H
 n
 l ; "} for general feature classes. The theorem is specialized toneural
network classes in Section 3.3.
 Note that we have forced the function class Gto be the same for all feature
mapsf , although
 thisis not necessary. Indeed variants of the results to follow can be obtained
if G is allowed to vary
 with f .
 3.2Capacity Bounds for General Feature Classes Notationally it is easier
to view thefeature maps f as mapping fromX 002 Y to V 002Y by {x; y}7!
 {f {x}; y}, and also to absorb the lossfunction l into the de002nition
ofG by viewing each g 2 Gas a
 map from V 002Y into [0; 1]via {v; y} 7!l{g{v};y}. Previously this latter
functionwould have been
 denoted g l
 but in what follows we willdrop the subscript l where this does not causeconfusion.
The
 class to whichg
 l
 belongs will still bedenoted by G
 l
 . With the above de002nitions letG
 l
 ffi F:= fg ffi f :g 2 G
 l
 ;f 2 F g. De002ne the capacityof G
 l
 in
 the usual way,
 C {"; G
 l
 }:= sup
 P
 N {"; G
 l
 ; d P
 }
 wherethe supremum is over all probability measures on V 002 Y , and d P
 {g; g 0
 } :=
 R V 002Y
 jg{v; y} 000
g
 0
 {v; y}j dP {v; y}. To de002ne the capacity ofF we 002rst de002ne a pseudo-metricd
 [P;G
 l ]
 on F by
 223pulling back224 the L
 1 metric on R through G l
 as follows:
 d
 [P;G
 l ]
 {f ; f 0
 } :=
 Z X 002Y
 sup g2G
 l
 jg ffi f {x; y}000 g ffi f
 0 {x; y}j dP{x; y}: {33} It is easily veri002ed that d [P;G
 l
 ]
 is a pseudo-metric. Notethat for d
 [P;G l
 ]
 to be well de002ned the supre-
 mumover G l
 in the integrand must bemeasurable. This is guaranteed if the hypothesisspace
family
 H = fG l
 ffi f :f 2 F g is permissible {Lemma 32,part 4}. Now de002ne N {"; F ;
d
 [P;G
 l
 ]
 }to be the
 smallest "-coverof the pseudo-metric space{F; d
 [P;G
 l
 ]
 } and the"-capacity ofF {with respect toG
 l
 }
 as C
 G
 l
 {"; F } := sup P
 N {"; F; d
 [P;G l
 ]
 }
 where the supremum is over all probabilitymeasures on X 002 Y .Now we can
state the main theorem of this section.
 166A MODEL OFINDUCTIVE BIASLEARNING
 Theorem 6.Let H be a hypothesis space familyas in equation {32}. Then for
all"; "
 1
 ; " 2
 > 0 with
 " = "
 1
 +"
 2
 ,
 C{"; H
 n
 l } 024 C {"
 1
 ; G
 l }
 n
 C
 G
 l
 {"
 2
 ; F } {34} C {"; H
 003 } 024 C
 G
 l
 {"; F }{35}
 Proof. SeeAppendix B.3.3 LearningNeural Network Features
 Ingeneral, a set of features may be viewed as amap from the {typically high-dimensional}
input space R
 d
 to a muchsmaller dimensional space R
 k {k 034d}. Inthis section we consider approximat-
 ing such a feature map by a one-hidden-layerneuralnetwork with d input nodes
andk output nodes
 {Figure 1}. We denote the set of all such feature maps byf010
 w
 = {036
 w;1
 ;: : : ; 036
 w;k
 } : w 2 Dg where D is a bounded subset ofR
 W
 {W isthe number of weights {parameters} inthe 002rsttwo layers}.
 This set is theF of the previous section. Each feature 036
 w;i : R
 d
 ![0; 1], i = 1; : : : ; k is de002ned by 036
 w;i
 {x} := 033
 0
 @
 l
 X
 j=1
 v
 ij
h
 j
 {x}+v
 il+1
 1 A
 {36}
 whereh
 j
 {x}is the output of the j th node in the002rst hidden layer, {v
 i1
 ; : : : ; v il+1
 } are the output node parameters for the ith feature and033 is a 223sigmoid224
squashing function033 : R ! [0;1]. Each
 002rst layer hidden node h
 i
 : R d
 ! R, i= 1; : : : ; l, computes h
 i
 {x} := 033
 0
 @ d
 X
 j=1 u
 ij
 x
 j
 + u
 id+1
 1
 A
 {37} where {u
 i1 ; : : : ; u
 id+1
 } are the hidden node's parameters. We assume 033is Lipschitz.
 5
 The weight vector for the entire feature map is thus w = {u
 11
 ; : : : ; u
 1d+1
 ; : : : ; u
 l1
 ; : : : ; u ld+1
 ; v
 11
 ; : : : ; v
 1l+1
 ; : : : ; v k1
 ; : : : ; v kl+1
 }
and the total number of feature parametersW = l{d+ 1} +k{l + 1}.
 For argument's sake, assume the 223simplefunctions224 of the features
{the classG of the previous
 section}are squashed af002ne maps using the same sigmoidfunction 033 above
{in keeping with the 223neural network224 003avor of thefeatures}. Thus,
each setting of the featureweights w generates a
 hypothesisspace:
 H
 w
 := {
 033
   k
 X
 i=1 ff
 i
 036 w;i
 + ff
 k+1
 !
 : {ff
 1
 ; : : : ; ff k+1
 } 2D
 0
 }
 ;{38}
 where D
 0 is a bounded subset of R k+1
 . The set of allsuch hypothesis spaces,
 H :=fH
 w
 : w2 Dg {39}5. 033 is Lipschitz if there exists a constant K such that
j033{x} 000 033{x
 0
 }j 024K jx 000 x
0
 j for all x; x 0
 2 R.
 167BAXTERrootknld          FeatureMapInput Multiple Output ClassesFigure
1: Neuralnetwork for feature learning. Thefeature map is implemented by the
002rst two hidden layers. The n output nodescorrespond to the n different
tasks in the{n; m}-
 samplez. Each node in the network computesa squashed linear function of
the nodes in the previous layer.
 isa hypothesis space family. Therestrictions on the output layer weights{ff
 1
 ;: : : ; ff
 k+1 } and feature
 weightsw, and the restriction to a Lipschitzsquashing function are needed
to obtain 002nite upper bounds on the covering numbers inTheorem 2.
 Finding a good set of featuresfor the environment {P ;Q} is equivalent to
002nding a goodhy-
 pothesisspace H
 w 2 H , which in turn means002nding a good set of feature map parametersw.
 As in Theorem 2, the correct setof features may be learnt by 002nding a
hypothesisspace with
 small error on a suf002ciently large {n; m}-samplez. Specializing to squared
loss, in thepresent
 framework the empirical loss ofH
 w
 on z {equation {8}} is given by
 ^er z
 {H
 w } =
 1n
 n
 X
 i=1
 inf
 {ff
 0
 ;ff
 1
 ;:::;ff
 k }2D
 0
 1m
 m
 X
 j=1 "
 033
   k
 X
 l=1 ff
 l
 036 w;l
 {x
ij
 } + ff
0
 !
 000 y ij
 #
 2
 {40}
 Since our sigmoid function033 only has range [0; 1], we also restrict the
outputs Yto this range.
 3.3.1ALGORITHMS FORFINDING A GOODSET OF FEATURES Provided the squashing
function033 is differentiable, gradientdescent {with a small variation on
backpropagation to compute the derivatives} can be used to 002nd feature
weightsw minimizing {40}
 {or at least alocal minimum of {40}}. The only extradif002culty over and
above ordinarygradient
 descent is the appearance of 223inf 224 in the de002nition of^er
 z
 {H
 w
 }. Thesolution is to perform gradient
 descent over both the output parameters {ff 0
 ; : : : ; ff k
 } for each node and thefeature weights w. For
 more detailssee Baxter {1995b} and Baxter {1995a, chapter 4},where empirical
results supporting
 the theoretical results presented here are also given. 168A MODEL OFINDUCTIVE
BIASLEARNING
 3.3.2SAMPLE COMPLEXITY BOUNDS FOR NEURAL-NETWORKFEATURE LEARNING The size
of z ensuring that theresulting features will be good for learning novel
tasks from the same
 environmentisgiven by Theorem 2. All we haveto do is compute the logarithm
of the covering numbers C {"; H n
 l
 } and C{"; H
 003
 }.
 Theorem 7. LetH =
 010
 H w
 : w 2 R W
 011
 be a hypothesisspace family where each H
 w
 is of the form
H
 w
 :=
 { 033
  
 k X
 i=1
 ff i
 036
 w;i {001} + ff
 0
 !
 : {ff 1
 ; : : : ; ff k
 } 2 R
 k
 }
 ;
where 010
 w
 = {036
 w;1
 ;: : : ; 036
 w;k
 } is a neural network withW weights mapping from R d
 to R
 k . If the
 feature weightsw and the output weights ff 0
 ; ff
 1 ; : : : ; ff
 k are bounded, the squashing function033 is
 Lipschitz,l is squaredloss, and the output space Y =[0; 1] {any bounded
subset ofR will do}, then
 there existconstants 024; 024
 0 {independent of "; W andk} such that for all " >0,
 log C {"; H
 n
 l
 } 024 2 {{k +1}n + W } log
 024"
 {41} log C {"; H 003
 } 024 2Wlog
 024
 0"
 {42}
 {recall that we have specialized tosquaredloss here}.
 Proof. SeeAppendix B.Noting that our neuralnetwork hypothesis space family
His permissible, plugging {41} and {42} into Theorem 2 gives the followingtheorem.
 Theorem 8. LetH = fH
 w
 g be a hypothesis space family where each hypothesis space H
 w is a
 set of squashed linear mapscomposed with a neural network feature map, asabove.
Suppose the
 numberof features is k, and the total numberof feature weights is W. Assume
all feature weights and
 output weights are bounded,and the squashing function 033 is Lipschitz.Let
z be an {n; m}-sample
 generated from the environment {P ; Q}. If
 n025 O
 022
 1"
 2
 024
 W log
 1"
 + log
 1ffi
 025 ; {43}
 and
 m 025 O
 022
 1"
 2
 024022
 k + 1 + Wn
 
 log
 1"
 +
 1n
 log 1ffi 025
 {44}
 then withprobability at least 1 000 ffiany H
 w
 2H will satisfy
 er Q
 {H
 w } 024 ^er
z
 {H
 w
 } + ": {45}
 169BAXTER 3.3.3 DISCUSSION
 1. Keeping the accuracy andcon002dence parameters " and ffi002xed, the
upper bound on the number of examples required of each task behaveslike O{k
+ W=n}. If the learner is simply learning n 002xed tasks {rather than learningto
learn}, then the same upper bound also applies{recall
 Theorem 4}.
 2.Note that if we do away with the featuremap altogether then W = 0 and
theupper bound on
 m becomes O{k}, independent of n{apart from the less important ffiterm}.
So in terms of the
 upper bound,learning n tasks becomes just as hard aslearning one task. At
the other extreme, if we 002x the output weights then effectively k = 0
and the number of examples required of
 each task decreases asO{W=n}. Thusa range of behavior in the number of examplesrequired
 of each task is possible:from no improvement at all to anO{1=n} decrease
as the number of tasks n increases {recall the discussionat the end of Section
2.6}.
 3.Once the feature map is learnt {which can beachieved using the techniques
outlined in Baxter,
 1995b; Baxter & Bartlett, 1998; Baxter,1995a, chapter 4}, only the output
weights haveto be
 estimated to learn a novel task.Again keeping the accuracy parameters 002xed,
this requires no
 more that O{k} examples. Thus,as the number of tasks learnt increases, the
upperbound on
 the number of examples required of each task decays to the minimum possible,O{k}.
 4. Ifthe 223small number of strong features224 assumptionis correct, then
k will be small.However,
 typically we will have very little idea of what the features are, soto be
con002dent that the neural
 network is capable of implementing a goodfeature set it will need to be
very large,implying
 W 035 k.O{k+W=n} decreases mostrapidly with increasing n when W035 k,
so at least in terms of the upper bound on the number of examples required
per task, learning small feature sets is an ideal application for biaslearning.
However, the upper bound onthe number of
 tasks does not fare sowell as it scales as O{W}.
 3.3.4 COMPARISON WITHTRADITIONAL MULTIPLE- CL AS SCLASSIFICATION A special
case of this multi-task framework is one in which the marginal distributionon
the input
 space P
 ijX
 is the same for each task i = 1; : : : ; n, and all that varies between
tasks is theconditional
 distribution over the output space Y . An example would be amulti-class
problem such as face
 recognition, in which Y =f1; : : : ; ngwhere n is the number of faces to
berecognized and the
 marginal distribution onX is simply the 223natural224 distributionover
images of those faces. Inthat
 case, if for every examplex
 ij
 we have227in addition to the sample y
 ij
 from the ith task's conditional distribution onY 227samplesfrom the remaining
n 000 1conditional distributions onY, then we can
 view then training sets containing mexamples eachas one large training set
for the multi-class problem with mn examples altogether. The bound on m in
Theorem 8 statesthat mn should be
 O{nk +W }, or proportional tothe total number of parameters in the network,
aresult we would
 expect from 6
 {Haussler, 1992}. So when specialized to the traditionalmultiple-class,
single task framework, Theorem 8 is con-
 sistent with the bounds already known.However, as we have already argued,
problems such as face
 recognition arenot really single-task, multiple-class problems.They are
more appropriately viewed6. If each example canbe classi002ed with a 223large
margin224 thennaive parameter counting can be improved upon {Bartlett,
 1998}.
 170A MODEL OFINDUCTIVE BIASLEARNING
 as a{potentially in002nite} collection of distinct binaryclassi002cation
problems. In that case, the goal of bias learning is not to 002nd asingle
n-output network that can classifysome subset of n faces
 well. Itis to learn a set of features that can reliablybe used as a 002xed
preprocessing fordistinguish-
 ing any single face from other faces. This is the new thing providedby Theorem
8: it tells us that provided we have trained ourn-output neural network on
suf002ciently many examples of suf002ciently
 many tasks, we can be con002dent that thecommon feature map learnt for
those ntasks will be good
 for learningany new, as yet unseen task, provided the new task is drawn
from the same distribution
 that generated the training tasks.In addition, learning the new task only
requires estimating the k
 output node parametersfor that task, a vastly easier problem thanestimating
the parameters of the
 entire network, from both a sample and computational complexity perspective.
Also, since we have high con002dence that the learnt features willbe good
for learning novel tasks drawn fromthe same
 environment, thosefeatures arethemselves a candidate for further study to
learnmore about the
 nature of the environment.The same claim could not be made if thefeatures
had been learnt on too
 small a setof tasks to guarantee generalization tonoveltasks, for then it
is likely that the features would implement idiosyncrasies speci002cto those
tasks, rather than 223invariances224that apply across
 all tasks.
 When viewed from a bias {or feature} learningperspective, ratherthan a traditionaln-class
 classi002cation perspective,the bound m on the number of examplesrequired
of each task takes on
 a somewhat different meaning. Ittells us that provided nis large {i.e.,we
are collecting examples
 of a largenumber tasks}, then we really only need to collecta few more examples
than we would otherwise have to collect if the featuremap was already known
{k +W=n examples vs. k examples}. So it tells us that the burden imposedby
feature learning can be made negligibly small,at least when
 viewed from the perspective of the sampling burden required of each task.
3.4 Learning Multiple Taskswith Boolean Feature Maps
 Ignoringthe accuracy and con002dence parameters" and ffi, Theorem 8 shows
that thenumber of
 examples required of each taskwhen learning n tasks with a common neural-network
feature map
 is bounded above byO{k +W=n},where k is the number of features andW is the
number of
 adjustableparameters in the feature map. SinceO{k} examples are required
tolearn a single task
 once the true featuresare known, this shows that the upper bound onthe number
of examples
 requiredof each task decays {in order} to the minimumpossible as the number
of tasks n increases. This suggests that learning multiple tasks isadvantageous,
butto be truly convincing weneed to
 prove a lower bound of thesame form. Proving lower bounds in a real-valued
setting {Y = R} is complicated by the fact that a singleexample can convey
an in002nite amount ofinformation, so
 one typically has to make extra assumptions, suchas that the targetsy 2
Y are corrupted by a noise process. Rather than concern ourselves with such
complications, in this section werestrict
 our attention to Boolean hypothesisspace families {meaning each hypothesish
2 H
 1
maps to
 Y = f0061g and we measure error by discrete lossl{h{x};y} = 1 if h{x} 6=
y and l{h{x}; y}= 0
 otherwise}.
 Weshow that the sample complexity for learningntasks with a Boolean hypothesis
space family H is controlled by a 223VCdimension224 type parameter d
H
 {n} {that is, we give nearly matching upper
 and lower boundsinvolving d
 H
{n}}. We then derivebounds on d
 H
 {n} for the hypothesis space family considered inthe previous sectionwith
the Lipschitz sigmoid function 033replaced by a hard
 threshold {linearthreshold networks}.
 171BAXTER As well as the bound on the number of examples required per task
for good generalizationacross
 those tasks, Theorem 8 also showsthat features performing well on O{W }
tasks will generalize well to novel tasks, where W isthe number of parameters
in the feature map.Given that for many feature learning problems W is likely
tobe quite large {recall Note 4 in Section 3.3.3}, it would be useful
 to know thatO{W } tasks are in factnecessary without further restrictions
on the environmental
 distributions Qgeneratingthe tasks. Unfortunately, wehave notyet been able
to show such a lower bound.
 There is some empirical evidencesuggesting that in practice the upper bound
on thenumber of
 tasks may be very weak.For example, in Baxter and Bartlett {1998}we reported
experiments in
 whicha set of neural network features learnt on asubset of only 400 Japanese
characters turned out to be good enough for classifying some 2600unseen characters,
even though the featurescontained
 several hundred thousand parameters. Similar results may be found in Intrator
andEdelman {1996}
 and in the experimentsreported in Thrun {1996} and Thrun and Pratt{1997,
chapter 8}. While
 thisgap between experiment and theory may be justanother example of the
looseness inherent in general bounds, itmay also be that theanalysis can
be tightened. In particular,thebound on the
 number of tasks isinsensitive to the size of the class of outputfunctions
{the class G inSection 3.1}, which may be where the looseness hasarisen.
3.4.1 UPPERANDLOWER BOUNDSFOR LEARNING nTASKS WITH BOOLEAN HYPOTHESIS SPACE
FAMILIES First we recall some concepts from thetheory of Boolean function
learning. LetH be a class of
 Booleanfunctions on X and x = {x
 1
 ; : : : ; x m
 } 2 X
 m
 . H
 jx is the set of all binary vectorsobtainable
 by applying functions inH to x:
 H
 jx
 := f{h{x
 1
 };: : : ; h{x
 m }) : h 2 Hg: Clearly jH
 jx j 024 2
 m
. IfjH
 jx
j = 2
 m
 we sayH shatters x. The growth function of H is de002ned by
 005 H
 {m} :=max
 x2X
 m fi
 fi
 H jx
 fi
 fi :
 The Vapnik-Chervonenkisdimension VCdim{H} is the sizeof the largest set
shattered by H:
 VCdim{H} := maxfm : 005
 H
 {m} = 2
 m g:
 An important result in thetheory of learning Boolean functions is Sauer's
Lemma {Sauer, 1972}, of
 which we will also make use.
 Lemma 9 {Sauer's Lemma}. For a Boolean function class H withVCdim{H} = d,
005
 H
 {m} 024
 d
 X i=0
 022
 m i
 
 024 020
 emd
 021
 d ;
 for all positive integersm.
 We now generalize theseconcepts to learning n tasks with a Booleanhypothesis
space family.
 172A MODEL OFINDUCTIVE BIASLEARNING
 De002nition 5.Let H be a Boolean hypothesis spacefamily. Denote the n 002m
matrices over the
 input spaceX by X
 {n;m}
 . For eachx 2 X
 {n;m} and H 2 H , de002neH
 jx
 to be the set of{binary}
 matrices,
 H
 jx
 :=
 8
 >
 <
 >
 :
 2 6
 4
 h
 1 {x
 11
 }001 001 001 h
 1 {x
 1m
}
 .
 .
 . .
 .
 .
 .
 .
 .
 h
 n {x
 n1
} 001 001 001 h
 n
 {x
 nm }
 3
 7
 5 : h
 1
 ;: : : ; h
 n
 2H
 9
 >
 =
 >
 ;
 :
 De002ne H
 jx
 := [
 H2 H
 H jx
 :
 Nowfor each n > 0; m >0, de002ne 005
 H {n; m} by
 005
 H
 {n;m} := max
 x2X
 {n;m}
 fi fi
 H
 jx fi
 fi
 : Note that 005
 H {n; m} 024 2 nm
 . If
 fi fi
 H
 jx fi
 fi
 =2 nm
 we say H shattersthe matrix x. For eachn > 0 let
 d
H
 {n} := maxfm : 005
 H
{n; m} = 2
 nm g:
 De002ned{ H }: = VCdim{ H
 1
 }and
 d{ H}: = max
 H2 H VCdim{H}:
 Lemma 10.d{H } 025 d{H }
 d
 H {n} 025 max
 032026d{H }n 027
 ; d{ H }
 033 025
 12
 022026d{ H }n
 027
+ d{ H}
 
 Proof.The 002rst inequality is trivial from thede002nitions. To get the
second term inthe maximum
 in the second inequality, choosean H 2 H withVCdim{H} = d{ H } and construct
a matrix x 2 X
 {n;m}
 whose rows are of lengthd{ H } and areshattered by H. Then clearlyH shatters
x. For the 002rst term in the maximum take asequence x = {x
 1 ; : : : ; xd{ H }
 }shattered by H
 1
{thehypothesis
 space consisting ofthe union over all hypothesis spaces from H}, and distribute
its elements equally among the rows of x {throw awayany leftovers}. The set
of matrices 8
 >
 <
 > :
 2
 6
 4 h{x
 11
 } 001 001 001 h{x
 1m
 }
 .
 .
 .
 . .
 .
 .
 . .
 h{x
n1
 } 001 001 001h{x
 nm
 } 3
 7
 5
 :h2 H
 1
 9 >
 =
 >
 ; :
 where m = bd{ H }=ncis a subset of H
 jx
 andhas size 2
 nm
 .Lemma 11. 005
 H
 {n;m} 024
 024
 emd
H
 {n}
 025 nd
 H
 {n}
 173BAXTER Proof. Observe that for eachn, 005
 H
 {n; m} = 005
 H {nm} where H is thecollection of all Boolean
 functions on sequencesx
 1
 ; : : : ; x nm
 obtained by 002rst choosingnfunctions h
 1
; : : : ; h
 n
 from some
 H 2 H ,and then applying h
 1
 to the 002rst m examples, h 2
 to the second m examplesand so on. By
 the de002nition ofd
 H
 {n}, VCdim{H} = nd H
 {n}, hence theresult follows from Lemma 9 applied to H.If one follows the
proof of Theorem 4 {in particular the proof of Theorem18 in Appendix
 A} then it is clear thatfor all 
*  > 0, C{ H
 n
 l
 ; "} may be replaced by005
 H
 {n;2m} in the Boolean case. Making this replacement in Theorem 18,and using
the choices of ff; 027from the discussion
 followingTheorem 26, we obtain the following bound on theprobability oflarge
deviation between empirical and true performance in this Booleansetting.
 Theorem 12. LetP = {P
 1
 ; : : : ; P
 n
 } be n probability distributionson X 002 f0061gand let z be an
 {n;m}-sample generated by samplingm times from X 002 f0061g according
to each P i
 . Let H =fHg
 be any permissible Booleanhypothesis space family. Forall 0 < 
*  024 1, Pr fz: 9h2 H
 n
 : er P
 {h} 025^er
 z
 {h} + "g 024 4005 H
 {n; 2m} exp{000
* 
 2 nm=64}: {46}
 Corollary 13. Under the conditions ofTheorem 12, ifthe number of examplesm
of each task
 satis002es m 025
 88"
 2
024
 2d
 H {n} log
 22"
 +
 1n
 log 4ffi 025
 {47}
 thenwith probability at least 1 000ffi {over the choice of z}, any h 2
H
 n will satisfy
 er
 P
 {h} 024^er
 z
 {h} + " {48}
 Proof. Applying Theorem 12, we require 4005
 H
 {n; 2m} exp{000
* 
 2
 nm=64}024 ffi;
 which is satis002ed if m 025
 64
* 
 2
 024
 d
 H
 {n} log
 2emd
 H {n}
 +
 1n
 log 4ffi 025
 ; {49}
 where we have used Lemma 11.Now, for all a 025 1, if
 m =
 022 1+
 1e
 
 a log 022
 1 +
 1e
  a;
 then m 025 alog m. So setting a= 64d
 H
 {n}="
 2
 , {49} issatis002ed if
 m 025
 88"
 2
 024
 2d H
 {n} log 22"
+
 1n log
 4ffi
 025
 :174A MODEL OFINDUCTIVE BIASLEARNING
 Corollary13 shows that any algorithm learningn tasks using the hypothesis
space familyH
 requires no more than m = O
 022
 1"
 2 024
 d
 H
 {n} log
 1"
 +
 1n
 log 1ffi
 025
 {50}
 examplesof each task to ensure that with high probabilitytheaverage true
error of any nhypotheses
 it selects from H n
 is within " of their average empirical error on the sample z. We now give
a
 theorem showing that if the learning algorithmis required to produce n hypotheseswhose
average
 true error is within" of the best possible error{achievable using H
 n } for an arbitrary sequence of distributions P
 1
 ; : : : ; P
 n
 , then within a log
 1"
 factor thenumber of examples in equation {50} is also necessary.
 For any sequenceP = {P
 1
 ; : : : ; P
 n
 } of n probability distributions onX 002 f0061g, de002ne opt
 P
 { H n
 } by
 opt P
 { H
 n } := inf
 h2H
 n
 er
P
 {h}:
 Theorem 14. Let H bea Boolean hypothesis space family such thatH
 1
 containsat least two functions. For each n= 1; 2; : : : ;let A
 n
 be any learningalgorithm taking as input {n; m}-samples
 z 2 {X002 f0061g}
 {n;m}
 and producing asoutput n hypotheses h = {h
 1
 ; : : : ; h n
 } 2 H
 n
 . For all 0 < " < 1=64and 0 < ffi <1=64, if
 m <
 1"
 2
 024
 d
 H
 {n}616
 + {1 000 "
 2
 }
 1n
 log
 022 18ffi{1 000 2ffi}
025
 then there exist distributions P = {P
 1 ; : : : ; P
 n } such that with probability atleast ffi {over the
 randomchoice of z},
 er P
 {A
 n {z}) > opt
 P
 { H
 n } + "
 Proof.See Appendix C3.4.2LINEAR THRESHOLD NETWORKS
 Theorems 13and 14 show that within constants and alog{1="} factor, the samplecomplexity
of
 learning n tasks usingthe Boolean hypothesis space family His controlled
by the complexity pa- rameter d
 H
 {n}. In this section we derivebounds on d
 H
 {n} for hypothesis space families constructed as thresholded linear combinations
of Booleanfeature maps. Speci002cally, weassumeH is of the
 form given by{39}, {38}, {37} and {36}, where nowthe squashing function
033 is replaced with ahard
 threshold:
 033{x} :=
 {
 1if x 025 0;
0001 otherwise;
 andwe don't restrict the range of the feature andoutput layer weights. Note
that in this casethe
 proof of Theorem 8 does not carrythrough because the constants 024; 024
0
 in Theorem 7 depend on the Lipschitz bound on 033.
 Theorem 15. Let H bea hypothesis space family of the form given in{39},
{38}, {37} and{36},with
 a hard thresholdsigmoid function 033. Recallthat the parameters d, l and
k are theinput dimension,
 numberof hidden nodes in the feature map and number of features {outputnodes
in the feature map}
 175BAXTER respectively. Let W:= l{d + 1} +k{l + 1} {the number ofadjustable
parameters in the feature map}. Then,
 d
 H {n}024 2
 022
 Wn
 + k +1
 
 log
 2
 {2e{k +l + 1}):
 Proof. Recall that for eachw 2 R
 W
 ,010
 w
 : R d
 ! R
 k denotes the feature map with parametersw.
 For each x 2X
 {n;m}
 , let010
 wjx
 denote thematrix
 2
 6
 4 010
 w
 {x 11
 } 001 001 001010
 w
 {x 1m
 }
 . .
 .
 .
 . .
 .
 .
 . 010
 w
 {x n1
 } 001001 001 010
 w
 {x
 nm
 } 3
 7
 5
 : Note that H
 jx
 is the set of all binaryn 002 m matrices obtainable by composingthresholded
linear
 functions withthe elementsof 010
 wjx
 , with the restriction that the same functionmust be applied to
 eachelement in a row{but the functions may differ between rows}.With a slight
abuse of notation, de002ne
 005
 010 {n; m} := max x2X
 {n;m}
 fi
 fi
 010
 010
 wjx
 : w 2 R
 W
 011
 fi
 fi
 :
 Fix x2 X
 {n;m}
 . By Sauer's Lemma, each node in the002rst hidden layer of the feature
map computes at most {emn={d+ 1})
 d+1
 functionson the nm input vectors in x. Thus, there can be at most {emn={d
+ 1}) l{d+1}
 distinctfunctions from the input to the output of the002rst hidden layer
on
 the nmpoints in x. Fixing the 002rst hiddenlayer parameters, each node
in the second layer ofthe
 feature map computes at most{emn={l + 1})
 l+1
 functions on the image ofx produced at the output
 of the 002rsthidden layer. Thus the second hidden layercomputes no more
than {emn={l + 1})
 k{l+1}
 functions on the output of the002rst hidden layer on the nm points inx.
So, in total,
 005 010
 {n; m}024
 022
 emnd+ 1
  l{d+1}
 022 emnl + 1 
 k{l+1} :
 Now, for each possiblematrix 010
 wjx , the number of functions computable on eachrow of 010
 wjx by a
 thresholded linear combination ofthe output of the feature map is at most{em={k
+ 1}) k+1
 . Hence, the number of binary sign assignments obtainableby applying linear
threshold functions to all the rowsis at most {em={k +1})
 n{k+1}
 . Thus,
 005 H
 {n; m}024
 022
 emnd + 1

 l{d+1} 022
 emnl + 1
 
 k{l+1}
 022 emnn{k + 1}
 
 n{k+1}
 :
 f{x} := x log xis a convex function, hence for alla; b; c > 0,
 f 022
 ka+ lb+ ck +l + 1
 
 024 1k +l + 1
 {kf {a} + lf {b}+ f {c})
 } 022
 k + l + 1ka + lb+ c
 
 ka+lb+c
 025 022
 1a
 
 ka 022
 1b
 
 lb 022
 1c
 
 c
 :
 Substituting a =l + 1, b = d + 1and c = n{k + 1}shows that
 005
 H {n; m} 024 022
 emn{k +l + 1}W+ n{k + 1}
 
 W +n{k+1}
 : {51} 176A MODEL OFINDUCTIVE BIASLEARNING
 Hence, if m >
 022
 Wn
 + k + 1 
 log
 2 022
 emn{k +l + 1}W+ n{k +1}
 
 {52}
 then 005 H
 {n; m}< 2
 nm
 and so byde002nition d
 H
 {n} 024 m. For alla > 1, observe that x > alog
 2
 x
 ifx = 2a log
 2 2a. Setting x =emn{k +l + 1}={W + n{k + 1})and a = e{k +l + 1} shows
that
 {52} issatis002ed if m = 2{W=n+ k + 1} log
 2 {2e{k + l+ 1}).Theorem 16.Let H be as in Theorem 15 withthe following
extra restrictions:d 025 3,l 025k and
 k 024 d. Then d
 H
 {n}025
 12
 022026
 W2n
 027 + k + 1
  Proof. We boundd{ H }and d{ H }and then apply Lemma 10. In the presentsetting
H
 1
 containsall three-layer linear-threshold networks withd input nodes, l hidden
nodes in the002rst hidden layer, k
 hidden nodes in the second hidden layer and one output node.From Theorem
13 in Bartlett {1993}, we have
 VCdim{H
 1
 } 025dl +
 l{k 0001}2
 + 1;
 which under the restrictions stated above is greater than W=2. Henced{ H
}025 W=2.
 As k024 dand l 025 kwe can choose a feature weight assignment so thatthe
feature map is the
 identity onk components of the input vector and insensitive to the setting
of the reminaingd 000 k
 components.Hence we can generate k+ 1points in X whose image under thefeature
map is
 shattered by the linearthreshold output node, and so d{ H }= k +1.Combining
Theorem 15 withCorrolary 13 shows that
 m025 O
 022
 1"
 2 024022
 Wn
 + k + 1 
 log
 1"
 +
 1n
 log 1ffi 025
 examples of each task suf002ce when learning n tasks using a linearthreshold
hypothesis space family, while combining Theorem 16 with Theorem 14 shows
that if
 m 024 012 022
 1"
 2
 024022 Wn
 + k +1
  +
 1n
 log
 1ffi
 025
 then any learning algorithm will fail on some set of n tasks.
 4. Conclusion The problem of inductive bias is onethat has broad signi002cance
in machine learning.In this paper
 we have introduced aformal model of inductive bias learning thatapplies
when the learner is able
 tosample from multiple related tasks.We proved that provided certain covering
numbers computed
 from the set of allhypothesis spaces available to the bias learnerare 002nite,
any hypothesis space
 that contains good solutions to suf002ciently many training tasks is likely
to contain goodsolutions to
 novel tasks drawn from the same environment.
 In the speci002c caseof learning a set of features, we showed thatthe number
of examples m
 requiredof each task in an n-task training set obeys m = O{k +W=n}, where
k is the number of 177BAXTER features and W is a measure ofthe complexity
of the feature class.We showed that this bound is essentially tight for Boolean
feature mapsconstructed fromlinear threshold networks.In addition,
 weproved that the numberof tasks required to ensure good performance from
the features on novel
 tasks is no more thanO{W }. Wealso showed how a good set of features maybe
found by gradient
 descent.
 Themodel of this paper represents a 002rst steptowards a formal model of
hierarchical approaches to learning. By modelling a learner's uncertainty
concerning its environment inprobabilistic terms,
 we have shown how learning can occur simultaneously atboth the baselevel227learn
the tasks at
 hand227and atthe meta-level227learn bias that can betransferred to novel
tasks. Froma technical
 perspective, itis the assumption that tasks are distributedprobabilstically
that allows the perfor- mance guarantees to be proved.From a practical perspective,
there are many problem domains that
 can be viewed asprobabilistically distributed sets of related tasks.For
example, speech recognition may be decomposed along many different axes:
words, speakers, accents, etc.Face recognition
 represents a potentiallyin002nite domain of related tasks. Medicaldiagnosis
and prognosis problems
 using the same pathology tests are yet another example.All of these domains
should bene002t from being tackled with a bias learning approach. Natural
avenues for further enquiry include: 
*  Alternative constructions forH . Although widely applicable, thespeci002c
example on feature
 learning viagradient descent represents just one possible wayof generating
and searching
 the hypothesis space family H . It would be interestingto investigate alternative
methods, including decision tree approaches, approachesfrom Inductive Logic
Programming {Khan et al., 1998}, and whether more generallearning techniques
such as boosting can be applied in a bias learning setting.

*  Algorithms for automatically determining thehypothesis space family H
. Inour model the
 structure of H is002xed apriori and represents thehyper-bias of the bias
learner.It would
 be interesting to see towhat extent this structure can also be learnt. 
*  Algorithms for automatically determiningtask relatedness. In ordinary
learning thereis usu-
 ally little doubt whether an individual example belongs to the same learningtask
or not.
 The analogous question in bias learning is whether an individual learning
taskbelongs to a
 given set of related tasks, which in contrast to ordinary learning, does
not always have such
 a clear-cut answer. For most of the examples we havediscussed here, such
as speech and
 face recognition, thetask-relatedness isnot inquestion, but in other cases
such as medical problems it is not so clear.Grouping too large a subset of
tasks togetheras related tasks could
 clearly have adetrimental impact on bias-learning or multi-tasklearning,
and there is empri-
 cal evidenceto support this {Caruana, 1997}. Thus,algorithms for automatically
determining task-relatedness are a potentially useful avenue for further
research. In this context,see Silver
 and Mercer {1996}, Thrun andO'Sullivan {1996}. Note that the questionof
task relatedness
 is clearly onlymeaningful relative to a particular hypothesisspace family
H {for example, all possible collections oftasks are related ifH contains
every possible hypothesisspace}.
 
*  Extended hierarchies.For an extension of our two-levelapproach to arbitrarily
deep hierarchies, see Langford {1999}. An interestingfurther question is
to what extent the hierarchycan
 be inferred from data. Thisis somewhat related to the question of automaticinduction
of
 structure in graphical models. 178A MODEL OFINDUCTIVE BIASLEARNING
 Acknowledgements This work was supported at varioustimes by an Australian
Postgraduate Award,aShell Aus-
 tralia Postgraduate Fellowship, U.K Engineering and Physical Sciences ResearchCouncil
grants
 K70366 and K70373, and anAustralian Postdoctoral Fellowship. Alongthe way,
many people
 havecontributed helpful comments and suggestions for improvement including
Martin Anthony, Peter Bartlett, Rich Caruana, John Langford,Stuart Russell,
John Shawe-Taylor,Sebastian Thrun
 and several anonymousreferees.
 Appendix A. UniformConvergence Results
 Theorem2 provides a bound {uniform over allH 2 H } on the probability of
large deviation between
 er
Q
 {H} and ^er
 z
 {H}. To obtain a more general result, wefollow Haussler {1992} and introduce
the following parameterized class of metrics onR
 +
 :
 d 027
 [x; y]:=
 jx 000 yjx + y+ 027
 ;
 where027 > 0. Our main theorem willbe a uniform bound on the probability
of large values of
 d
 027
 [er
 Q
 {H}; er
 z
 {H}], rather than j er Q
 {H} 000^er
 z
 {H}j. Theorem2 will then followas a corollary, as
 will better boundsfor the realizable case ^er z
 {H} = 0{Appendix A.3}.
 Lemma 17.The following three properties ofd
 027
 are easily established: 1. For all r;s 025 0, 0 024 d 027
 [r; s]024 1
 2. Forall 0 024 r 024 s 024t, d
 027
 [r; s] 024 d
 027
 [r; t]and d
 027
 [s; t] 024 d
 027 [r; t].
 3. For 0 024 r;s 024 1,
 jr000sj027+2
 024 d
 027
 [r; s]024
 jr000sj027
 For easeof exposition we have up until now beendealing explicitly withhypothesis
spacesH
 containing functions h: X ! Y , and then constructingloss functions h
 l
mapping X 002 Y ![0; 1]
 by h
 l
 {x; y} :=l{h{x};y} for some loss function l: Y 002Y ! [0; 1]. However,
in generalwe can view
 h
 l just as a function from an abstract setZ {X 002 Y }to [0; 1] and ignore
its particular construction
 in terms of the loss functionl. So for the remainder of this section,unless
otherwise stated, all
 hypothesisspaces H will be sets of functions mappingZ to [0; 1].It will
also be considerably more convenient to transpose our notation for{n; m}-samples,
writing then training sets as columns
 insteadof rows:
 z =
 z 11
 : : : z 1n
 .
 . .
 .
 .
 . .
 .
 .
 z m1
 : : : z mn
 where each z ij
 2 Z . Recallingthe de002nition of {X 002Y }
 {n;m} {Equation 9 and prior discussion}, with this transposition z lives
in{X 002 Y }
 {m;n}
 . Thefollowing de002nition now generalizes quantities like ^er
 z {H}, er
 P
 {H} and so on to thisnew setting.
 De002nition 6.Let H
 1
 ; : : : ;H
 n
 be n sets offunctions mapping Z into [0; 1]. For any h 1
 2
 H
1
 ; : : : ; h
 n
 2 H
 n
 , let h
 1
 010001001 001 010 h
 n or simply h denote the map h{~z } = 1=n
 n
 X
 i=1
 h
 i
 {z
 i
 }
 179BAXTER for all ~z = {z
 1
 ; : : : ; z n
 } 2 Z
 n
 . Let H
1
 010 001 001 001010 H
 n
 denote the set ofall such functions. Given
 h2 H
 1
 010 001001 001 010 H
 n
and m elements of {X 002Y }
 n
 , {~z
 1
 ; : : : ; ~z
 m
 } {or equivalently an element z of
 {X002 Y }
 {m;n}
 by writing the ~z i
 as rows}, de002ne ^er
 z
 {h} :=
 1m
 m
 X
i=1
 h{~z i
 }
 {recallequation {8}}. Similarly, for anyproduct probability measure P =
P 1
 002 001 001 001002 P
 n
 on {X 002 Y }
 n
 , de002ne
 er P
 {h} := Z
 Z
 n
 h{~z } dP{~z }
 {recallequation {26}}. For any h; h
 0
 : {X 002 Y }
 n ! [0; 1] {not necessarilyof the form h
 1
010 001 001 001 010 h n
 },
 de002ne d
 P
 {h; h
 0
 }:=
 Z
 Z
 n jh{~z }000 h
 0
 {~z }j dP{~z }
 {recall equation {17}}. For any class of functionsH mapping {X 002 Y}
 n
 to [0;1], de002ne
 C{"; H} := sup
 P N {"; H; d P
 }
 wherethe supremum is over all product probabilitymeasures on {X 002 Y}
 n
 and N{"; H; d
 P } is the
 size of the smallest"-cover of H under d P
 {recall De002nition 4}. The following theorem is the main result fromwhich
the rest of the uniform convergenceresults
 in this paper are derived. Theorem 18. Let H 022 H 1
 010 001 001 001 010 H n
 be a permissible class offunctions mapping {X 002 Y}
 n
 into
[0; 1]. Let z2 {X 002 Y } {m;n}
 be generatedby m 025 2={ff 2
 027 } independent trials from {X 002 Y } n
 according to some productprobability measure P = P 1
 002 001 001 001002 P
 n
 . For all 027 > 0, 0< ff < 1,
 Pr
 032
 z 2 {X 002Y }
 {m;n} : sup
 H
 d 027
 [ ^er z
 {h};er
 P
 {h}]> ff
 033
 0244C {ff027 =8;H} exp{000ff
 2
 027 nm=8}:{53}
 The following immediate corollarywill also be of use later.
 Corollary 19. Under the same conditionsasTheorem 18, if
 m 025max
 {
 8ff
 2
 027 n
 log
 4C 000
 ff0278
 ; H 001ffi ;
 2ff
 2
 027
 }
 ; {54}
 then Pr
 032
 z2 {X 002 Y } {m;n}
 : sup H
 d
 027 [ ^er
 z
 {h}; er
P
 {h}] > ff 033
 024 ffi {55} A.1 Proof of Theorem 18 The proof is via a double symmetrization
argument of the kind given in chapter 2 ofPollard {1984}.
 I have also borrowedsome ideas from the proof of Theorem 3 in Haussler{1992}.
 180A MODEL OFINDUCTIVE BIASLEARNING
 A.1.1FIRST SYMMETRIZATION
 An extra piece ofnotation: for all z 2 {X 002 Y }
 {2m;n}
 , let z{1}be the top half of z and z{2} be the
 bottom half, viz: z{1} =
 z
11
 : : : z
 1n
 .
 .
 .
 .
 .
 .
 .
 . .
 z
 m1 : : : z
 mn z{2} =
 z
 m+1;1
 : : :z
 m+1;n
 . .
 .
 .
 .
 .
 .
 .
 .
 z
 2m;1
 :: : z
 2m;n
The following lemma is the 002rst223symmetrization trick.224 Werelate
the probability oflarge deviation between an empirical estimate of the loss
andthe true loss to the probability oflarge deviation
 between two independent empiricalestimates of the loss.
 Lemma 20.Let H be a permissible set of functionsfrom {X 002 Y } n
 into [0;1] and let P be a probability measure on {X 002 Y }
 n . For all 027 >0; 0 < ff < 1and m 025
 2ff
 2
 027
 ,
 Pr
 032
 z 2 {X 002Y }
 {m;n} : sup
 H
 d 027
 [ ^er z
 {h};er
 P
 {h}]> ff
 033
 0242 Pr
 032
 z2 Z
 {2m;n} : sup
 H
 d 027
 002
 ^er
 z{1}
 {h}; ^er
 z{2}
 {h}
003
 >
 ff2
 033
 : {56}
 Proof.Note 002rst that permissibility ofHguarantees themeasurability ofsuprema
overH
 {Lemma32 part 5}. Bythe triangle inequality for d
 027
 , ifd
027
 002
 ^er z{1}
 {h}; er
 P
 {h}
 003
 >ff and
 d
 027 002
 ^er
 z{2}
 {h};er
 P
 {h} 003
 < ff=2, thend
 027
 002
 ^er
 z{1}
 {h}; ^er z{2}
 {h} 003
 > ff=2. Thus, Pr
 n
 z 2{X 002 Y }
 {2m;n}
 : 9h2 H : d
 027 002
 ^er
z{1}
 {h};^er
 z{2}
 {h}
 003
 >
 ff2 o
 025 Pr
n
 z 2 {X 002Y }
 {2m;n} : 9h2 H : d 027
 002
 ^er
 z{1}
 {h}; er
 P {h}
 003 > ff and
 d 027
 002
 ^er
 z{2}
 {h}; er
 P {h}
 003 < ff=2
 o
 :
 {57}
 By Chebyshev's inequality, for any 002xedh,
 Pr
 n
 z 2 {X 002 Y } {m;n}
 : d 027
 [ ^er z
 {h};er
 P
 {h}]<
 ff2 o
 025 Pr
 032 z 2 {X 002 Y}
 {m;n}
:
 j ^er
z
 {h} 000er
 P
 {h}j027
 <
 ff2 033
 025 1000 er
 P
 {h}{1 000 er
 P
 {h})m027 ff
 2
 =4
 025
 12
 as m 0252={ff
 2
 027 } and er
 P {h} 024 1.Substituting this last expression into the righthand side of
{57}
 gives the result.181BAXTER A.1.2 SECONDSYMMETRIZATION The second symmetrization
trick bounds theprobability of large deviation between twoempirical
 estimates of the loss {i.e.the right hand side of {56}} by computingthe
probability of large devi-
 ation when elements are randomly permuted between the002rst and second
sample. The following de002nitionintroduces the appropriate permutationgroup
for this purpose.
 De002nition 7.For all integers m; n025 1, let 000
{2m;n}
 denote the set of allpermutations 033 of the
 sequenceof pairs of integers f{1;1}; : : : ;{1; n}; : : : ; {2m; 1}; : :
: ;{2m; n}gsuch that for all i, 1 024 i 024 m,either 033{i; j } = {m+
i; j } and 033{m + i; j } = {i; j } or 033{i; j } = {i; j} and 033{m+
i;j } =
 {m +i; j }.
 For anyz 2 {X 002 Y } {2m;n}
 and any033 2 000
 {2m;n}
 , let
 z
 033
 :=
 z
 033{1;1}
 :: : z
 033{1;n}
 .
 .
 . .
 .
 .
 . .
 .
 z
 033{2m;1}
 : : :z
 033{2m;n} :
 Lemma 21. LetH = H
 1
010 001 001 001 010 H n
 be a permissible set offunctions mapping {X002 Y}
 n
 into
[0; 1] {as in the statement ofTheorem 18}. Fix z 2{X002 Y }
{2m;n}
 and let ^
 H := ff 1
 ; : : : ; f M
 g be
 anff027 =8-cover for {H; d
 z
 }, where d
 z
 {h; h
 0 } :=
 12m
 P
2m
 i=1
 jh{~z
 i
 } 000 h
 0
 {~z
 i
 }j where the ~z
 i
 are the rows
 of z. Then,
 Pr
 032
 033 2 000
 {2m;n}
 : sup H
 d
 027 002
 ^er
z
 033
 {1}
 {h}; ^er z
 033
 {2} {h}
 003 >
 ff2
 033
 024 M
 X
 i=1 Pr
 n
 0332000
 {2m;n} : d
 027
002
 ^er
 z 033
 {1}
 {f
 i
 }; ^er
 z
 033
{2}
 {f
 i }
 003
 > ff4
 o
 ; {58}
 whereeach 033 2 000
 {2m;n}
 is chosen uniformly at random.
 Proof. Fix 0332 000
 {2m;n} and let h 2 H be such thatd
 027
 002
 ^er
 z
 033 {1}
 {h};^er
 z
 033 {2}
 {h} 003
 > ff=2{if thereis
 nosuch h for any 033we are already done}. Choose f2
 ^
 H such thatd
 z
 {h;f } 024 ff027 =8. Without loss
 of generality wecan assume f is of the formf = f
 1
010 001 001 001 010 f n
 . Now,
 2027
 d
 z
 {h;f } =
 P
2m
 i=1
 fi fi
 fi
 P
 n
 j=1
 h
 j
 {z
 ij } 000 f
 j {z
 ij
 } fi
 fi
 fi027 mn
=
 P
 2m
 i=1
 fi
 fi fi
 P
 n
 j=1
 h
 j
 {z
 033{i;j}
 } 000 f
 j
 {z
 033{i;j}
 }
fi
 fi
 fi027 mn
 025 fi
 fi
 fi P
 m
 i=1 P
 n
 j=1 002
 h
 j {z
 033{i;j}
 } 000 f
 j
 {z
 033{i;j}
 }
003
 fi
 fi
 fi027 mn + P
 m
 i=1 P
 n
 j=1 002
 h
 j {z
 033{i;j}
 } + f
 j {z
 033{i;j}
 }
 003
 +
 fi
 fi
 fi
 P
 2m
 i=m+1
 P
 n
 j=1
 002 h
 j
 {z 033{i;j}
 }000 f
 j
 {z 033{i;j}
 } 003
 fi
 fi fi027mn +
 P
 2m i=m+1
 P n
 j=1
 002 h
 j
 {z 033{i;j}
 } +f
 j
 {z
 033{i;j}
 } 003
 = d
027
 002
 ^er z
 033
 {1} {h}; ^er z
 033
 {1} {f }
 003 + d
 027
002
 ^er
 z 033
 {2}
 {h}; ^er
 z 033
 {2}
 {f }
 003
 : 182A MODEL OFINDUCTIVE BIASLEARNING
 Hence, by thetriangle inequality for d
 027 ,
 2027
 d
 z
 {h; f }+d
 027 002
 ^er
 z 033
 {1}
 {f }; ^er
z
 033
 {2}
 {f }
 003
 025 d
 027
 002 ^er
 z
 033 {1}
 {h};^er
 z
 033 {1}
 {f } 003
 + d
 027 002
 ^er
 z 033
 {2}
 {h}; ^er
 z 033
 {2}
 {f }
 003
 +d
 027
 002
 ^er
 z
 033 {1}
 {f };^er
 z
 033 {2}
 {f } 003
 025 d
027
 002
 ^er z
 033
 {1} {h}; ^er z
 033
 {2} {h}
 003
 :
 {59}
 But
 2027
 d z
 {h; f} 024 ff=4 by construction andd
 027
 002
 ^er
 z
 033 {1}
 {h};^er
 z
 033 {2}
 {h} 003
 > ff=2by assumption, so
 {59} impliesd
 027
 002
^er
 z
 033 {1}
 {f }; ^er
 z
033
 {2}
 {f}
 003
 > ff=4. Thus,
 n
 0332 000
 {2m;n} : 9h 2 H :d
 027
 002
 ^er
 z
 033 {1}
 {h};^er
 z
 033 {2}
 {h} 003
 >
 ff2
 o
 022
 n
 033 2000
 {2m;n}
 : 9f 2
 ^ H : d
 027 002
 ^er
z
 033
 {1}
 {f }; ^er z
 033
 {2} {f }
 003 } >
 ff4
 o
 ; which gives {58}.Now we bound the probability ofeach termin the right
hand side of {58}. Lemma 22. Let f :{X 002 Y }
 n
 ! [0; 1]be any function that can be written in the form f =
 f
 1 010 001 001 001 010f
 n
 . Forany z 2 {X 002 Y}
 {2m;n}
,
 Pr
 n
 0332000
 {2m;n} : d
 027
002
 ^er
 z 033
 {1}
 {f }; ^er
z
 033
 {2}
 {f }
 003
 >
 ff4
 o
 024 2exp
 022
 000ff 2
 027 mn8
 
; {60}
 where each033 2 000
 {2m;n}
 is chosen uniformly at random. Proof. For any 0332 000
 {2m;n} ,
 d
 027
 002
 ^er
 z 033
 {1}
 {f }; ^er
z
 033
 {2}
 {f }
 003
 =
 fi
 fi
 fi
 P
 m
 i=1
 P
 n
 j=1
 002
 f
 j
 {z
 033{i;j}
 } 000f
 j
 {z
 033{m+i;j} }
 003
 fi fi
 fi027 mn +
 P 2m
 i=1
P
 n
 j=1
 f
 j
 {z ij
 }
 :{61}
 To simplify the notation denotef
 j
 {z
 ij
 } by fi
 ij
 . For each pairij , 1 024 i 024m, 1 024 j 024n, let
 Y
 ij be an independent random variable such thatY
 ij
 = fi ij
 000 fi
 m+i;j
 with probability1=2 and
 Y
 ij
 = fi
 m+i;j
 000 fi
 ij
 with probability 1=2. From {61},
 Pr n
 033 2 000 {2m;n}
 : d 027
 002
 ^er
 z
 033
{1}
 {f };^er
 z
 033 {2}
 {f } 003
 >
 ff4
 o
 = Pr
 8
 <
 :
 033 2000
 {2m;n}
 :
 fi fi
 fi
 fi
 fi
 fi
 m
X
 i=1
 n
 X
 j=1
 002 f
 j
 {z 033{i;j}
 }000 f
 j
 {z 033{m+i;j} }
 003
 fi fi
 fi
 fi
 fi
 fi
 >
 ff4
 0
 @
 027 mn +
2m
 X
 i=1 n
 X
 j=1 fi
 ij
 1 A
 9
 =
 ; = Pr
 8
 < :
 fi
 fi
 fi fi
 fi
 fi
 m
 X
 i=1
 n
 X
 j=1
 Y
 ij
 fi
 fi
 fi
 fi
 fi fi
 ff4
 0
 @
 027mn +
 2m
 X i=1
 n
 X j=1
 fi
 ij 1
 A
 9
 = ;
 For zero-mean independent randomvariables Y
 1
 ;: : : ; Y
 k
 withbounded ranges a
 i
 024 Y
 i
 024b
 i
 , Ho-
 effding's inequality {Devroye, Gy250or002, & Lugosi, 1996} is Pr
 {
 fi fi
 fi
 fi
 fi
 k
 X
i=1
 Y
 i
 fi
 fi
 fi
 fi
 fi
 025 021 }
 024 2 exp  
 000
 2021 2P k
 i=1
 {b
 i
 000 a i
 }
 2
 !
 :
 183BAXTER Noting that the range of eachY
 ij
 is [000jfi
 ij
 000 fi i+m;j
 }j; jfi
 ij
000 fi
 i+m;j }j], we have Pr
 8
 <
 : fi
 fi
 fi
 fi fi
 fi
 m
 X i=1
 n
 X j=1
 Y
 ij fi
 fi
 fi
 fi fi
 fi
 >
 ff4
 0
 @
 027 mn +
 2m
 X
 i=1
 n
 X
 j=1
 fi
 ij
 1
 A 9
 =
 ;
 0242 exp
 0
 B
 @
 000
 ff
2
 h
 027 mn + P
 2m
 i=1 P
 n
 j=1 fi
 ij
 i
 232
 P
 m
 i=1
 P
 n
 j=1
 {fi
 ij
 000fi
 i+m;j
 } 2
 1
 C
 A Let fl =
 P
 2m
 i=1
 P n
 j=1
 fi ij
 . As 0 024fi
 ij
 024 1,
 P
 m
 i=1 P
 n
 j=1 {fi
 ij
 000fi
 m+ij
 } 2
 024 fl .Hence,
 2 exp
 0 B
 @
 000
 ff 2
 h
 027 mn+
 P
 2m i=1
 P
 n j=1
 fi
 ij i
 232
 P
 m
 i=1
 P
 n
 j=1
 {fi
 ij 000 fi
 i+m;j }
 2
 1
 C
 A
 024 2 exp 022
 000
 ff 2
 {027 mn +fl }
 232fl
  :
 {027 mn +fl }
 2
 =flis minimized by setting fl =027 mn giving a value of4027 mn. Hence
 Pr n
 033 2 000 {2m;n}
 :d
 027
 002
 ^er
 z
 033 {1}
 {f }; ^er
 z
033
 {2}
 {f}
 003
 >
 ff4
 o
 024 2 exp
 022
 000ff
 2 027 mn8
 
 ;
as required.A.1.3PUTTING IT TOGETHER
 For 002xed z2 {X 002 Y }
 {2m;n}
 , Lemmas 21 and 22give:
 Pr
 032 033 2 000
 {2m;n}
 : sup
 H d
 027
 002 ^er
 z
 033 {1}
 {h};^er
 z
 033 {2}
 {h} 003
 >
 ff2
 033 024 2N {ff027=8; H; d
 z }) exp
 022
000
 ff
 2
 027 mn8 
 :
 Notethat d
 z
 is simplyd
 P
 where P= {P
 1
 ;: : : ; P
 n
 }and each P
 i
 is theempirical distribution that
 putspoint mass 1=m on each z ji
 ; j =1; : : : ; 2m {recallDe002nition 3}. Hence,
 Pr
 032
 033 2 000
 {2m;n}
 ; z2 {X 002 Y } {2m;n}
 : sup H
 d
 027 002
 ^er
z
 033
 {1}
 {h}; ^er z
 033
 {2} {h}
 003 >
 ff2
 033
 0242C{ff027 =8;H} exp
 022
000
 ff
 2
 027 mn8 
 :
 Now, for a random choice of z, eachz
 ij
 in z isindependently {butnot identically} distributedand 033
 only ever swapsz
 ij
 and z
 i+m;j
 {so that 033swaps a z
 ij
drawn according to P
 j with another component
 drawnaccording to the same distribution}.Thus we can integrate out with
respect to thechoice of
 033 and write Pr
 032
 z 2{X 002 Y }
 {2m;n}
 : sup
H
 d
 027
002
 ^er
 z{1}
 {h};^er
 z{2}
{h}
 003
>
 ff2
 033
 024 2C{ff027 =8; H} exp
 022
 000 ff
 2
 027mn8

 :
 Applying Lemma 20 tothis expression gives Theorem 18.184A MODEL OFINDUCTIVE
BIASLEARNING
 A.2Proof of Theorem 2
 Another piece ofnotation is required for the proof. Forany hypothesis space
H and any probability measures P = {P
 1
 ; : : : ; P
 n
 } on Z ,let
 ^er
 P
 {H} :=
 1n
 n
 X i=1
 inf
 h2H
 er
 P
 i {h}:
 Notethat we have used ^er P
 {H} rather thaner
 P
 {H}to indicate that ^er
 P
 {H} is anotherempirical
 estimate of er
 Q
 {H}.
 With the {n; m}-samplingprocess, inaddition to the sample zthere is also
generated a se-
 quenceof probability measures, P = {P 1
 ; : : : ; P n
 } although these are notsupplied to the learner.
 Thisnotion is used in the following Lemma, wherePrf{z; P}2 {X 002 Y } {n;m}
 002 P n
 : Ag means 223the probability of generating asequence ofmeasures P from
the environment {P ; Q} and then an {n; m}-samplez according to P such that
A holds224. Lemma 23. If
 Pr 032
 {z; P} 2 {X 002 Y}
 {n;m}
002 P
 n
 : sup H
 d
 027 [ ^er
 z
 {H}; ^er P
 {H}] > ff2 033
 024
 ffi2
 ;{62}
 and
 Pr
 032
 P 2 P
n
 : sup
 H d
 027
 [ ^er
 P
 {H}; er
 Q
 {H}] >
 ff2
 033
 024
 ffi2
 ; {63}
 then
 Pr
 032
 z 2 {X 002 Y}
 {n;m}
: sup
 H
 d 027
 [ ^er z
 {H};er
 Q
 {H}]> ff
 033
 024ffi:
 Proof. Followsdirectly from the triangle inequality ford
 027
 .We treat the two inequalities inLemma 23 separately.
 A.2.1INEQUALITY {62} In the following Lemma we replace thesupremum over
H 2 H ininequality {62} with a supremum
 over h 2 H
 n .
 Lemma 24.
 Pr 032
 {z; P} 2 {X 002 Y}
 {n;m}
002 P
 n
 : sup H
 d
 027 [ ^er
 z
 {H}; ^er P
 {H}] > ff 033
 024 Pr
{
 {z; P}2 {X 002 Y } {n;m}
 002 P n
 : sup
 H n
 l
 d
 027
 [ ^er
 z
 {h}; er P
 {h}] > ff }
 {64}
 185BAXTER Proof. Suppose that {z; P} are such thatsup
 H
 d
 027 [ ^er
 z
{H}; ^er
 P
 {H}] > ff. Let H satisfy this in- equality. Suppose 002rst that^er
 z
 {H} 024 ^er
 P {H}. By the de002nition of^er
 z
 {H}, for all " > 0there
 exists h 2 H n
 := H 010 001001 001 010 H such that ^er
 z
 {h}< ^er
 z
{H} + ". Henceby property {3} of the d
027
 metric, for all " >0,there exists h 2 H n
 suchthat d
 027
 [ ^er
z
 {h}; ^er
 z
 {H}]< ". Pick an arbitrary h satisfying this inequality. Byde002nition,
^er
 P {H} 024 er P
 {h},and so^er
 z
 {H} 024 ^er
 P {H} 024 er P
 {h}. As d
 027
 [^er
 z
 {H}; ^er
 P
 {H}] > ff{by assumption},by the compatibility of d
 027 with the ordering on the
 reals,d
 027
 [ ^er z
 {H};er
 P
 {h}]> ff = ff + ffi, say. By the triangle inequality ford
 027
 ,
 d 027
 [ ^er
 z
 {h}; er P
 {h}] +d
 027
 [^er z
 {h};^er
 z
 {H}] 025 d
 027 [ ^er
 z
 {H}; er
P
 {h}] = ff+ ffi:
 Thus d
 027
 [ ^er
z
 {h}; er P
 {h}] > ff+ ffi 000 " and for any" > 0 an h satisfying thisinequality can
be
 found. Choosing" = ffi shows that there existsh 2 H
 n
 such thatd
 027
 [ ^er
 z
 {h}; er
 P
 {h}] > ff.
 If instead,^er
 P
 {H} < ^er
 z {H}, then an identical argument can be run with the role of zand P
 interchanged. Thus in bothcases,
 sup
 H
d
 027
 [ ^er z
 {H};^er
 P
 {H}] > ff } 9h 2H
 n
 l
 :d
 027
 [ ^er
 z
 {h}; er
 P
 {h}] > ff;
 which completes theproof of the Lemma.By the nature ofthe {n; m} sampling
process, Pr
 {
 {z; P} 2 {X 002Y }
 {n;m} 002 P
 n
 sup H
 n
 l
 : d
 027
 [^er
 z
 {h}; er
 P
{h}] > ff
 } =
 Z
 P2P n
 Pr
 { z 2 {X 002 Y}
 {n;m}
: sup
 H
 n l
 d
 027
 [ ^er
 z
 {h}; er
 P {h}] > ff
 }
 dQ
 n
 {P}: {65}
 Now H
 n
 l
 022 K 010 001 001 001010K where K := fh l
 : h 2 H: H 2 H g andH
 n
 l
 ispermissible by the assumed
 permissibilityof H {Lemma 32, Appendix D}.Hence H
 n
 l satis002esthe conditions of Corollary 19 and so combining Lemma24, Equation
{65} andsubstituting ff=2 for ffand ffi=2 for ffi in Corollary 19 gives the
following Lemma on thesample size required to ensure {62} holds. Lemma 25.
If
 m025 max
 032
 32ff
 2 027 n
 log
 8C {ff027 =16; H n
 l
 }ffi
 ;
8ff
 2
 027
 033
 then
 Pr
 032
 {z; P} 2 {X 002 Y }
 {n;m}
 002 P
 n : sup
 H
 d 027
 [ ^er z
 {H};^er
 P
 {H}] >
 ff2
 033
 024
 ffi2
 :
 A.2.2INEQUALITY {63} Note that er
 P
{H} =
 1n
 P
 n i=1
 H
 003 {P
 i
 }and er
 Q
 {H} = E
 P 030Q
 H
 003
{P }, i.e the expectation ofH
 003
 {P}
 where P is distributedaccording toQ. So to bound theleft-hand-side of{63}
we can apply Corollary 19 with n = 1, m replaced byn, H replaced by H
 003
 , ff and ffireplaced by ff=2 and ffi=2 respectively,
 Preplaced by Q and Z replacedbyP . Note that H
 003
 is permissible wheneverH is {Lemma 32}.
 Thus,if n 025 max
 032 32ff 2
 027
 log 8C {ff027 =16; H
 003
 }ffi
 ; 8ff 2
 027
 033 {66}
 186A MODEL OFINDUCTIVE BIASLEARNING
 then inequality{63} is satis002ed.
 Now,putting together Lemma 23, Lemma 25 and Equation 66, we have proved
the following more general version of Theorem 2. Theorem 26. Let Hbe a permissible
hypothesis space family and letz be an {n; m}-samplegen-
 erated from the environment{P ; Q}. For all 0 < ff; ffi <1 and 027 > 0,
if n 025 max
 032 32ff 2
 027
 log 8C {ff027 =16; H
 003
 }ffi
 ; 8ff 2
 027
 033 and m 025 max
 032
 32ff
 2
 027n
 log
 8C {ff027 =16; H
n
 l
 }ffi
 ;
 8ff
 2 027
 033
 ; then
 Pr
 032 z 2 {X 002 Y}
 {n;m}
: sup
 H
 d 027
 [ ^er z
 {H};er
 Q
 {H}]> ff
 033
 024ffi
 To get Theorem 2, observethat er
 Q
 {H} > ^er
 z {H} + " }d
 027
 [ ^er
 z
 {H}; er
 Q
 {H}] > "={2 + 027}.
 Setting ff ="={2 + 027 } and maximizingff
 2
 027 gives027 = 2. Substitutingff = "=4 and 027= 2 into
 Theorem 26 givesTheorem 2.
 A.3 The Realizable Case In Theorem 2 the sample complexity forboth m and
n scales as 1="
 2
 . Thiscan be improved to
 1="if instead of requiring er
Q
 {H} 024^er
 z
 {H} + ", we require only thater
 Q
 {H}024 024 ^er
 z {H} + "
for some 024 > 1. To see this, observe that er Q
 {H} >^er
 z
 {H}{1 + ff}={1000 ff} + ff027 ={1000 ff} }
 d 027
 [ ^er
 z
 {H}; er Q
 {H}] > ff, so setting ff027 ={1000 ff} = "in Theorem 26 and treating ff
as a constant
 gives:
 Corollary 27.Under the same conditions asTheorem 26, forall " > 0 and 0
< ff;ffi < 1, if
 n025 max
 032
 32ff{1 000ff}"
 log
 8C {(1 000 ff}"=16; H
 003
 }ffi
 ; 8ff{1 000 ff}"
033
 and m 025 max 032
 32ff{1 000 ff}"n
 log
 8C {(1000 ff}"=16;H
 n
 l
 }ffi
 ; 8ff{1 000 ff}"
033
 ;
 then
 Pr
 032
 z 2{X 002 Y }
{n;m}
 : sup H
 er
 Q
 {H} 025
 1+ ff1000 ff
 ^er
 z
 {H} + " 033
 024 ffi: These bounds are particularly useful if we know that ^er
 z
 {H} = 0, for then we can set ff = 1=2
 {which maximizesff{1000 ff}}. Appendix B. Proof of Theorem 6 Recalling
De002nition 6, forH of the form given in {32},H
 n
 l
 can be written H
 n
 l
 = fg
 1
 ffif 010 001 001 001 010g
 n
 ffi f :g
 1
 ; : : : ; g n
 2 G
 l and f 2 F g : To write H
 n l
 asa composition of two functionclasses note that if for each f: X ! V we
de002ne 026
 f : {X002 Y }
 n
! {V 002 Y } n
 by
 026 f {x
 1
 ; y
 1
 ; : : : ; x n
 ; y
 n } := {f {x 1
 }; y
 1
 ; : : : ; f {x
 n
 }; y n
 }
 187BAXTER then g
 1
 ffif010 001 001 001 010g
 n
 ffi f= g
 1
 010001 001 001 010 g
 n ffi
 026
 f. Thus, setting G
 n l
 := G
 l 010 001 001 001 010 G l
 andF:= f
 026
 f: f 2
 F g, H
 n
 l
 =G
 n
 l
 ffiF : {67}
 Thefollowing two Lemmas will enable us to boundC{"; H
 n
 l
 }.
 Lemma 28.Let H : X 002Y ! [0; 1] be of theform H = G
 l
 ffi F where X 002Y
 F
 000!V 002 Y
 G
l
 000!
 [0;1]. For all "
 1
 ; "
 2
 > 0,
 C {"
 1
 + " 2
 ; H} 024 C G
 l
 {" 1
 ; F }C {"
 2
 ;G
 l
 }: Proof. Fix a measureP on X 002 Yand let F be a minimum size"
 1
 -cover for{F ; d
 [P;G
 l
 ]
 }. By
 de002nition jFj 024 C
 G
 l {"
 1
 ;F }. For each f2 F let P
 f be the measure on V 002Y de002ned by P
 f {S } =
 P{f
 0001
 {S }) for any set Sin the 033-algebra on V002 Y {f is measurable sof
 0001
 {S} is measurable}.
 LetG
 f
 be a minimum size"
 2
 -cover for{G
 l
 ; d P
 f
 }.By de002nition again, jG
 f
 j 024 C {" 2
 ; G
 l }. Let
 N :=fg ffi f : f2 F and g 2G
 f
 g. Notethat jN j 024 C
 G
 l
 {" 1
 ; F }C{"
 2
 ; G l
 } so the Lemma will be proved if N can be shown tobe an "
 1
 +"
 2
 -coverfor {H; d
 P }. So, given anyg ffi f 2 H choose f
 0
 2 Fsuch that d
 [P;G
 l
 ]
 {f ; f
 0
 }024 "
 1
 andg
 0
 2 G
 f
 0
 such thatd
 P
 f
 0 {g; g
 0 } 024 "
 2 . Now,
 d
 P
 {g ffi f; g
 0
 ffif
 0
 } 024d
 P
 {g ffif ; g ffif
0
 } + d
 P {g ffif
 0 ; g
 0
 ffif
 0
 }
024 d
 [P;G l
 ]
 {f; f
 0
 } +d
 P
 f
 0 {g; g
 0 }
 024 "
 1
 + "
 2 :
 where the 002rst line follows from the triangle inequality for d P
 and the second line followsfrom
 the facts: d
P
 {g ffif 0
 ; g
 0 ffi f
 0
} = d
 P
 f {g; g
 0 } and d
 P {g ffif ; gffi f
 0
 }024 d
 [P;G l
 ]
 {f; f
 0
 }.Thus N is an
 " 1
 + "
 2 -cover for {H;d
 P
 } and so the result follows.Recalling the de002nition of H
 1
 010001001 001 010 H
 n
 {De002nition 6}, we have the followingLemma.
 Lemma 29.
 C{"; H
 1
 010 001 001 001 010 H n
 } 024
 n
 Y
 i=1 C {"; H
 i }
 Proof. Fixa product probability measure P =P
 1
 002 001001 001 002 P
 n on {X 002 Y}
 n
 . LetN
 1
 ; : : : ; N n
 be
 "-coversof {H
 1
 ;d
 P
 1
 }: : : ; {H
 n ; d
 P
 n }. and let N =N
 1
 010 001001 001 010 N
 n . Given h = h 1
 010 001 001 001010 h
 n
 2
 H
 1
 010 001 001 001010 H
 n
 , chooseg
 1
 010 001001 001 010 g
 n 2 N such that d P
 i
 {h i
 ; g
 i } 024 " for each i= 1; : : : ; n. Now,
 d
 P
 {h
 1
 010 001001 001 010 h
 n ; g
 1
 010001 001 001 010 g
n
 } =
 1n
 Z
Z
 n
 fi
 fi fi
 fi
 fi
 n
 X
 i=1 h
 i
 {z i
 } 000
 n
 X
 i=1 g
 i
 {z i
 }
 fi fi
 fi
 fi
 fi
 dP{z
 1
 ; : : : ; z
 n
 }
 024
 1n
 n X
 i=1
 d P
 i
 {h i
 ; g
 i }
 024 ":
 Thus N is an "-cover forH
 1
 010 001 001 001010 H
 n
 and asjN j =
 Q n
 i=1
 jN
 i
 j the result follows.188A MODEL OFINDUCTIVE BIASLEARNING
 B.1Bounding C {"; H
n
 l
 }
 From Lemma 28, C
 000
 "
 1
 + "
 2
 ; G
 l
 n
 ffiF
 001 024 C {"
 1 ; G
 l
 n } C
 G
 l n
 000
 "
2
 ;F
 001
 {68}
 and from Lemma 29, C {"
 1
 ; G
 l
 n
 } 024 C{"
 1 ; G
 l
 } n
 : {69}
 Using similar techniques tothose used to prove Lemmas 28 and 29, C
 G l
 n
 {";F } can be shown to satisfy
 C
 G l
 n
 {" 2
 ;F} 024 C
 G
l
 {"
 2
 ; F } : {70} Equations {67}, {68}, {69} and {70}together imply inequality
{34}.
 B.2 Bounding C {"; H 003
 }
 Wewish to prove that C {"; H
 003
 }024 C
 G
 l {"; F } whenH is a hypothesis space family of theform
 H = fG
l
 ffif : f 2F g. Note that each H 003
 2 H
003
 corresponds to some G l
 ffi f , and that H
 003
 {P} = inf
 g2G l
 er
 P
 {g ffi f }: Any probability measure Qon P induces a probability measureQ
 X 002Y
on X 002 Y , de002ned by Q
 X 002Y {S } =
 Z P
 P {S }dQ{P }
 for anyS in the 033-algebra onX 002 Y . Note also that ifh; h
 0
 are bounded,positive functions on an
 arbitrary setA, then
 fi
 fi fi
 fi
 inf a2A
 h{a} 000 inf
 a2A
 h
 0
 {a}
 fi
 fi
 fi
 fi
024 sup
 a2A fi
 fi
 h{a}000 h
 0
 {a}
 fi
 fi
 : {71}
 LetQ be any probability measure on the spaceP of probability measures on
X002 Y . Let H
 003
 1
 ; H 003
 2
 be twoelements of H
 003
withcorresponding hypothesis spaces G l
 ffif
 1 ; G
 l
 ffif
 2
 . Then, d
 Q
 {H 003
 1
 ; H 003
 2
 } = Z
 P
 fi fi
 fi
 fi
 inf
 g2G
 l er
 P
 {gffi f
 1
 } 000inf
 g2G
 l er
 P
 {gffi f
 2
 } fi
 fi
 fi fi
 dQ{P } 024
 Z
 P sup
 g2G
 l jer
 P
 {g ffif
 1
} 000 er
 P
{g ffif
 2
 }j dQ{P }{by {71} above}
 024 Z
 P
 Z
 X 002Y
 sup
 g2G
 l
 jgffi f
 1
 {x; y} 000 g ffif
 2
 {x; y}j dP {x; y}dQ{P }
 = d [Q
 X002Y ;G
 l
 ] {f
 1
 ;f
 2
 }:
 The measurability of sup
G
 l
 gffi fis guaranteed by the permissibility ofH {Lemma 32 part 4, Ap-
 pendixD}. From d
 Q {H
 003
 1 ; H
 003
2
 } 024 d
 [Q
 X002Y ;G
 l
 ] {f
 1
 ;f
 2
 } we have, N {"; H
003
 ; d
 Q } 024 N
 000 "; F ; d
 [Q
 X002Y
 ;G
 l
 ]
 001 ; {72}
 which givesinequality {35}.
 189BAXTER B.3 Proof of Theorem 7 In order to prove the bounds inTheorem
7 we have to apply Theorem 6 to theneural network
 hypothesis space family ofequation {39}. In this case the structure is R
 d
 F 000! R
 k G
 000! [0;1]
 where G = f{x
 1
 ; : : : ; x k
 } 7! 033 020
 P
 k
 i=1
 ff
 i
 x i
 + ff
 0 021
 : {ff 0
 ; ff
 1 ; : : : ; ff
 k } 2 U g for some bounded subset U of R
 k+1
 and some Lipschitz squashingfunction 033. The feature classF : R
 d
! R
 k
 is the set ofall one hidden layer neural networks withd inputs, l hidden
nodes, koutputs, 033 as
 the squashing functionand weights w 2 T whereT is a bounded subset of R
W
 . The Lipschitz restriction on 033 and the boundedrestrictions on the weights
ensure thatF and G areLipschitz
 classes. Hence there exists b< 1 such that for all f2 F and x; x
0
 2 R
 d , kf {x}000 f {x
 0 }k <
 bkx000 x
 0
 k and for allg 2 G andx; x
 0 2 R
 k
 ,jg{x} 000 g{x
 0
 }j< bkx 000 x
 0
 k where k 001 kis the L
 1
 norm in each case. The loss function is squaredloss.
 Now, g
l
 {x; y} =l{g{x};y} = {g{x}000 y}
 2
 , hence for all g; g
 0
2 G and all probability measures P on R
 k 002 [0; 1] {recall that weassumed the output space Y was[0; 1]},
 d P
 {g
 l ; g
 0
 l } =
 Z
 R k
 002[0;1] fi
 fi
 {g{v} 000 y}
 2
 000 {g
0
 {v} 000 y}
 2
 fi
 fi
 dP {v; y}
 024 2
 Z R
 k
 fi fi
 g{v}000 g
 0
 {v}
 fi
 fi
 dP
 R
 k
{v}; {73}
 where P
 R
 k is the marginal distribution onR
 k
 derived fromP . Similarly, for allf ; f
 0
 2F and
 probability measures Pon R
 d
 002[0; 1],
 d [P;G
 l
 ]
 {f ; f
 0
 } 024 2b Z
 R
 d
 kf {x} 000f
 0
 {x}k dP
 R
 d {x}: {74} De002ne
 C
 000 "; G ; L
 1 001
 := sup
P
 N
 000
"; G ; L
 1 {P }
 001 ;
 where the supremum is overall probability measures on {the Borel subsets
of}R
 k
 , and
 N
 000
 "; G; L
 1
 {P}
 001
 is the size of thesmallest "-cover of G under theL
 1
 {P }metric. Similarly set,
 C 000
 "; F ;L
 1
 001
:= sup
 P
 N 000
 "; F ;L
 1
 {P } 001
 ;
 wherenow the supremum is over all probabilitymeasures on R
 d
 . Equations{73} and {74} imply
 C {"; G
 l
 }024 C
 020
 "2
 ;G ; L
 1
 021 {75}
 C
 G l
 {"; F }024 C
 020
 "2b
 ;F ; L
 1
 021 {76}
 Applying Theorem 11 fromHaussler {1992}, we 002nd
 C 020
 "2
 ; G
 l ; L
 1
 021 024
 024
 2eb"
 025 2k+2
 C
 020
 "2b
 ; F ; L 1
 021
 024 024
 2eb
 2"
 025 2W
 :
 Substitutingthese two expressions into {75} and {76} and applying Theorem
6 yields Theorem
 7.190A MODEL OFINDUCTIVE BIASLEARNING
 Appendix C.Proof of Theorem 14
 Thisproof follows a similar argument to the onepresented in Anthony and
Bartlett {1999} for ordinary Boolean function learning. First we need a technical
Lemma. Lemma 30. Let ff be a randomvariable uniformly distributed onf1=2
+ fi =2;1=2 000 fi =2g, with
 0 < fi< 1. Let 030
1
 ; : : : ; 030
 m
 be i.i.d. f1;0001g-valued random variables with Pr{030
 i
 = 1} =ff for all
 i. For any function f mappingf1; 0001g
 n
 ! f1=2 +fi =2; 1=2000 fi =2g,
 Pr f030
 1
 ; : : : ; 030
 m
 : f {030
 1 ; : : : ; 030
 m } 6= ffg > 14
 "
 1 000
 q1 000 e
 000
 mfi
 21000fi
 2
 #
 :
 Proof.Let N {030}denote the number of occurences of+1in the random sequence
030 ={030
 1
 ; : : : ; 030 m
 }.
 Thefunction f can be viewed as a decisionrule, i.e. based on the observations030,f
tries to guess whether the probability of +1 is1=2 + fi =2 or1=2 000 fi
=2. The optimal decision rule is the Bayes estimator: f {030
 1
 ; : : : ; 030 m
 } = 1=2+ fi =2 if N {030} 025 m=2,andf {030
 1
 ; : : : ; 030
 m } = 1=2 000fi =2 otherwise.
 Hence, Pr {f{030}6= ff} 025
12
 Pr 020
 N {030} 025
 m2
 fi
 fi fi
 fi
 ff = 12
000
 fi2
 
 +
 12
 Pr 020
 N {030} <
 m2
 fi
 fi fi
 fi
 ff = 12
 + fi2 
 >
 12
 Pr
020
 N {030}025
 m2
 fi
 fi
 fi
 fi
 ff = 12
000
 fi2
 
 which is half theprobability that a binomial {m;1=2 000 fi =2}random variable
is at least m=2. By
 Slud's inequality {Slud,1977},
 Pr {f {030} 6= ff}>
 12 Pr
  
 Z 025 smfi 21000 fi
 2
 ! where Z is normal {0; 1}. Tate' s inequality {Tate, 1953} states that
for all x025 0,
 Pr {Z025 x} 025
 12
 h 1 000
 p1 000 e
 000x
 2
 i
 : Combining the last two inequalitiescompletes the proof.Letx 2 X
 {n;m} beshattered by H , withm = d
 H
 {n}. For each rowi in x let P
 i be the set of
 all 2 d
 distributions Pon X 002 f0061gsuch that P {x; 1}= P {x; 0} = 0ifx is not
contained in the
ith row of x, and for eachj = 1; : : : ; d H
 {n},P {x
 ij
 ; 1} = {1 006 fi}={2d
 H {n}) and P{x
 ij
 ; 0001} =
 {1 007 fi}={2d
 H
 {n}). Let P:= P
 1
 002001 001 001 002 P
 n .
 Note that for P = {P
 1
 ; : : : ; P n
 } 2 P ,the optimal error opt
 P { H
 n
 }is achieved by any sequence h
 003
 = {h
 003
 1
 ;: : : ; h
 003
 n } such that h
 003 i
 {x
 ij } = 1 if and only ifP
 i
 {x
 ij
 ; 1} = {1+ fi }={2d
 H
 {n}), andH
 n
 alwayscontains such a sequence because Hshatters x. The optimal error is
then opt
 P
 {H
 n
 } = er P
 {h
 003 } =
 1n
 n
 X
 i=1
 P
 i fh
 003
 i {x} 6= yg=
 1n n
 X
 i=1 d
 H
 {n}
 X
 j=1 1 000 fi2d
 H
 {n}
 =
 1000 fi2 ;
 191BAXTER and for any h = {h 1
 ; : : : ; h n
 } 2 H
 n
 ,
 er
 P {h} = opt
 P { H
 n
 }+
 find H
 {n}
jf{i; j } : h i
 {x
 ij } 6= h
 003 i
 {x
 ij }gj : {77} For any {n; m}-sample z, let each elementm
 ij
 in the array m{z} :=
 m 11
 001 001 001m
 1d
 H
 {n}
 .
 . .
 .
 .
 . .
 .
 .
 m n1
 001 001 001m
 nd
 H
{n}
 equal the number ofoccurrences of x
 ij
 in z.
 Now, if we selectP = {P
 1
 ; : : : ; P
 n
 } uniformly at random from P, and generate an {n; m}-
 sample z usingP, then for h = A
 n
 {z} {the output of the learning algorithm} we have:
 E { jf{i; j }: h
 i
 {x ij
 } 6= h 003
 i
 {x ij
 }gj} = X
 m
 P {m}E { jf{i;j } : h
 i {x
 ij
 }6= h
 003
i
 {x
 ij }gj jm}
 =
 X
 m
 P{m}
 n
 X i=1
 d
 H {n}
 X
 j=1
 P {h{x
 ij
 } 6=h
 003
 {x ij
 }jm
 ij
 }
 where P{m} is the probability ofgenerating acon002guration mof thex
 ij
 under the {n; m}-sampling
 processand the sum is over all possiblecon002gurations. From Lemma 30,
 P {h{x
 ij } 6= h
 003 {x
 ij
 }jm
 ij
 } > 14
 2
 4
 1 000 r1 000e
 000
 m
 ij fi
 21000fi
 2
 3
 5
 ;
 hence E
 024
 1nd
 H
 {n}
 jf{i;j } : h
 i
 {x
 ij
 }6= h
 003
i
 {x
 ij }gj
 025
 >
 1nd H
 {n}
 X
 m
 P {m}
 n
 X
 i=1
 d
 H {n}
 X
 j=1
 14
 2
 4
 1000
 r1 000 e
 000
m
 ij
 fi
 21000fi 2
 3
 5
 025
 14
 "
 1 000 q1 000e
 000
 mfi
 2d
 H
 {n}{1000fi
 2
 }
 # {78}
 by Jensen'sinequality. Since for any [0; 1]-valued random variableZ , Pr{Z
> x}025 E Z 000 x, {78} implies:
 Pr
 022 1nd H
 {n}
 jf{i; j } :h
 i
 {x
 ij
 } 6= h 003
 i
 {x ij
 }gj > flff
 
 > {1000 fl }ff
 where ff :=
 14
 "
 1000
 q1 000 e
 000
 mfi
 2d
 H
 {n}{1000fi
 2
 }
 #
 {79}
 and fl 2 [0; 1]. Plugging this into {77} shows that Pr f{P; z} : er
 P
 {A
 n
 {z})> opt
 P
 {H
 n
 } + flfffi g > {1000 fl}ff:
 192A MODEL OFINDUCTIVE BIASLEARNING
 Since theinequality holds over the random choice ofP, it must also hold
for some speci002cchoice
 ofP. Hence for anylearning algorithm A
 n
 there is some sequence of distributionsP such that
 Pr fz : er
 P
 {A
 n
 {z})> opt
 P
 {H
 n
 } + flfffi g > {1000 fl}ff:
 Setting
 {1000 fl }ff 025ffi; and fl fffi025 "; {80}
 ensures Pr fz: er
 P
 {A
 n {z}) > opt P
 { H
 n } + "g > ffi:{81}
 Assuming equality in {80}, weget
 ff =
 ffi1 000 fl ; fi =
 "ffi
 1 000flfl
 :
 Solving {79} for m, and substituting theabove expressions forff and fi shows
that {81} issatis002ed
 provided
 m024 d
 H
 {n}
 "
 022
 ffi"
  2
 022
 fl1 000 fl 
 2
 0001
 #
 log
 {1000 fl }
 28ffi {1 000fl 000 2ffi} {82}
 Setting fl=1 000 affi for somea > 4 {a > 4since ff < 1=4and ff = ffi={1000
fl }}, and assuming "; ffi 024 1={ka} for some k >2, {82} becomes
 m 024 d
 H
 {n}a
 2
 024
 1 000 2k 025
 log
 a 28{a 000 2}
 : {83} Subject to the constraint a >4, the right hand side of {83} isapproximately
maximized at a = 8:7966, at which point the value exceeds d
 H
 {n}{1000 2=k}={220"
 2 }. Thus, for all k0251,if "; ffi 024 1=9k and
 m 024
 d
 H {n}
 000 1000
 2k
 001220"
 2 ; {84}
 then
 Pr fz: er
 P {A
 n
 {z}) > opt
 P { H
 n
 }+ "g > ffi:
 To obtain the ffi-dependence inTheorem 14 observe that by assumptionH
 1
 containsat least two functions h
 1
 ; h
 2
 , hence there exists an x 2 X such thath
 1
 {x}6= h
 2
 {x}. Let P
 006 be two distributions
 concentratedon {x; 1} and {x; 0001} such that P 006
 {x; h 1
 {x}) = {1006 "}=2 andP
 006
 {x;h
 2
 {x})=
 {1 007 "}=2. Let P
 +
 := P
 +
 002001 001 001 002 P
+
 and P
 000 := P
 000
 002001 001 001 002 P
000
 be the product distributions on {X 002 f0061g}
 n
 generated byP
 006
 , and h 1
 := {h
1
 ; : : : ; h
 1
 }; h
 2 := {h
 2
 ; : : : ; h
 2
 }. Note that h
1
 and h
 2
 are both in H
 n
 . If P is one of P 006
 and the learning algorithmA
 n
 chooses the wronghypothesis h,
 then
er
 P
 {h}000 opt
 P
 {H
 n
 } = ": 193BAXTER Now, if we choose Puniformly at random from fP +
 ; P
 000 g and generate an {n;m}-sample z ac-
 cordingtoP, Lemma 30 shows that
Prf{P; z}: er
 P
 {A n
 {z}) 025opt
 P
 { H n
 } + "g>
 14 "
 1 000
 q1 000 e
 nm"
 21000"
 2
 # ;
 which is at least ffiif
 m <
 1000"
 2"
 2
 1n
 log
 18ffi{1000 2ffi}
 {85} provided 0 < ffi <1=4. Combining the two constraints onm: {84} {with
k = 7} and {85}, and using
 maxfx
 1
 ; x 2
 g 025
 12
 {x 1
 + x
 2 } 002nishes the proof.Appendix D. Measurability In order for Theorems
2 and 18 to hold infull generality wehad to impose a constraint called 223permissibility224
on the hypothesisspace family H . Permissibility wasintroduced by Pollard
{1984}
 forordinary hypothesis classes H. Hisde002nition is very similar to Dudley's
223image admissible
 Suslin224{Dudley, 1984}. We will be extending this de002nition to cover
hypothesisspace families.
 Throughout this section weassume all functions hmap from {the completeseparable
metric
 space} Zinto [0; 1]. LetB{T } denote the Borel033-algebra of any topological
spaceT . As in Section
 2.2,we view P , the set of all probability measures on Z , as a topological
space by equipping it
 with the topology of weak convergence. B{P }is then the 033-algebra generated
by thistopology. The
 followingtwo de002nitions are taken {with minormodi002cations} from Pollard
{1984}.
 De002nition 8. A set H of[0; 1]-valued functions onZ is indexed by the
setT if there exists a function f : Z002 T! [0; 1] such that H = ff { 001;
t} : t 2 Tg :
 De002nition 9.The set H is permissible if it canbe indexed by a set T suchthat
 1. T is an analytic subsetof a Polish
 7
 spaceT , and
 2. the function f : Z002 T ! [0; 1]indexing H by T ismeasurable with respect
to the product 033-algebra B{Z} 012 B{T }.
 An analytic subset Tof a Polish spaceTis simply the continuous imageof a
Borelsubset X
 of another Polish spaceX . The analytic subsetsof a Polish space include
the Borel sets.They
 are important because projections ofanalytic sets are analytic, and can
be measured ina complete
 measure space whereas projectionsof Borel sets are not necessarily Borel,
and hencecannot be
 measured with a Borel measure.For more details see Dudley {1989}, section13.2.
 Lemma 31. H
1
 010 001001 001 010 H n
 : {X 002Y }
 n
 ![0; 1] is permissible ifH
 1
 ; : : : ;H
 n
 are all permissible. Proof. Omitted.We now de002ne permissibility ofhypothesis
space families.7. A topological space is calledPolish if it is metrizable
such that itis a complete separable metric space. 194A MODEL OFINDUCTIVE
BIASLEARNING
 De002nition 10.A hypothesis space family H= fHg is permissible if thereexist
sets S andT that are analytic subsets of PolishspacesS andT respectively,and
a function f : Z002T 002 S ! [0; 1],
 measurable with respectto S 012 B{T }012 B{S }, such that H =
 010
 ff { 001 ; t; s}: t 2 T g :s 2 S
 011
 : Let {X; 006; 026} be a measure space and Tbe an analytic subset of a
Polish space.Let A{X }
 denotethe analytic subsets of X . Thefollowing three facts about analytic
sets aretaken from
 Pollard {1984}, appendix C. {a} If {X; 006; 026} is complete then A{X
} 022 006. {b} A{X 002T } contains the product033-algebra 006 012 B{T
}.
 {c} Forany set Y in A{X 002 T }, the projection031
 X
 Y ofY onto X is in A{X }.
 Recall De002nition 2for the de002nition of H
 003 . In the following Lemma we assume that{Z; B{Z }) has been completed
with respect to anyprobability measure P , and also that{P ; B{P })is complete
 with respect to the environmental measure Q.
 Lemma 32.For any permissible hypothesis space familyH ,
 1. H
 n l
 is permissible.
 2. fh 2 H : H2 H gis permissible.
 3. H is permissible for allH 2 H .
 4. sup H
 and inf
H
 are measurable for allH 2 H .
 5. H
 003
 is measurable for allH 2 H .
 6. H 003
 is permissible. Proof. As we have absorbed theloss function into the hypotheses
h,H
 n
 l
 issimply the set of all
 n-fold productsH 010 001 001 001 010 Hsuch that H 2 H . Thus{1} follows
from Lemma 31. {2}and {3}
 are immediate from the de002nitions.As H is permissible for allH 2 H ,
{4} can be proved byan
 identical argument to that used in the223Measurable Suprema224 section
of Pollard {1984},appendix
 C.
 For {5}, note thatfor any Borel-measurable h :Z ! [0; 1],the functionh :P
! [0; 1] de002ned byh{P} :=
 R
 Z
 h{z} dP {z} is Borel measurable Kechris {1995, chapter 17}. Now, permissibility
of H automatically implies permissibility ofH := fh : h 2 Hg, andH
 003
 = infH
 so H 003
 is measurable
 by {4}.
 Now let Hbe indexed by f :Z 002 T 002 S! [0; 1] in the appropriate way.
To prove {6}, de002ne g : P 002T 002 S ! [0; 1] by g{P;t; s} :=
 R
Z
 f {z; t; s} dP {z}. ByFubini's theorem g is a B{P } 012 B{T } 012 B{S}-measurable
function. LetG : P 002 S ![0; 1] be de002ned byG{P; s} :=
 inf
 t2T
 g{P; t; s}. Gindexes H
 003 in the appropriate way forH
 003
 to be permissible,provided it can
 be shown thatG is B{P } 012 B{S }-measurable. Thisis where analyticity
becomes important.Let
 g
 ff
 :=f{P; t; s} :g{P; t; s} > ffg. By property {b} of analytic sets,A {P 002
T 002S } contains g
ff
 .
 The set G ff
 := f{P;s} : G{P; s} > ffg is the projection ofg
 ff
 onto P 002S , which by property {c} is also analytic. As {P; B{P }; Q}
is assumed complete, G
 ff
 is measurable, by property {a}.Thus G is
 a measurable function andthe permissibility of H
 003 follows.195BAXTER References
 Abu-Mostafa,Y. {1993}. A method for learning fromhints. In Hanson, S. J.,
Cowan, J. D., &Giles,
 C. L. {Eds.}, Advances in Neural Information Processing Systems 5, pp.7322680
San Mateo,
 CA. Morgan Kaufmann. Anthony, M., & Bartlett, P. L. {1999}. Neural Network
Learning:Theoretical Foundations. Cam- bridge University Press, Cambridge,
UK. Bartlett, P. L. {1993}.Lower bounds on the VC-dimension of multi-layerthreshold
networks. In
 Proccedingsof the Sixth ACM Conference on ComputationalLearning Theory,
pp. 44226150
 New York. ACM Press. Summaryappeared in Neural Computation, 5, no. 3. Bartlett,
P. L. {1998}.The sample complexity of pattern classi002cationwithneural
networks: the
 sizeof the weights is more important than the size ofthe network. IEEE Transactionson
 Information Theory, 44{2}, 525226536.
 Baxter, J.{1995a}. Learning Internal Representations. Ph.D. thesis, Department
of Mathe- matics and Statistics, The Flinders University of South Australia.
Copy availablefrom
 http://wwwsyseng.anu.edu.au/030jon/papers/thesis.ps.gz. Baxter, J. {1995b}.
Learninginternal representations. In Proceedings ofthe Eighth International
 Conference onComputational Learning Theory, pp. 311226320. ACM Press. Copy
available
 fromhttp://wwwsyseng.anu.edu.au/030jon/papers/colt95.ps.gz.
 Baxter, J. {1997a}.A Bayesian/information theoretic model of learningto
learn via multiple task
 sampling.Machine Learning, 28, 722640. Baxter, J. {1997b}. The canonicaldistortion
measure for vector quantization andfunction approx-
 imation. InProceedings of the Fourteenth InternationalConference on Machine
Learning, pp. 3922647. Morgan Kaufmann. Baxter, J., & Bartlett, P.L. {1998}.
The canonical distortion measure infeature space and 1-NN
 classi002cation.In Advances in Neural Information ProcessingSystems 10,
pp. 245226251. MIT
 Press.
 Berger, J. O. {1985}.Statistical Decision Theory and Bayesian Analysis.
Springer-Verlag, New York.
 Blumer, A., Ehrenfeucht,A.,Haussler, D.,& Warmuth, M. K.{1989}. Learnability
and the vapnik- chervonenkis dimension. Journalof the ACM, 36, 929226965.
Caruana, R. {1997}. Multitask learning.Machine Learning, 28, 4122670. Devroye,
L., Gy 250or002,L., & Lugosi, G. {1996}. AProbabilistic Theory of PatternRecognition.
 Springer, New York.
 Dudley , R. M. {1984}.A Course on Empirical Processes, Vol. 1097 of Lecture
Notes inMathemat-
 ics, pp. 2226142.Springer-Verlag.
 Dudley ,R. M. {1989}. Real Analysis and Probability. Wadsworth & Brooks/Cole,California.
 196A MODEL OFINDUCTIVE BIASLEARNING
 Gelman,A., Carlin, J. B., Stern, H. S., & Rubim, D. B.{Eds.}. {1995}. Bayesian
Data Analysis.
 Chapman and Hall.
 Good,I. J. {1980}. Some history of the hierarchical Bayesian methodology.
In Bernardo, J.M., Groot, M. H. D., Lindley, D. V., & Smith, A. F. M. {Eds.},Bayesian
Statistics II. University Press, Valencia.
 Haussler,D. {1992}. Decision theoretic generalizationsofthe pac model for
neural net and other learning applications. Informationand Computation, 100,
78226150. Heskes, T. {1998}. Solvinga huge number of similar tasks: acombination
ofmulti-task learning and a hierarchical Bayesian approach. InShavlik, J.
{Ed.}, Proceedings of the 15thInternational
 Conference on Machine Learning{ICML '98}, pp. 233226241. Morgan Kaufmann.
Intrator, N., & Edelman, S.{1996}.How to make a low-dimensional representationsuitable
for
 diverse tasks.Connection Science, 8. Kechris, A. S. {1995}. ClassicalDescriptive
Set Theory. Springer-Verlag, NewYork.
 Khan,K., Muggleton, S., & Parson, R. {1998}.Repeat learning using predicate
invention.In Page,
 C. D. {Ed.}, Proceedingsof the 8th International Workshop on InductiveLogic
Programming
 {ILP-98}, LNAI 1446, pp. 65226174. Springer-Verlag.
 Langford, J. C. {1999}.Staged learning. Tech. rep.,CMU, School of Computer
Science.
 http://www.cs.cmu.edu/030jcl/research/ltol/stagedlatest.ps. Mitchell, T.
M. {1991}.The need for biases in learning generalisations.In Dietterich,
T. G., &
 Shavlik, J. {Eds.}, Readings in MachineLearning. Morgan Kaufmann.
 Parthasarathy, K.R. {1967}. ProbabiliityMeasures on Metric Spaces. AcademicPress,
London.
 Pollard, D. {1984}.Convergence of Stochastic Processes. Springer-Verlag,
NewYork.
 Pratt, L. Y. {1992}.Discriminability-based transferbetween neural networks.
In Hanson, S.J.,
 Cowan,J.D., & Giles, C. L. {Eds.}, Advances in Neural Information Processing
Systems 5, pp. 204226211. Morgan Kaufmann. Rendell, L., Seshu, R., & Tcheng,
D.{1987}.Layered concept learning and dynamically-variable bias management.
In Proceedings ofthe Tenth International Joint Conference onArti002cial
 Intelligence {IJCAI'87}, pp. 308226314. IJCAI , Inc.
 Ring, M. B. {1995}. Continual Learning inReinforcement Environments. R.Oldenbourg
Verlag.
 Russell,S. {1989}. The Use of Knowledge in Analogy and Induction. Morgan
Kaufmann. Sauer, N. {1972}. On the densityof families of sets. Journalof
Combinatorial Theory A, 13, 145226168.
 Sharkey,N. E., & Sharkey, A. J. C. {1993}.Adaptive generalisation andthe
transfer of knowledge.
 Arti002cial Intelligence Review,7, 313226328.
 197BAXTER Silver, D. L., & Mercer,R. E. {1996}. The parallel transfer of
task knowledge using dynamic
 learningrates based on a measure of relatedness.Connection Science, 8, 277226294.
Singh, S. {1992}. Transferof learning by composing solutions of elementalsequential
tasks. Ma-
 chineLearning, 8, 323226339.
 Slud, E. {1977}. Distribution inequalities forthe binomial law. Annals of
Probability, 4,404226412.
 Suddarth, S.C., & Holden, A. D. C. {1991}.Symolic-neural systems and the
use of hints in devel-
 oping complex systems.International Journal of Man-Machine Studies, 35,
291226311.
 Suddarth,S. C., & Kergosien, Y.L. {1990}. Rule-injection hints as a means
ofimproving net-
 work performance andlearning time. In Proceedings of the EURASIPWorkshop
on Neural
 NetworksPortugal. EURASIP.
 Sutton,R. {1992}. Adapting bias by gradient descent:An incremental version
of delta-bar-delta.In
 Proceedings of the TenthNational Conference on Arti002cial Intelligence,
pp. 171226176. MIT
 Press. Tate, R.F. {1953}.On a double inequality of the normal distribution.
Annals of Mathematical
 Statistics, 24, 132226134. Thrun, S. {1996}. Is learning the n-ththing
any easier than learning the 002rst?.In Advances in Neural
 InformationProcessing Systems 8, pp. 640226646. MITPress.
 Thrun, S., & Mitchell, T. M. {1995}. Learning one more thing.In Proceedings
of the International Joint Conference on Arti002cial Intelligence, pp. 12172261223.
Morgan Kaufmann. Thrun, S., & O'Sullivan, J. {1996}.Discovering structure
in multiple learning tasks:The TC al-
 gorithm. InSaitta, L.{Ed.}, Proceedings ofthe 13thInternational Conference
on Machine Learning {ICML '96}, pp. 489226497. Morgen Kaufmann.
 Thrun, S., & Pratt, L. {Eds.}. {1997}. Learning to Learn.Kluwer Academic.
 Thrun, S.,& Schwartz, A.{1995}. Finding structure inreinforcement learning.
In Tesauro,G.,
 Touretzky , D., & Leen, T. {Eds.}, Advances in Neural Information Processing
Systems, Vol. 7, pp. 385226392. MIT Press.
 Utgoff, P. E. {1986}. Shift of bias for inductive concept learning. In Machine
Learning:An Arti002cial
 IntelligenceApproach, pp. 107226147. Morgan Kaufmann. Valiant, L. G. {1984}.
Atheory of the learnable. Comm. ACM, 27, 11342261142.
 Vapnik, V. N. {1982}. Estimationof Dependences Based on Empirical Data.
Springer-Verlag, New York.
 V apnik, V. N. {1996}. The Nature of StatisticalLearning Theory. Springer
Verlag,New York.
 198