Abstract:A major problem in machine learning is that of inductive bias: how to choose a learner's hypothesis space so that it is large enough to contain a solution to the problem being learnt, yet small enough to ensure reliable generalization from reasonably-sized training sets. Typically such bias is supplied by hand through the skill and insights of experts. In this paper a model for automatically learning bias is investigated. The central assumption of the model is that the learner is embedded within an environment of related learning tasks. Within such an environment the learner can sample from multiple tasks, and hence it can search for a hypothesis space that contains good solutions to many of the problems in the environment. Under certain restrictions on the set of all hypothesis spaces available to the learner, we show that a hypothesis space that performs well on a sufficiently large number of training tasks will also perform well when learning novel tasks in the same environment. Explicit bounds are also derived demonstrating that learning multiple tasks within an environment of related tasks can potentially give much better generalization than learning a single task.
Extracted TextJournal of Arti002cial IntelligenceResearch 12 {2000} 149226198 Submitted11/99;
published 3/00
A Model of Inductive Bias Learning
JonathanBaxter JONATHAN.BAXTER@ANU.EDU.AU
ResearchSchool of Information Sciences and Engineering Australian National
University, Canberra 0200, Australia
Abstract
A major problem in machinelearning is that of inductive bias:how to choose
a learner's hy- pothesis space so that it is large enough to contain a solution
to the problem being learnt, yetsmall
enough to ensure reliable generalizationfrom reasonably-sized training sets.
Typically such bias is
supplied by handthrough the skill and insights of experts.In this paper
a model for automatically learning bias is investigated.The central assumption
of the model is that thelearner is embedded
within anenvironment of related learning tasks.Within such an environment
the learner cansample
from multiple tasks, and hence it cansearch for a hypothesis space that
contains goodsolutions to
many of the problems in the environment. Under certain restrictions on theset
of all hypothesis
spaces availableto the learner, we show that a hypothesis space that performs
well on a suf002ciently large number of training tasks will alsoperform
well when learning novel tasks in thesame en-
vironment. Explicit bounds arealso derived demonstrating that learning multipletasks
within an
environment of relatedtasks can potentially give much bettergeneralization
than learning a single
task.
1. Introduction
Oftenthe hardest problem in any machinelearning task is the initial choice
of hypothesisspace;
it has to be large enough tocontain a solution to the problem at hand, yetsmall
enough to ensure
good generalization froma small number of examples {Mitchell, 1991}.Once
a suitable bias has
been found, theactual learning task is often straightforward.Existing methods
of bias generally
requirethe input of a human expert in the form of heuristics and domain
knowledge {for example, through the selection of an appropriate setof features}.
Despite their successes, suchmethods are
clearly limited by the accuracyand reliability of the expert'sknowledge
and also by the extent to whichthat knowledge can be transferred to the learner.
Thus it is natural to search formethods for
automatically learningthe bias.
In this paper we introduce andanalyze a formal model of bias learningthat
builds upon
the PACmodel of machine learning and its variants {Vapnik, 1982; Valiant,
1984;Blumer,
Ehrenfeucht, Haussler, &Warmuth, 1989; Haussler, 1992}.These models typically
take the
following general form: the learner is supplied with a hypothesis space
H and training data z =f{x
1 ; y
1
}; : : : ; {x
m ; y
m
}g drawn independently according tosomeunderlying distribution P
onX 002 Y . Based on theinformation contained in z, the learner's goal
is to select a hypothesis h : X ! Yfrom H minimizing some measureer
P
{h}of expected loss with respect to P{for ex-
ample, in the case ofsquared loss er
P
{h} := E
{x;y}030P
{h{x} 000 y} 2
}. In such models thelearner's
bias is represented by thechoice of H; if H does not contain agood solution
to the problem, then,
regardless of how much data the learner receives, it cannot learn.
Of course, the bestway to bias the learner is to supply it withan H containing
just a single op- timal hypothesis. But 002nding such ahypothesis is precisely
the original learning problem,so in the
c
fl2000AI Access Foundation and Morgan KaufmannPublishers. Allrights reserved.BAXTER
PAC model there is no distinctionbetween bias learning and ordinary learning.Or
put differently,
thePAC model does not model the process ofinductive bias, it simply takes
the hypothesisspace H
as given and proceeds fromthere. To overcome this problem, inthis paper
we assume that instead
of being faced with just a single learning task,the learner is embedded
within an environmentof
related learning tasks. Thelearner is supplied with a family of hypothesisspaces
H = fHg, and its goal is to 002nd a bias {i.e.hypothesis space H 2 H } that
isappropriate for the entire environment. A simple example is the problem
of handwrittencharacter recognition. A preprocessingstage that
identi002es and removes any{small} rotations, dilations and translations
ofanimage of a character
will be advantageousfor recognizing allcharacters. If the set ofall individual
character recognition
problems is viewed as an environment oflearning problems {that is, the set
of all problems of the form 223distinguish `A ' from allother characters224,
223distinguish `B' from allother characters224, and
so on}, thispreprocessor represents a bias that is appropriate forall problems
in the environment.
It is likely that there are many othercurrently unknown biases that are
also appropriateforthis
environment. Wewould like to be able to learn theseautomatically.
There are many other examples of learning problems that can be viewed as
belonging to envi-
ronments of relatedproblems. For example, each individual face recognition
problem belongs to an {essentially in002nite} set of related learningproblems
{all the other individual facerecognition prob-
lems}; the set of all individual spoken word recognition problems formsanother
large environment,
asdoes the set of all 002ngerprint recognitionproblems, printed Chinese
and Japanese character recog- nition problems, stock price prediction problemsand
so on. Even medical diagnostic andprognostic
problems, where a multitude ofdiseases are predicted from the same pathology
tests,constitute an
environment ofrelated learningproblems.
In many cases these 223environments224 arenot normally modeled as such;instead
they are treated
as single, multiplecategory learning problems. For example,recognizing a
group of faces would normally be viewed as a single learningproblem with
multiple class labels {one for each face in
the group}, not as multiple individual learning problems. However,if a reliable
classi002er for each
individual face in the group can beconstructed then they can easily be combined
toproduce a
classi002er for the whole group.Furthermore, byviewing the faces as an
environment of related
learning problems, theresults presented here show that bias can belearnt
that will be good for
learningnovel faces, a claim that cannot be madefor the traditional approach.
This point goesto the heart of our model: we are not notconcerned with adjusting
a learner's bias so it performs better on some002xed set of learning problems.
Sucha process is in fact just
ordinarylearning but with a richer hypothesis space inwhich some components
labelled 223bias224 are also able to be varied. Instead,we suppose the
learner is faced with a{potentially in002nite} stream of
tasks, and that by adjusting its bias on somesubset of the tasks it improves
its learningperformance
on future, as yet unseen tasks. Bias that is appropriate forall problemsin
an environment must be learnt by sampling from many tasks. If only a single
task is learnt then the bias extracted is likely to be speci002c to that
task. In the rest ofthis paper, a general theory of bias learning isdeveloped
based upon the idea of learning multiple related tasks. Looselyspeaking {formal
results are stated in Section 2},there are
twomain conclusions ofthetheory presented here:
* Learningmultiple related tasks reduces the sampling burdenrequired for
good generalization,
at least ona number-of-examples-required-per-taskbasis.
150A MODEL OFINDUCTIVE BIASLEARNING
* Bias that is learnt on suf002ciently manytraining tasks is likely to be
good for learningnovel
tasks drawn from the same environment.
The second point shows that aform of meta-generalization ispossible in bias
learning. Or-
dinarily, wesay a learner generalizes wellif, after seeing suf002ciently
many training examples, it
produces a hypothesis that withhigh probability will perform well on future
examples of the same
task. However,a bias learner generalizes well if, after seeing suf002ciently
many training tasks it pro- duces a hypothesis space that with highprobability
contains good solutions to novel tasks.Another
term that has been used forthis process is Learning to Learn {Thrun &Pratt,
1997}.
Our main theorems are statedin an agnostic setting {that is, Hdoes not necessarily
contain a
hypothesisspace with solutions to all the problemsin the environment}, butwe
also giveimproved
bounds in the realizable case.The sample complexity bounds appearing intheseresults
are stated
interms of combinatorialparameters related to the complexity of the set
of all hypothesis spaces H
availableto the bias learner. For Boolean learning problems {pattern classi002cation}
these parameters are the bias learning analogue of theVapnik-Chervonenkis
dimension{Vapnik, 1982; Blumer et al.,
1989}. As an application of the general theory, the problem of learning
an appropriate set ofneural-
network features for an environmentof related tasks is formulated as a bias
learningproblem. In
the case of continuousneural-network features we are able to proveupper
bounds on the number
of training tasksand number of examples of each training taskrequired to
ensure a set of features that works well for the training tasks will,with
high probability, work well on novel tasks drawn
from the same environment.The upper bound on the number of tasks scalesas
O{b} where bis
a measure of the complexity of thepossible feature sets available to the
learner,whilethe upper
bound on the number of examples of each task scales as O{a+ b=n} where O{a}
is the number
ofexamples required to learn a task if the223true224 set of features {that
is, the correctbias} is already
known, and nis the number of tasks. Thus, in this casewe see that as the
number of related tasks learnt increases, the number of examplesrequired
of each task for good generalization decaysto
the minimum possible. ForBoolean neural-network feature maps we are able
toshow a matching
lowerbound on thenumber of examples required per task of the sameform.
1.1 Related Work There is a large body of previousalgorithmic and experimental
work in the machinelearning and
statistics literature addressingthe problems of inductive bias learning
and improving generalization
through multiple tasklearning. Some of these approaches can be seenas special
cases of, or at least
closely aligned with, the model described here,while others are more orthogonal.
Withoutbeing
completely exhaustive, in thissection we present an overview of the maincontributions.
See Thrun
and Pratt{1997, chapter 1} for a more comprehensivetreatment.
* Hierarchical Bayes.The earliest approaches to bias learning comefrom Hierarchical
Bayesian
methods in statistics{Berger, 1985; Good,1980; Gelman, Carlin, Stern,& Rubim,
1995}.
In contrast to the Bayesianmethodology, the present paper takes anessentially
empirical
process approach tomodeling the problem of bias learning.However, a model
using a mixture of hierarchical Bayesian and information-theoretic ideaswas
presented in Baxter {1997a}, with similar conclusions tothose found here.An
empirical study showing the utility of the hierarchical Bayes approach in
a domaincontaining alarge number of related tasks wasgiven
in Heskes {1998}. 151BAXTER
* Early machine learning work.In Rendell, Seshu, and Tcheng {1987}223VBMS224
or Variable Bias Management System was introduced as amechanism for selecting
amongst different learning algorithms when tackling a new learningproblem.
223STABB224 or Shift To a Better Bias {Ut-
goff, 1986} was another early scheme for adjusting bias, butunlike VBMS,
STABB was not primarily focussed on searching forbias applicableto large
problem domains. Our use of an 223environment ofrelated tasks224 in thispaper
may also be interpreted as an 223environmentof
analogous tasks224 in the sense thatconclusions about one task can be arrived
at by analogy
with {suf002ciently many of} the other tasks. For an early discussion ofanalogy
in this con-
text, see Russell{1989, S4.3}, in particular the observation thatfor analogous
problems the
samplingburden per task can be reduced.
* Metric-based approaches.The metric used in nearest-neighbourclassi002cation,
andin vector
quantizationtodetermine the nearest code-book vector,represents aform of
inductive bias. Using the model of the present paper, andunder some extra
assumptions on the tasks in the environment {speci002cally, that their marginal
input-space distributionsare identical and
they only differ inthe conditional probabilities theyassign to classlabels},
it can be shown
that there isan optimal metric or distance measure to usefor vector quantization
andone-
nearest-neighbourclassi002cation {Baxter, 1995a,1997b; Baxter & Bartlett,
1998}. Thismetric
can be learnt by sampling from asubset of tasks from the environment,andthen
used as a
distance measure whenlearning novel tasks drawn from the same environment.
Bounds on
the number oftasks and examples of each task required to ensure good performance
on novel
taskswere given in Baxter and Bartlett {1998},along with an experiment in
which a metric was successfully trained on examples of asubset of 400 Japanese
characters and then used as a 002xed distance measure when learning 2600as
yet unseen characters.
A similar approachis described in Thrun and Mitchell {1995}, Thrun{1996},
in which a
neuralnetwork's output was trained to match labels on a novel task, while
simultaneously
beingforced to match its gradient to derivativeinformation generated from
a distance metric trained on previous, related tasks.Performance on the novel
tasks improvedsubstantially
with the use of the derivative information.
Notethat there are many other adaptive metric techniques used inmachine
learning, but these
all focus exclusively on adjusting the metric for a 002xed set of problems
rather than learning a metric suitable for learning novel, relatedtasks {bias
learning}.
* Featurelearning or learning internal representations.As with adaptive
metric techniques, there are many approaches to feature learning that focus
on adapting features for a 002xedtask
rather than learning features to beused in novel tasks. One of the fewcases
where features
have been learnt ona subset of tasks with the explicit aim of using them
on novel tasks was
Intrator andEdelman {1996} in which a low-dimensionalrepresentation waslearnt
for a set ofmultiple related image-recognition tasksandthen used to successfully
learn novel tasks of the same kind. The experiments reportedin Baxter {1995a,
chapter 4} and Baxter {1995b}, Baxter and Bartlett {1998} are also of thisnature.
* Bias learning in Inductive Logic Programming {ILP}. Predicate invention
refers to the pro-
cessin ILP whereby new predicates thought to beuseful for the classi002cation
task at hand are added to the learner'sdomain knowledge. By using the newpredicates
as background do-
main knowledgewhen learning novel tasks, predicate inventionmay be viewed
as a form of
152A MODEL OFINDUCTIVE BIASLEARNING
inductivebias learning. Preliminary results with thisapproach on a chess
domain are reported in Khan, Muggleton, and Parson {1998}.
* Improving performance on a 002xed reference task. 223Multi-task learning224{Caruana,
1997}
trains extra neural network outputs to match related tasks in order toimprove
generalization
performance on a 002xed reference task. Although this approach does notexplicitly
identify the
extrabias generated by the related tasks in a waythat can be used to learn
novel tasks, it is anexample of exploiting the bias provided by a set of
related tasks to improvegeneralization
performance. Othersimilar approaches include Suddarth and Kergosien{1990},
Suddarth and
Holden {1991}, Abu-Mostafa {1993}.
* Bias ascomputational complexity. In this paper weconsider inductive bias
from a sample- complexity perspective: howdoes the learnt bias decrease the
number of examples required of
novel tasks for goodgeneralization? A natural alternative line ofenquiry
is how the running-
timeor computational complexity of a learning algorithmmay be improved by
training on
related tasks. Some early algorithms for neural networks in this vein are
contained in Sharkey
and Sharkey {1993}, Pratt {1992}.
* Reinforcement Learning.Many control tasks can appropriately be viewed
as elements of sets
of related tasks,suchas learning to navigate to different goalstates, orlearning
a set of
complex motorcontrol tasks. A number of papers in thereinforcement learning
literature
haveproposed algorithms for both sharing the informationin related tasks
to improve average generalization performance across those tasksSingh {1992},
Ring {1995}, orlearning bias from a set of tasks to improveperformance on
future tasks Sutton {1992}, Thrun andSchwartz
{1995}.
1.2Overview of the Paper
In Section 2 the bias learning model is formally de002ned, and the main
sample complexity results aregiven showing the utility of learningmultiple
related tasks and the feasibility ofbiaslearning.
These results show that thesample complexity is controlled by the size ofcertain
covering numbers
associated with the set of all hypothesis spaces available to thebias learner,
in much the same way as the sample complexity in learning Booleanfunctions
is controlled bythe Vapnik-Chervonenkis
dimension {Vapnik, 1982; Blumer et al., 1989}.The results of Section 2 are
upper bounds on the sample complexity required for goodgeneralization when
learning multiple tasks and learning inductive bias.
The general resultsof Section 2 are specialized tothe case of feature learning
with neural net-
works in Section3, where an algorithm for training features bygradient descent
is also presented.
For this special case we are able to showmatching lower bounds for the sample
complexityof
multiple task learning. InSection 4 we present some concluding remarks anddirections
for future
research. Manyof the proofs are quite lengthy and have beenmoved to the
appendices so as not to interrupt the 003ow of the main text. The following
tables contain a glossary of the mathematical symbols used in the paper.
153BAXTERSymbolDescriptionFirstReferencedXInput Space155YOutput Space155PDistribution
on X 002Y {learning task}155lLoss function155HHypothesis Space155hHypothesis155er
P
{h}Error of hypothesis hon distribution P156zTraining set156ALearning Algorithm156^er
z
{h}Empirical error of h on training setz156PSet of all learningtasks P157QDistribution
over learning tasks157HFamily of hypothesis spaces157er Q
{H}Lossof hypothesis space H on environment Q158z{n; m}-sample158^er z
{H}Empirical loss of H onz158ABias learning algorithm159h
lFunction inducedby h and l159H
lSet of h
l159{h
1
; : : : ; h
n
}
lAverage of h
1;l ; : : : ; h
n;l159h
lSame as{h
1 ; : : : ; h
n }
l159H n
lSetof {h
1
;: : : ; h
n
} l159H
n
lSet of H n
l159H
003Function on probability distributions160H
003Set of H
003160d
PPseudo-metric on H
n
l160d
QPseudo-metric on H
003160N {"; H
003
; d
Q }Coveringnumber of H
003160C {"; H
003 }Capacity of H 003160N {";H
n
l
;d
P
}Covering number of H
n
l160C {";H
n
l
}Capacity of H
n
l160hSequence of n hypotheses {h 1
; : : : ; h n
}163PSequence of n distributions{P
1
; : : : ; P n
}163er P
{h}Average loss of h onP164^er
z
{h}Averageloss of h on z164FSet of feature maps166GOutput class composed
with feature mapsf166G ffi fHypothesis space associated withf166G
lLoss function class associated with G166N {"; G
l ; d
P
}Covering number of G l166C {"; G l
}Capacityof G
l166d [P;G
l
]
{f ; f
0
}Pseudo-metricon feature maps f ; f
0166N {"; F ;d
[P;G
l ]
}Coveringnumber of F166154A MODEL OFINDUCTIVE BIASLEARNINGSymbolDescriptionFirst
ReferencedN {"; F ; d
[P;G
l
] }Covering number ofF166C
G
l {"; F }Capacity of F166H
wNeural network hypothesis space167H
jxH restricted to vector x172005
H
{m}Growth function of H172VCdim{H}Vapnik-Chervonenkis dimensionof H172H
jxH restricted to matrix x173H
jxH restricted to matrix x173005
H
{n;m}Growth function ofH173d
H
{n}Dimension function ofH173d{ H}Upper dimension functionof H173d{H }Lowerdimension
function of H173opt P
{ H
n }Optimal performance ofH
n
onP175d
027Metric onR
+179h
1 010001 001 001 010 h nAverageof h
1
, :: : , h
n179H
1
010 001 001 001010 H
nSet ofh
1
010 001 001 001010 h
n180000 {2m;n}Permutations on integer pairs182z
033Permutedz182d
z
{h; h
0
}Empirical l
1 metric on functions h182^er
P
{H}Optimal average error ofH on P1852. TheBias Learning Model
In this section thebias learning model is formally introduced.To motivate
the de002nitions, we 002rst describe the main features of ordinary{single-task}
supervised learning models. 2.1 Single-Task Learning Computational learning
theory models ofsupervised learning usually include the followingingre-
dients:
* Aninput space X and an output spaceY ,
* a probabilitydistribution P on X 002Y ,
* a loss functionl : Y 002 Y! R, and
* ahypothesis space H which is a set ofhypotheses or functions h :X ! Y
.
As an example,if the problem is to learn to recognize images ofMary's face
using a neural network, then X would be the set of allimages {typically represented
as a subset ofR
d
where each component is a pixel intensity}, Y wouldbe the set f0; 1g, and
the distribution P wouldbe peaked over images
of differentfaces and the correct class labels.The learner's hypothesis
spaceH would be a class of
neural networks mapping the input space R
d
to f0; 1g. The loss in this case would bediscrete loss:
l{y;y
0
} :=
032
1 if y 6= y
0
0 ify = y
0
{1} 155BAXTER Using the loss function allows us topresent a uni002ed treatment
of both patternrecognition {Y =
f0; 1g, l as above}, andreal-valued function learning {e.g.regression} in
which Y =R and usually
l{y; y
0
} = {y 000 y
0
} 2
.
The goal of the learneris to select a hypothesis h 2 Hwith minimum expected
loss: er
P
{h} :=
Z
X 002Y
l{h{x}; y} dP {x; y}: {2}
Of course,thelearner does not know P andso it cannot search through H for
anh minimizing
er
P {h}. In practice, the learner samples repeatedly from X 002 Yaccording
to the distribution Pto
generate a training set z := f{x
1
; y
1
}; : : : ; {x
m
; y
m
}g: {3}
Basedon the information contained in z the learnerproduces a hypothesis
h 2 H. Hence,in general
a learner is simply a mapA from the set of all training samples to thehypothesis
space H:
A:
[
m>0 {X 002 Y } m
! H
{stochasticlearner's can be treated by assuming a distribution-valued A.}
Manyalgorithms seek to minimize the empiricalloss of h on z, where this
isde002ned by:
^er
z
{h} := 1m
m
X
i=1 l{h{x
i
}; y
i }: {4}
Ofcourse, there are more intelligent things to do with the data than simply
minimizing empirical error227for example one can add regularisation terms
to avoid over-002tting. However the learner chooses its hypothesish, if
we have a uniformbound {over all h 2 H} on the probability oflarge deviation
between^er
z
{h} and er
P {h}, then we can bound thelearner's gen-
eralization errorer
P
{h}as a function of its empirical loss on thetraining set ^er
z {h}. Whether such a bound holds depends upon the 223richness224of H.
The conditions ensuring convergence between
^er
z
{h} and er P
{h} are by nowwell understood; for Boolean function learning {Y = f0; 1g,
discrete
loss}, convergence iscontrolled by the VC-dimension
1
of H:
Theorem 1.Let P be any probability distribution onX 002 f0;1g and suppose
z = f{x
1
; y
1
}; : : : ;{x
m
; y m
}g is generated bysampling m times from X 002f0; 1g according toP . Let
d := VCdim{H}. Then with probability at least1000 ffi {over the choice
ofthe training set z}, all h2 H will satisfy
er
P
{h}024 ^er
z
{h} +
024 32m 022
d log
2emd
+ log
4ffi
025
1=2
{5}
Proofsof this result may be found in Vapnik {1982}, Blumer et al. {1989},and
will not be
reproduced here.1. The VC dimension of aclass of Boolean functions H is
the largestinteger d such that there exists a subsetS :=
fx
1 ; : : : ; x
d g 032 X such that the restrictionof H to S contains all 2 d
Boolean functions onS.
156A MODEL OFINDUCTIVE BIASLEARNING
Theorem 1 onlyprovides conditions under which the deviationbetween er
P
{h} and ^er
z {h} is
likely to besmall, it does not guarantee that the true errorer
P
{h}will actually be small. This is governed by the choice of H. If H contains
a solution with smallerror and the learner minimizes
erroron thetraining set, then with high probabilityer
P
{h}will be small. However, a bad choiceof
H will mean there is no hope ofachieving small error. Thus, thebias of the
learner in this model 2
is represented by the choice ofhypothesis space H.
2.2The Bias Learning Model
The main extra assumption of the bias learning model introducedhere is that
the learner is embed-
ded in an environment of related tasks,and can sample from the environment
to generatemultiple
training sets belonging to multipledifferent tasks. In the above model ofordinary
{single-task}
learning, alearning task is represented by a distributionP on X 002 Y .So
in the bias learning
model,an environment oflearning problems is representedby a pair {P ; Q}where
P is the set of all probability distributions onX 002 Y {i.e., Pis the set
of all possible learning problems},and Q is a
distributiononP . Q controls which learningproblems the learner is likely
to see 3
. For example, if the learner is in a face recognition environment, Qwill
be highly peaked overface-recognition-type
problems,whereas if the learner is in a character recognition environment
Q will be peaked over character-recognition-type problems{here,as in the
introduction, we view these environments as
sets of individualclassi002cation problems, rather than single, multipleclass
classi002cation problems}.
Recall from the last paragraph of the previous section that thelearner's
bias is represented byits choice of hypothesis space H.So to enable the learner
to learn the bias, wesupply it with a family
or set ofhypothesis spaces H := fHg.
Putting all this together, formally alearning to learn or bias learningproblem
consists of:
* aninput space X and an output spaceY {both of which are separable metricspaces},
* a loss functionl : Y 002 Y! R,
* anenvironment {P ; Q} where P is the set of allprobability distributions
on X 002Y and Q is
a distributionon P ,
* ahypothesis space family H =fHg where each H 2 His a set of functions
h :X ! Y .
From now on wewill assume the loss function l has range[0; 1], or equivalently,
with rescaling,
we assume thatl is bounded.2. The bias is also governed by howthe learner
uses the hypothesis space.For example, under some circumstances the learner
may choose not to use the full power of H {a neural network example isearly-stopping}.
For simplicity in this paper we abstract away from suchfeatures of the algorithm
A and assume that ituses the entire hypothesis space
H.
3. Q'sdomain is a033-algebra of subsets of P. A suitable one for our purposes
is theBorel 033-algebra B{P} generated
by the topology of weakconvergence on P . Ifwe assume that X and Yare separable
metric spaces, then Pis also
a separable metric space in the Prohorov metric {which metrizes the topology
ofweak convergence} {Parthasarathy,
1967}, so there is no problem withthe existence of measures on B{P }. See
Appendix D for furtherdiscussion,
particularly the proof of part 5in Lemma 32.
157BAXTER We de002ne the goal of a bias learner to be to 002nd a hypothesis
space H2 H minimizing the
followingloss:
er
Q
{H} :=
Z
P inf
h2H
er P
{h} dQ{P } {6}
= Z
P
inf
h2H
Z
X 002Y
l{h{x}; y} dP {x; y} dQ{P }:
The only way er
Q {H} can be small is if, withhigh Q-probability, Hcontains agood solution
h to
any problemP drawn at random according toQ. In this sense er
Q
{H} measures howappropriate
the bias embodied byH is for the environment {P; Q}.
In general thelearner will not know Q, so it will notbe able to 002nd an
H minimizinger
Q
{H} directly. However,the learner can sample from the environment inthe
following way:
* Sample n times from P accordingto Q to yield:
P
1
; : : : ; P
n
.
* Samplem times from X 002 Yaccording to each P
i to yield:
z
i = f{x
i1 ; y
i1
} : : : ; {x
im ; y
im
}g.
* The resultingntraining sets227henceforth called an{n; m}-sample if they
aregenerated by the
above process227aresupplied to the learner. In the sequel, an{n; m}-sample
will be denoted by z and written as a matrix: {x
11
;y
11
} 001001 001 {x
1m ; y
1m
} = z
1
z :=
.
.
.
.
.
.
.
.
.
.
.
.
{x
n1
; y
n1
} 001 001 001{x
nm
;y
nm
} = z n
{7}
An {n; m}-sample is simplyn training sets z
1 ; : : : ; z
n sampled from n different learningtasks
P
1
;: : : ; P
n
,where each task is selected according to the environmental probability
distribution Q.
The size of each training set is kept the same primarily to facilitate the
analysis. Based on the information contained inz, the learner must choose
a hypothesis spaceH 2 H .
One way to dothis would be for the learner to 002nd anH minimizing the
empirical loss onz, where
this is de002ned by: ^er
z
{H} :=
1n
n
X i=1
inf
h2H
^er
z i
{h} {8} Note that ^er
z {H} is simply the averageof the best possible empirical error achievableon
each
training set z
i
, using a function fromH. It is a biased estimate ofer
Q
{H}. An unbiased esti-
mateof er
Q
{H} would require choosing anHwith minimal average error over then distributions
P
1
; : : : ; P
n
, where this is de002ned by 1n P
n
i=1 inf
h2H
er P
i
{h}.
As with ordinary learning, it islikely there are more intelligent things
to dowith the training data
z thanminimizing {8}. Denoting the set of all{n; m}-samples by{X 002 Y
}
{n;m}
, a general 223bias learner224is a map A that takes{n; m}-samples as input
andproduces hypothesis spaces H 2 Has
output:
A : [
n>0
m>0
{X 002 Y}
{n;m}
! H : {9}
158A MODEL OFINDUCTIVE BIASLEARNING
{as stated,Ais a deterministic bias learner, however it is trivial to extend
our results tostochastic
learners}.
Note that in thispaper we are concerned only with the sample complexity
properties of a bias
learnerA; we do not discuss issues of thecomputability of A.
Since Ais searching for entire hypothesis spacesH within a family of such
hypothesis spaces H , there is an extra representationalquestion in our model
of bias learning that is notpresent in
ordinary learning, and that is how the family H is represented andsearched
by A. We defer this discussion until Section 2.5, after the mainsample complexity
results for this model of biaslearning
have been introduced.For the speci002c case of learning a set offeatures
suitable for an environment of related learning problems, see Section 3.
Regardless of how the learner chooses itshypothesis space H, if we have a
uniform bound {over
all H 2 H} on the probability oflarge deviationbetween ^er
z
{H} and er
Q
{H}, and we cancompute
an upper bound on ^er
z
{H}, then we can bound the bias learner's 223generalization error224 er
Q
{H}. With this view, the question ofgeneralization withinour bias learning
model becomes:how many
tasks {n} and how many examples of each task {m} are required to ensure
that^er
z
{H} and er
Q {H}
areclose with highprobability, uniformly over allH 2 H ? Or, informally,
howmany tasks and how many examples of each task are requiredto ensure that
a hypothesis space with good solutions to
all the training tasks will containgood solutions to novel tasks drawn from
thesame environment?
It turns out that thiskind of uniform convergence forbias learningis controlled
by the 223size224
of certainfunction classes derived from the hypothesis spacefamily H , in
much the same way as the VC-dimension of a hypothesis spaceH controls uniform
convergence inthe case of Boolean
function learning{Theorem 1}. These 223size224 measures andother auxiliary
de002nitions needed to state the main theorem are introduced inthefollowing
subsection.
2.3 CoveringNumbers
De002nition 1. For any hypothesis h : X! Y , de002ne h
l
: X002 Y! [0; 1] by
h
l
{x;y} := l{h{x}; y} {10}
For any hypothesis space Hin the hypothesis space family H ,de002ne
H
l
:= fh
l
:h2 Hg: {11}
For any sequence of n hypotheses{h
1
; : : : ;h
n
}, de002ne{h
1
; : : : ; h n
}
l : {X 002 Y } n
! [0; 1]by
{h
1 ; : : : ; h
n }
l
{x 1
; y
1 ; : : : ; x
n ; y
n
}:=
1n n
X
i=1 l{h
i
{x
i
};y
i
}: {12} We will also use h l
to denote {h 1
; : : : ; h n
}
l
. For any H in the hypothesisspace family H , de002ne
H
n
l
:=f{h
1
;: : : ; h
n
} l
: h
1 ; : : : ; h
n 2 Hg: {13}
De002ne
H
n l
:=
[
H2 H
H
n l
: {14}
159BAXTER In the 002rst part of the de002nition above, hypotheses h :
X! Y are turned into functionsh
l
mapping X002 Y ! [0; 1]by composition withthe loss function.H
l
is then just thecollection ofall
such functions where theoriginal hypotheses comefrom H.H
l
is often called aloss-function class.
In our case weare interested in the average loss acrossn tasks, where each
of the n hypotheses is chosen from a 002xed hypothesis spaceH. This motivates
the de002nition ofh
l
and H
n
l
. Finally, H
n
l
isthe collection of all {h
1
; : : : ; h
n
}
l
,with the restriction thatall h
1
; : : : ; h
n
belong to a single hypothesis space H 2 H . De002nition 2. Foreach H 2
H , de002ne H 003
: P ! [0; 1] by
H
003
{P } :=inf
h2H
er P
{h}:{15}
For the hypothesis spacefamily H , de002ne
H 003
:= fH
003
: H2 H g: {16}
It is the 223size224 ofH
n
l
andH
003
that controls howlarge the {n; m}-samplez must be to ensure
^er
z
{H}and er
Q
{H} are close uniformly over allH 2 H . Their size will bede002ned in terms
of
certain coveringnumbers, and for this we need to de002ne how to measure
the distance between
elementsof H
n
l
andalsobetween elements of H
003 .
De002nition 3. LetP = {P
1
; : : : ; P
n
} be any sequence of n probabilitydistributions on X 002 Y. For
any h
l
; h
0 l
2 H
n l
, de002ne
d P
{h
l ; h
0
l } :=
Z
{X 002Y }
n jh
l
{x
1
; y
1
; : : : ; x
n
; y
n
} 000 h
0
l
{x
1 ; y
1
; : : : ;x
n
; y
n
}j
dP 1
{x
1 ; y
1
}: : : dP
n
{x
n
; y
n
}
{17}
Similarly, for any distributionQ on P and any H 003
1
;H
003
2
2 H
003
,de002ne
d
Q
{H
003
1 ; H
003
2
} :=
Z P
jH
003 1
{P }000 H
003
2 {P }j dQ{P } {18}
It is easily veri002ed that d
P
and d
Q
are pseudo-metrics 4
on H
n l
and H
003 respectively.
De002nition4. An "-cover of {H
003
; d Q
} is a setfH
003
1
; : : : ; H
003 N
g such that for allH
003
2 H 003
,
d
Q
{H
003 ; H
003
i } 024 " for somei = 1 : : : N .Note that we do not require theH
003
i
to be contained in
H
003
, just that they be measurable functions on P . LetN {"; H
003 ; d
Q
}denote the size of the smallest
such cover. De002ne thecapacity of H
003
by
C {"; H 003
} := sup
Q
N {"; H 003
; d
Q } {19}
wherethe supremum is over all probability measures on P . N {";H
n
l
;d
P
} is de002ned in a similar
way, using d P
in place of d Q
. De002nethe capacityof H
n
l
by:
C {"; H n
l
} := sup P
N {";H
n
l
;d
P
} {20} where now the supremum is over allsequences of n probability measures
onX 002 Y .4. A pseudo-metric d is a metricwithout the condition that d{x;y}
= 0 } x = y.
160A MODEL OFINDUCTIVE BIASLEARNING
2.4Uniform Convergence for Bias Learners Now we have enough machinery tostate
the main theorem. In the theorem thehypothesis space
familyis required to bepermissible. Permissibility is discussed indetail
in Appendix D, but note
that it is a weak measure-theoretic conditionsatis002ed by almost all 223real-world224
hypothesisspace
families. All logarithms are to basee.
Theorem 2. SupposeX and Y are separable metricspaces and let Q be any probability
distri- butionon P , the set of alldistributions on X 002 Y. Suppose z is
an {n; m}-sample generated by sampling n times from Paccordingto Q to give
P 1
; : : : ; P n
, and then samplingm times from each
P i
to generate z i
= f{x i1
; y
i1
}; : : : ;{x
im
; y im
}g, i= 1; : : : ; n. LetH = fHgbe any permissible hypothesis space family.
Ifthe number of tasks n satis002es n 025 max
{ 256" 2
log
8C 000
"32
;H
003 001ffi ;
64"
2
}
; {21}
and the number of examples m of each task satis002es m 025 max
032 256n" 2
log
8C{
"32
; H
n l
}ffi
;
64"
2
033
; {22}
thenwith probability at least 1 000ffi {over the {n; m}-sample z}, all
H 2H will satisfy
er Q
{H} 024^er
z
{H} + " {23}
Proof. See Appendix A.There are several important points tonote about Theorem
2:
1. Providedthe capacities C {"; H 003
} and C{"; H
n
l
} are 002nite, the theorem shows that any bias
learner that selectshypothesis spaces from H can bound itsgeneralisation
error er
Q {H} in
termsof^er
z
{H} for suf002ciently large {n; m}-samples z.Most bias learner's will not
002nd the exact value of ^er z
{H} because it involves 002nding the smallest error of anyhypothesis h
2 H
on each of then training sets in z. Butany upper bound on ^er z
{H} {found,for example
bygradient descent on someerror function} willstill give an upper boundon
er
Q
{H}. See
Section 3.3.1 for a briefdiscussion on how this can be achieved in afeature
learning setting.
2. Inorder to learn bias {in the sense thater
Q
{H}and ^er
z
{H} are close uniformly overall
H 2 H }, both thenumber of tasks n and the number of examples of each task
m must
be suf002cientlylarge. This is intuitively reasonablebecause the bias learner
must see both suf002ciently many tasks to be con002dentof the nature of
the environment, andsuf002ciently
many examples of eachtask to be con002dent of the nature of each task.
3. Once the learner has found anH 2 H with a small value of^er
z
{H}, it can then use Hto
learn novel tasks Pdrawn according to Q. One then hasthe following theorem
bounding the
sample complexity required for good generalisationwhenlearning with H {the
proof is very similar to the proof of the bound onm in Theorem 2}.
161BAXTER Theorem 3. Let z= f{x
1 ; y
1
}; : : : ; {x
m ; y
m
}gbe a training set generated by sampling from
X 002 Y accordingto some distribution P . LetH be a permissible hypothesis
space.For all
"; ffiwith 0 < "; ffi < 1, if the number of training examplesm satis002es
m 025 max {
64"
2
log
4C
000
"16
; H
l
001ffi
;
16"
2
} {24}
then with probability atleast 1 000 ffi, all h2 H will satisfy
er
P
{h} 024^er
z
{h} + ":
The capacityC {"; H} appearing inequation {24} is de002ned in an analogous
fashion to the
capacities in De002nition 4 {wejust use the pseudo-metric d
P
{h
l
; h
0
l } :=
R
X002Y
jh
l
{x; y} 000 h
0
l
{x; y}j dP {x; y}}. The important thing to note about Theorem 3 is that
the number of ex- amples required for good generalisation whenlearning novel
tasks is proportional tothe log-
arithm of the capacity of thelearnt hypothesis space H. Incontrast, if the
learner does not do any bias learning, it will have noreason to select one
hypothesis space H2 H over any
other andconsequently it would have to view as acandidate solution any hypothesis
in any of the hypothesis spaces H 2H . Thus, its sample complexity willbe
proportional to the
capacityof [
H2 H
fH l
g = H
1
l
, which in generalwill be considerably larger than the capacity of any individual
H 2 H. So by learning H the learner haslearnt to learn in the environment
{P ; Q} in the sensethat it needs far smaller training sets to learnnovel
tasks.
4. Havinglearnt a hypothesis space H with a small value of ^er
z
{H}, Theorem 2 tells us that with probability atleast 1000 ffi, the expected
value ofinf
h2H
er P
{h} on a novel task P will be
less than^er
z
{H} + ". Of course, this does notrule out really bad performance onsome
tasks P . However, the probability of generating such 223bad224 tasks can
be bounded.In particular,
note that er Q
{H} is just the expected value of the function H 003
over P ,and so by Markov's
inequality ,for fl > 0,
Pr 032
P : inf
h2H
er
P {h} 025 fl 033
= Pr fP: H
003
{P } 025 fl g 024
E
Q
H
003fl
=
er
Q
{H}fl
024
^er
z
{H} + "fl
{with probability 1000 ffi}.
5. Keepingthe accuracy and con002dence parameters"; ffi 002xed, note that
the numberof examples
required of each task for goodgeneralisation obeys
m =O
022
1n
log C{"; H
n
l }
:{25}
So provided logC {"; H
n l
} increases sublinearly withn, the upper bound on the number of examples
required of each task willdecrease as the number of tasks increases.This
shows
that for suitably constructed hypothesis space families it is possible toshare
information
between tasks.This is discussed further after Theorem 4 below.
162A MODEL OFINDUCTIVE BIASLEARNING
2.5Choosing the Hypothesis Space FamilyH .
Theorem 2 only providesconditions under which ^er
z
{H} ander
Q
{H}are close, it does not guaran-
tee thater
Q
{H}is actually small. This is governed bythe choice of H . IfH containsa
hypothesis
space H with a small value of er
Q
{H} and the learner is able to 002nd anH 2 H minimizing error on the {n;
m} samplez {i.e., minimizing ^er z
{H}},then, for suf002ciently large nand m, Theorem 2 en-
sures that with high probability er
Q
{H}will be small. However, a bad choice of H willmean there
is no hope of 002nding anH with small error. In this sense thechoice of
H represents the hyper-bias of the learner.
Notethat from a sample complexity point of view, the optimal hypothesis
space family tochoose
is one containing a single, minimalhypothesis space H that contains good
solutionsto all of the
problems in the environment{or at least a set of problems with highQ-probability},
and no more. For then there is no bias learning to do{because there is no
choice to be made betweenhypothesis
spaces}, the output of the biaslearning algorithm is guaranteed tobe a goodhypothesis
space for
the environment,andsince the hypothesis space is minimal, learningany problem
within the en-
vironment usingH will require the smallest possible number of examples.
However, this scenario is analagous tothe trivial scenario inordinary learning
in which the learning algorithmcontains a
single, optimal hypothesis for the problem being learnt. In that case there
is no learning to be done,
just as there is nobias learning to be done if the correct hypothesisspace
is already known.
At the other extreme, if H contains a single hypothesisspace H consisting
ofall possible func- tions from X ! Ythen bias learning is impossible because
thebias learner cannot produce a
restrictedhypothesis space as output, and hence cannot producea hypothesis
space with improved
sample complexity requirements on as yet unseentasks.
Focussing on these two extremeshighlights the minimal requirements on Hfor
successful bias
learningto occur: the hypothesis spaces H2 H must be strictly smaller than
thespace of all
functions X! Y , but not so small or so223skewed224 that none of them
contain goodsolutions to a
large majority of theproblems in the environment.
It may seemthat we have simply replaced the problem ofselecting the right
bias {i.e., selecting theright hypothesis space H} with theequally dif002cult
problem of selecting the righthyper-bias {i.e.,
the right hypothesisspace family H }. However,in many cases selecting the
right hyper-bias isfar
easier than selecting the right bias.For example, in Section 3 we will see
how the feature selection
problem may be viewed as a bias selection problem. Selectingthe right features
can be extremely dif002cult if one knows little about theenvironment, withintelligent
trial-and-error typicallythe best
one can do. However,in a bias learning scenario, one only has tospecify
that a set of features should exist, 002nd a loosely parameterised setoffeatures
{for example neural networks}, and thenlearn the
features by sampling from multiplerelated tasks.
2.6 LearningMultiple Tasks
It may be that thelearner is not interested in learning to learn, but just
wants to learn a 002xed set of n tasks from the environment{P ; Q}. Asin
the previous section, we assume the learnerstarts
outwith a hypothesis space familyH , and also that it receives an{n; m}-sample
zgenerated from
the n distributionsP
1
; : : : ; P n
. This time, however, the learner is simply looking for nhypotheses
{h
1 ; : : : ; h
n }, all contained in the samehypothesis space H, such that the averagegeneralization
error of the nhypotheses is minimal. Denoting {h
1
; : : : ; h n
} by h and writingP = {P
1
; : : : ; P
n
},
163BAXTER this error is given by: er
P
{h} :=
1n
n
X
i=1
er
P
i {h
i
}{26}
=
1n
n
X i=1
Z
X002Y
l{h i
{x};y} dP
i
{x; y};
and the empirical loss of h on z is
^er
z
{h}:=
1n n
X
i=1 ^er
z
i {h
i
}{27}
=
1n
n
X i=1
1m
m
X j=1
l{h i
{x
ij }; y
ij
}:
As before, regardless of how the learner chooses {h
1
; : : : ; h
n
}, if we can prove auniform bound on
the probability oflargedeviation between ^er
z
{h} ander
P
{h}then any {h
1 ; : : : ; h
n } that perform
wellon thetraining sets z will with high probabilityperform well on future
examples of the same tasks.
Theorem 4. LetP = {P
1
; : : : ; P
n
} be n probability distributionsonX 002 Y and letz be an {n; m}- sample
generated by samplingm times from X 002 Yaccording to each P
i
. Let H =fHg be any
permissible hypothesisspace family. If the number of examplesmof each task
satis002es
m 025 max
032
64n"
2
log
4C { "16 ; H
n
l }ffi ;
16"
2
033
{28}
then with probability at least1 000 ffi {over the choice ofz}, any h 2
H
n
will satisfy
er P
{h} 024^er
z
{h} + " {29}
{recall De002nition 4 for the meaning ofC {"; H
n l
}}.
Proof.Omitted {follow the proof of the bound onm in Theorem 2}.Thebound
on m in Theorem 4 is virtually identical to the bound on m in Theorem 2,
and note againthat it depends inverselyon the number of tasks n {assuming
that the002rst part of the 223max224
expression is the dominate one}. Whether this helps dependson the rate of
growth of C { "16 ; H
n
l } as
a function ofn. The following Lemma shows that this growth is always small
enough to ensure that we never do worse by learning multipletasks {at least
in terms of the upper bound onthe number of
examples required per task}. Lemma 5. For any hypothesisspace family H ,
C
000
"; H
1 l
001
024C {"; H
n
l
} 024 C
000 ";H
1
l 001
n
:{30}
164A MODEL OFINDUCTIVE BIASLEARNING
Proof.Let K denotethe set of all functions{h
1
; : : : ; h n
}
l where each h
i can be a member of any
hypothesis space H 2 H {recallDe002nition 1}. Then H
n
l
022 K andso C {"; H
n l
} 024 C {"; K }. By
Lemma29 in Appendix B, C {"; K} 024 C
000
"; H
1
l 001
n
and so the righthand inequality follows.
Forthe 002rst inequality, let Pbe any probability measure on X002 Y and
let P be the mea- sureon {X 002 Y}
n
obtained by usingP on the 002rst copy of X002 Y in the product, and ignoring
all other elements of the product.Let N be an "-cover for{ H
n
l
; d
P
}.Pick any h
l
2 H
1
l
and
let {g
1 ; : : : ; g
n }
l
2 Nbe such that d
P
{(h; h; : : : ; h} l
; {g
1
; : : : ; g
n }
l
} 024". But by construction,
d
P
{(h;h; : : : ;h}
l ; {g
1
; : : : ; g
n
}
l
} = d P
{h; {g 1
}
l
}, which establishes the002rst inequality.By Lemma 5
log C
000
";H
1
l
001 024 log C {";H
n
l
}024 n log C
000 "; H
1
l 001
: {31}
So keeping the accuracy parameters" and ffi 002xed, and plugging {31}into
{28}, we see that the upper bound on the number of examples required of each
task never increases with the number of tasks,
and at best decreases asO{1=n}. Althoughonly an upper bound, this provides
a strong hintthat
learning multiple related tasks shouldbe advantageous on a 223number of
examplesrequired per task224
basis. InSection 3 it will be shown that for featurelearning all types of
behavior are possible, from no advantage at all to O{1=n} decrease.
2.7 Dependenceon "
In Theorems 2, 3 and 4 thebounds on sample complexity all scale as1="
2
. Thisbehavior can be
improved to1=" if the empirical loss is alwaysguaranteed tobe zero {i.e.,
we are in therealizable
case}. The same behaviorresults if we are interested in relative deviation
between empirical and
trueloss, rather than absolute deviation.Formal theorems along these lines
are stated inAppendix
A.3.
3. FeatureLearning
The use of restricted featuresets is nearly ubiquitous as a method of encodingbias
in many areas of
machine learning andstatistics, including classi002cation, regression anddensity
estimation.
In this section we showhow the problem of choosing aset of featuresfor an
environment of
related tasks can berecast as a bias learning problem. Explicitbounds on
C { H
003 ; "} and C {H
n
l
;"}
are calculated for generalfeature classes in Section 3.2. Thesebounds are
applied to the problem of learning a neural network feature set inSection
3.3.
3.1 The Feature Learning Model
Consider the following quotefrom Vapnik {1996}:
Theclassical approach to estimating multidimensionalfunctional dependencies
is
based on the following belief:
Real-lifeproblems are such thatthere exists a small number of 223strong
features,224 simple
functions of which {saylinear combinations} approximate well the unknownfunction.
Therefore, it is necessary tocarefully choose a low-dimensional feature
space andthen
to use regular statistical techniques to construct an approximation.
165BAXTER In general a set of 223strong features224may be viewed as a
function f: X ! V mapping the input space X into some {typically lower} dimensional
space V . LetF = ff g be a set ofsuch feature
maps {each f maybe viewed as a set of features {f
1
; : : : ; f k
} if V =R
k
}. It is thef that must be
223carefullychosen224 in the above quote. Ingeneral, the 223simple functions
of the features224may be
represented as a class of functionsG mappingV to Y .If for each f 2 F wede002ne
the hypothesis
space G ffif := fg ffi f :g 2 G g, then we have thehypothesis space family
H
H:= fG ffi f : f2 F g: {32}
Now the problem of 223carefully choosing224 theright features f is equivalent
to thebias learning
problem 223002nd the righthypothesis space H 2 H 224. Hence,provided
the learner is embedded within an environment ofrelated tasks, and thecapacities
C { H
003 ; "} and C {H
n
l
;"} are 002nite, Theorem 2 tells us that the feature set fcan be learnt
rather than carefully chosen.This represents animportant
simpli002cation,as choosing a set of features is often the mostdif002cult
part of any machine learning problem.
In Section 3.2 we give atheorem bounding C { H
003
; "} andC { H
n
l ; "} for general feature classes. The theorem is specialized toneural
network classes in Section 3.3.
Note that we have forced the function class Gto be the same for all feature
mapsf , although
thisis not necessary. Indeed variants of the results to follow can be obtained
if G is allowed to vary
with f .
3.2Capacity Bounds for General Feature Classes Notationally it is easier
to view thefeature maps f as mapping fromX 002 Y to V 002Y by {x; y}7!
{f {x}; y}, and also to absorb the lossfunction l into the de002nition
ofG by viewing each g 2 Gas a
map from V 002Y into [0; 1]via {v; y} 7!l{g{v};y}. Previously this latter
functionwould have been
denoted g l
but in what follows we willdrop the subscript l where this does not causeconfusion.
The
class to whichg
l
belongs will still bedenoted by G
l
. With the above de002nitions letG
l
ffi F:= fg ffi f :g 2 G
l
;f 2 F g. De002ne the capacityof G
l
in
the usual way,
C {"; G
l
}:= sup
P
N {"; G
l
; d P
}
wherethe supremum is over all probability measures on V 002 Y , and d P
{g; g 0
} :=
R V 002Y
jg{v; y} 000
g
0
{v; y}j dP {v; y}. To de002ne the capacity ofF we 002rst de002ne a pseudo-metricd
[P;G
l ]
on F by
223pulling back224 the L
1 metric on R through G l
as follows:
d
[P;G
l ]
{f ; f 0
} :=
Z X 002Y
sup g2G
l
jg ffi f {x; y}000 g ffi f
0 {x; y}j dP{x; y}: {33} It is easily veri002ed that d [P;G
l
]
is a pseudo-metric. Notethat for d
[P;G l
]
to be well de002ned the supre-
mumover G l
in the integrand must bemeasurable. This is guaranteed if the hypothesisspace
family
H = fG l
ffi f :f 2 F g is permissible {Lemma 32,part 4}. Now de002ne N {"; F ;
d
[P;G
l
]
}to be the
smallest "-coverof the pseudo-metric space{F; d
[P;G
l
]
} and the"-capacity ofF {with respect toG
l
}
as C
G
l
{"; F } := sup P
N {"; F; d
[P;G l
]
}
where the supremum is over all probabilitymeasures on X 002 Y .Now we can
state the main theorem of this section.
166A MODEL OFINDUCTIVE BIASLEARNING
Theorem 6.Let H be a hypothesis space familyas in equation {32}. Then for
all"; "
1
; " 2
> 0 with
" = "
1
+"
2
,
C{"; H
n
l } 024 C {"
1
; G
l }
n
C
G
l
{"
2
; F } {34} C {"; H
003 } 024 C
G
l
{"; F }{35}
Proof. SeeAppendix B.3.3 LearningNeural Network Features
Ingeneral, a set of features may be viewed as amap from the {typically high-dimensional}
input space R
d
to a muchsmaller dimensional space R
k {k 034d}. Inthis section we consider approximat-
ing such a feature map by a one-hidden-layerneuralnetwork with d input nodes
andk output nodes
{Figure 1}. We denote the set of all such feature maps byf010
w
= {036
w;1
;: : : ; 036
w;k
} : w 2 Dg where D is a bounded subset ofR
W
{W isthe number of weights {parameters} inthe 002rsttwo layers}.
This set is theF of the previous section. Each feature 036
w;i : R
d
![0; 1], i = 1; : : : ; k is de002ned by 036
w;i
{x} := 033
0
@
l
X
j=1
v
ij
h
j
{x}+v
il+1
1 A
{36}
whereh
j
{x}is the output of the j th node in the002rst hidden layer, {v
i1
; : : : ; v il+1
} are the output node parameters for the ith feature and033 is a 223sigmoid224
squashing function033 : R ! [0;1]. Each
002rst layer hidden node h
i
: R d
! R, i= 1; : : : ; l, computes h
i
{x} := 033
0
@ d
X
j=1 u
ij
x
j
+ u
id+1
1
A
{37} where {u
i1 ; : : : ; u
id+1
} are the hidden node's parameters. We assume 033is Lipschitz.
5
The weight vector for the entire feature map is thus w = {u
11
; : : : ; u
1d+1
; : : : ; u
l1
; : : : ; u ld+1
; v
11
; : : : ; v
1l+1
; : : : ; v k1
; : : : ; v kl+1
}
and the total number of feature parametersW = l{d+ 1} +k{l + 1}.
For argument's sake, assume the 223simplefunctions224 of the features
{the classG of the previous
section}are squashed af002ne maps using the same sigmoidfunction 033 above
{in keeping with the 223neural network224 003avor of thefeatures}. Thus,
each setting of the featureweights w generates a
hypothesisspace:
H
w
:= {
033
k
X
i=1 ff
i
036 w;i
+ ff
k+1
!
: {ff
1
; : : : ; ff k+1
} 2D
0
}
;{38}
where D
0 is a bounded subset of R k+1
. The set of allsuch hypothesis spaces,
H :=fH
w
: w2 Dg {39}5. 033 is Lipschitz if there exists a constant K such that
j033{x} 000 033{x
0
}j 024K jx 000 x
0
j for all x; x 0
2 R.
167BAXTERrootknld FeatureMapInput Multiple Output ClassesFigure
1: Neuralnetwork for feature learning. Thefeature map is implemented by the
002rst two hidden layers. The n output nodescorrespond to the n different
tasks in the{n; m}-
samplez. Each node in the network computesa squashed linear function of
the nodes in the previous layer.
isa hypothesis space family. Therestrictions on the output layer weights{ff
1
;: : : ; ff
k+1 } and feature
weightsw, and the restriction to a Lipschitzsquashing function are needed
to obtain 002nite upper bounds on the covering numbers inTheorem 2.
Finding a good set of featuresfor the environment {P ;Q} is equivalent to
002nding a goodhy-
pothesisspace H
w 2 H , which in turn means002nding a good set of feature map parametersw.
As in Theorem 2, the correct setof features may be learnt by 002nding a
hypothesisspace with
small error on a suf002ciently large {n; m}-samplez. Specializing to squared
loss, in thepresent
framework the empirical loss ofH
w
on z {equation {8}} is given by
^er z
{H
w } =
1n
n
X
i=1
inf
{ff
0
;ff
1
;:::;ff
k }2D
0
1m
m
X
j=1 "
033
k
X
l=1 ff
l
036 w;l
{x
ij
} + ff
0
!
000 y ij
#
2
{40}
Since our sigmoid function033 only has range [0; 1], we also restrict the
outputs Yto this range.
3.3.1ALGORITHMS FORFINDING A GOODSET OF FEATURES Provided the squashing
function033 is differentiable, gradientdescent {with a small variation on
backpropagation to compute the derivatives} can be used to 002nd feature
weightsw minimizing {40}
{or at least alocal minimum of {40}}. The only extradif002culty over and
above ordinarygradient
descent is the appearance of 223inf 224 in the de002nition of^er
z
{H
w
}. Thesolution is to perform gradient
descent over both the output parameters {ff 0
; : : : ; ff k
} for each node and thefeature weights w. For
more detailssee Baxter {1995b} and Baxter {1995a, chapter 4},where empirical
results supporting
the theoretical results presented here are also given. 168A MODEL OFINDUCTIVE
BIASLEARNING
3.3.2SAMPLE COMPLEXITY BOUNDS FOR NEURAL-NETWORKFEATURE LEARNING The size
of z ensuring that theresulting features will be good for learning novel
tasks from the same
environmentisgiven by Theorem 2. All we haveto do is compute the logarithm
of the covering numbers C {"; H n
l
} and C{"; H
003
}.
Theorem 7. LetH =
010
H w
: w 2 R W
011
be a hypothesisspace family where each H
w
is of the form
H
w
:=
{ 033
k X
i=1
ff i
036
w;i {001} + ff
0
!
: {ff 1
; : : : ; ff k
} 2 R
k
}
;
where 010
w
= {036
w;1
;: : : ; 036
w;k
} is a neural network withW weights mapping from R d
to R
k . If the
feature weightsw and the output weights ff 0
; ff
1 ; : : : ; ff
k are bounded, the squashing function033 is
Lipschitz,l is squaredloss, and the output space Y =[0; 1] {any bounded
subset ofR will do}, then
there existconstants 024; 024
0 {independent of "; W andk} such that for all " >0,
log C {"; H
n
l
} 024 2 {{k +1}n + W } log
024"
{41} log C {"; H 003
} 024 2Wlog
024
0"
{42}
{recall that we have specialized tosquaredloss here}.
Proof. SeeAppendix B.Noting that our neuralnetwork hypothesis space family
His permissible, plugging {41} and {42} into Theorem 2 gives the followingtheorem.
Theorem 8. LetH = fH
w
g be a hypothesis space family where each hypothesis space H
w is a
set of squashed linear mapscomposed with a neural network feature map, asabove.
Suppose the
numberof features is k, and the total numberof feature weights is W. Assume
all feature weights and
output weights are bounded,and the squashing function 033 is Lipschitz.Let
z be an {n; m}-sample
generated from the environment {P ; Q}. If
n025 O
022
1"
2
024
W log
1"
+ log
1ffi
025 ; {43}
and
m 025 O
022
1"
2
024022
k + 1 + Wn
log
1"
+
1n
log 1ffi 025
{44}
then withprobability at least 1 000 ffiany H
w
2H will satisfy
er Q
{H
w } 024 ^er
z
{H
w
} + ": {45}
169BAXTER 3.3.3 DISCUSSION
1. Keeping the accuracy andcon002dence parameters " and ffi002xed, the
upper bound on the number of examples required of each task behaveslike O{k
+ W=n}. If the learner is simply learning n 002xed tasks {rather than learningto
learn}, then the same upper bound also applies{recall
Theorem 4}.
2.Note that if we do away with the featuremap altogether then W = 0 and
theupper bound on
m becomes O{k}, independent of n{apart from the less important ffiterm}.
So in terms of the
upper bound,learning n tasks becomes just as hard aslearning one task. At
the other extreme, if we 002x the output weights then effectively k = 0
and the number of examples required of
each task decreases asO{W=n}. Thusa range of behavior in the number of examplesrequired
of each task is possible:from no improvement at all to anO{1=n} decrease
as the number of tasks n increases {recall the discussionat the end of Section
2.6}.
3.Once the feature map is learnt {which can beachieved using the techniques
outlined in Baxter,
1995b; Baxter & Bartlett, 1998; Baxter,1995a, chapter 4}, only the output
weights haveto be
estimated to learn a novel task.Again keeping the accuracy parameters 002xed,
this requires no
more that O{k} examples. Thus,as the number of tasks learnt increases, the
upperbound on
the number of examples required of each task decays to the minimum possible,O{k}.
4. Ifthe 223small number of strong features224 assumptionis correct, then
k will be small.However,
typically we will have very little idea of what the features are, soto be
con002dent that the neural
network is capable of implementing a goodfeature set it will need to be
very large,implying
W 035 k.O{k+W=n} decreases mostrapidly with increasing n when W035 k,
so at least in terms of the upper bound on the number of examples required
per task, learning small feature sets is an ideal application for biaslearning.
However, the upper bound onthe number of
tasks does not fare sowell as it scales as O{W}.
3.3.4 COMPARISON WITHTRADITIONAL MULTIPLE- CL AS SCLASSIFICATION A special
case of this multi-task framework is one in which the marginal distributionon
the input
space P
ijX
is the same for each task i = 1; : : : ; n, and all that varies between
tasks is theconditional
distribution over the output space Y . An example would be amulti-class
problem such as face
recognition, in which Y =f1; : : : ; ngwhere n is the number of faces to
berecognized and the
marginal distribution onX is simply the 223natural224 distributionover
images of those faces. Inthat
case, if for every examplex
ij
we have227in addition to the sample y
ij
from the ith task's conditional distribution onY 227samplesfrom the remaining
n 000 1conditional distributions onY, then we can
view then training sets containing mexamples eachas one large training set
for the multi-class problem with mn examples altogether. The bound on m in
Theorem 8 statesthat mn should be
O{nk +W }, or proportional tothe total number of parameters in the network,
aresult we would
expect from 6
{Haussler, 1992}. So when specialized to the traditionalmultiple-class,
single task framework, Theorem 8 is con-
sistent with the bounds already known.However, as we have already argued,
problems such as face
recognition arenot really single-task, multiple-class problems.They are
more appropriately viewed6. If each example canbe classi002ed with a 223large
margin224 thennaive parameter counting can be improved upon {Bartlett,
1998}.
170A MODEL OFINDUCTIVE BIASLEARNING
as a{potentially in002nite} collection of distinct binaryclassi002cation
problems. In that case, the goal of bias learning is not to 002nd asingle
n-output network that can classifysome subset of n faces
well. Itis to learn a set of features that can reliablybe used as a 002xed
preprocessing fordistinguish-
ing any single face from other faces. This is the new thing providedby Theorem
8: it tells us that provided we have trained ourn-output neural network on
suf002ciently many examples of suf002ciently
many tasks, we can be con002dent that thecommon feature map learnt for
those ntasks will be good
for learningany new, as yet unseen task, provided the new task is drawn
from the same distribution
that generated the training tasks.In addition, learning the new task only
requires estimating the k
output node parametersfor that task, a vastly easier problem thanestimating
the parameters of the
entire network, from both a sample and computational complexity perspective.
Also, since we have high con002dence that the learnt features willbe good
for learning novel tasks drawn fromthe same
environment, thosefeatures arethemselves a candidate for further study to
learnmore about the
nature of the environment.The same claim could not be made if thefeatures
had been learnt on too
small a setof tasks to guarantee generalization tonoveltasks, for then it
is likely that the features would implement idiosyncrasies speci002cto those
tasks, rather than 223invariances224that apply across
all tasks.
When viewed from a bias {or feature} learningperspective, ratherthan a traditionaln-class
classi002cation perspective,the bound m on the number of examplesrequired
of each task takes on
a somewhat different meaning. Ittells us that provided nis large {i.e.,we
are collecting examples
of a largenumber tasks}, then we really only need to collecta few more examples
than we would otherwise have to collect if the featuremap was already known
{k +W=n examples vs. k examples}. So it tells us that the burden imposedby
feature learning can be made negligibly small,at least when
viewed from the perspective of the sampling burden required of each task.
3.4 Learning Multiple Taskswith Boolean Feature Maps
Ignoringthe accuracy and con002dence parameters" and ffi, Theorem 8 shows
that thenumber of
examples required of each taskwhen learning n tasks with a common neural-network
feature map
is bounded above byO{k +W=n},where k is the number of features andW is the
number of
adjustableparameters in the feature map. SinceO{k} examples are required
tolearn a single task
once the true featuresare known, this shows that the upper bound onthe number
of examples
requiredof each task decays {in order} to the minimumpossible as the number
of tasks n increases. This suggests that learning multiple tasks isadvantageous,
butto be truly convincing weneed to
prove a lower bound of thesame form. Proving lower bounds in a real-valued
setting {Y = R} is complicated by the fact that a singleexample can convey
an in002nite amount ofinformation, so
one typically has to make extra assumptions, suchas that the targetsy 2
Y are corrupted by a noise process. Rather than concern ourselves with such
complications, in this section werestrict
our attention to Boolean hypothesisspace families {meaning each hypothesish
2 H
1
maps to
Y = f0061g and we measure error by discrete lossl{h{x};y} = 1 if h{x} 6=
y and l{h{x}; y}= 0
otherwise}.
Weshow that the sample complexity for learningntasks with a Boolean hypothesis
space family H is controlled by a 223VCdimension224 type parameter d
H
{n} {that is, we give nearly matching upper
and lower boundsinvolving d
H
{n}}. We then derivebounds on d
H
{n} for the hypothesis space family considered inthe previous sectionwith
the Lipschitz sigmoid function 033replaced by a hard
threshold {linearthreshold networks}.
171BAXTER As well as the bound on the number of examples required per task
for good generalizationacross
those tasks, Theorem 8 also showsthat features performing well on O{W }
tasks will generalize well to novel tasks, where W isthe number of parameters
in the feature map.Given that for many feature learning problems W is likely
tobe quite large {recall Note 4 in Section 3.3.3}, it would be useful
to know thatO{W } tasks are in factnecessary without further restrictions
on the environmental
distributions Qgeneratingthe tasks. Unfortunately, wehave notyet been able
to show such a lower bound.
There is some empirical evidencesuggesting that in practice the upper bound
on thenumber of
tasks may be very weak.For example, in Baxter and Bartlett {1998}we reported
experiments in
whicha set of neural network features learnt on asubset of only 400 Japanese
characters turned out to be good enough for classifying some 2600unseen characters,
even though the featurescontained
several hundred thousand parameters. Similar results may be found in Intrator
andEdelman {1996}
and in the experimentsreported in Thrun {1996} and Thrun and Pratt{1997,
chapter 8}. While
thisgap between experiment and theory may be justanother example of the
looseness inherent in general bounds, itmay also be that theanalysis can
be tightened. In particular,thebound on the
number of tasks isinsensitive to the size of the class of outputfunctions
{the class G inSection 3.1}, which may be where the looseness hasarisen.
3.4.1 UPPERANDLOWER BOUNDSFOR LEARNING nTASKS WITH BOOLEAN HYPOTHESIS SPACE
FAMILIES First we recall some concepts from thetheory of Boolean function
learning. LetH be a class of
Booleanfunctions on X and x = {x
1
; : : : ; x m
} 2 X
m
. H
jx is the set of all binary vectorsobtainable
by applying functions inH to x:
H
jx
:= f{h{x
1
};: : : ; h{x
m }) : h 2 Hg: Clearly jH
jx j 024 2
m
. IfjH
jx
j = 2
m
we sayH shatters x. The growth function of H is de002ned by
005 H
{m} :=max
x2X
m fi
fi
H jx
fi
fi :
The Vapnik-Chervonenkisdimension VCdim{H} is the sizeof the largest set
shattered by H:
VCdim{H} := maxfm : 005
H
{m} = 2
m g:
An important result in thetheory of learning Boolean functions is Sauer's
Lemma {Sauer, 1972}, of
which we will also make use.
Lemma 9 {Sauer's Lemma}. For a Boolean function class H withVCdim{H} = d,
005
H
{m} 024
d
X i=0
022
m i
024 020
emd
021
d ;
for all positive integersm.
We now generalize theseconcepts to learning n tasks with a Booleanhypothesis
space family.
172A MODEL OFINDUCTIVE BIASLEARNING
De002nition 5.Let H be a Boolean hypothesis spacefamily. Denote the n 002m
matrices over the
input spaceX by X
{n;m}
. For eachx 2 X
{n;m} and H 2 H , de002neH
jx
to be the set of{binary}
matrices,
H
jx
:=
8
>
<
>
:
2 6
4
h
1 {x
11
}001 001 001 h
1 {x
1m
}
.
.
. .
.
.
.
.
.
h
n {x
n1
} 001 001 001 h
n
{x
nm }
3
7
5 : h
1
;: : : ; h
n
2H
9
>
=
>
;
:
De002ne H
jx
:= [
H2 H
H jx
:
Nowfor each n > 0; m >0, de002ne 005
H {n; m} by
005
H
{n;m} := max
x2X
{n;m}
fi fi
H
jx fi
fi
: Note that 005
H {n; m} 024 2 nm
. If
fi fi
H
jx fi
fi
=2 nm
we say H shattersthe matrix x. For eachn > 0 let
d
H
{n} := maxfm : 005
H
{n; m} = 2
nm g:
De002ned{ H }: = VCdim{ H
1
}and
d{ H}: = max
H2 H VCdim{H}:
Lemma 10.d{H } 025 d{H }
d
H {n} 025 max
032026d{H }n 027
; d{ H }
033 025
12
022026d{ H }n
027
+ d{ H}
Proof.The 002rst inequality is trivial from thede002nitions. To get the
second term inthe maximum
in the second inequality, choosean H 2 H withVCdim{H} = d{ H } and construct
a matrix x 2 X
{n;m}
whose rows are of lengthd{ H } and areshattered by H. Then clearlyH shatters
x. For the 002rst term in the maximum take asequence x = {x
1 ; : : : ; xd{ H }
}shattered by H
1
{thehypothesis
space consisting ofthe union over all hypothesis spaces from H}, and distribute
its elements equally among the rows of x {throw awayany leftovers}. The set
of matrices 8
>
<
> :
2
6
4 h{x
11
} 001 001 001 h{x
1m
}
.
.
.
. .
.
.
. .
h{x
n1
} 001 001 001h{x
nm
} 3
7
5
:h2 H
1
9 >
=
>
; :
where m = bd{ H }=ncis a subset of H
jx
andhas size 2
nm
.Lemma 11. 005
H
{n;m} 024
024
emd
H
{n}
025 nd
H
{n}
173BAXTER Proof. Observe that for eachn, 005
H
{n; m} = 005
H {nm} where H is thecollection of all Boolean
functions on sequencesx
1
; : : : ; x nm
obtained by 002rst choosingnfunctions h
1
; : : : ; h
n
from some
H 2 H ,and then applying h
1
to the 002rst m examples, h 2
to the second m examplesand so on. By
the de002nition ofd
H
{n}, VCdim{H} = nd H
{n}, hence theresult follows from Lemma 9 applied to H.If one follows the
proof of Theorem 4 {in particular the proof of Theorem18 in Appendix
A} then it is clear thatfor all
* > 0, C{ H
n
l
; "} may be replaced by005
H
{n;2m} in the Boolean case. Making this replacement in Theorem 18,and using
the choices of ff; 027from the discussion
followingTheorem 26, we obtain the following bound on theprobability oflarge
deviation between empirical and true performance in this Booleansetting.
Theorem 12. LetP = {P
1
; : : : ; P
n
} be n probability distributionson X 002 f0061gand let z be an
{n;m}-sample generated by samplingm times from X 002 f0061g according
to each P i
. Let H =fHg
be any permissible Booleanhypothesis space family. Forall 0 <
* 024 1, Pr fz: 9h2 H
n
: er P
{h} 025^er
z
{h} + "g 024 4005 H
{n; 2m} exp{000
*
2 nm=64}: {46}
Corollary 13. Under the conditions ofTheorem 12, ifthe number of examplesm
of each task
satis002es m 025
88"
2
024
2d
H {n} log
22"
+
1n
log 4ffi 025
{47}
thenwith probability at least 1 000ffi {over the choice of z}, any h 2
H
n will satisfy
er
P
{h} 024^er
z
{h} + " {48}
Proof. Applying Theorem 12, we require 4005
H
{n; 2m} exp{000
*
2
nm=64}024 ffi;
which is satis002ed if m 025
64
*
2
024
d
H
{n} log
2emd
H {n}
+
1n
log 4ffi 025
; {49}
where we have used Lemma 11.Now, for all a 025 1, if
m =
022 1+
1e
a log 022
1 +
1e
a;
then m 025 alog m. So setting a= 64d
H
{n}="
2
, {49} issatis002ed if
m 025
88"
2
024
2d H
{n} log 22"
+
1n log
4ffi
025
:174A MODEL OFINDUCTIVE BIASLEARNING
Corollary13 shows that any algorithm learningn tasks using the hypothesis
space familyH
requires no more than m = O
022
1"
2 024
d
H
{n} log
1"
+
1n
log 1ffi
025
{50}
examplesof each task to ensure that with high probabilitytheaverage true
error of any nhypotheses
it selects from H n
is within " of their average empirical error on the sample z. We now give
a
theorem showing that if the learning algorithmis required to produce n hypotheseswhose
average
true error is within" of the best possible error{achievable using H
n } for an arbitrary sequence of distributions P
1
; : : : ; P
n
, then within a log
1"
factor thenumber of examples in equation {50} is also necessary.
For any sequenceP = {P
1
; : : : ; P
n
} of n probability distributions onX 002 f0061g, de002ne opt
P
{ H n
} by
opt P
{ H
n } := inf
h2H
n
er
P
{h}:
Theorem 14. Let H bea Boolean hypothesis space family such thatH
1
containsat least two functions. For each n= 1; 2; : : : ;let A
n
be any learningalgorithm taking as input {n; m}-samples
z 2 {X002 f0061g}
{n;m}
and producing asoutput n hypotheses h = {h
1
; : : : ; h n
} 2 H
n
. For all 0 < " < 1=64and 0 < ffi <1=64, if
m <
1"
2
024
d
H
{n}616
+ {1 000 "
2
}
1n
log
022 18ffi{1 000 2ffi}
025
then there exist distributions P = {P
1 ; : : : ; P
n } such that with probability atleast ffi {over the
randomchoice of z},
er P
{A
n {z}) > opt
P
{ H
n } + "
Proof.See Appendix C3.4.2LINEAR THRESHOLD NETWORKS
Theorems 13and 14 show that within constants and alog{1="} factor, the samplecomplexity
of
learning n tasks usingthe Boolean hypothesis space family His controlled
by the complexity pa- rameter d
H
{n}. In this section we derivebounds on d
H
{n} for hypothesis space families constructed as thresholded linear combinations
of Booleanfeature maps. Speci002cally, weassumeH is of the
form given by{39}, {38}, {37} and {36}, where nowthe squashing function
033 is replaced with ahard
threshold:
033{x} :=
{
1if x 025 0;
0001 otherwise;
andwe don't restrict the range of the feature andoutput layer weights. Note
that in this casethe
proof of Theorem 8 does not carrythrough because the constants 024; 024
0
in Theorem 7 depend on the Lipschitz bound on 033.
Theorem 15. Let H bea hypothesis space family of the form given in{39},
{38}, {37} and{36},with
a hard thresholdsigmoid function 033. Recallthat the parameters d, l and
k are theinput dimension,
numberof hidden nodes in the feature map and number of features {outputnodes
in the feature map}
175BAXTER respectively. Let W:= l{d + 1} +k{l + 1} {the number ofadjustable
parameters in the feature map}. Then,
d
H {n}024 2
022
Wn
+ k +1
log
2
{2e{k +l + 1}):
Proof. Recall that for eachw 2 R
W
,010
w
: R d
! R
k denotes the feature map with parametersw.
For each x 2X
{n;m}
, let010
wjx
denote thematrix
2
6
4 010
w
{x 11
} 001 001 001010
w
{x 1m
}
. .
.
.
. .
.
.
. 010
w
{x n1
} 001001 001 010
w
{x
nm
} 3
7
5
: Note that H
jx
is the set of all binaryn 002 m matrices obtainable by composingthresholded
linear
functions withthe elementsof 010
wjx
, with the restriction that the same functionmust be applied to
eachelement in a row{but the functions may differ between rows}.With a slight
abuse of notation, de002ne
005
010 {n; m} := max x2X
{n;m}
fi
fi
010
010
wjx
: w 2 R
W
011
fi
fi
:
Fix x2 X
{n;m}
. By Sauer's Lemma, each node in the002rst hidden layer of the feature
map computes at most {emn={d+ 1})
d+1
functionson the nm input vectors in x. Thus, there can be at most {emn={d
+ 1}) l{d+1}
distinctfunctions from the input to the output of the002rst hidden layer
on
the nmpoints in x. Fixing the 002rst hiddenlayer parameters, each node
in the second layer ofthe
feature map computes at most{emn={l + 1})
l+1
functions on the image ofx produced at the output
of the 002rsthidden layer. Thus the second hidden layercomputes no more
than {emn={l + 1})
k{l+1}
functions on the output of the002rst hidden layer on the nm points inx.
So, in total,
005 010
{n; m}024
022
emnd+ 1
l{d+1}
022 emnl + 1
k{l+1} :
Now, for each possiblematrix 010
wjx , the number of functions computable on eachrow of 010
wjx by a
thresholded linear combination ofthe output of the feature map is at most{em={k
+ 1}) k+1
. Hence, the number of binary sign assignments obtainableby applying linear
threshold functions to all the rowsis at most {em={k +1})
n{k+1}
. Thus,
005 H
{n; m}024
022
emnd + 1
l{d+1} 022
emnl + 1
k{l+1}
022 emnn{k + 1}
n{k+1}
:
f{x} := x log xis a convex function, hence for alla; b; c > 0,
f 022
ka+ lb+ ck +l + 1
024 1k +l + 1
{kf {a} + lf {b}+ f {c})
} 022
k + l + 1ka + lb+ c
ka+lb+c
025 022
1a
ka 022
1b
lb 022
1c
c
:
Substituting a =l + 1, b = d + 1and c = n{k + 1}shows that
005
H {n; m} 024 022
emn{k +l + 1}W+ n{k + 1}
W +n{k+1}
: {51} 176A MODEL OFINDUCTIVE BIASLEARNING
Hence, if m >
022
Wn
+ k + 1
log
2 022
emn{k +l + 1}W+ n{k +1}
{52}
then 005 H
{n; m}< 2
nm
and so byde002nition d
H
{n} 024 m. For alla > 1, observe that x > alog
2
x
ifx = 2a log
2 2a. Setting x =emn{k +l + 1}={W + n{k + 1})and a = e{k +l + 1} shows
that
{52} issatis002ed if m = 2{W=n+ k + 1} log
2 {2e{k + l+ 1}).Theorem 16.Let H be as in Theorem 15 withthe following
extra restrictions:d 025 3,l 025k and
k 024 d. Then d
H
{n}025
12
022026
W2n
027 + k + 1
Proof. We boundd{ H }and d{ H }and then apply Lemma 10. In the presentsetting
H
1
containsall three-layer linear-threshold networks withd input nodes, l hidden
nodes in the002rst hidden layer, k
hidden nodes in the second hidden layer and one output node.From Theorem
13 in Bartlett {1993}, we have
VCdim{H
1
} 025dl +
l{k 0001}2
+ 1;
which under the restrictions stated above is greater than W=2. Henced{ H
}025 W=2.
As k024 dand l 025 kwe can choose a feature weight assignment so thatthe
feature map is the
identity onk components of the input vector and insensitive to the setting
of the reminaingd 000 k
components.Hence we can generate k+ 1points in X whose image under thefeature
map is
shattered by the linearthreshold output node, and so d{ H }= k +1.Combining
Theorem 15 withCorrolary 13 shows that
m025 O
022
1"
2 024022
Wn
+ k + 1
log
1"
+
1n
log 1ffi 025
examples of each task suf002ce when learning n tasks using a linearthreshold
hypothesis space family, while combining Theorem 16 with Theorem 14 shows
that if
m 024 012 022
1"
2
024022 Wn
+ k +1
+
1n
log
1ffi
025
then any learning algorithm will fail on some set of n tasks.
4. Conclusion The problem of inductive bias is onethat has broad signi002cance
in machine learning.In this paper
we have introduced aformal model of inductive bias learning thatapplies
when the learner is able
tosample from multiple related tasks.We proved that provided certain covering
numbers computed
from the set of allhypothesis spaces available to the bias learnerare 002nite,
any hypothesis space
that contains good solutions to suf002ciently many training tasks is likely
to contain goodsolutions to
novel tasks drawn from the same environment.
In the speci002c caseof learning a set of features, we showed thatthe number
of examples m
requiredof each task in an n-task training set obeys m = O{k +W=n}, where
k is the number of 177BAXTER features and W is a measure ofthe complexity
of the feature class.We showed that this bound is essentially tight for Boolean
feature mapsconstructed fromlinear threshold networks.In addition,
weproved that the numberof tasks required to ensure good performance from
the features on novel
tasks is no more thanO{W }. Wealso showed how a good set of features maybe
found by gradient
descent.
Themodel of this paper represents a 002rst steptowards a formal model of
hierarchical approaches to learning. By modelling a learner's uncertainty
concerning its environment inprobabilistic terms,
we have shown how learning can occur simultaneously atboth the baselevel227learn
the tasks at
hand227and atthe meta-level227learn bias that can betransferred to novel
tasks. Froma technical
perspective, itis the assumption that tasks are distributedprobabilstically
that allows the perfor- mance guarantees to be proved.From a practical perspective,
there are many problem domains that
can be viewed asprobabilistically distributed sets of related tasks.For
example, speech recognition may be decomposed along many different axes:
words, speakers, accents, etc.Face recognition
represents a potentiallyin002nite domain of related tasks. Medicaldiagnosis
and prognosis problems
using the same pathology tests are yet another example.All of these domains
should bene002t from being tackled with a bias learning approach. Natural
avenues for further enquiry include:
* Alternative constructions forH . Although widely applicable, thespeci002c
example on feature
learning viagradient descent represents just one possible wayof generating
and searching
the hypothesis space family H . It would be interestingto investigate alternative
methods, including decision tree approaches, approachesfrom Inductive Logic
Programming {Khan et al., 1998}, and whether more generallearning techniques
such as boosting can be applied in a bias learning setting.
* Algorithms for automatically determining thehypothesis space family H
. Inour model the
structure of H is002xed apriori and represents thehyper-bias of the bias
learner.It would
be interesting to see towhat extent this structure can also be learnt.
* Algorithms for automatically determiningtask relatedness. In ordinary
learning thereis usu-
ally little doubt whether an individual example belongs to the same learningtask
or not.
The analogous question in bias learning is whether an individual learning
taskbelongs to a
given set of related tasks, which in contrast to ordinary learning, does
not always have such
a clear-cut answer. For most of the examples we havediscussed here, such
as speech and
face recognition, thetask-relatedness isnot inquestion, but in other cases
such as medical problems it is not so clear.Grouping too large a subset of
tasks togetheras related tasks could
clearly have adetrimental impact on bias-learning or multi-tasklearning,
and there is empri-
cal evidenceto support this {Caruana, 1997}. Thus,algorithms for automatically
determining task-relatedness are a potentially useful avenue for further
research. In this context,see Silver
and Mercer {1996}, Thrun andO'Sullivan {1996}. Note that the questionof
task relatedness
is clearly onlymeaningful relative to a particular hypothesisspace family
H {for example, all possible collections oftasks are related ifH contains
every possible hypothesisspace}.
* Extended hierarchies.For an extension of our two-levelapproach to arbitrarily
deep hierarchies, see Langford {1999}. An interestingfurther question is
to what extent the hierarchycan
be inferred from data. Thisis somewhat related to the question of automaticinduction
of
structure in graphical models. 178A MODEL OFINDUCTIVE BIASLEARNING
Acknowledgements This work was supported at varioustimes by an Australian
Postgraduate Award,aShell Aus-
tralia Postgraduate Fellowship, U.K Engineering and Physical Sciences ResearchCouncil
grants
K70366 and K70373, and anAustralian Postdoctoral Fellowship. Alongthe way,
many people
havecontributed helpful comments and suggestions for improvement including
Martin Anthony, Peter Bartlett, Rich Caruana, John Langford,Stuart Russell,
John Shawe-Taylor,Sebastian Thrun
and several anonymousreferees.
Appendix A. UniformConvergence Results
Theorem2 provides a bound {uniform over allH 2 H } on the probability of
large deviation between
er
Q
{H} and ^er
z
{H}. To obtain a more general result, wefollow Haussler {1992} and introduce
the following parameterized class of metrics onR
+
:
d 027
[x; y]:=
jx 000 yjx + y+ 027
;
where027 > 0. Our main theorem willbe a uniform bound on the probability
of large values of
d
027
[er
Q
{H}; er
z
{H}], rather than j er Q
{H} 000^er
z
{H}j. Theorem2 will then followas a corollary, as
will better boundsfor the realizable case ^er z
{H} = 0{Appendix A.3}.
Lemma 17.The following three properties ofd
027
are easily established: 1. For all r;s 025 0, 0 024 d 027
[r; s]024 1
2. Forall 0 024 r 024 s 024t, d
027
[r; s] 024 d
027
[r; t]and d
027
[s; t] 024 d
027 [r; t].
3. For 0 024 r;s 024 1,
jr000sj027+2
024 d
027
[r; s]024
jr000sj027
For easeof exposition we have up until now beendealing explicitly withhypothesis
spacesH
containing functions h: X ! Y , and then constructingloss functions h
l
mapping X 002 Y ![0; 1]
by h
l
{x; y} :=l{h{x};y} for some loss function l: Y 002Y ! [0; 1]. However,
in generalwe can view
h
l just as a function from an abstract setZ {X 002 Y }to [0; 1] and ignore
its particular construction
in terms of the loss functionl. So for the remainder of this section,unless
otherwise stated, all
hypothesisspaces H will be sets of functions mappingZ to [0; 1].It will
also be considerably more convenient to transpose our notation for{n; m}-samples,
writing then training sets as columns
insteadof rows:
z =
z 11
: : : z 1n
.
. .
.
.
. .
.
.
z m1
: : : z mn
where each z ij
2 Z . Recallingthe de002nition of {X 002Y }
{n;m} {Equation 9 and prior discussion}, with this transposition z lives
in{X 002 Y }
{m;n}
. Thefollowing de002nition now generalizes quantities like ^er
z {H}, er
P
{H} and so on to thisnew setting.
De002nition 6.Let H
1
; : : : ;H
n
be n sets offunctions mapping Z into [0; 1]. For any h 1
2
H
1
; : : : ; h
n
2 H
n
, let h
1
010001001 001 010 h
n or simply h denote the map h{~z } = 1=n
n
X
i=1
h
i
{z
i
}
179BAXTER for all ~z = {z
1
; : : : ; z n
} 2 Z
n
. Let H
1
010 001 001 001010 H
n
denote the set ofall such functions. Given
h2 H
1
010 001001 001 010 H
n
and m elements of {X 002Y }
n
, {~z
1
; : : : ; ~z
m
} {or equivalently an element z of
{X002 Y }
{m;n}
by writing the ~z i
as rows}, de002ne ^er
z
{h} :=
1m
m
X
i=1
h{~z i
}
{recallequation {8}}. Similarly, for anyproduct probability measure P =
P 1
002 001 001 001002 P
n
on {X 002 Y }
n
, de002ne
er P
{h} := Z
Z
n
h{~z } dP{~z }
{recallequation {26}}. For any h; h
0
: {X 002 Y }
n ! [0; 1] {not necessarilyof the form h
1
010 001 001 001 010 h n
},
de002ne d
P
{h; h
0
}:=
Z
Z
n jh{~z }000 h
0
{~z }j dP{~z }
{recall equation {17}}. For any class of functionsH mapping {X 002 Y}
n
to [0;1], de002ne
C{"; H} := sup
P N {"; H; d P
}
wherethe supremum is over all product probabilitymeasures on {X 002 Y}
n
and N{"; H; d
P } is the
size of the smallest"-cover of H under d P
{recall De002nition 4}. The following theorem is the main result fromwhich
the rest of the uniform convergenceresults
in this paper are derived. Theorem 18. Let H 022 H 1
010 001 001 001 010 H n
be a permissible class offunctions mapping {X 002 Y}
n
into
[0; 1]. Let z2 {X 002 Y } {m;n}
be generatedby m 025 2={ff 2
027 } independent trials from {X 002 Y } n
according to some productprobability measure P = P 1
002 001 001 001002 P
n
. For all 027 > 0, 0< ff < 1,
Pr
032
z 2 {X 002Y }
{m;n} : sup
H
d 027
[ ^er z
{h};er
P
{h}]> ff
033
0244C {ff027 =8;H} exp{000ff
2
027 nm=8}:{53}
The following immediate corollarywill also be of use later.
Corollary 19. Under the same conditionsasTheorem 18, if
m 025max
{
8ff
2
027 n
log
4C 000
ff0278
; H 001ffi ;
2ff
2
027
}
; {54}
then Pr
032
z2 {X 002 Y } {m;n}
: sup H
d
027 [ ^er
z
{h}; er
P
{h}] > ff 033
024 ffi {55} A.1 Proof of Theorem 18 The proof is via a double symmetrization
argument of the kind given in chapter 2 ofPollard {1984}.
I have also borrowedsome ideas from the proof of Theorem 3 in Haussler{1992}.
180A MODEL OFINDUCTIVE BIASLEARNING
A.1.1FIRST SYMMETRIZATION
An extra piece ofnotation: for all z 2 {X 002 Y }
{2m;n}
, let z{1}be the top half of z and z{2} be the
bottom half, viz: z{1} =
z
11
: : : z
1n
.
.
.
.
.
.
.
. .
z
m1 : : : z
mn z{2} =
z
m+1;1
: : :z
m+1;n
. .
.
.
.
.
.
.
.
z
2m;1
:: : z
2m;n
The following lemma is the 002rst223symmetrization trick.224 Werelate
the probability oflarge deviation between an empirical estimate of the loss
andthe true loss to the probability oflarge deviation
between two independent empiricalestimates of the loss.
Lemma 20.Let H be a permissible set of functionsfrom {X 002 Y } n
into [0;1] and let P be a probability measure on {X 002 Y }
n . For all 027 >0; 0 < ff < 1and m 025
2ff
2
027
,
Pr
032
z 2 {X 002Y }
{m;n} : sup
H
d 027
[ ^er z
{h};er
P
{h}]> ff
033
0242 Pr
032
z2 Z
{2m;n} : sup
H
d 027
002
^er
z{1}
{h}; ^er
z{2}
{h}
003
>
ff2
033
: {56}
Proof.Note 002rst that permissibility ofHguarantees themeasurability ofsuprema
overH
{Lemma32 part 5}. Bythe triangle inequality for d
027
, ifd
027
002
^er z{1}
{h}; er
P
{h}
003
>ff and
d
027 002
^er
z{2}
{h};er
P
{h} 003
< ff=2, thend
027
002
^er
z{1}
{h}; ^er z{2}
{h} 003
> ff=2. Thus, Pr
n
z 2{X 002 Y }
{2m;n}
: 9h2 H : d
027 002
^er
z{1}
{h};^er
z{2}
{h}
003
>
ff2 o
025 Pr
n
z 2 {X 002Y }
{2m;n} : 9h2 H : d 027
002
^er
z{1}
{h}; er
P {h}
003 > ff and
d 027
002
^er
z{2}
{h}; er
P {h}
003 < ff=2
o
:
{57}
By Chebyshev's inequality, for any 002xedh,
Pr
n
z 2 {X 002 Y } {m;n}
: d 027
[ ^er z
{h};er
P
{h}]<
ff2 o
025 Pr
032 z 2 {X 002 Y}
{m;n}
:
j ^er
z
{h} 000er
P
{h}j027
<
ff2 033
025 1000 er
P
{h}{1 000 er
P
{h})m027 ff
2
=4
025
12
as m 0252={ff
2
027 } and er
P {h} 024 1.Substituting this last expression into the righthand side of
{57}
gives the result.181BAXTER A.1.2 SECONDSYMMETRIZATION The second symmetrization
trick bounds theprobability of large deviation between twoempirical
estimates of the loss {i.e.the right hand side of {56}} by computingthe
probability of large devi-
ation when elements are randomly permuted between the002rst and second
sample. The following de002nitionintroduces the appropriate permutationgroup
for this purpose.
De002nition 7.For all integers m; n025 1, let 000
{2m;n}
denote the set of allpermutations 033 of the
sequenceof pairs of integers f{1;1}; : : : ;{1; n}; : : : ; {2m; 1}; : :
: ;{2m; n}gsuch that for all i, 1 024 i 024 m,either 033{i; j } = {m+
i; j } and 033{m + i; j } = {i; j } or 033{i; j } = {i; j} and 033{m+
i;j } =
{m +i; j }.
For anyz 2 {X 002 Y } {2m;n}
and any033 2 000
{2m;n}
, let
z
033
:=
z
033{1;1}
:: : z
033{1;n}
.
.
. .
.
.
. .
.
z
033{2m;1}
: : :z
033{2m;n} :
Lemma 21. LetH = H
1
010 001 001 001 010 H n
be a permissible set offunctions mapping {X002 Y}
n
into
[0; 1] {as in the statement ofTheorem 18}. Fix z 2{X002 Y }
{2m;n}
and let ^
H := ff 1
; : : : ; f M
g be
anff027 =8-cover for {H; d
z
}, where d
z
{h; h
0 } :=
12m
P
2m
i=1
jh{~z
i
} 000 h
0
{~z
i
}j where the ~z
i
are the rows
of z. Then,
Pr
032
033 2 000
{2m;n}
: sup H
d
027 002
^er
z
033
{1}
{h}; ^er z
033
{2} {h}
003 >
ff2
033
024 M
X
i=1 Pr
n
0332000
{2m;n} : d
027
002
^er
z 033
{1}
{f
i
}; ^er
z
033
{2}
{f
i }
003
> ff4
o
; {58}
whereeach 033 2 000
{2m;n}
is chosen uniformly at random.
Proof. Fix 0332 000
{2m;n} and let h 2 H be such thatd
027
002
^er
z
033 {1}
{h};^er
z
033 {2}
{h} 003
> ff=2{if thereis
nosuch h for any 033we are already done}. Choose f2
^
H such thatd
z
{h;f } 024 ff027 =8. Without loss
of generality wecan assume f is of the formf = f
1
010 001 001 001 010 f n
. Now,
2027
d
z
{h;f } =
P
2m
i=1
fi fi
fi
P
n
j=1
h
j
{z
ij } 000 f
j {z
ij
} fi
fi
fi027 mn
=
P
2m
i=1
fi
fi fi
P
n
j=1
h
j
{z
033{i;j}
} 000 f
j
{z
033{i;j}
}
fi
fi
fi027 mn
025 fi
fi
fi P
m
i=1 P
n
j=1 002
h
j {z
033{i;j}
} 000 f
j
{z
033{i;j}
}
003
fi
fi
fi027 mn + P
m
i=1 P
n
j=1 002
h
j {z
033{i;j}
} + f
j {z
033{i;j}
}
003
+
fi
fi
fi
P
2m
i=m+1
P
n
j=1
002 h
j
{z 033{i;j}
}000 f
j
{z 033{i;j}
} 003
fi
fi fi027mn +
P
2m i=m+1
P n
j=1
002 h
j
{z 033{i;j}
} +f
j
{z
033{i;j}
} 003
= d
027
002
^er z
033
{1} {h}; ^er z
033
{1} {f }
003 + d
027
002
^er
z 033
{2}
{h}; ^er
z 033
{2}
{f }
003
: 182A MODEL OFINDUCTIVE BIASLEARNING
Hence, by thetriangle inequality for d
027 ,
2027
d
z
{h; f }+d
027 002
^er
z 033
{1}
{f }; ^er
z
033
{2}
{f }
003
025 d
027
002 ^er
z
033 {1}
{h};^er
z
033 {1}
{f } 003
+ d
027 002
^er
z 033
{2}
{h}; ^er
z 033
{2}
{f }
003
+d
027
002
^er
z
033 {1}
{f };^er
z
033 {2}
{f } 003
025 d
027
002
^er z
033
{1} {h}; ^er z
033
{2} {h}
003
:
{59}
But
2027
d z
{h; f} 024 ff=4 by construction andd
027
002
^er
z
033 {1}
{h};^er
z
033 {2}
{h} 003
> ff=2by assumption, so
{59} impliesd
027
002
^er
z
033 {1}
{f }; ^er
z
033
{2}
{f}
003
> ff=4. Thus,
n
0332 000
{2m;n} : 9h 2 H :d
027
002
^er
z
033 {1}
{h};^er
z
033 {2}
{h} 003
>
ff2
o
022
n
033 2000
{2m;n}
: 9f 2
^ H : d
027 002
^er
z
033
{1}
{f }; ^er z
033
{2} {f }
003 } >
ff4
o
; which gives {58}.Now we bound the probability ofeach termin the right
hand side of {58}. Lemma 22. Let f :{X 002 Y }
n
! [0; 1]be any function that can be written in the form f =
f
1 010 001 001 001 010f
n
. Forany z 2 {X 002 Y}
{2m;n}
,
Pr
n
0332000
{2m;n} : d
027
002
^er
z 033
{1}
{f }; ^er
z
033
{2}
{f }
003
>
ff4
o
024 2exp
022
000ff 2
027 mn8
; {60}
where each033 2 000
{2m;n}
is chosen uniformly at random. Proof. For any 0332 000
{2m;n} ,
d
027
002
^er
z 033
{1}
{f }; ^er
z
033
{2}
{f }
003
=
fi
fi
fi
P
m
i=1
P
n
j=1
002
f
j
{z
033{i;j}
} 000f
j
{z
033{m+i;j} }
003
fi fi
fi027 mn +
P 2m
i=1
P
n
j=1
f
j
{z ij
}
:{61}
To simplify the notation denotef
j
{z
ij
} by fi
ij
. For each pairij , 1 024 i 024m, 1 024 j 024n, let
Y
ij be an independent random variable such thatY
ij
= fi ij
000 fi
m+i;j
with probability1=2 and
Y
ij
= fi
m+i;j
000 fi
ij
with probability 1=2. From {61},
Pr n
033 2 000 {2m;n}
: d 027
002
^er
z
033
{1}
{f };^er
z
033 {2}
{f } 003
>
ff4
o
= Pr
8
<
:
033 2000
{2m;n}
:
fi fi
fi
fi
fi
fi
m
X
i=1
n
X
j=1
002 f
j
{z 033{i;j}
}000 f
j
{z 033{m+i;j} }
003
fi fi
fi
fi
fi
fi
>
ff4
0
@
027 mn +
2m
X
i=1 n
X
j=1 fi
ij
1 A
9
=
; = Pr
8
< :
fi
fi
fi fi
fi
fi
m
X
i=1
n
X
j=1
Y
ij
fi
fi
fi
fi
fi fi
ff4
0
@
027mn +
2m
X i=1
n
X j=1
fi
ij 1
A
9
= ;
For zero-mean independent randomvariables Y
1
;: : : ; Y
k
withbounded ranges a
i
024 Y
i
024b
i
, Ho-
effding's inequality {Devroye, Gy250or002, & Lugosi, 1996} is Pr
{
fi fi
fi
fi
fi
k
X
i=1
Y
i
fi
fi
fi
fi
fi
025 021 }
024 2 exp
000
2021 2P k
i=1
{b
i
000 a i
}
2
!
:
183BAXTER Noting that the range of eachY
ij
is [000jfi
ij
000 fi i+m;j
}j; jfi
ij
000 fi
i+m;j }j], we have Pr
8
<
: fi
fi
fi
fi fi
fi
m
X i=1
n
X j=1
Y
ij fi
fi
fi
fi fi
fi
>
ff4
0
@
027 mn +
2m
X
i=1
n
X
j=1
fi
ij
1
A 9
=
;
0242 exp
0
B
@
000
ff
2
h
027 mn + P
2m
i=1 P
n
j=1 fi
ij
i
232
P
m
i=1
P
n
j=1
{fi
ij
000fi
i+m;j
} 2
1
C
A Let fl =
P
2m
i=1
P n
j=1
fi ij
. As 0 024fi
ij
024 1,
P
m
i=1 P
n
j=1 {fi
ij
000fi
m+ij
} 2
024 fl .Hence,
2 exp
0 B
@
000
ff 2
h
027 mn+
P
2m i=1
P
n j=1
fi
ij i
232
P
m
i=1
P
n
j=1
{fi
ij 000 fi
i+m;j }
2
1
C
A
024 2 exp 022
000
ff 2
{027 mn +fl }
232fl
:
{027 mn +fl }
2
=flis minimized by setting fl =027 mn giving a value of4027 mn. Hence
Pr n
033 2 000 {2m;n}
:d
027
002
^er
z
033 {1}
{f }; ^er
z
033
{2}
{f}
003
>
ff4
o
024 2 exp
022
000ff
2 027 mn8
;
as required.A.1.3PUTTING IT TOGETHER
For 002xed z2 {X 002 Y }
{2m;n}
, Lemmas 21 and 22give:
Pr
032 033 2 000
{2m;n}
: sup
H d
027
002 ^er
z
033 {1}
{h};^er
z
033 {2}
{h} 003
>
ff2
033 024 2N {ff027=8; H; d
z }) exp
022
000
ff
2
027 mn8
:
Notethat d
z
is simplyd
P
where P= {P
1
;: : : ; P
n
}and each P
i
is theempirical distribution that
putspoint mass 1=m on each z ji
; j =1; : : : ; 2m {recallDe002nition 3}. Hence,
Pr
032
033 2 000
{2m;n}
; z2 {X 002 Y } {2m;n}
: sup H
d
027 002
^er
z
033
{1}
{h}; ^er z
033
{2} {h}
003 >
ff2
033
0242C{ff027 =8;H} exp
022
000
ff
2
027 mn8
:
Now, for a random choice of z, eachz
ij
in z isindependently {butnot identically} distributedand 033
only ever swapsz
ij
and z
i+m;j
{so that 033swaps a z
ij
drawn according to P
j with another component
drawnaccording to the same distribution}.Thus we can integrate out with
respect to thechoice of
033 and write Pr
032
z 2{X 002 Y }
{2m;n}
: sup
H
d
027
002
^er
z{1}
{h};^er
z{2}
{h}
003
>
ff2
033
024 2C{ff027 =8; H} exp
022
000 ff
2
027mn8
:
Applying Lemma 20 tothis expression gives Theorem 18.184A MODEL OFINDUCTIVE
BIASLEARNING
A.2Proof of Theorem 2
Another piece ofnotation is required for the proof. Forany hypothesis space
H and any probability measures P = {P
1
; : : : ; P
n
} on Z ,let
^er
P
{H} :=
1n
n
X i=1
inf
h2H
er
P
i {h}:
Notethat we have used ^er P
{H} rather thaner
P
{H}to indicate that ^er
P
{H} is anotherempirical
estimate of er
Q
{H}.
With the {n; m}-samplingprocess, inaddition to the sample zthere is also
generated a se-
quenceof probability measures, P = {P 1
; : : : ; P n
} although these are notsupplied to the learner.
Thisnotion is used in the following Lemma, wherePrf{z; P}2 {X 002 Y } {n;m}
002 P n
: Ag means 223the probability of generating asequence ofmeasures P from
the environment {P ; Q} and then an {n; m}-samplez according to P such that
A holds224. Lemma 23. If
Pr 032
{z; P} 2 {X 002 Y}
{n;m}
002 P
n
: sup H
d
027 [ ^er
z
{H}; ^er P
{H}] > ff2 033
024
ffi2
;{62}
and
Pr
032
P 2 P
n
: sup
H d
027
[ ^er
P
{H}; er
Q
{H}] >
ff2
033
024
ffi2
; {63}
then
Pr
032
z 2 {X 002 Y}
{n;m}
: sup
H
d 027
[ ^er z
{H};er
Q
{H}]> ff
033
024ffi:
Proof. Followsdirectly from the triangle inequality ford
027
.We treat the two inequalities inLemma 23 separately.
A.2.1INEQUALITY {62} In the following Lemma we replace thesupremum over
H 2 H ininequality {62} with a supremum
over h 2 H
n .
Lemma 24.
Pr 032
{z; P} 2 {X 002 Y}
{n;m}
002 P
n
: sup H
d
027 [ ^er
z
{H}; ^er P
{H}] > ff 033
024 Pr
{
{z; P}2 {X 002 Y } {n;m}
002 P n
: sup
H n
l
d
027
[ ^er
z
{h}; er P
{h}] > ff }
{64}
185BAXTER Proof. Suppose that {z; P} are such thatsup
H
d
027 [ ^er
z
{H}; ^er
P
{H}] > ff. Let H satisfy this in- equality. Suppose 002rst that^er
z
{H} 024 ^er
P {H}. By the de002nition of^er
z
{H}, for all " > 0there
exists h 2 H n
:= H 010 001001 001 010 H such that ^er
z
{h}< ^er
z
{H} + ". Henceby property {3} of the d
027
metric, for all " >0,there exists h 2 H n
suchthat d
027
[ ^er
z
{h}; ^er
z
{H}]< ". Pick an arbitrary h satisfying this inequality. Byde002nition,
^er
P {H} 024 er P
{h},and so^er
z
{H} 024 ^er
P {H} 024 er P
{h}. As d
027
[^er
z
{H}; ^er
P
{H}] > ff{by assumption},by the compatibility of d
027 with the ordering on the
reals,d
027
[ ^er z
{H};er
P
{h}]> ff = ff + ffi, say. By the triangle inequality ford
027
,
d 027
[ ^er
z
{h}; er P
{h}] +d
027
[^er z
{h};^er
z
{H}] 025 d
027 [ ^er
z
{H}; er
P
{h}] = ff+ ffi:
Thus d
027
[ ^er
z
{h}; er P
{h}] > ff+ ffi 000 " and for any" > 0 an h satisfying thisinequality can
be
found. Choosing" = ffi shows that there existsh 2 H
n
such thatd
027
[ ^er
z
{h}; er
P
{h}] > ff.
If instead,^er
P
{H} < ^er
z {H}, then an identical argument can be run with the role of zand P
interchanged. Thus in bothcases,
sup
H
d
027
[ ^er z
{H};^er
P
{H}] > ff } 9h 2H
n
l
:d
027
[ ^er
z
{h}; er
P
{h}] > ff;
which completes theproof of the Lemma.By the nature ofthe {n; m} sampling
process, Pr
{
{z; P} 2 {X 002Y }
{n;m} 002 P
n
sup H
n
l
: d
027
[^er
z
{h}; er
P
{h}] > ff
} =
Z
P2P n
Pr
{ z 2 {X 002 Y}
{n;m}
: sup
H
n l
d
027
[ ^er
z
{h}; er
P {h}] > ff
}
dQ
n
{P}: {65}
Now H
n
l
022 K 010 001 001 001010K where K := fh l
: h 2 H: H 2 H g andH
n
l
ispermissible by the assumed
permissibilityof H {Lemma 32, Appendix D}.Hence H
n
l satis002esthe conditions of Corollary 19 and so combining Lemma24, Equation
{65} andsubstituting ff=2 for ffand ffi=2 for ffi in Corollary 19 gives the
following Lemma on thesample size required to ensure {62} holds. Lemma 25.
If
m025 max
032
32ff
2 027 n
log
8C {ff027 =16; H n
l
}ffi
;
8ff
2
027
033
then
Pr
032
{z; P} 2 {X 002 Y }
{n;m}
002 P
n : sup
H
d 027
[ ^er z
{H};^er
P
{H}] >
ff2
033
024
ffi2
:
A.2.2INEQUALITY {63} Note that er
P
{H} =
1n
P
n i=1
H
003 {P
i
}and er
Q
{H} = E
P 030Q
H
003
{P }, i.e the expectation ofH
003
{P}
where P is distributedaccording toQ. So to bound theleft-hand-side of{63}
we can apply Corollary 19 with n = 1, m replaced byn, H replaced by H
003
, ff and ffireplaced by ff=2 and ffi=2 respectively,
Preplaced by Q and Z replacedbyP . Note that H
003
is permissible wheneverH is {Lemma 32}.
Thus,if n 025 max
032 32ff 2
027
log 8C {ff027 =16; H
003
}ffi
; 8ff 2
027
033 {66}
186A MODEL OFINDUCTIVE BIASLEARNING
then inequality{63} is satis002ed.
Now,putting together Lemma 23, Lemma 25 and Equation 66, we have proved
the following more general version of Theorem 2. Theorem 26. Let Hbe a permissible
hypothesis space family and letz be an {n; m}-samplegen-
erated from the environment{P ; Q}. For all 0 < ff; ffi <1 and 027 > 0,
if n 025 max
032 32ff 2
027
log 8C {ff027 =16; H
003
}ffi
; 8ff 2
027
033 and m 025 max
032
32ff
2
027n
log
8C {ff027 =16; H
n
l
}ffi
;
8ff
2 027
033
; then
Pr
032 z 2 {X 002 Y}
{n;m}
: sup
H
d 027
[ ^er z
{H};er
Q
{H}]> ff
033
024ffi
To get Theorem 2, observethat er
Q
{H} > ^er
z {H} + " }d
027
[ ^er
z
{H}; er
Q
{H}] > "={2 + 027}.
Setting ff ="={2 + 027 } and maximizingff
2
027 gives027 = 2. Substitutingff = "=4 and 027= 2 into
Theorem 26 givesTheorem 2.
A.3 The Realizable Case In Theorem 2 the sample complexity forboth m and
n scales as 1="
2
. Thiscan be improved to
1="if instead of requiring er
Q
{H} 024^er
z
{H} + ", we require only thater
Q
{H}024 024 ^er
z {H} + "
for some 024 > 1. To see this, observe that er Q
{H} >^er
z
{H}{1 + ff}={1000 ff} + ff027 ={1000 ff} }
d 027
[ ^er
z
{H}; er Q
{H}] > ff, so setting ff027 ={1000 ff} = "in Theorem 26 and treating ff
as a constant
gives:
Corollary 27.Under the same conditions asTheorem 26, forall " > 0 and 0
< ff;ffi < 1, if
n025 max
032
32ff{1 000ff}"
log
8C {(1 000 ff}"=16; H
003
}ffi
; 8ff{1 000 ff}"
033
and m 025 max 032
32ff{1 000 ff}"n
log
8C {(1000 ff}"=16;H
n
l
}ffi
; 8ff{1 000 ff}"
033
;
then
Pr
032
z 2{X 002 Y }
{n;m}
: sup H
er
Q
{H} 025
1+ ff1000 ff
^er
z
{H} + " 033
024 ffi: These bounds are particularly useful if we know that ^er
z
{H} = 0, for then we can set ff = 1=2
{which maximizesff{1000 ff}}. Appendix B. Proof of Theorem 6 Recalling
De002nition 6, forH of the form given in {32},H
n
l
can be written H
n
l
= fg
1
ffif 010 001 001 001 010g
n
ffi f :g
1
; : : : ; g n
2 G
l and f 2 F g : To write H
n l
asa composition of two functionclasses note that if for each f: X ! V we
de002ne 026
f : {X002 Y }
n
! {V 002 Y } n
by
026 f {x
1
; y
1
; : : : ; x n
; y
n } := {f {x 1
}; y
1
; : : : ; f {x
n
}; y n
}
187BAXTER then g
1
ffif010 001 001 001 010g
n
ffi f= g
1
010001 001 001 010 g
n ffi
026
f. Thus, setting G
n l
:= G
l 010 001 001 001 010 G l
andF:= f
026
f: f 2
F g, H
n
l
=G
n
l
ffiF : {67}
Thefollowing two Lemmas will enable us to boundC{"; H
n
l
}.
Lemma 28.Let H : X 002Y ! [0; 1] be of theform H = G
l
ffi F where X 002Y
F
000!V 002 Y
G
l
000!
[0;1]. For all "
1
; "
2
> 0,
C {"
1
+ " 2
; H} 024 C G
l
{" 1
; F }C {"
2
;G
l
}: Proof. Fix a measureP on X 002 Yand let F be a minimum size"
1
-cover for{F ; d
[P;G
l
]
}. By
de002nition jFj 024 C
G
l {"
1
;F }. For each f2 F let P
f be the measure on V 002Y de002ned by P
f {S } =
P{f
0001
{S }) for any set Sin the 033-algebra on V002 Y {f is measurable sof
0001
{S} is measurable}.
LetG
f
be a minimum size"
2
-cover for{G
l
; d P
f
}.By de002nition again, jG
f
j 024 C {" 2
; G
l }. Let
N :=fg ffi f : f2 F and g 2G
f
g. Notethat jN j 024 C
G
l
{" 1
; F }C{"
2
; G l
} so the Lemma will be proved if N can be shown tobe an "
1
+"
2
-coverfor {H; d
P }. So, given anyg ffi f 2 H choose f
0
2 Fsuch that d
[P;G
l
]
{f ; f
0
}024 "
1
andg
0
2 G
f
0
such thatd
P
f
0 {g; g
0 } 024 "
2 . Now,
d
P
{g ffi f; g
0
ffif
0
} 024d
P
{g ffif ; g ffif
0
} + d
P {g ffif
0 ; g
0
ffif
0
}
024 d
[P;G l
]
{f; f
0
} +d
P
f
0 {g; g
0 }
024 "
1
+ "
2 :
where the 002rst line follows from the triangle inequality for d P
and the second line followsfrom
the facts: d
P
{g ffif 0
; g
0 ffi f
0
} = d
P
f {g; g
0 } and d
P {g ffif ; gffi f
0
}024 d
[P;G l
]
{f; f
0
}.Thus N is an
" 1
+ "
2 -cover for {H;d
P
} and so the result follows.Recalling the de002nition of H
1
010001001 001 010 H
n
{De002nition 6}, we have the followingLemma.
Lemma 29.
C{"; H
1
010 001 001 001 010 H n
} 024
n
Y
i=1 C {"; H
i }
Proof. Fixa product probability measure P =P
1
002 001001 001 002 P
n on {X 002 Y}
n
. LetN
1
; : : : ; N n
be
"-coversof {H
1
;d
P
1
}: : : ; {H
n ; d
P
n }. and let N =N
1
010 001001 001 010 N
n . Given h = h 1
010 001 001 001010 h
n
2
H
1
010 001 001 001010 H
n
, chooseg
1
010 001001 001 010 g
n 2 N such that d P
i
{h i
; g
i } 024 " for each i= 1; : : : ; n. Now,
d
P
{h
1
010 001001 001 010 h
n ; g
1
010001 001 001 010 g
n
} =
1n
Z
Z
n
fi
fi fi
fi
fi
n
X
i=1 h
i
{z i
} 000
n
X
i=1 g
i
{z i
}
fi fi
fi
fi
fi
dP{z
1
; : : : ; z
n
}
024
1n
n X
i=1
d P
i
{h i
; g
i }
024 ":
Thus N is an "-cover forH
1
010 001 001 001010 H
n
and asjN j =
Q n
i=1
jN
i
j the result follows.188A MODEL OFINDUCTIVE BIASLEARNING
B.1Bounding C {"; H
n
l
}
From Lemma 28, C
000
"
1
+ "
2
; G
l
n
ffiF
001 024 C {"
1 ; G
l
n } C
G
l n
000
"
2
;F
001
{68}
and from Lemma 29, C {"
1
; G
l
n
} 024 C{"
1 ; G
l
} n
: {69}
Using similar techniques tothose used to prove Lemmas 28 and 29, C
G l
n
{";F } can be shown to satisfy
C
G l
n
{" 2
;F} 024 C
G
l
{"
2
; F } : {70} Equations {67}, {68}, {69} and {70}together imply inequality
{34}.
B.2 Bounding C {"; H 003
}
Wewish to prove that C {"; H
003
}024 C
G
l {"; F } whenH is a hypothesis space family of theform
H = fG
l
ffif : f 2F g. Note that each H 003
2 H
003
corresponds to some G l
ffi f , and that H
003
{P} = inf
g2G l
er
P
{g ffi f }: Any probability measure Qon P induces a probability measureQ
X 002Y
on X 002 Y , de002ned by Q
X 002Y {S } =
Z P
P {S }dQ{P }
for anyS in the 033-algebra onX 002 Y . Note also that ifh; h
0
are bounded,positive functions on an
arbitrary setA, then
fi
fi fi
fi
inf a2A
h{a} 000 inf
a2A
h
0
{a}
fi
fi
fi
fi
024 sup
a2A fi
fi
h{a}000 h
0
{a}
fi
fi
: {71}
LetQ be any probability measure on the spaceP of probability measures on
X002 Y . Let H
003
1
; H 003
2
be twoelements of H
003
withcorresponding hypothesis spaces G l
ffif
1 ; G
l
ffif
2
. Then, d
Q
{H 003
1
; H 003
2
} = Z
P
fi fi
fi
fi
inf
g2G
l er
P
{gffi f
1
} 000inf
g2G
l er
P
{gffi f
2
} fi
fi
fi fi
dQ{P } 024
Z
P sup
g2G
l jer
P
{g ffif
1
} 000 er
P
{g ffif
2
}j dQ{P }{by {71} above}
024 Z
P
Z
X 002Y
sup
g2G
l
jgffi f
1
{x; y} 000 g ffif
2
{x; y}j dP {x; y}dQ{P }
= d [Q
X002Y ;G
l
] {f
1
;f
2
}:
The measurability of sup
G
l
gffi fis guaranteed by the permissibility ofH {Lemma 32 part 4, Ap-
pendixD}. From d
Q {H
003
1 ; H
003
2
} 024 d
[Q
X002Y ;G
l
] {f
1
;f
2
} we have, N {"; H
003
; d
Q } 024 N
000 "; F ; d
[Q
X002Y
;G
l
]
001 ; {72}
which givesinequality {35}.
189BAXTER B.3 Proof of Theorem 7 In order to prove the bounds inTheorem
7 we have to apply Theorem 6 to theneural network
hypothesis space family ofequation {39}. In this case the structure is R
d
F 000! R
k G
000! [0;1]
where G = f{x
1
; : : : ; x k
} 7! 033 020
P
k
i=1
ff
i
x i
+ ff
0 021
: {ff 0
; ff
1 ; : : : ; ff
k } 2 U g for some bounded subset U of R
k+1
and some Lipschitz squashingfunction 033. The feature classF : R
d
! R
k
is the set ofall one hidden layer neural networks withd inputs, l hidden
nodes, koutputs, 033 as
the squashing functionand weights w 2 T whereT is a bounded subset of R
W
. The Lipschitz restriction on 033 and the boundedrestrictions on the weights
ensure thatF and G areLipschitz
classes. Hence there exists b< 1 such that for all f2 F and x; x
0
2 R
d , kf {x}000 f {x
0 }k <
bkx000 x
0
k and for allg 2 G andx; x
0 2 R
k
,jg{x} 000 g{x
0
}j< bkx 000 x
0
k where k 001 kis the L
1
norm in each case. The loss function is squaredloss.
Now, g
l
{x; y} =l{g{x};y} = {g{x}000 y}
2
, hence for all g; g
0
2 G and all probability measures P on R
k 002 [0; 1] {recall that weassumed the output space Y was[0; 1]},
d P
{g
l ; g
0
l } =
Z
R k
002[0;1] fi
fi
{g{v} 000 y}
2
000 {g
0
{v} 000 y}
2
fi
fi
dP {v; y}
024 2
Z R
k
fi fi
g{v}000 g
0
{v}
fi
fi
dP
R
k
{v}; {73}
where P
R
k is the marginal distribution onR
k
derived fromP . Similarly, for allf ; f
0
2F and
probability measures Pon R
d
002[0; 1],
d [P;G
l
]
{f ; f
0
} 024 2b Z
R
d
kf {x} 000f
0
{x}k dP
R
d {x}: {74} De002ne
C
000 "; G ; L
1 001
:= sup
P
N
000
"; G ; L
1 {P }
001 ;
where the supremum is overall probability measures on {the Borel subsets
of}R
k
, and
N
000
"; G; L
1
{P}
001
is the size of thesmallest "-cover of G under theL
1
{P }metric. Similarly set,
C 000
"; F ;L
1
001
:= sup
P
N 000
"; F ;L
1
{P } 001
;
wherenow the supremum is over all probabilitymeasures on R
d
. Equations{73} and {74} imply
C {"; G
l
}024 C
020
"2
;G ; L
1
021 {75}
C
G l
{"; F }024 C
020
"2b
;F ; L
1
021 {76}
Applying Theorem 11 fromHaussler {1992}, we 002nd
C 020
"2
; G
l ; L
1
021 024
024
2eb"
025 2k+2
C
020
"2b
; F ; L 1
021
024 024
2eb
2"
025 2W
:
Substitutingthese two expressions into {75} and {76} and applying Theorem
6 yields Theorem
7.190A MODEL OFINDUCTIVE BIASLEARNING
Appendix C.Proof of Theorem 14
Thisproof follows a similar argument to the onepresented in Anthony and
Bartlett {1999} for ordinary Boolean function learning. First we need a technical
Lemma. Lemma 30. Let ff be a randomvariable uniformly distributed onf1=2
+ fi =2;1=2 000 fi =2g, with
0 < fi< 1. Let 030
1
; : : : ; 030
m
be i.i.d. f1;0001g-valued random variables with Pr{030
i
= 1} =ff for all
i. For any function f mappingf1; 0001g
n
! f1=2 +fi =2; 1=2000 fi =2g,
Pr f030
1
; : : : ; 030
m
: f {030
1 ; : : : ; 030
m } 6= ffg > 14
"
1 000
q1 000 e
000
mfi
21000fi
2
#
:
Proof.Let N {030}denote the number of occurences of+1in the random sequence
030 ={030
1
; : : : ; 030 m
}.
Thefunction f can be viewed as a decisionrule, i.e. based on the observations030,f
tries to guess whether the probability of +1 is1=2 + fi =2 or1=2 000 fi
=2. The optimal decision rule is the Bayes estimator: f {030
1
; : : : ; 030 m
} = 1=2+ fi =2 if N {030} 025 m=2,andf {030
1
; : : : ; 030
m } = 1=2 000fi =2 otherwise.
Hence, Pr {f{030}6= ff} 025
12
Pr 020
N {030} 025
m2
fi
fi fi
fi
ff = 12
000
fi2
+
12
Pr 020
N {030} <
m2
fi
fi fi
fi
ff = 12
+ fi2
>
12
Pr
020
N {030}025
m2
fi
fi
fi
fi
ff = 12
000
fi2
which is half theprobability that a binomial {m;1=2 000 fi =2}random variable
is at least m=2. By
Slud's inequality {Slud,1977},
Pr {f {030} 6= ff}>
12 Pr
Z 025 smfi 21000 fi
2
! where Z is normal {0; 1}. Tate' s inequality {Tate, 1953} states that
for all x025 0,
Pr {Z025 x} 025
12
h 1 000
p1 000 e
000x
2
i
: Combining the last two inequalitiescompletes the proof.Letx 2 X
{n;m} beshattered by H , withm = d
H
{n}. For each rowi in x let P
i be the set of
all 2 d
distributions Pon X 002 f0061gsuch that P {x; 1}= P {x; 0} = 0ifx is not
contained in the
ith row of x, and for eachj = 1; : : : ; d H
{n},P {x
ij
; 1} = {1 006 fi}={2d
H {n}) and P{x
ij
; 0001} =
{1 007 fi}={2d
H
{n}). Let P:= P
1
002001 001 001 002 P
n .
Note that for P = {P
1
; : : : ; P n
} 2 P ,the optimal error opt
P { H
n
}is achieved by any sequence h
003
= {h
003
1
;: : : ; h
003
n } such that h
003 i
{x
ij } = 1 if and only ifP
i
{x
ij
; 1} = {1+ fi }={2d
H
{n}), andH
n
alwayscontains such a sequence because Hshatters x. The optimal error is
then opt
P
{H
n
} = er P
{h
003 } =
1n
n
X
i=1
P
i fh
003
i {x} 6= yg=
1n n
X
i=1 d
H
{n}
X
j=1 1 000 fi2d
H
{n}
=
1000 fi2 ;
191BAXTER and for any h = {h 1
; : : : ; h n
} 2 H
n
,
er
P {h} = opt
P { H
n
}+
find H
{n}
jf{i; j } : h i
{x
ij } 6= h
003 i
{x
ij }gj : {77} For any {n; m}-sample z, let each elementm
ij
in the array m{z} :=
m 11
001 001 001m
1d
H
{n}
.
. .
.
.
. .
.
.
m n1
001 001 001m
nd
H
{n}
equal the number ofoccurrences of x
ij
in z.
Now, if we selectP = {P
1
; : : : ; P
n
} uniformly at random from P, and generate an {n; m}-
sample z usingP, then for h = A
n
{z} {the output of the learning algorithm} we have:
E { jf{i; j }: h
i
{x ij
} 6= h 003
i
{x ij
}gj} = X
m
P {m}E { jf{i;j } : h
i {x
ij
}6= h
003
i
{x
ij }gj jm}
=
X
m
P{m}
n
X i=1
d
H {n}
X
j=1
P {h{x
ij
} 6=h
003
{x ij
}jm
ij
}
where P{m} is the probability ofgenerating acon002guration mof thex
ij
under the {n; m}-sampling
processand the sum is over all possiblecon002gurations. From Lemma 30,
P {h{x
ij } 6= h
003 {x
ij
}jm
ij
} > 14
2
4
1 000 r1 000e
000
m
ij fi
21000fi
2
3
5
;
hence E
024
1nd
H
{n}
jf{i;j } : h
i
{x
ij
}6= h
003
i
{x
ij }gj
025
>
1nd H
{n}
X
m
P {m}
n
X
i=1
d
H {n}
X
j=1
14
2
4
1000
r1 000 e
000
m
ij
fi
21000fi 2
3
5
025
14
"
1 000 q1 000e
000
mfi
2d
H
{n}{1000fi
2
}
# {78}
by Jensen'sinequality. Since for any [0; 1]-valued random variableZ , Pr{Z
> x}025 E Z 000 x, {78} implies:
Pr
022 1nd H
{n}
jf{i; j } :h
i
{x
ij
} 6= h 003
i
{x ij
}gj > flff
> {1000 fl }ff
where ff :=
14
"
1000
q1 000 e
000
mfi
2d
H
{n}{1000fi
2
}
#
{79}
and fl 2 [0; 1]. Plugging this into {77} shows that Pr f{P; z} : er
P
{A
n
{z})> opt
P
{H
n
} + flfffi g > {1000 fl}ff:
192A MODEL OFINDUCTIVE BIASLEARNING
Since theinequality holds over the random choice ofP, it must also hold
for some speci002cchoice
ofP. Hence for anylearning algorithm A
n
there is some sequence of distributionsP such that
Pr fz : er
P
{A
n
{z})> opt
P
{H
n
} + flfffi g > {1000 fl}ff:
Setting
{1000 fl }ff 025ffi; and fl fffi025 "; {80}
ensures Pr fz: er
P
{A
n {z}) > opt P
{ H
n } + "g > ffi:{81}
Assuming equality in {80}, weget
ff =
ffi1 000 fl ; fi =
"ffi
1 000flfl
:
Solving {79} for m, and substituting theabove expressions forff and fi shows
that {81} issatis002ed
provided
m024 d
H
{n}
"
022
ffi"
2
022
fl1 000 fl
2
0001
#
log
{1000 fl }
28ffi {1 000fl 000 2ffi} {82}
Setting fl=1 000 affi for somea > 4 {a > 4since ff < 1=4and ff = ffi={1000
fl }}, and assuming "; ffi 024 1={ka} for some k >2, {82} becomes
m 024 d
H
{n}a
2
024
1 000 2k 025
log
a 28{a 000 2}
: {83} Subject to the constraint a >4, the right hand side of {83} isapproximately
maximized at a = 8:7966, at which point the value exceeds d
H
{n}{1000 2=k}={220"
2 }. Thus, for all k0251,if "; ffi 024 1=9k and
m 024
d
H {n}
000 1000
2k
001220"
2 ; {84}
then
Pr fz: er
P {A
n
{z}) > opt
P { H
n
}+ "g > ffi:
To obtain the ffi-dependence inTheorem 14 observe that by assumptionH
1
containsat least two functions h
1
; h
2
, hence there exists an x 2 X such thath
1
{x}6= h
2
{x}. Let P
006 be two distributions
concentratedon {x; 1} and {x; 0001} such that P 006
{x; h 1
{x}) = {1006 "}=2 andP
006
{x;h
2
{x})=
{1 007 "}=2. Let P
+
:= P
+
002001 001 001 002 P
+
and P
000 := P
000
002001 001 001 002 P
000
be the product distributions on {X 002 f0061g}
n
generated byP
006
, and h 1
:= {h
1
; : : : ; h
1
}; h
2 := {h
2
; : : : ; h
2
}. Note that h
1
and h
2
are both in H
n
. If P is one of P 006
and the learning algorithmA
n
chooses the wronghypothesis h,
then
er
P
{h}000 opt
P
{H
n
} = ": 193BAXTER Now, if we choose Puniformly at random from fP +
; P
000 g and generate an {n;m}-sample z ac-
cordingtoP, Lemma 30 shows that
Prf{P; z}: er
P
{A n
{z}) 025opt
P
{ H n
} + "g>
14 "
1 000
q1 000 e
nm"
21000"
2
# ;
which is at least ffiif
m <
1000"
2"
2
1n
log
18ffi{1000 2ffi}
{85} provided 0 < ffi <1=4. Combining the two constraints onm: {84} {with
k = 7} and {85}, and using
maxfx
1
; x 2
g 025
12
{x 1
+ x
2 } 002nishes the proof.Appendix D. Measurability In order for Theorems
2 and 18 to hold infull generality wehad to impose a constraint called 223permissibility224
on the hypothesisspace family H . Permissibility wasintroduced by Pollard
{1984}
forordinary hypothesis classes H. Hisde002nition is very similar to Dudley's
223image admissible
Suslin224{Dudley, 1984}. We will be extending this de002nition to cover
hypothesisspace families.
Throughout this section weassume all functions hmap from {the completeseparable
metric
space} Zinto [0; 1]. LetB{T } denote the Borel033-algebra of any topological
spaceT . As in Section
2.2,we view P , the set of all probability measures on Z , as a topological
space by equipping it
with the topology of weak convergence. B{P }is then the 033-algebra generated
by thistopology. The
followingtwo de002nitions are taken {with minormodi002cations} from Pollard
{1984}.
De002nition 8. A set H of[0; 1]-valued functions onZ is indexed by the
setT if there exists a function f : Z002 T! [0; 1] such that H = ff { 001;
t} : t 2 Tg :
De002nition 9.The set H is permissible if it canbe indexed by a set T suchthat
1. T is an analytic subsetof a Polish
7
spaceT , and
2. the function f : Z002 T ! [0; 1]indexing H by T ismeasurable with respect
to the product 033-algebra B{Z} 012 B{T }.
An analytic subset Tof a Polish spaceTis simply the continuous imageof a
Borelsubset X
of another Polish spaceX . The analytic subsetsof a Polish space include
the Borel sets.They
are important because projections ofanalytic sets are analytic, and can
be measured ina complete
measure space whereas projectionsof Borel sets are not necessarily Borel,
and hencecannot be
measured with a Borel measure.For more details see Dudley {1989}, section13.2.
Lemma 31. H
1
010 001001 001 010 H n
: {X 002Y }
n
![0; 1] is permissible ifH
1
; : : : ;H
n
are all permissible. Proof. Omitted.We now de002ne permissibility ofhypothesis
space families.7. A topological space is calledPolish if it is metrizable
such that itis a complete separable metric space. 194A MODEL OFINDUCTIVE
BIASLEARNING
De002nition 10.A hypothesis space family H= fHg is permissible if thereexist
sets S andT that are analytic subsets of PolishspacesS andT respectively,and
a function f : Z002T 002 S ! [0; 1],
measurable with respectto S 012 B{T }012 B{S }, such that H =
010
ff { 001 ; t; s}: t 2 T g :s 2 S
011
: Let {X; 006; 026} be a measure space and Tbe an analytic subset of a
Polish space.Let A{X }
denotethe analytic subsets of X . Thefollowing three facts about analytic
sets aretaken from
Pollard {1984}, appendix C. {a} If {X; 006; 026} is complete then A{X
} 022 006. {b} A{X 002T } contains the product033-algebra 006 012 B{T
}.
{c} Forany set Y in A{X 002 T }, the projection031
X
Y ofY onto X is in A{X }.
Recall De002nition 2for the de002nition of H
003 . In the following Lemma we assume that{Z; B{Z }) has been completed
with respect to anyprobability measure P , and also that{P ; B{P })is complete
with respect to the environmental measure Q.
Lemma 32.For any permissible hypothesis space familyH ,
1. H
n l
is permissible.
2. fh 2 H : H2 H gis permissible.
3. H is permissible for allH 2 H .
4. sup H
and inf
H
are measurable for allH 2 H .
5. H
003
is measurable for allH 2 H .
6. H 003
is permissible. Proof. As we have absorbed theloss function into the hypotheses
h,H
n
l
issimply the set of all
n-fold productsH 010 001 001 001 010 Hsuch that H 2 H . Thus{1} follows
from Lemma 31. {2}and {3}
are immediate from the de002nitions.As H is permissible for allH 2 H ,
{4} can be proved byan
identical argument to that used in the223Measurable Suprema224 section
of Pollard {1984},appendix
C.
For {5}, note thatfor any Borel-measurable h :Z ! [0; 1],the functionh :P
! [0; 1] de002ned byh{P} :=
R
Z
h{z} dP {z} is Borel measurable Kechris {1995, chapter 17}. Now, permissibility
of H automatically implies permissibility ofH := fh : h 2 Hg, andH
003
= infH
so H 003
is measurable
by {4}.
Now let Hbe indexed by f :Z 002 T 002 S! [0; 1] in the appropriate way.
To prove {6}, de002ne g : P 002T 002 S ! [0; 1] by g{P;t; s} :=
R
Z
f {z; t; s} dP {z}. ByFubini's theorem g is a B{P } 012 B{T } 012 B{S}-measurable
function. LetG : P 002 S ![0; 1] be de002ned byG{P; s} :=
inf
t2T
g{P; t; s}. Gindexes H
003 in the appropriate way forH
003
to be permissible,provided it can
be shown thatG is B{P } 012 B{S }-measurable. Thisis where analyticity
becomes important.Let
g
ff
:=f{P; t; s} :g{P; t; s} > ffg. By property {b} of analytic sets,A {P 002
T 002S } contains g
ff
.
The set G ff
:= f{P;s} : G{P; s} > ffg is the projection ofg
ff
onto P 002S , which by property {c} is also analytic. As {P; B{P }; Q}
is assumed complete, G
ff
is measurable, by property {a}.Thus G is
a measurable function andthe permissibility of H
003 follows.195BAXTER References
Abu-Mostafa,Y. {1993}. A method for learning fromhints. In Hanson, S. J.,
Cowan, J. D., &Giles,
C. L. {Eds.}, Advances in Neural Information Processing Systems 5, pp.7322680
San Mateo,
CA. Morgan Kaufmann. Anthony, M., & Bartlett, P. L. {1999}. Neural Network
Learning:Theoretical Foundations. Cam- bridge University Press, Cambridge,
UK. Bartlett, P. L. {1993}.Lower bounds on the VC-dimension of multi-layerthreshold
networks. In
Proccedingsof the Sixth ACM Conference on ComputationalLearning Theory,
pp. 44226150
New York. ACM Press. Summaryappeared in Neural Computation, 5, no. 3. Bartlett,
P. L. {1998}.The sample complexity of pattern classi002cationwithneural
networks: the
sizeof the weights is more important than the size ofthe network. IEEE Transactionson
Information Theory, 44{2}, 525226536.
Baxter, J.{1995a}. Learning Internal Representations. Ph.D. thesis, Department
of Mathe- matics and Statistics, The Flinders University of South Australia.
Copy availablefrom
http://wwwsyseng.anu.edu.au/030jon/papers/thesis.ps.gz. Baxter, J. {1995b}.
Learninginternal representations. In Proceedings ofthe Eighth International
Conference onComputational Learning Theory, pp. 311226320. ACM Press. Copy
available
fromhttp://wwwsyseng.anu.edu.au/030jon/papers/colt95.ps.gz.
Baxter, J. {1997a}.A Bayesian/information theoretic model of learningto
learn via multiple task
sampling.Machine Learning, 28, 722640. Baxter, J. {1997b}. The canonicaldistortion
measure for vector quantization andfunction approx-
imation. InProceedings of the Fourteenth InternationalConference on Machine
Learning, pp. 3922647. Morgan Kaufmann. Baxter, J., & Bartlett, P.L. {1998}.
The canonical distortion measure infeature space and 1-NN
classi002cation.In Advances in Neural Information ProcessingSystems 10,
pp. 245226251. MIT
Press.
Berger, J. O. {1985}.Statistical Decision Theory and Bayesian Analysis.
Springer-Verlag, New York.
Blumer, A., Ehrenfeucht,A.,Haussler, D.,& Warmuth, M. K.{1989}. Learnability
and the vapnik- chervonenkis dimension. Journalof the ACM, 36, 929226965.
Caruana, R. {1997}. Multitask learning.Machine Learning, 28, 4122670. Devroye,
L., Gy 250or002,L., & Lugosi, G. {1996}. AProbabilistic Theory of PatternRecognition.
Springer, New York.
Dudley , R. M. {1984}.A Course on Empirical Processes, Vol. 1097 of Lecture
Notes inMathemat-
ics, pp. 2226142.Springer-Verlag.
Dudley ,R. M. {1989}. Real Analysis and Probability. Wadsworth & Brooks/Cole,California.
196A MODEL OFINDUCTIVE BIASLEARNING
Gelman,A., Carlin, J. B., Stern, H. S., & Rubim, D. B.{Eds.}. {1995}. Bayesian
Data Analysis.
Chapman and Hall.
Good,I. J. {1980}. Some history of the hierarchical Bayesian methodology.
In Bernardo, J.M., Groot, M. H. D., Lindley, D. V., & Smith, A. F. M. {Eds.},Bayesian
Statistics II. University Press, Valencia.
Haussler,D. {1992}. Decision theoretic generalizationsofthe pac model for
neural net and other learning applications. Informationand Computation, 100,
78226150. Heskes, T. {1998}. Solvinga huge number of similar tasks: acombination
ofmulti-task learning and a hierarchical Bayesian approach. InShavlik, J.
{Ed.}, Proceedings of the 15thInternational
Conference on Machine Learning{ICML '98}, pp. 233226241. Morgan Kaufmann.
Intrator, N., & Edelman, S.{1996}.How to make a low-dimensional representationsuitable
for
diverse tasks.Connection Science, 8. Kechris, A. S. {1995}. ClassicalDescriptive
Set Theory. Springer-Verlag, NewYork.
Khan,K., Muggleton, S., & Parson, R. {1998}.Repeat learning using predicate
invention.In Page,
C. D. {Ed.}, Proceedingsof the 8th International Workshop on InductiveLogic
Programming
{ILP-98}, LNAI 1446, pp. 65226174. Springer-Verlag.
Langford, J. C. {1999}.Staged learning. Tech. rep.,CMU, School of Computer
Science.
http://www.cs.cmu.edu/030jcl/research/ltol/stagedlatest.ps. Mitchell, T.
M. {1991}.The need for biases in learning generalisations.In Dietterich,
T. G., &
Shavlik, J. {Eds.}, Readings in MachineLearning. Morgan Kaufmann.
Parthasarathy, K.R. {1967}. ProbabiliityMeasures on Metric Spaces. AcademicPress,
London.
Pollard, D. {1984}.Convergence of Stochastic Processes. Springer-Verlag,
NewYork.
Pratt, L. Y. {1992}.Discriminability-based transferbetween neural networks.
In Hanson, S.J.,
Cowan,J.D., & Giles, C. L. {Eds.}, Advances in Neural Information Processing
Systems 5, pp. 204226211. Morgan Kaufmann. Rendell, L., Seshu, R., & Tcheng,
D.{1987}.Layered concept learning and dynamically-variable bias management.
In Proceedings ofthe Tenth International Joint Conference onArti002cial
Intelligence {IJCAI'87}, pp. 308226314. IJCAI , Inc.
Ring, M. B. {1995}. Continual Learning inReinforcement Environments. R.Oldenbourg
Verlag.
Russell,S. {1989}. The Use of Knowledge in Analogy and Induction. Morgan
Kaufmann. Sauer, N. {1972}. On the densityof families of sets. Journalof
Combinatorial Theory A, 13, 145226168.
Sharkey,N. E., & Sharkey, A. J. C. {1993}.Adaptive generalisation andthe
transfer of knowledge.
Arti002cial Intelligence Review,7, 313226328.
197BAXTER Silver, D. L., & Mercer,R. E. {1996}. The parallel transfer of
task knowledge using dynamic
learningrates based on a measure of relatedness.Connection Science, 8, 277226294.
Singh, S. {1992}. Transferof learning by composing solutions of elementalsequential
tasks. Ma-
chineLearning, 8, 323226339.
Slud, E. {1977}. Distribution inequalities forthe binomial law. Annals of
Probability, 4,404226412.
Suddarth, S.C., & Holden, A. D. C. {1991}.Symolic-neural systems and the
use of hints in devel-
oping complex systems.International Journal of Man-Machine Studies, 35,
291226311.
Suddarth,S. C., & Kergosien, Y.L. {1990}. Rule-injection hints as a means
ofimproving net-
work performance andlearning time. In Proceedings of the EURASIPWorkshop
on Neural
NetworksPortugal. EURASIP.
Sutton,R. {1992}. Adapting bias by gradient descent:An incremental version
of delta-bar-delta.In
Proceedings of the TenthNational Conference on Arti002cial Intelligence,
pp. 171226176. MIT
Press. Tate, R.F. {1953}.On a double inequality of the normal distribution.
Annals of Mathematical
Statistics, 24, 132226134. Thrun, S. {1996}. Is learning the n-ththing
any easier than learning the 002rst?.In Advances in Neural
InformationProcessing Systems 8, pp. 640226646. MITPress.
Thrun, S., & Mitchell, T. M. {1995}. Learning one more thing.In Proceedings
of the International Joint Conference on Arti002cial Intelligence, pp. 12172261223.
Morgan Kaufmann. Thrun, S., & O'Sullivan, J. {1996}.Discovering structure
in multiple learning tasks:The TC al-
gorithm. InSaitta, L.{Ed.}, Proceedings ofthe 13thInternational Conference
on Machine Learning {ICML '96}, pp. 489226497. Morgen Kaufmann.
Thrun, S., & Pratt, L. {Eds.}. {1997}. Learning to Learn.Kluwer Academic.
Thrun, S.,& Schwartz, A.{1995}. Finding structure inreinforcement learning.
In Tesauro,G.,
Touretzky , D., & Leen, T. {Eds.}, Advances in Neural Information Processing
Systems, Vol. 7, pp. 385226392. MIT Press.
Utgoff, P. E. {1986}. Shift of bias for inductive concept learning. In Machine
Learning:An Arti002cial
IntelligenceApproach, pp. 107226147. Morgan Kaufmann. Valiant, L. G. {1984}.
Atheory of the learnable. Comm. ACM, 27, 11342261142.
Vapnik, V. N. {1982}. Estimationof Dependences Based on Empirical Data.
Springer-Verlag, New York.
V apnik, V. N. {1996}. The Nature of StatisticalLearning Theory. Springer
Verlag,New York.
198