Active Learning with Statistical Models

D. A. Cohn, Z. Ghahramani and M. I. Jordan

Volume 4, 1996

Links to Full Text:

Abstract:
For many types of machine learning algorithms, one can compute the statistically `optimal' way to select training data. In this paper, we review how optimal data selection techniques have been used with feedforward neural networks. We then show how the same principles may be used to select data for two alternative, statistically-based learning architectures: mixtures of Gaussians and locally weighted regression. While the techniques for neural networks are computationally expensive and approximate, the techniques for mixtures of Gaussians and locally weighted regression are both efficient and accurate. Empirically, we observe that the optimality criterion sharply decreases the number of training examples the learner needs in order to achieve good performance.

Extracted Text

Journal of Artificial Intelligence Research 4 {1996} 129-145 Submitted 11/95;
published 3/96 Active Learning with Statistical Models David A. Cohn cohn@harlequin.com
Zoubin Ghahramani zoubin@cs.toronto.edu Michael I. Jordan jordan@psyche.mit.edu
Center for Biological and Computational Learning
 Dept. of Brain and Cognitive Sciences Massachusetts Institute of Technology
Cambridge, MA 02139 USA
 Abstract Formany types of machine learning algorithms, one can compute the
statistically \op- timal" way to select training data. In this paper, we
review how optimal data selection
 techniques have been used with feedforward neural networks. We then show
how the same
 principles may be used to select data for two alternative, statistically-based
learning ar-
 chitectures: mixtures of Gaussians and locally weighted regression. While
the techniques
 for neural networks are computationally expensive and approximate, the techniques
for
 mixtures of Gaussians and locally weighted regression are both efficient
and accurate. Em-
 pirically, we observe that the optimality criterion sharply decreases the
number of training
 examples the learner needs in order to achieve good performance.
 1. Introduction
 The goal of machine learning is to create systems that can improve their
performance at
 some task as they acquire experience or data. In many natural learning tasks,
this experience
 or data is gained interactively, by taking actions, making queries, or doing
experiments.
 Mostmachine learning research, however, treats the learner as a passive
recipient of data
 to be processed. This \passive" approach ignores the fact that, in many
situations, the
 learner's most powerful tool is its ability to act, to gather data, and
to influence the world
 it is trying to understand. Active learning is the study of how to use this
ability effectively.
 Formally, active learning studies the closed-loop phenomenon of a learner
selecting ac-
 tions or making queries that influence what data are added to its training
set. Examples
 include selecting joint angles or torques to learn the kinematics or dynamics
of a robot arm, selecting locations for sensor measurements to identify and
locate buried hazardous
 wastes, or querying a human expert to classify an unknown word in a natural
language
 understandingproblem.
 When actions/queries are selected properly, the data requirements for some
problems
 decrease drastically, and some NP-complete learning problems become polynomial
in com- putation time {Angluin, 1988; Baum & Lang, 1991}. In practice, active
learning offers its greatest rewards in situations where data are expensive
or difficult to obtain, or when the environment is complex or dangerous.
In industrial settings each training point may take
 days to gather and cost thousands of dollars; a method for optimally selecting
these points
 could offer enormous savings in time and money.
 c
 fl1996 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.Cohn,
Ghahramani & Jordan There are a number of different goals which one may wish
to achieve using active learn-
 ing. One is optimization, where the learner performs experiments to find
a set of inputs
 that maximize some response variable. An example of the optimization problem
would be finding the operating parameters that maximize the output of a steel
mill or candy factory. There is an extensive literature on optimization,
examining both cases where the learner
 has some prior knowledge of the parameterized functional form and cases
where the learner has no such knowledge; the latter case is generally of
greater interest to machine learning practitioners. The favored technique
for this kind of optimization is usually a form of re- sponse surface methodology
{Box & Draper, 1987}, which performs experiments that guide hill-climbing
through the input space. Arelated problem exists in the field of adaptive
control, where one must learn a control policy by taking actions. Incontrol
problems, one faces the complication that the value of a specific action
may not be known until many time steps after it is taken. Also,in control
 {as in optimization}, one is usually concerned with the performing well
during the learning
 task and must trade of exploitation of the current policy for exploration
which may improve
 it. The subfield of dual control {Fe'ldbaum, 1965} is specifically concerned
with finding an
 optimal balance of exploration and control while learning.
 In this paper, we will restrict ourselves to examining the problem of supervised
learning:
 based on a set of potentially noisy training examples D = f{x
 i
 ; y
 i }g
 m
 i=1 , where x
 i
 2 X and
 y
 i
 2 Y , we wish to learn a general mapping X ! Y . In robot control, the mapping
may be
 state002action ! new state; in hazard location it may be sensor reading
! target position.
 In contrast to the goals of optimization and control, the goal of supervised
learning is to be
 able to efficiently and accurately predict y for a given x.
 In active learning situations, the learner itself is responsible for acquiring
the training
 set. Here, we assume it can iteratively select a new input ~x {possibly
from a constrained
 set}, observe the resulting output ~y, and incorporate the new example {~x;
~y} into its training set. This contrasts with related work by Plutowski
and White {1993}, which is concerned with filtering an existing data set.
In our case, ~x may be thought of as a query, experiment,
 or action, depending on the research field and problem domain. The question
we will be
 concerned with is how to choose which ~x to try next.
 There are many heuristics for choosing ~x, including choosing places where
we don't have
 data {Whitehead, 1991}, where we perform poorly {Linden & Weber, 1993},
where we have
 low confidence {Thrun & Mªoller, 1992}, where we expect it to change our
model {Cohn,
 Atlas, & Ladner, 1990, 1994}, and where we previously found data that resulted
in learning
 {Schmidhuber & Storck, 1993}. In this paper we will consider how one may
select ~x in a
 statistically \optimal" manner for some classes of machine learning algorithms.
We first
 briefly review how the statistical approach can be applied to neural networks,
as described
 in earlier work {MacKay, 1992; Cohn, 1994}. Then, in Sections 3 and 4 we
consider two
 alternative, statistically-based learning architectures: mixtures of Gaussians
and locally
 weighted regression. Section5 presents the empirical results of applying
statistically-based active learning to these architectures. While optimal
data selection for a neural network is computationally expensive and approximate,
we find that optimal data selection for the two statistical models is efficient
and accurate.
 130Active Learning with Statistical Models
 2. Active Learning { A Statistical Approach
 We begin by defining P {x; y} to be the unknown joint distribution over
x and y, and P {x}
 to be the known marginal distribution of x {commonly called the input distribution}.
We
 denote the learner's output on input x, given training set D as ^y{x; D}.
 1
 We can then write
 the expected error of the learner as follows:
 Z
 x
 E
 T h
 {^y{x; D} 000 y{x}) 2
 jx
 i
 P{x}dx; {1}
 where E
 T
 [001] denotes expectation over P {yjx} and over training sets D. The expectation
 inside the integral may be decomposed as follows {Geman, Bienenstock, &
Doursat, 1992}:
 E
 T h
 {^y{x; D} 000 y{x}) 2
 jx
 i
 = E
 h
 {y{x} 000 E[yjx]} 2
 i
 {2}
 + {E
 D
 [^y{x; D}] 000 E[yjx]}
 2
 +E D
 h
 {^y{x; D} 000 E D
 [^y{x; D}]}
 2
 i where E
 D
 [001] denotes the expectation over training sets D and the remaining expectations
on the right-hand side are expectations with respect to the conditional density
P {yjx}. It is
 important to remember here that in the case of active learning, the distribution
of D may
 differ substantially from the joint distribution P {x; y}.
 The first term in Equation 2 is the variance of y given x  -  it is the
noise in the
 distribution, and does not depend on the learner or on the training data.
Thesecond term
 is the learner's squared bias, and the third is its variance; these last
two terms comprise the
 mean squared error of the learner with respect to the regression function
E[yjx]. Whenthe
 second term of Equation 2 is zero, we say that the learner is unbiased.
We shall assume that the learners considered in this paper are approximately
unbiased; that is, that their squared bias is negligible when compared with
their overall mean squared error. Thus we focus on algorithms that minimize
the learner's error by minimizing its variance:
 033
 2
 ^y
 021 033
 2
 ^y {x} = E
 D h
 {^y{x; D} 000 E
 D [^y{x; D}]}
 2
 i
 : {3}
 {For readability, we will drop the explicit dependence on x and D  -  unless
denoted other-
 wise, ^y and 033
 2
 ^y
 are functions of x and D.} Inan active learning setting, we will have chosen
 the x-component of our training set D; we indicate this by rewriting Equation
3 as
 033
 2
 ^y
 =
 D {^y 000 h^yi}
 2
 E
 ;
 where h001i denotes E D
 [001] given a fixed x-component of D. When a new input ~x is selected and
queried, and the resulting {~x; ~y} added to the training set, 033
 2
 ^y should change. We will
 denote the expectation {over values of ~y} of the learner's new variance
as D
 ~033
 2 ^y
 E
 = E D[{~x;~ y}
 h
 033
 2 ^ y
 j~x
 i
 : {4}1. We present our equations in the univariate setting. All results
in the paper apply equally to the multi-
 variate case. 131Cohn, Ghahramani & Jordan 2.1 Selecting Data to Minimize
Learner Variance
 In this paper we consider algorithms for active learning which select data
in an attempt to
 minimize the value of Equation 4, integrated over X. Intuitively, the minimization
proceeds
 as follows: we assume that we have an estimate of 033
 2 ^y
 , the variance of the learner at x. If,
 for some new input ~x, we knew the conditional distribution P {~yj~ x},
we could compute an
 estimate of the learner's new variance at x given an additional example
at ~x. While the
 truedistribution P {~yj~ x} is unknown, many learning architectures let
us approximate it by
 giving us estimates of its mean and variance. Using the estimated distribution
of ~y, we can
 estimate D
 ~033
 2 ^y
 E
 , the expected variance of the learner after querying at ~x.
 Given the estimate of D
 ~033
 2 ^y
 E
 , which applies to a given x and a given query ~x, we must
 integrate x over the input distribution to compute the integrated average
variance of the
 learner. In practice, we will compute a Monte Carlo approximation of this
integral, eval-
 uating D
 ~033
 2 ^y
 E
 at a number of reference points drawn according to P {x}. By querying an
~x that minimizes the average expected variance over the reference points,
we have a solid
 statistical basis for choosing new examples.
 2.2 Example: Active Learning with a Neural Network
 In this section we review the use of techniques from Optimal Experiment
Design {OED} to
 minimize the estimated variance of a neural network {Fedorov, 1972; MacKay,
1992; Cohn,
 1994}. We will assume we have been given a learner ^y = f
 ^w {}, a training set D = f{x
 i
 ; y
 i
 }g
 m
 i=1
 and a parameter vector estimate ^w that maximizes some likelihood measuregiven
D. If, for
 example, one assumes that the data were produced by a process whose structure
matches
 that of the network, and that noise in the process outputs is normal and
independently identically distributed, then the negative log likelihood of
^w given D is proportional to
 S
 2 =
 1m m
 X
 i=1
 {y
 i
 000 ^y{x
 i
 }) 2
 :
 The maximum likelihood estimate for ^w is that which minimizes S
 2
 .
 The estimated output variance of the network is 033
 2
 ^y 031 S
 2
 022 @ ^y{x}@w
 
 T
  
 @
 2 S
 2@w 2
 !
 0001 022
 @ ^y{x}@w 
 ; {MacKay, 1992}
 where the true variance is approximated by a second-order Taylor series
expansion around S
 2
 . This estimate makes the assumption that @ ^y=@w is locally linear. Combined
with the assumption that P {yjx} is Gaussian with constant variance for all
x, one can derive a closed
 form expression for
 D
 ~033
 2
 ^y
 E
 . SeeCohn {1994} for details.
 In practice, @ ^y=@w may be highly nonlinear, and P {yjx} may be far from
Gaussian; in
 spite of this, empirical results show that it works well on some problems
{Cohn, 1994}. It
 hasthe advantage of being grounded in statistics, and is optimal given the
assumptions.
 Furthermore, the expectation is differentiable with respect to ~x. As such,
it is applicable
 in continuous domains with continuous action spaces, and allows hillclimbing
to find the ~x that minimizes the expected model variance.
 132Active Learning with Statistical Models
 For neural networks, however, this approach has many disadvantages. In addition
to
 relying on simplifications and assumptions which hold only approximately,
the process is
 computationally expensive. Computing the variance estimate requires inversion
of a jwj002jwj
 matrix for each new example, and incorporating new examples into the network
requires expensive retraining. Paass and Kindermann {1995}discuss a Markov-chain
based sampling approach which addresses some of these problems. In the rest
of this paper, we consider two\non-neural" machine learning architectures
that are much more amenable to optimal data selection.
 3. Mixtures of Gaussians The mixture of Gaussians model is a powerful estimation
and prediction technique with roots in the statistics literature {Titterington,
Smith, & Makov, 1985}; it has, over the last few years, been adopted by researchers
in machine learning {Cheeseman et al., 1988; Nowlan, 1991; Specht, 1991;
Ghahramani & Jordan, 1994}. The model assumes that the data are produced
by a mixture of N multivariate Gaussians g
 i
 , for i = 1; :::; N {see Figure 1}. Inthe context of learning from random
examples, one begins by producing a joint density estimate over the input/output
space X 002Y based on the training set D. The EM algorithm
 {Dempster, Laird, & Rubin, 1977} can be used to efficiently find a locally
optimal fit of the
 Gaussians to the data. It is then straightforward to compute ^y given x
by conditioning the joint distribution on x and taking the expected value.
ooooooooooooxoy1y2g1g2g3Figure 1: Using a mixture of Gaussians to compute
^y. The Gaussians model the data
 density. Predictions are made by mixing the conditional expectations of
each Gaussian given the input x. One benefit of learning with a mixture of
Gaussians is that there is no fixed distinction between inputs and outputs
-  one may specify any subset of the input-output dimensions, and compute
expectations on the remaining dimensions. If one has learned a forward model
of the dynamics of a robot arm, for example, conditioning on the outputs
automatically gives a model of the arm's inverse dynamics. With the mixture
model, it is also straightforward
 to compute the mode of the output, rather than its mean, which obviates
many of the problemsof learning direct inverse models {Ghahramani & Jordan,
1994}.
 133Cohn, Ghahramani & Jordan For each Gaussian g
 i we will denote the input/output means as 026
 x;i
 and 026 y;i
 and vari-
 ances and covariances as 033
 2 x;i
 , 033
 2
 y;i
 and 033
 xy;i respectively. We can then express the probability
 of point {x; y}, given g
 i
 as P {x; yji} = 12031
 pj006
 i j
 exp
 024
 000 12
 {x 000 026
 i
 }
 T
 006
 0001
 i
 {x 000 026
 i
 }
 025
 {5} where we have defined
 x =
 "
 x
 y
 # 026
 i
 =
 " 026
 x;i
 026
 y;i
 #
 006
 i =
 "
 033
 2 x;i
 033
 xy;i 033
 xy;i
 033 2
 y;i
 #
 :
 In practice, the true means and variances will be unknown, but can be estimated
from data via the EM algorithm. The {estimated} conditional variance of y
given x is then 033
 2
 yjx;i = 033
 2
 y;i 000
 033
 2
 xy;i033
 2
 x;i
 :
 and the conditional expectation ^y
 i
 and variance 033
 2
 ^y;i
 given x are:
 ^y
 i
 = 026 y;i
 +
 033
 xy;i033
 2 x;i
 {x 000 026 x;i
 }; 033
 2
 ^y;i
 =
 033
 2
 yjx;in
 i
  
 1 +
 {x 000 026
 x;i
 }
 2033
 2 x;i
 !
 : {6} Here, n
 i
 is the amount of \support" for the Gaussian g
 i
 in the training data. It can be
 computed as
 n
 i
 =
 m
 X j=1
 P {x
 j
 ; y
 j
 ji}P
 N
 k=1
 P {x j
 ; y
 j
 jk}
 :
 The expectations and variances in Equation 6 are mixed according to the
probability
 thatg i
 has of being responsible for x, prior to observing y:
 h
 i
 021 h
 i
 {x} =
 P {xji}P
 N
 j=1
 P {xjj}
 ; where
 P {xji} =
 1q2031033
 2 x;i
 exp
 "
 000
 {x 000 026
 x;i
 }
 22033
 2
 x;i
 #
 : {7} For input x then, the conditional expectation ^y of the resulting
mixture and its variance
 may be written:
 ^y =
 N
 X
 i=1
 h
 i
 ^y
 i
 ; 033
 2
 ^y
 =
 N X
 i=1
 h
 2
 i
 033
 2 yjx;in
 i
  
 1 + {x 000 026
 x;i }
 2033
 2
 x;i
 ! ;
 where we have assumed that the ^y
 i
 are independent in calculating 033
 2
 ^y
 . Both of these terms can be computed efficiently in closed form. It is
also worth noting that 033 2
 ^y
 is only one of many
 variance measures we might be interested in. If, for example, our mapping
is stochastically
 multivalued {that is, if the Gaussians overlapped significantly in the x
dimension}, we may
 wish our prediction ^y to reflect the most likely y value. In this case,
^y would be the mode,
 and a preferable measure of uncertainty would be the {unmixed} variance
of the individual
 Gaussians. 134Active Learning with Statistical Models
 3.1 Active Learning with a Mixture of Gaussians
 In the context of active learning, we are assuming that the input distribution
P {x} is known.
 With a mixture of Gaussians, one interpretation of this assumption is that
we know 026
 x;i
 and033
 2
 x;i
 for each Gaussian. In that case, our application of EM will estimate only
026 y;i
 , 033
 2 y;i
 ,
 and 033 xy;i
 .
 Generally however, knowing the input distribution will not correspond to
knowing the
 actual 026
 x;i and 033
 2
 x;i for each Gaussian. We may simply know, for example, that P {x} is
 uniform, or can be approximated by some set of sampled inputs. In such cases,
we must use
 EM to estimate 026
 x;i
 and 033
 2
 x;i
 in addition to the parameters involving y. If we simply estimate these values
from the training data, though, we will be estimating the joint distribution
of P {~x; yji} instead of P {x; yji}. To obtain a proper estimate, we must
correct Equation 5 as
 follows: P {x; yji} = P {~x; yji} P {xji}P {~xji}
 : {8}
 Here, P {~xji} is computed by applying Equation 7 given the mean and x variance
of the
 training data, and P {xji} is computed by applying the same equation using
the mean and x variance of a set of reference data drawn according to P {x}.
 If our goal in active learning is to minimize variance, we should select
training examples ~x to minimize
 D ~033
 2
 ^y
 E
 . With a mixture of Gaussians, we can compute
 D
 ~033
 2
 ^y E
 efficiently. The model's estimated distribution of ~y given ~x is explicit:
P {~yj~ x} =
 N
 X
 i=1
 ~
 h
 i
 P {~yj~ x; i} =
 N
 X
 i=1
 ~
 h
 i N{^y
 i
 {~ x}; 033
 2 yjx;i
 {~x});
 where
 ~
 h i
 021 h
 i {~x}, and N {026; 033
 2
 } denotes the normal distribution with mean 026 and variance 033
 2
 . Given this, we can model the change in each g
 i
 separately, calculating its expected
 variancegiven a new point sampled from P {~yj~ x; i} and weight this change
by
 ~
 h
 i
 . The new
 expectations combine to form the learner's new expected variance
 D
 ~033
 2
 ^y
 E
 =
 N X
 i=1
 h
 2
 i
 D
 ~033
 2
 yjx;i En
 i +
 ~
 h
 i  
 1 +
 {x 000 026
 x;i
 } 2033
 2
 x;i
 !
 {9} where the expectation can be computed exactly in closed form:
 D
 ~033 2
 y;i
 E
 =
 n
 i
 033 2
 y;in
 i
 +
 ~
 h
 i
 +
 n i
 ~
 h
 i 020
 033
 2
 yj~x;i
 + {^y
 i
 {~ x} 000 026
 y;i
 } 2
 021{n
 i
 +
 ~ h
 i
 }
 2 ;
 D
 ~033 2
 yjx;i
 E =
 D
 ~ 033 2
 y;i
 E
 000
 D
 ~ 033 2
 xy;i
 E033
 2
 x;i
 ;
 D
 ~ 033
 xy;i
 E
 = n
 i
 033
 xy;in
 i
 +
 ~
 h
 i
 +
 n
 i
 ~ h
 i
 {~ x 000 026
 x;i
 }{^y
 i
 {~ x} 000 026
 y;i
 }{n
 i
 +
 ~
 h
 i
 }
 2
 ;
 D ~033
 2
 xy;i E
 = h~033 xy;i
 i
 2
 +
 n
 2
 i ~
 h
 2
 i 033
 2
 yj~x;i
 {~x 000 026 x;i
 }
 2{n
 i
 +
 ~
 h
 i }
 4
 :
 If, as discussed earlier, we are also estimating 026
 x;i
 and 033 2
 x;i
 , we must take into account the
 effect of the new example on those estimates, and must replace 026 x;i
 and 033
 2 x;i
 in the above
 equations with
 ~026
 x;i =
 n
 i
 026 x;i
 +
 ~
 h i
 ~ xn
 i
 +
 ~
 h
 i
 ; ~033 2
 x;i
 =
 n i
 033
 2
 x;in
 i +
 ~
 h
 i +
 n
 i
 ~ h
 i
 {~ x 000 026
 x;i
 } 2{n i
 +
 ~
 h i
 }
 2
 :
 135Cohn, Ghahramani & Jordan We can use Equation 9 to guide active learning.
By evaluating the expected new variance
 over a reference set given candidate ~x, we can select the ~x giving the
lowest expected model variance. Note that in high-dimensional spaces, it
may be necessary to evaluate an excessive number of candidate points to get
good coverage of the potential query space. In these cases,
 it is more efficient to differentiate Equation 9 and hillclimb on @
 D
 ~033
 2
 ^y
 E
 =@ ~x to find a locally
 maximal ~x. See, for example, {Cohn, 1994}.
 4. Locally Weighted Regression
 Model-based methods, such as neural networks and the mixture of Gaussians,
use the data
 to build a parameterized model. After training, the model is used for predictions
and the
 data are generally discarded. In contrast, \memory-based" methods are non-parametric
 approaches that explicitly retain the training data, and use it each time
a prediction needs tobe made. Locally weighted regression {LWR} is a memory-based
method that performs a
 regression around a point of interest using only training data that are
\local" to that point.
 One recent study demonstrated that LWR was suitable for real-time control
by constructing
 an LWR-based system that learned a difficult juggling task {Schaal & Atkeson,
1994}.
oooooooooooooxFigure 2: Inlocally weighted regression, points are weighted
by proximity to the current
 x in question using a kernel. A regression is then computed using the weighted
points.
 We consider here a form of locally weighted regression that is a variant
of the LOESS
 model {Cleveland, Devlin, & Grosse, 1988}. The LOESS model performs a linear
regression on points in the data set, weighted by a kernel centered at x
{see Figure 2}. The kernel
 shape is a design parameter for which there are many possible choices: the
original LOESS
 model uses a \tricubic" kernel; in our experiments we have used a Gaussian
h
 i
 {x} 021 h{x 000 x
 i } = exp{000k{x 000 x
 i
 }
 2
 };
 where k is a smoothing parameter. In Section 4.1 we will describe several
methods for
 automatically setting k.
 136Active Learning with Statistical Models
kernel too wide - includes nonlinear region)(kernel just right)(kernel too
narrow - excludes some of linear regionooooooooooooxoFigure 3: The estimator
variance is minimized when the kernel includes as many training points as
can be accommodated by the model. Here the linear LOESS model is shown. Too
large a kernel includes points that degrade the fit; too small a kernel neglects
points that increase confidence in the fit.
 For brevity, we will drop the argument x for h
 i
 {x}, and define n= P
 i
 h
 i . Wecan then
 write the estimated means and covariances as:
 026 x
 =
 P
 i h
 i
 x
 in
 ; 033 2
 x
 =
 P i
 h
 i
 {x
 i
 000 026
 x
 }
 2n
 ; 033
 xy =
 P
 i
 h i
 {x
 i 000 026
 x
 }{y
 i
 000 026
 y
 }n 026
 y
 =
 P i
 h
 i
 y in
 ; 033
 2
 y
 = P
 i
 h
 i {y
 i
 000 026
 y
 }
 2n
 ; 033 2
 yjx
 = 033
 2
 y
 000 033
 2
 xy033
 2
 x :
 We use the data covariances to express the conditional expectations and
their estimated
 variances:
 ^y = 026
 y
 +
 033
 xy033 2
 x
 {x000 026
 x
 }; 033 2
 ^y
 =
 033
 2
 yjxn
 2
   X
 i
 h
 2 i
 +
 {x 000 026
 x
 }
 2033
 2 x
 X
 i
 h 2
 i
 {x i
 000 026
 x }
 2033
 2
 x
 ! {10}
 4.1 Setting the Smoothing Parameter k
 There are a number of ways one can set k, the smoothing parameter. The method
used
 by Cleveland et al. {1988} is to set k such that the reference point being
predicted has a
 predetermined amount of support, that is, k is set so that n is close to
some target value.
 This has the disadvantage of requiring assumptions about the noise and smoothness
of the
 function being learned. Another technique, used by Schaal and Atkeson {1994},
sets k to
 minimize the crossvalidated error on the training set. A disadvantage of
this technique is that it assumes the distribution of the training set is
representative of P {x}, which it
 may not be in an active learning situation. A third method, also described
by Schaal and
 Atkeson {1994}, is to set k so as to minimize the estimate of 033
 2
 ^y
 at the reference points. As
 k decreases, the regression becomes more global. The total weight n will
increase {which decreases 033
 2
 ^y
 }, but so will the conditional variance 033
 2
 yjx
 {which increases 033
 2
 ^y
 }. At some
 value of k, these two quantities will balance to produce a minimum estimated
variance {see
 Figure 3}. This estimate can be computed for arbitrary reference points
in the domain,
 137Cohn, Ghahramani & Jordan and the user has the option of using either
a different k for each reference point or a single
 global k that minimizes the average 033
 2
 ^y
 over all reference points. Empirically, we found that
 the variance-based method gave the best performance.
 4.2 Active Learning with Locally Weighted Regression
 As with the mixture of Gaussians, we want to select ~x to minimize
 D
 ~033
 2
 ^y
 E
 . To do this, we must estimate the mean and variance of P {~yj~ x}. With
locally weighted regression, these are explicit: the mean is ^y{~ x} and
the variance is 033 2
 yj~x
 . The estimate of
 D
 ~033
 2
 ^y E
 is also explicit.
 Defining ~
 h as the weight assigned to ~x by the kernel we can compute these expectations
 exactly in closed form. For the LOESS model, the learner's expected new
variance is
 D
 ~033 2
 ^y
 E
 =
 D
 ~033
 2
 yjx
 E{n +
 ~
 h}
 2
 "
 X i
 h
 2
 i +
 ~
 h
 2
 +
 {x 000 ~026 x
 }
 2~ 033
 2
 x  
 X
 i
 h 2
 i
 {x
 i
 000 ~026
 x
 }
 2~ 033
 2
 x +
 ~
 h
 2 {~ x 000 ~026 x
 }
 2~ 033
 2
 x !#
 : {11}
 Note that, since
 P
 i
 h 2
 i
 {x
 i
 000 026
 x
 } 2
 =
 P
 i h
 2
 i
 x 2
 i
 + 026
 2
 x
 P
 i h
 2
 i
 000 2026
 x
 P
 i h
 2
 i
 x i
 , the new expectation of Equation 11 may be efficiently computed by caching
the values of
 P i
 h
 2
 i x
 2
 i
 and P
 i
 h
 2 i
 x
 i
 . This obviates the need to recompute the entire sum for each new candidate
point. The component expectations in Equation 11 are computed as follows:
 D
 ~033
 2
 yjx E
 =
 D
 ~ 033
 2
 y
 E 000
 D
 ~ 033 2
 xy
 E~ 033
 2
 x ;
 D
 ~ 033
 2
 y
 E
 = n033
 2
 yn +
 ~
 h
 +
 n
 ~ h
 020
 033
 2
 yj~x
 + {^y{~ x} 000 026
 y
 } 2
 021{n +
 ~
 h} 2
 ;
 ~026 x
 =
 n026
 x
 +
 ~
 h~ xn +
 ~ h
 ; h~033 xy
 i =
 n033 xyn + ~
 h
 +
 n ~
 h{~x 000 026
 x
 }{^y{~ x} 000 026
 y
 }{n +
 ~
 h}
 2
 ;
 ~033
 2
 x
 =
 n033 2
 xn +
 ~
 h
 +
 n
 ~
 h{~x 000 026
 x
 }
 2{n + ~
 h}
 2
 ;
 D
 ~033
 2
 xy
 E
 = h~033
 xy
 i 2
 +
 n
 2 ~
 h
 2
 033 2
 yj~x {~x 000 026
 x
 }
 2{n +
 ~
 h}
 4
 :
 Just as with the mixture of Gaussians, we can use the expectation in Equation
11 to guide
 active learning.
 5. Experimental Results For an experimental testbed, we used the \Arm2D"
problem described by Cohn {1994}. The task is to learn the kinematics of
a toy 2-degree-of-freedom robot arm {see Figure 4}. The inputs are joint
angles {002 1
 ; 002
 2 }, and the outputs are the Cartesian coordinates of the
 tip {X
 1
 ; X
 2
 }. One of the implicit assumptions of both models described here is that
the
 noise is Gaussian in the output dimensions. To test the robustness of the
algorithm to this
 assumption, we ran experiments using no noise, using additive Gaussian noise
in the outputs,
 and using additive Gaussian noise in the inputs. The results of each were
comparable; we
 report here the results using additive Gaussian noise in the inputs. Gaussian
input noise
 correspondsto the case where the arm effectors or joint angle sensors are
noisy, and results
 in non-Gaussian errors in the learner's outputs. The input distribution
P {x} is assumed to
 be uniform. We compared the performance of the variance-minimizing criterion
by comparing the learning curves of a learner using the criterion with that
of one learning from random 138Active Learning with Statistical Models
1212{x ,x }Figure 4: The arm kinematics problem. The learner attempts to
predict tip position given a set of joint angles {022
 1
 ; 022
 2
 }. samples. The learning curves plot the mean squared error and variance
of the learner as its training set size increases. The curves are created
by starting with an initial sample,
 measuring the learner's mean squared error or estimated variance on a set
of \reference" points {independent of the training set}, selecting and adding
a new example to the training set, retraining the learner on the augmented
set, and repeating.
 On each step, the variance-minimizing learner chose a set of 64 unlabeled
reference
 points drawn from input distribution P {x}. It then selected a query ~x
= {022
 1
 ; 022
 2
 } that it
 estimated would minimize D
 ~033
 2 yjx
 E
 over the reference set. In the experiments reported here,
 the best ~x was selected from another set of 64 \candidate" points drawn
at random on each
 iteration.
 2
 5.1 Experiments with Mixtures of Gaussians
 With the mixtures of Gaussians model, there are three design parameters
that must be considered  -  the number of Gaussians, their initial placement,
and the number of itera- tions of the EM algorithm. We set these parameters
by optimizing them on the learner using random examples, then used the same
settings on the learner using the variance-
 minimization criterion. Parameters were set as follows: Models with fewer
Gaussians have the obvious advantage of requiring less storage space and
computation. Intuitively,a small model should also have the advantage of
avoiding overfitting, which is thought to occur in
 systems with extraneous parameters. Empirically, aswe increased the number
of Gaussians,
 generalization improved monotonically with diminishing returns{for a fixed
training set size and number of EM iterations}. The test error of the larger
models generally matched that
 of the smaller models on small training sets {where overfitting would be
a concern}, and
 continued to decrease on large training sets where the smaller networks
\bottomed out." We therefore preferred the larger mixtures, and report here
our results with mixtures of 60 Gaussians. We selected initial placement
of the Gaussians randomly, chosen uniformly from the smallest hypercube containing
all current training examples. We arbitrarily chose the2. As described earlier,
we could also have selected queries by hillclimbing on @
 012
 ~033 2
 yjx
 ff =@~ x; in this low dimensional problem it was more computationally efficient
to consider a random candidate set. 139Cohn, Ghahramani & Jordan identity
matrix as an initial covariance matrix. The learner was surprisingly sensitive
to the
 number of EM iterations. We examined a range of 5 to 40 iterations of the
EM algorithm
 per step. Small numbers of iterations {5-10} appear insufficent to allow
convergence with
 large training sets, while large numbers of iterations {30-40} degraded
performance on small
 training sets. An ideal training regime would employ some form of regularization,
or would examine the degree of change between iterations to detect convergence;
in our experiments,
 however, we settled on a fixed regime of 20 iterations per step.
50100150200250300350400450500Var0.0030.010.030.10.3 1randomvariance50100150200250300350400450500MSE0.0010.0030.010.030.10.3
1randomvarianceFigure 5: Varianceand MSE learning curves for mixture of 60
Gaussians trained on the Arm2D domain. Dotted lines denote standard error
for average of 10 runs, each
 started with one initial random example.
 Figure 5 plots the variance and MSE learning curves for a mixture of 60
Gaussians
 trained on the Arm2D domain with 1045 input noise added. The estimated
model variance
 usingthe variance-minimizing criterion is significantly better than that
of the learner select-
 ing data at random. The mean squared error, however, exhibits even greater
improvement,
 with an error that is consistently 1=3 that of the randomly sampling learner.
5.2 Experiments with LOESS Regression With LOESS, the design parameters are
the the size and shape of the kernel. As described earlier, we arbitrarily
chose to work with a Gaussian kernel; we used the variance-based methodfor
automatically selecting the kernel size.
 In the case of LOESS, both the variance and the MSE of the learner using
the variance- minimizing criterion are significantly lower than those of
the learner selecting data randomly. It is worth noting that on the Arm2D
domain, this form of locally weighted regression also significantly outperforms
both the mixture of Gaussians and the neural networks discussed by Cohn {1994}.
 140Active Learning with Statistical Models
training set size50100150200250300350400450500Var5e-050.00010.00020.00040.001randomvariancetraining
set size50100150200250300350400450500MSE0.00010.0010.010.1 110randomvarianceFigure
6: Variance and MSE learning curves for LOESS model trained on the Arm2D
do- main. Dotted lines denote standard error for average of 60 runs, each
started
 with a single initial random example.
 5.3 Computation Time
 One obvious concern about the criterion described here is its computational
cost. In sit-
 uations where obtaining new examples may take days and cost thousands of
dollars, it is clearly wise to expend computation to ensure that those examples
are as useful as possible. In other situations, however, new data may be
relatively inexpensive, so the computational cost of finding optimal examples
must be considered.
 Table 1 summarizes the computation times for the two learning algorithms
discussed in this paper.
 3
 Note that, with the mixture of Gaussians, training time depends linearly
 on the number of examples, but prediction time is independent. Conversely,
with locally
 weighted regression, there is no \training time" per se, but the cost of
additional examples
 accrues when predictions are made using the training set.
 While the training time incurred by the mixture of Gaussians may make it
infeasible forselecting optimal action learning actions in realtime control,
it is certainly fast enough to be used in many applications. Optimized, parallel
implementations will also enhance its
 utility.
 4
 Locally weighted regression is certainly fast enough for many control applications,
 and may be made faster still by optimized, parallel implementations. It
is worth noting3. The times reported are \per reference point" and \per
candidate per reference point"; overall time must
 be computed from the number of candidates and reference points examined.
In the case of the LOESS
 model, for example, with 100 training points, 64 reference points and 64
candidate points, the time required to select an action would be {58+ 0:16
002 100} 002 4096026seconds, or about 0.3 seconds. 4. It is worth mentioning
that approximately half of the training time for the mixture of Gaussians
is spent
 computing the correction factor in Equation 8. Withoutthe correction, the
learner still computes P {yjx}, but does so by modeling the training set
distribution ratherthan the reference distribution. We have
 found however, that for the problems examined, the performance of such \uncorrected"
learners does
 not differ appreciably from that of the \corrected" learners.
 141Cohn, Ghahramani & JordanTrainingEvaluating ReferenceEvaluating CandidatesMixture3:9
+ 0:05m sec15000 026sec1300 026secLOESS-92+ 9:7m 026sec58 + 0:16m 026secTable
1: Computationtimes on a Sparc 10 as a function of training set size m. Mixture
 model had 60 Gaussians trained for 20 iterations. Reference times are per
reference
 point; candidate times are per candidate point per reference point. that,
since the prediction speed of these learners depends on their training set
size, optimal data selection is doubly important, as it creates a parsimonious
training set that allows faster predictions on future points. 6. Discussion
 Mixtures of Gaussians and locally weighted regression are two statistical
models that offer
 elegant representations and efficient learning algorithms. Inthis paper
we have shown that
 they also offer the opportunity to perform active learning in an efficient
and statistically
 correct manner. The criteria derived here can be computed cheaply and, for
problems
 tested, demonstrate good predictive power. In industrial settings, where
gathering a single
 data point may take days and cost thousands of dollars, the techniques described
here have
 the potential for enormous savings.
 In this paper, we have only considered function approximation problems.
Problems requiring classification could be handled analogously with the appropriate
models. For learning classification with a mixture model, one would select
examples so as to maximize discriminability between Gaussians; for locally
weighted regression, one would use a logistic
 regression instead of the linear one considered here {Weisberg, 1985}.
 Our future work will proceed in several directions. The most important is
active bias
 minimization. As noted in Section 2, the learner's error is composed of
both bias and
 variance. The variance-minimizing strategy examined here ignores the bias
component, which can lead to significant errors when the learner's bias is
non-negligible. Work in progress examines effective ways of measuring and
optimally eliminating bias {Cohn, 1995}; future work will examine how to
jointly minimize both bias and variance to produce a criterion that truly
minimizes the learner's expected error.
 Another direction for future research is the derivation of variance- {and
bias-} mini- mizing techniques for other statistical learning models. Of
particular interest is the class of models known as \belief networks" or
\Bayesian networks" {Pearl, 1988; Heckerman,
 Geiger, & Chickering, 1994}. These models have the advantage of allowing
inclusion of
 domain knowledge and prior constraints while still adhering to a statistically
sound frame- work. Current research in belief networks focuses on algorithms
for efficient inference and
 learning; it would be an important step to derive the proper criteria for
learning actively
 with these models.
 142Active Learning with Statistical Models
 Appendix A. Notation
 GeneralX input space
 Y output space
 x an arbitrary point in the input space y true output value corresponding
to input x
 ^y predicted output value corresponding to input x x
 i
 \input" part of example i
 y
 i
 \output" part of example i
 m the number of examples in the training set
 ~x specified input of a query
 ~y the {possibly not yet known} output of query ~x
 033
 2
 ^y estimated variance of ^y ~ 033
 2
 ^y new variance of ^y, after example {~x; ~y} has been added
 D
 ~033
 2
 ^y
 E the expected value of ~033 2
 ^y
 P {x} the {known} natural distribution over x
 Neural Networkw a weight vector for a neural network
 ^w estimated \best" w given a training set f
 ^w
 {} function computed by neural network given ^w
 S
 2
 average estimated noise in data, used as an estimate for 033
 2
 y
 Mixture of GaussiansN total number of Gaussians
 g
 i
 Gaussian number i
 n
 i
 total point weighting attributed to Gaussian i
 026
 x;i
 estimated x mean of Gaussian i 026
 y;i
 estimated y mean of Gaussian i
 033
 2
 x;i
 estimated x variance of Gaussian i
 033
 2
 y;i
 estimated y variance of Gaussian i
 033
 xy;i
 estimated xy covariance of Gaussian i
 033
 2
 yjx;i estimated y variance of Gaussian i, given x
 P {x; yji} joint distribution of input-output pair given Gaussian i
 P {xji} distribution x being given Gaussian i
 h
 i
 weight of a given point that is attributed to Gaussian i
 ~
 h i
 weight of new point {~x; ~y} that is attributed to Gaussian i
 Locally Weighted Regressionk kernel smoothing parameter
 h
 i
 weight given to example i by kernel centered at x
 n sum of weights given to all points by kernel
 026
 x
 mean of inputs, weighted by kernel centered at x
 026
 y
 mean of outputs, weighted by kernel centered at x
 ~
 h weight of new point {~x; ~y} given kernel centered at x
 143Cohn, Ghahramani & Jordan Acknowledgements
 David Cohn's current address is: Harlequin, Inc., One Cambridge Center,
Cambridge, MA
 02142 USA. Zoubin Ghahramani's current address is: Department of Computer
Science, University of Toronto, Toronto, Ontario M5S 1A4 CANADA. This work
was funded by NSF grant CDA-9309300, the McDonnell-Pew Foundation, ATR Human
Information Pro-
 cessing Laboratories and Siemens Corporate Research. Weare deeply indebted
to Michael
 Titterington and Jim Kay, whose careful attention and continued kind help
allowed us to make several corrections to an earlier version of this paper.
 References
 Angluin, D. {1988}. Queries and concept learning. Machine Learning,2, 319{342.
 Baum, E., & Lang, K. {1991}. Neural network algorithms that learn in polynomial
time
 from examples and queries. IEEE Trans. Neural Networks, 2.
 Box, G., & Draper, N. {1987}. Empirical model-building and response surfaces.
Wiley.
 Cheeseman, P., Self, M., Kelly, J., Taylor, W., Freeman, D., & Stutz, J.
{1988}. Bayesian
 classification. In AAAI 88, The 7th National Conference on Artificial Intelligence,
 pp.607{611. AAAI Press.
 Cleveland, W., Devlin, S., & Grosse, E. {1988}. Regression by local fitting.
Journal of
 Econometrics, 37, 87{114.
 Cohn, D. {1994}. Neural network exploration using optimal experiment design.
In Cowan,
 J., Tesauro, G., & Alspector, J. {Eds.}, Advances in Neural Information
Processing
 Systems 6. Morgan Kaufmann. Expanded version available as MIT AI Lab memo
1491 by anonymous ftp to publications.ai.mit.edu.
 Cohn, D.{1995}. Minimizing statistical bias with queries. AI Lab memo AIM-
 1552,Massachusetts Institute of Technology. Available by anonymous ftp from
 publications.ai.mit.edu.
 Cohn, D., Atlas, L., & Ladner, R. {1990}. Training connectionist networkswith
queries and selective sampling. In Touretzky, D. {Ed.}, Advances in Neural
Information Processing Systems 2. Morgan Kaufmann.
 Cohn, D., Atlas, L., & Ladner, R. {1994}. Improving generalization with
active learning.
 Machine Learning, 5 {2}, 201{221.
 Dempster, A., Laird, N., & Rubin, D. {1977}. Maximum likelihood fromincomplete
data
 viathe EM algorithm. J.Royal Statistical Society Series B, 39, 1{38.
 Fedorov, V. {1972}. Theory of Optimal Experiments. Academic Press.
 Fe'ldbaum, A. A. {1965}. Optimal control systems. Academic Press, New York,
NY.
 144Active Learning with Statistical Models
 Geman, S., Bienenstock, E., & Doursat, R. {1992}. Neural networks and the
bias/variance dilemma. Neural Computation, 4, 1{58.
 Ghahramani, Z., & Jordan, M. {1994}. Supervised learning from incomplete
data via an
 EM approach. In Cowan, J., Tesauro, G., & Alspector, J. {Eds.}, Advances
in Neural
 Information Processing Systems 6. Morgan Kaufmann.
 Heckerman, D., Geiger, D., & Chickering, D. {1994}. Learning Bayesian networks:
the
 combination of knowledge and statistical data. Tech report MSR-TR-94-09,
Microsoft.
 Linden, A., & Weber, F. {1993}. Implementing inner drive by competence reflection.
In
 Roitblat, H. {Ed.}, Proceedings of the 2nd International Conference on Simulation
of
 Adaptive Behavior. MIT Press, Cambridge, MA.
 MacKay, D. J. {1992}. Information-based objective functions for active data
selection.
 Neural Computation, 4 {4}, 590{604.
 Nowlan, S. {1991}. Soft competitive adaptation: Neuralnetwork learning algorithms
based on fitting statistical mixtures. Tech report CS-91-126, Carnegie Mellon
University. Paass, G., & Kindermann, J. {1995}. Bayesian query construction
for neural network models. In Tesauro, G., Touretzky, D., & Leen, T. {Eds.},
Advances in Neural Information Processing Systems 7. MIT Press. Pearl, J.
{1988}. Probablistic Reasoning in Intelligent Systems. Morgan Kaufmann. Plutowski,
M., & White, H. {1993}. Selecting concise training sets from clean data.
IEEE
 Transactions on Neural Networks, 4, 305{318.
 Schaal, S., & Atkeson, C. {1994}. Robot juggling: An implementation of memory-based
 learning. Control Systems, 14, 57{71. Schmidhuber, J., & Storck, J. {1993}.
Reinforcement driven information acquisition in non- deterministic environments.
Tech report, Fakultªat fªur Informatik, Technische Univer-
 sitªat Mªunchen.
 Specht, D. {1991}. A general regression neural network. IEEE Trans. Neural
Networks, 2 {6}, 568{576.
 Thrun, S., & Mªoller, K. {1992}. Active exploration in dynamic environments.
In Moody, J., Hanson, S., & Lippmann, R. {Eds.}, Advances in Neural Information
Processing Systems 4. Morgan Kaufmann.
 Titterington, D., Smith, A., & Makov, U. {1985}. Statistical Analysis of
Finite Mixture Distributions. Wiley.
 Weisberg, S. {1985}. Applied Linear Regression. Wiley.
 Whitehead,S. {1991}. A study of cooperative mechanisms for faster reinforcement
learning.
 Technical report CS-365, University of Rochester, Rochester, NY.
 145