Learning to Order Things

W. W. Cohen, R. E. Schapire and Y. Singer

Volume 10, 1999

Links to Full Text:

Abstract:
There are many applications in which it is desirable to order rather than classify instances. Here we consider the problem of learning how to order instances given feedback in the form of preference judgments, i.e., statements to the effect that one instance should be ranked ahead of another. We outline a two-stage approach in which one first learns by conventional means a binary preference function indicating whether it is advisable to rank one instance before another. Here we consider an on-line algorithm for learning preference functions that is based on Freund and Schapire's 'Hedge' algorithm. In the second stage, new instances are ordered so as to maximize agreement with the learned preference function. We show that the problem of finding the ordering that agrees best with a learned preference function is NP-complete. Nevertheless, we describe simple greedy algorithms that are guaranteed to find a good approximation. Finally, we show how metasearch can be formulated as an ordering problem, and present experimental results on learning a combination of 'search experts', each of which is a domain-specific query expansion strategy for a web search engine.

Extracted Text


Journal of Artificial Intelligence Research 10 (1999) 243-270 Submitted 10/98;
published 5/99

Learning to Order Things William W. Cohen wcohen@research.att.com Robert
E. Schapire schapire@research.att.com Yoram Singer singer@research.att.com
AT&T Labs, Shannon Laboratory, 180 Park Avenue Florham Park, NJ 07932, USA

Abstract There are many applications in which it is desirable to order rather
than classify instances. Here we consider the problem of learning how to
order instances given feedback in the form of preference judgments, i.e.,
statements to the effect that one instance should be ranked ahead of another.
We outline a two-stage approach in which one first learns by conventional
means a binary preference function indicating whether it is advisable to
rank one instance before another. Here we consider an on-line algorithm for
learning preference functions that is based on Freund and Schapire's "Hedge"
algorithm. In the second stage, new instances are ordered so as to maximize
agreement with the learned preference function. We show that the problem
of finding the ordering that agrees best with a learned preference function
is NP-complete. Nevertheless, we describe simple greedy algorithms that are
guaranteed to find a good approximation. Finally, we show how metasearch
can be formulated as an ordering problem, and present experimental results
on learning a combination of "search experts," each of which is a domain-specific
query expansion strategy for a web search engine.

1. Introduction Work in inductive learning has mostly concentrated on learning
to classify. However, there are many applications in which it is desirable
to order rather than classify instances. An example might be a personalized
email filter that prioritizes unread mail. Here we will consider the problem
of learning how to construct such orderings given feedback in the form of
preference judgments, i.e., statements that one instance should be ranked
ahead of another.

Such orderings could be constructed based on a learned probabilistic classifier
or regression model and in fact often are. For instance, it is common practice
in information retrieval to rank documents according to their probability
of relevance to a query, as estimated by a learned classifier for the concept
"relevant document." An advantage of learning orderings directly is that
preference judgments can be much easier to obtain than the labels required
for classification learning.

For instance, in the email application mentioned above, one approach might
be to rank messages according to their estimated probability of membership
in the class of "urgent" messages, or by some numerical estimate of urgency
obtained by regression. Suppose, however, that a user is presented with an
ordered list of email messages, and elects to read the third message first.
Given this election, it is not necessarily the case that message three is
urgent, nor is there sufficient information to estimate any numerical urgency
measures.

cfl1999 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.

Cohen, Schapire, & Singer However, it seems quite reasonable to infer that
message three should have been ranked ahead of the others. Thus, in this
setting, obtaining preference information may be easier and more natural
than obtaining the labels needed for a classification or regression approach.

Another application domain that requires ordering instances is collaborative
filtering; see, for instance, the papers contained in Resnick and Varian
(1997). In a typical collaborative filtering task, a user seeks recommendations,
say, on movies that she is likely to enjoy. Such recommendations are usually
expressed as ordered lists of recommended movies, produced by combining movie
ratings supplied by other users. Notice that each user's movie ratings can
be viewed as a set of preference judgements. In fact, interpreting ratings
as preferences is advantageous in several ways: for instance, it is not necessary
to assume that a rating of "7" means the same thing to every user.

In the remainder of this paper, we will investigate the following two-stage
approach to learning how to order. In stage one, we learn a preference function,
a two-argument function PREF(u; v) which returns a numerical measure of how
certain it is that u should be ranked before v. In stage two, we use the
learned preference function to order a set of new instances X; to accomplish
this, we evaluate the learned function PREF(u; v) on all pairs of instances
u; v 2 X, and choose an ordering of X that agrees, as much as possible, with
these pairwise preference judgments.

For stage one, we describe a specific algorithm for learning a preference
function from a set of "ranking-experts". The algorithm is an on-line weight
allocation algorithm, much like the weighted majority algorithm (Littlestone
& Warmuth, 1994) and Winnow (Littlestone, 1988), and, more directly, Freund
and Schapire's (1997) "Hedge" algorithm. For stage two, we show that finding
a total order that agrees best with such a preference function is NPcomplete.
Nevertheless, we show that there are efficient greedy algorithms that always
find a good approximation to the best ordering.

We then present some experimental results in which these algorithm are used
to combine the results of several "search experts," each of which is a domain-specific
query expansion strategy for a web search engine. Since our work touches
several different fields we defer the discussion of related work to Sec.
6.

2. Preliminaries Let X be a set of instances. For simplicity, in this paper,
we always assume that X is finite. A preference function PREF is a binary
function PREF : X Theta  X ! [0; 1]. A value of PREF(u; v) which is close
to 1 (respectively 0) is interpreted as a strong recommendation that u should
be ranked above (respectively, below) v. A value close to 1=2 is interpreted
as an abstention from making a recommendation. As noted earlier, the hypothesis
of our learning system will be a preference function, and new instances will
be ranked so as to agree as much as possible with the preferences predicted
by this hypothesis.

In standard classification learning, a hypothesis is constructed by combining
primitive features. Similarly, in this paper, a preference function will
be a combination of primitive preference functions. In particular, we will
typically assume the availability of a set of N primitive preference functions
R1; : : : ; RN . These can then be combined in the usual ways, for instance
with a boolean or linear combination of their values. We will be especially
interested in the latter combination method.

244

Learning to Order Things a

b

c d

a

b

c d

a

b

c d

7/8 7/81

1/4

1

3/4

f(a)=1 f(b)=2f(c)=0 f(d)=^ g(a)=0 g(b)=2g(c)=1 g(d)=2

1/8 1/4f() + 3/4g()

1/8

Figure 1: Left and middle: Two ordering functions and their graph representation.
Right:

The graph representation of the preference function created by a weighted
( 14 and

3 4 ) combination of the two functions. Edges with weight of

1 2 or 0 are omitted.

It is convenient to assume that the Ri's are well-formed in certain ways.
To this end, we introduce a special kind of preference function called a
rank ordering which is defined by an ordering function. Let S be a totally
ordered set. We assume without loss of generality that S ` R. An ordering
function into S is any function f : X ! S, where we interpret an inequality
f (u) ? f (v) to mean that u is ranked above v by f . It is sometimes convenient
to allow an ordering function to "abstain" and not give a preference for
a pair u, v. We therefore allow S to include a special symbol ? not in R,
and we interpret f (u) = ? to mean that u is "unranked." We define the symbol
? to be incomparable to all the elements in S (that is, ? 6! s and s 6! ?
for all s 2 S).

An ordering function f induces the preference function Rf , defined as

Rf (u; v) =

8?!

?:

1 if f (u) ? f (v) 0 if f (u) ! f (v)

12 otherwise.

We call Rf a rank ordering for X into S. If Rf (u; v) = 1, then we say that
u is preferred to v, or u is ranked higher than v. Note that Rf (u; v) =
12 if either u or v (or both) is unranked.

We will sometimes describe and manipulate preference functions as directed
weighted graphs. The nodes of a graph correspond to the instances in X. Each
pair (u; v) is connected by a directed edge with weight PREF(u; v). Since
an ordering function f induces a preference function Rf , we can also describe
ordering functions as graphs. In Fig. 1 we give an example of two ordering
functions and their corresponding graphs. For brevity, we do not draw edges
(u; v) such that PREF(u; v) = 12 or PREF(u; v) = 0.

To give a concrete example of rank orderings, imagine learning to order documents
based on the words that they contain. To model this, let X be the set of
all documents in a repository, and for N words w1; : : : ; wN , let fi(u)
be the number of occurrences of word wi in document u. Then Rfi will prefer
u to v whenever wi occurs more often in u than v. As a second example, consider
a metasearch application in which the goal is to combine the

245

Cohen, Schapire, & Singer rankings of several web search engines on some
fixed query. For N search engines e1; : : : ; eN , one might define fi so
that Rfi prefers web page u to web page v whenever u is ranked ahead of v
in the list Li produced by the corresponding search engine. To do this, one
could let fi(u) = Gamma k for the web page u appearing in the k-th position
in the list Li, and let fi(u) = Gamma M (where M ? jLij) for any web page
u not appearing in Li.

Feedback from the user will be represented in a similar but more general
way. We will assume that feedback is a set element pairs (u; v), each representing
an assertion of the form "u should be preferred to v." This definition of
feedback is less restricted than ordering functions. In particular, we will
not assume that the feedback is consistent--cycles, such as a ? b ? a, will
be allowed.

3. Learning a Combination of Ordering Functions In this section, we consider
the problem of learning a good linear combination of a set of ordering functions.
Specifically, we assume access to a set of ranking experts, each of which
generates an ordering function when provided with a set of instances. For
instance, in a metasearch problem, each ranking expert might be a function
that submits the user's query to a different search engine; the domain of
instances might be the set of all web pages returned by any of the ranking
experts; and the ordering function associated with each ranking expert might
be represented as in the example above (i.e., letting fi(u) = Gamma k for
the k-the web page u returned by i-th search engine, and letting fi(u) =
Gamma M for any web page u not retrieved by the i-th search engine). The
user's feedback will be a set of pairwise preferences between web pages.
This feedback may be obtained directly, for example, by asking the user to
explicitly rank the URL's returned by the search engine; or the feedback
may be obtained indirectly, for example, by measuring the time spent viewing
each of the returned pages.

We note that for the metasearch problem, an approach that works directly
with the numerical scores associated with the different search engines might
not be feasible; these numerical scores might not be comparable across different
search engines, or might not be provided by all search engines. Another problem
is that most web pages will not be indexed by all search engines. This can
be easily modeled in our setting: rather than letting fi(u) = Gamma M for
a web page u that is not ranked by search engine i, one could let fi(u) =
?. This corresponds to the assumption that the search engine's preference
for u relative to ranked web pages is unknown.

We now describe a weight allocation algorithm that uses the preference functions
Ri to learn a preference function of the form PREF(u; v) = PNi=1 wiRi(u;
v). We adopt the on-line learning framework first studied by Littlestone
(1988) in which the weight wi assigned to each ranking expert i is updated
incrementally.

Formally, learning is assumed to take place in a sequence of rounds. On each
round t, we assume the learning algorithm is provided with a set Xt of instances
to be ranked, for which each ranking expert i 2 f1; : : : ; N g provides
an ordering function f ti . (In metasearch, for instance, f ti is the ordering
function associated with the list Lti of web pages returned by the i-th ranking
expert for the t-th query, and Xt is the set of all web pages that appear
in any of the lists Lt1; : : : ; LtN .) Each ordering function f ti induces
a preference function Rft

i , whichwe denote for brevity by Rt

i. The learner may compute Rti(u; v) for any and all preference

246

Learning to Order Things functions Rti and pairs u; v 2 Xt before producing
a combined preference function PREFt, which is then used to produce an ordering
^aet of Xt. (Methods for producing an ordering from a preference function
will be discussed below.)

After producing the ordering ^aet, the learner receives feedback from the
environment. We assume that the feedback is an arbitrary set of assertions
of the form "u should be preferred to v." That is, the feedback on the t-th
round is a set F t of pairs (u; v).

The algorithm we propose for this problem is based on the "weighted majority
algorithm" of Littlestone and Warmuth (1994) and, more directly, on Freund
and Schapire's (1997) "Hedge" algorithm. We define the loss of a preference
function R with respect to the user's feedback F as

Loss(R; F ) = P(u;v)2F (1 Gamma  R(u; v))jF j = 1 Gamma  1jF j X

(u;v)2F

R(u; v) : (1)

This loss has a natural probabilistic interpretation. If R is viewed as a
randomized prediction algorithm that predicts that u will precede v with
probability R(u; v), then Loss(R; F ) is the probability of R disagreeing
with the feedback on a pair (u; v) chosen uniformly at random from F .

It is worth noting that the assumption on the form of the feedback can be
further relaxed by allowing the user to indicate the degree to which she
prefers u over v. In this case, the loss should be normalized by the weighted
sum of feedback pairs. Since this generalization is rather straightforward,
we assume for brevity that the feedback is an unweighted set of assertions
over element pairs.

We now can use the Hedge algorithm almost verbatim, as shown in Figure 2.
The algorithm maintains a positive weight vector whose value at time t is
denoted by wt = (wt1; : : : ; wtN ). If there is no prior knowledge about
the ranking experts, we set all initial weights to be equal so that w1i =
1=N .

On each round t, the weight vector wt is used to combine the preference functions
of the different experts to obtain the preference function PREFt(u; v) =
PNi=1 wti Rti(u; v). This preference function is next converted into an ordering
^aet on the current set of elements Xt. For the purposes of this section,
the method of producing an ordering is immaterial; in particular, any of
the methods described in Sec. 4 could be used here. Based on this ordering,
the user provides feedback F t, and the loss for each preference function
Loss(Rti; F t) is evaluated as in Eq. (1). Finally, the weight vector wt
is updated using the multiplicative rule

wt+1i = w

ti Delta  fiLoss(Rti;F t)

Zt where fi 2 [0; 1] is a parameter, and Zt is a normalization constant,
chosen so that the weights sum to one after the update. Thus, in each round,
the weights of the ranking experts are adjusted so that experts producing
preference functions with relatively large agreement with the feedback are
increased.

We now give the theoretical rationale behind this algorithm. Freund and Schapire
(1997) prove general results about Hedge which can be applied directly to
this loss function. Their results imply almost immediately a bound on the
cumulative loss of the preference function PREFt in terms of the loss of
the best ranking expert, specifically:

247

Cohen, Schapire, & Singer Allocate Weights for Ranking Experts Parameters:
fi 2 [0; 1], initial weight vector w1 2 [0; 1]N with PNi=1 w1i = 1

N ranking experts, number of rounds T

Do for t = 1; 2; : : : ; T

1. Receive a set of elements Xt and ordering functions f t1; : : : ; f tN
. Let Rti denote the

preference function induced by f ti .

2. Compute a total order ^aet which approximates

PREFt(u; v) =

NX

i=1

wtiRti(u; v)

(Sec. 4 describes several ways of approximating a preference function with
a total order.)

3. Order Xt using ^aet. 4. Receive feedback F t from the user. 5. Evaluate
losses Loss(Rti; F t) as defined in Eq. (1). 6. Set the new weight vector

wt+1i = w

ti Delta  fiLoss(Rti;F t)

Zt where Zt is a normalization constant, chosen so that PNi=1 wt+1i = 1.

Figure 2: The on-line weight allocation algorithm. Theorem 1 For the algorithm
of Fig. 2,

TX

t=1

Loss(PREFt; F t) ^ afi mini

TX

t=1

Loss(Rti; F t) + cfi ln N

where afi = ln(1=fi)=(1 Gamma  fi) and cfi = 1=(1 Gamma  fi).

Note that Pt Loss(PREFt; F t) is the cumulative loss of the combined preference
functions PREFt, and Pt Loss(Rti; F t) is the cumulative loss of the ith
ranking expert. Thus, Theorem 1 states that the cumulative loss of the combined
preference functions will not be much worse than that of the best ranking
expert. Proof: We have that

Loss(PREFt; F t) = 1 Gamma  1F t X

(u;v)2F t

X

i

wtiRti(u; v)

= X

i

wti

0@

1 Gamma  1F t X

(u;v)2F t

Rti(u; v)

1A

248

Learning to Order Things

= X

i

wtiLoss(Rti(u; v); F t):

Therefore, by Freund and Schapire's (1997) Theorem 2,

TX

t=1

Loss(PREFt; F t) =

TX

t=1

X

i

wtiLoss(Rti(u; v); F t)

^ afi mini

TX

t=1

Loss(Rti; F t) + cfi ln N:

2

Of course, we are not interested in the loss of PREFt (since it is not an
ordering), but rather in the performance of the actual ordering ^aet computed
by the learning algorithm. Fortunately, the losses of these can be related
using a kind of triangle inequality. Let

DISAGREE(ae; PREF) = X

u;v:ae(u)?ae(v)

(1 Gamma  PREF(u; v)) : (2)

Theorem 2 For any PREF, F and total order defined by an ordering function
ae,

Loss(Rae; F ) ^ DISAGREE(ae; PREF)jF j + Loss(PREF; F ): (3) Proof: For x;
y 2 [0; 1], let us define d(x; y) = x(1 Gamma  y) + y(1 Gamma  x). We now
show that d satisfies the triangle inequality. Let x, y and z be in [0; 1],
and let X, Y and Z be independent Bernoulli (f0; 1g-valued) random variables
with probability of outcome 1 equal to x, y and z, respectively. Then

d(x; z) = Pr[X 6= Z]

= Pr[(X 6= Y ^ Y = Z) . (X = Y ^ Y 6= Z)] ^ Pr[X 6= Y . Y 6= Z] ^ Pr[X 6=
Y ] + Pr[Y 6= Z] = d(x; y) + d(y; z):

For [0; 1]-valued functions f; g defined on X Theta  X, we next define

D(f; g) = X

u;v:u6=v

d(f (u; v); g(u; v)):

Clearly, D also satisfies the triangle inequality.

Let O/F be the characteristic function of F so that O/F : X Theta  X ! f0;
1g and O/F (u; v) = 1 if and only if (u; v) 2 F . Then from the definition
of Loss and DISAGREE, we have

jF j Loss(Rae; F ) = D(Rae; O/F )

^ D(Rae; PREF) + D(PREF; O/F ) = DISAGREE(ae; PREF) + jF j Loss(PREF; F ):

2

Notice that the learning algorithm Hedge minimizes the second term on the
right hand side of Eq. (3). Below, we consider the problem of finding an
ordering ae which minimizes the first term, namely, DISAGREE.

249

Cohen, Schapire, & Singer 4. Ordering Instances with a Preference Function
4.1 Measuring the Quality of an Ordering We now consider the complexity of
finding a total order that agrees best with a learned preference function.
To analyze this, we must first quantify the notion of agreement between a
preference function PREF and an ordering. One natural notion is the following:
Let X be a set, PREF be a preference function, and let ae be a total ordering
of X, expressed again as an ordering function (i.e., ae(u) ? ae(v) if and
only if u is above v in the order). For the analysis of this section, it
is convenient to use the measure AGREE(ae; PREF), which is defined to be
the sum of PREF(u; v) over all pairs u; v such that u is ranked above v by
ae:

AGREE(ae; PREF) = X

u;v:ae(u)?ae(v)

PREF(u; v): (4)

Clearly, AGREE is a linear transformation of the measure DISAGREE introduced
in Eq. (2), and hence maximizing AGREE is equivalent to minimizing DISAGREE.
This definition is also closely related to similarity metrics used in decision
theory and information processing (Kemeny & Snell, 1962; Fishburn, 1970;
Roberts, 1979; French, 1989; Yao, 1995) (see the discussion in Sec. 6).

4.2 Finding an Optimal Ordering is Hard Ideally one would like to find a
ae that maximizes AGREE(ae; PREF). The general optimization problem is of
little interest in our setting, since there are many constraints on the preference
function that are imposed by the learning algorithm. Using the learning algorithm
of Sec. 3, for instance, PREF will always be a linear combination of simpler
functions. However, the theorem below shows that this optimization problem
is NP-complete even if PREF is restricted to be a linear combination of well-behaved
preference functions. In particular, the problem is NP-complete even if all
the primitive preference functions used in the linear combination are rank
orderings which map into a set S with only three elements, one of which may
or may not be ?. (Clearly, if S consists of more than three elements then
the problem is still hard.)

Theorem 3 The following decision problem is NP-complete for any set S with
jSj * 3:

Input: A rational number ^; a set X; a collection of N ordering functions
fi : X ! S; and a preference function PREF defined as

PREF(u; v) =

NX

i=1

wiRfi (u; v) (5)

where w = (w1; : : : ; wN ) is a rational weight vector in [0; 1]N with PNi=1
wi = 1. Question: Does there exist a total order ae such that AGREE(ae; PREF)
* ^?

Proof: The problem is clearly in NP since a nondeterministic algorithm can
guess a total order and check the weighted number of agreements in polynomial
time.

To prove that the problem is NP-hard we reduce from CYCLIC-ORDERING (Galil
& Megido, 1977; Gary & Johnson, 1979), defined as follows: "Given a set A
and a collection

250

Learning to Order Things C of ordered triples (a; b; c) of distinct elements
from A, is there a one-to-one function f : A ! f1; 2; : : : ; jAjg such that
for each (a; b; c) 2 C we have either f (a) ? f (b) ? f (c) or f (b) ? f
(c) ? f (a) or f (c) ? f (a) ? f (b)?"

Without loss of generality, S is either f0; 1; ?g or f0; 1; 2g. We first
show that the problem of finding an optimal total order is hard when S =
f0; 1; ?g. Given an instance of CYCLIC-ORDERING, we let X = A. For each triplet
t = (a; b; c) we will introduce three ordering functions ft;1, ft;2, and
ft;3, and define them so that ft;1(a) ? ft;1(b), ft;2(b) ? ft;2(c), and ft;3(c)
? ft;3(a). To do this, we let ft;1(a) = ft;2(b) = ft;3(c) = 1, ft;1(b) =
ft;2(c) = ft;3(a) = 0, and ft;i(Delta ) = ? in all other cases. We let the
weight vector be uniform, so that wt;i = 13jCj . Let

^ = 53 + jAj(jAj Gamma  1)=2 Gamma  32 :

Define Rt(u; v) = P3i=1 wt;iRft;i(u; v), which is the contribution of these
three functions to PREF(u; v). Notice that for any triplet t = (a; b; c)
2 C, Rt(a; b) = 23jCj whereas

Rt(b; a) = 13jCj , and similarly for b; c and c; a. In addition, for any
pair u; v 2 A such that at least one of them does not appear in t, we get
that Rt(u; v) = 12jCj . Since a total order ae can satisfy at most two of
the three conditions ae(a) ? ae(b), ae(b) ? ae(c), and ae(c) ? ae(a), the
largest possible weighted number of agreements associated with this triple
is exactly ^=jCj.

If the number of weighted agreements is at least ^, it must be exactly ^,
by the argument above; and if there are exactly ^ weighted agreements, then
the total order must satisfy exactly 2 out of the possible 3 relations for
each three elements that form a triplet from C. Thus, the constructed rank
ordering instance will be positive if and only if the original CYCLIC-ORDERING
instance is positive.

The case for S = f0; 1; 2g uses a similar construction; however, for each
triplet t = (a; b; c), we define six ordering functions, f jt;1, f jt;2,
and f jt;3, where j 2 f0; 1g. The basic idea here is to replace each ft;i
with two functions, f 0t;i and f 1t;i, that agree on the single ordering
constraint associated with ft;i, but disagree on all other orderings. For
instance, we will define these functions so that f jt;1(a) ? f jt;1(b) for
j = 0 and j = 1, but for all other pairs u; v, f 1t;1(u) ? f 1t;1(v) iff
f 0t;1(v) ? f 0t;1(u). Averaging the two orderings f 0t;1 and f 1t;1 will
thus yield the same preference expressed by the original function ft;1 (i.e.,
a preference for a ? b only).

In more detail, we let f jt;1(a) = f jt;2(b) = f jt;3(c) = 2Gamma j, f jt;1(b)
= f jt;2(c) = f jt;3(a) = 1Gamma j, and f jt;i(Delta ) = 2j in all other
cases. We again let the weight vector be uniform, so that wjt;i = 16jCj .
Similar to the first case, we define Rt(u; v) = Pi;j wt;i Rfj

t;i(u; v). It can beverified that R

t is identical to the Rt constructed in the first case. Therefore, by the
same argument, the constructed rank ordering instance will be positive if
and only if the original

CYCLIC-ORDERING instance is positive. 2

Although this problem is hard when jSj * 3, the next theorem shows that it
becomes tractable for linear combinations of rank orderings into a set S
of size two. Of course, when jSj = 2, the rank orderings are really only
binary classifiers. The fact that this special case is tractable underscores
the fact that manipulating orderings (even relatively simple

251

Cohen, Schapire, & Singer ones) can be computationally more difficult than
performing the corresponding operations on binary classifiers.

Theorem 4 The following optimization problem is solvable in linear time:

Input: A set X; a set S with jSj = 2; a collection of N ordering functions
fi : X ! S; and a preference function PREF defined by Eq. (5). Output: A
total order defined by an ordering function ae which maximizes AGREE(ae;
PREF).

Proof: Assume without loss of generality that the two-element set S is f0;
1g, and define ae(u) = Pi wifi(u). We now show that any total order1 consistent
with ae maximizes AGREE(ae; PREF). Fix a pair u; v 2 X and let

qb1b2 = X

i s.t. fi(u)=b1;fi(v)=b2

wi :

We can now rewrite ae and PREF as

ae(u) = q10 + q11 PREF(u; v) = q10 + 12 q11 + 12 q00 ae(v) = q01 + q11 PREF(v;
u) = q01 + 12 q11 + 12 q00 :

Note that both ae(u) Gamma  ae(v) and PREF(u; v) Gamma  PREF(v; u) are
equal to q10 Gamma  q01. Hence, if ae(u) ? ae(v) then PREF(u; v) ? PREF(v;
u). Therefore, for each pair u; v 2 X, the order defined by ae agrees on
all pairs with the pairwise preference defined by PREF. In other words, we
have shown that

AGREE(ae; PREF) = X

fu;vg

maxfPREF(u; v); PREF(v; u)g (6)

where the sum is over all unordered pairs. Clearly, the right hand side of
Eq. (6) maximizes the right hand side of Eq. (4) since at most one of (u;
v) or (v; u) can be included in the latter sum. 2

4.3 Finding an Approximately Optimal Ordering Theorem 3 implies that we are
unlikely to find an efficient algorithm that finds the optimal total order
for a weighted combination of rank orderings. Fortunately, there do exist
efficient algorithms for finding an approximately optimal total order. In
fact, finding a good total order is closely related to the problem of finding
the minimum feedback arc set, for which there exist good approximation algorithms;
see, for instance, (Shmoys, 1997) and the references therein. However, the
algorithms that achieve the good approximation results for the minimum feedback
arc set problem are based on (or further approximate) a linear-programming
relaxation (Seymour, 1995; Even, Naor, Rao, & Schieber, 1996; Berger & Shor,
1997; Even, Naor, Schieber, & Sudan, 1998) which is rather complex to implement
and quite slow in practice.

1. Notice that in case of a tie, so that ae(u) = ae(v) for distinct u; v,
ae defines only a partial order. The

theorem holds for any total order which is consistent with this partial order,
i.e., for any ae

0 so that

ae(u) ? ae(v) ) ae

0(u) ? ae0(v).

252

Learning to Order Things Algorithm Greedy-Order Inputs: an instance set X;
a preference function PREF Output: an approximately optimal ordering function
^ae let V = X for each v 2 V do ss(v) = Pu2V PREF(v; u) Gamma  Pu2V PREF(u;
v) while V is non-empty do

let t = arg maxu2V ss(u) let ^ae(t) = jV j V = V Gamma  ftg for each v 2
V do ss(v) = ss(v) + PREF(t; v) Gamma  PREF(v; t) endwhile

Figure 3: The greedy ordering algorithm. We describe instead a simple greedy
algorithm which is very simple to implement. Figure 3 summarizes the greedy
algorithm. As we will shortly demonstrate, this algorithm produces a good
approximation to the best total order.

The algorithm is easiest to describe by thinking of PREF as a directed weighted
graph, where initially, the set of vertices V is equal to the set of instances
X, and each edge u ! v has weight PREF(u; v). We assign to each vertex v
2 V a potential value ss(v), which is the weighted sum of the outgoing edges
minus the weighted sum of the ingoing edges. That is,

ss(v) = X

u2V

PREF(v; u) Gamma  X

u2V

PREF(u; v) :

The greedy algorithm then picks some node t that has maximum potential2,
and assigns it a rank by setting ^ae(t) = jV j, effectively ordering it ahead
of all the remaining nodes. This node, together with all incident edges,
is then deleted from the graph, and the potential values ss of the remaining
vertices are updated appropriately. This process is repeated until the graph
is empty. Notice that nodes removed in subsequent iterations will have progressively
smaller and smaller ranks.

As an example, consider the preference function defined by the leftmost graph
of Fig. 4. (This graph is identical to the weighted combination of the two
ordering functions from Fig. 1.) The initial potentials the algorithm assigns
are: ss(b) = 2, ss(d) = 3=2, ss(c) = Gamma 5=4, and ss(a) = Gamma 9=4.
Hence, b has maximal potential. It is given a rank of 4, and then node b
and all incident edges are removed from the graph.

The result is the middle graph of Fig. 4. After deleting b, the potentials
of the remaining nodes are updated: ss(d) = 3=2, ss(c) = Gamma 1=4, and
ss(a) = Gamma 5=4. Thus, d will be assigned rank jV j = 3 and removed from
the graph, resulting in the rightmost graph of Fig. 4.

After updating potentials again, ss(c) = 1=2 and ss(a) = Gamma 1=2. Now
c will be assigned rank jV j = 2 and removed, resulting in a graph containing
the single node a, which will

2. Ties can be broken arbitrarily in case of two or more nodes with the same
potential.

253

Cohen, Schapire, & Singer a

b

c d

7/8 7/81

1/4

1

3/4

1/8 1/8 a c

d

7/8 7/8

1/4

3/4

1/8 1/8 a c

1/4

3/4

Figure 4: Behavior of the greedy ordering algorithm. The leftmost graph is
the original

input. From this graph, node b will be assigned maximal rank and deleted,
leading to the middle graph; from this graph, node d will deleted, leading
to the rightmost graph. In the rightmost graph, node c will be ranked ahead
of node a, leading the total ordering b ? d ? c ? a.

finally be assigned the rank jV j = 1. The ordering produced by the greedy
algorithm is thus b ? d ? c ? a.

The next theorem shows that this greedy algorithm comes within a factor of
two of optimal.

Theorem 5 Let OPT(PREF) be the weighted agreement achieved by an optimal
total order for the preference function PREF, and let APPROX(PREF) be the
weighted agreement achieved by the greedy algorithm. Then

APPROX(PREF) * 12 OPT(PREF) : Proof: Consider the edges that are incident
on the node vj which is selected on the j-th repetition of the while loop
of Figure 3. The ordering produced by the algorithm will agree with all of
the outgoing edges of vj and disagree with all of the ingoing edges. Let
aj be the sum of the weights of the outgoing edges of vj, and dj be the sum
of the weights

of the ingoing edges. Clearly APPROX(PREF) * PjV jj=1 aj. However, at every
repetition, the total weight of all incoming edges must equal the total weight
of all outgoing edges. This means that Pv2V ss(v) = 0, and hence for the
node v? that has maximal potential, ss(v?) * 0. Thus on every repetition
j, it must be that aj * dj, so we have that

OPT(PREF) ^

jV jX

j=1

(aj + dj) ^

jV jX

j=1

(aj + aj) ^ 2 Delta  APPROX(PREF):

The first inequality holds because OPT(PREF) can at best include every edge
in the graph, and since every edge is removed exactly once, each edge must
contribute to some aj or some dj. 2

254

Learning to Order Things

2k+3

k+2

k+3k+1

1 2

k

k+1

1 2

k k+2

2k+3 Figure 5: An example of a graph (left) for which the node-based greedy
algorithm achieves

an approximation factor of 12 by constructing the partial order on the right.

In passing, we note that there are other natural greedy algorithms that do
not achieve good approximations. Consider, for example, an algorithm that
starts from a graph consisting of all the nodes but with no edges, and iteratively
adds the highest weighted edge in the graph, while avoiding cycles. It can
be shown that this algorithm can produce a very poor partial order, given
an adversarially chosen graph; there are cases where the optimal total order
achieves a multiplicative factor of O(jV j) more weighted agreements than
this "edge-based" greedy algorithm.

4.4 Improvements to the Greedy Algorithm The approximation factor of two
given in Theorem 5 is tight. That is, there exist problems for which the
greedy algorithm approximation is worse than the optimal solution by a factor
arbitrarily close to two. Consider the graph shown on the left-hand side
of Fig. 5. An optimal total order ranks the instances according to their
position in the figure, left to right, breaking ties randomly, and achieves
OPT(PREF) = 2k + 2 weighted agreements. However, the greedy algorithm picks
the node labeled k + 1 first and orders all the remaining nodes randomly,
achieving as few as APPROX(PREF) = k + 2 agreements. For large k, the ratio
APPROX(PREF)=OPT(PREF) approaches 12 .

For graph of Figure 5, there is another simple algorithm which produces an
optimal ordering: since the graph is already a partial order, picking any
total order consistent with this partial order gives an optimal result. To
cope with problems such as the one of Figure 5, we devised an improvement
to the greedy algorithm which combines a greedy method with topological sorting.
The aim of the improvement is to find better approximations for graphs which
are composed of many strongly connected components.

As before, the modified algorithm is easiest to describe by thinking of PREF
as a weighted directed graph. Recall that for each pair of nodes u and v,
there exist two edges: one from u to v with weight PREF(u; v) and one from
v to u with weight PREF(v; u). In the modified greedy algorithm we will pre-process
the graph. For each pair of nodes, we

255

Cohen, Schapire, & Singer Algorithm SCC-Greedy-Order Inputs: an instance
set X; a preference function PREF Output: an approximately optimal ordering
function ^ae

Define PREF0(u; v) = maxfPREF(u; v) Gamma  PREF(v; u); 0g : Find strongly
connected components U1; : : : ; Uk of the graph G = (V; E) where

V = X and E = f(u; v) j PREF0(u; v) ? 0g : Order the strongly connected components
in any way consistent with the partial order !scc:

U !scc U 0 iff 9u 2 U; u0 2 U 0 : (u; u0) 2 E

Use algorithm Greedy-Order or full enumeration to order the instances within
each component Ui according to PREF0.

Figure 6: The improved greedy ordering algorithm. remove the edge with the
smaller weight and set the weight of the other edge to be

j PREF(v; u) Gamma  PREF(u; v) j : For the special case where PREF(v; u)
= PREF(u; v) = 12 , we remove both edges. In the reduced graph, there is
at most one directed edge between each pair of nodes. Note that the greedy
algorithm would behave identically on the transformed graph since it is based
on the weighted differences between the incoming and outgoing edges.

We next find the strongly connected components3 of the reduced graph, ignoring
(for now) the weights. One can now split the edges of the reduced graph into
two classes: intercomponent edges connect nodes u and v, where u and v are
in different strongly connected components; and intra-component edges connect
nodes u and v from the same strongly connected component. It is straightforward
to verify that any optimal order agrees with all the inter-component edges.
Put another way, if there is an edge from node u to node v of two different
connected components in the reduced graph, then ae(u) ? ae(v) for any optimal
total order ae.

The first step of the improved algorithm is thus to totally order the strongly
connected components in some way consistent with the partial order defined
by the inter-component edges. More precisely, we pick a total ordering for
the components consistent with the partial order !scc, defined as follows:
for components U and U 0, U !scc U 0 iff there is an edge from some node
u 2 U to some node u0 2 U 0 in the reduced graph.

We next order the nodes within each strongly connected component, thus providing
a total order of all nodes. Here the greedy algorithm can be used. As an
alternative, in cases where a component contains only a few elements (say
at most five), one can find the optimal order between the elements of the
component by a brute-force approach, i.e., by full enumeration of all permutations.

3. Two nodes u and v are in the same strongly connected component iff there
are directed paths from u to

v and from v to u.

256

Learning to Order Things a b

c d

0.55 0.45 0.95

0.05

0.65 0.35

0.4 0.6

0.5 0.5 0.55 0.45

)

a b c d

0.10.9 0.3

0.2 0.1

)

a b c d

0.10.9 0.3

0.2 0.1

0.1

0.3 0.2

b c d a0.9

Figure 7: An illustration of the approximation algorithm for finding a total
order from

a weighted combination of ordering functions. The original graph (top left)
is reduced by removing at least one edge for each edge-pair (u; v) and (v;
u) (middle). The strongly connected components are then found (right). Finally,
an ordering is found within each strongly connected component which yield
the order b ? c ? d ? a (bottom).

The improved algorithm is summarized in Figure 6 and illustrated in Figure
7. There are four elements in Figure 7 which constitute two strongly connected
components in the reduced graph (fbg and fa; c; dg). Therefore, b is assigned
the top rank and ranked above a, c and d. If the brute-force algorithm were
used to order the components, then we would check all 3! permutations between
a, c and d and output the total order b ? c ? d ? a, which is the optimal
order in this toy example.

In the worst case, the reduced graph contains only a single strongly connected
component. In this case, the improved algorithm generates the same ordering
as the greedy algorithm. However, in the experiments on metasearch problems
described in Sec. 5, many of the strongly connected components are small;
the average size of a strongly connected component is less than five. In
cases such as these, the improved algorithm will often improve on the simple
greedy algorithm.

4.5 Experiments with the Ordering Algorithms Ideally, each algorithm would
be evaluated by determining how closely it approximates the optimal ordering
on large, realistic problems. Unfortunately, finding the optimal ordering
for large graphs is impractical. We thus performed two sets of experiments
with the ordering algorithms described above. In the first set of experiments,
we evaluated the algorithms on small graphs--specifically, graphs for which
the optimal ordering could be feasibly found with brute-force enumeration.
In these experiments, we measure the "goodness" of the resulting orderings
relative to the optimal ordering. In the second set of experiments, we evaluated
the algorithms on large graphs for which the optimal orderings are unknown.
In these experiments, we compute a "goodness" measure which depends on the
total weight of all edges, rather than the optimal ordering.

257

Cohen, Schapire, & Singer In addition to the simple greedy algorithm and
its improvement, we also considered the following simple randomized algorithm:
pick a permutation at random, and then output the better of that permutation
and its reverse. It can be easily shown that this algorithm achieves the
same approximation bound on expected performance as the greedy algorithm.
(Briefly, one of the two permutations must agree with at least half of the
weighted edges in the graph.) The random algorithm can be improved by repeating
the process, i.e., examining many random permutations and their reverses,
and choosing the permutation that achieves the largest number of weighted
agreements.

In a first set of experiments, we compared the performance of the greedy
approximation algorithm, the improved algorithm which first finds strongly
connected components, and the randomized algorithm on graphs of nine or fewer
elements. For each number of elements, we generated 10;000 random graphs
by choosing PREF(u; v) uniformly at random, and setting PREF(v; u) to 1 Gamma
PREF(u; v). For the randomized algorithm, we evaluated 10n random permutations
(and their reverses) where n is the number of instances (nodes). To have
a fair comparison between the different algorithms on the smaller graphs,
we always used the greedy algorithm (rather than a brute-force algorithm)
to order the elements of each strongly connected component of a graph.

To evaluate the algorithms, we examined the reduced graph and calculated
the average ratio of the weights of the edges chosen by the approximation
algorithm to the weights of the edges that were chosen by the optimal order.
More precisely, let ae be the optimal order and ^ae be an order chosen by
an approximation algorithm. Then for each random graph, we calculated X

u; v : ^ae(u) ? ^ae(v)

maxfPREF(u; v) Gamma  PREF(v; u); 0g

X u; v : ae(u) ? ae(v)

maxfPREF(u; v) Gamma  PREF(v; u); 0g :

If this measure is 0.9, for instance, then the total weight of the edges
in the total order picked by the approximation algorithm is 90% of the corresponding
figure for the optimal algorithm.

We averaged the above ratios over all random graphs of the same size. The
results are shown on the left hand side of Figure 8. On the right hand side
of the figure, we show the average running time for each of the algorithms
as a function of the number of elements. When the number of ranked elements
is more than five, the greedy algorithms outperform the randomized algorithm,
while their running time is much smaller. Thus, if a full enumeration had
been used to find the optimal order of small strongly connected components,
the approximation would have been consistently better than the randomized
algorithm.

We note that the greedy algorithm also generally performs better on average
than the lower bound given in Theorem 5. In fact, combining the greedy algorithm
with prepartitioning of the graph into strongly connected components often
yields the optimal order.

In the second set of experiments, we measured performance and running time
for larger random graphs. Since for large graphs we cannot find the optimal
solution by brute-force enumeration, we use as a "goodness" measure the ratio
of the weights of the edges that were left in the reduced graph after applying
an approximation algorithm to the total weight of

258

Learning to Order Things 3 4 5 6 7 8 90.88 0.9 0.92 0.94 0.96 0.98

1

Number of elements Fraction of optimal solution

GreedySCC + Greedy Randomized

3 4 5 6 7 8 90 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

Number of elements Running time (seconds)

GreedySCC + Greedy Randomized

Figure 8: Comparison of goodness (left) and the running time (right) of the
approximations

achieved by the greedy algorithms and the randomized algorithm as a function
of the number of ranked elements for random preference functions with 3 through
9 elements.

5 10 15 20 25 30 0.6

0.65

0.7 0.75

0.8 0.85

0.9 0.95

Number of elements Fraction of total weight

GreedySCC + Greedy Randomized

5 10 15 20 25 30 0

0.05

0.1 0.15

0.2 0.25

0.3 0.35

0.4 0.45

Number of elements Running time (seconds)

GreedySCC + Greedy Randomized

Figure 9: Comparison of goodness (left) and the running time (right) of the
approximations

achieved by the greedy algorithms and the randomized algorithm as a function
of the number of ranked elements for random preference functions with 3 through
30 elements. Note that the graphs for Greedy and SCC+Greedy coincide for
most of the points.

259

Cohen, Schapire, & Singer edges in the graph. That is, for each random graph
we calculated

X u; v : ^ae(u) ? ^ae(v)

maxfPREF(u; v) Gamma  PREF(v; u); 0g

X u; v

maxfPREF(u; v) Gamma  PREF(v; u); 0g :

We ran the three algorithms with the same parameters as above (i.e., 10;000
random graphs). The results are given in Figure 9. The advantage of the greedy
algorithms over the randomized algorithm is even more apparent on these larger
problems. Note also that for large graphs the performance of the two greedy
algorithms is indistinguishable. This is mainly due to the fact that large
random graphs are strongly connected with high probability.

To summarize the experiments, when there are six or more elements the greedy
algorithm clearly outperforms the randomized algorithm even if many randomly
chosen permutations are examined. Furthermore, the improved algorithm which
first finds the strongly connected components outperforms the randomized
algorithm for all graph sizes. In practice the improved greedy algorithm
achieves very good approximations--within about 5 percent of optimal, for
the cases in which optimal graphs can be feasibly found.

5. Experimental Results for Metasearch So far, we have described a method
for learning a preference function, and a means of converting a preference
function into an ordering of new instances. We will now present some experimental
results in learning to order. In particular, we will describe results on
learning to combine the orderings of several web "search experts" using the
algorithm of Figure 2 to learn a preference function, and the simple greedy
algorithm to order instances using the learned preference function. The goals
of these experiments are to illustrate the type of problems that can be solved
with our method; to empirically evaluate the learning method; to evaluate
the ordering algorithm on large, non-random graphs, such as might arise in
a realistic application; and to confirm the theoretical results of the preceding
sections. We thus restrict ourselves to comparing the learned orderings to
individual search experts, as is suggested by Theorem 1, rather than attempt
to compare this application of learningto-order with previous experimental
techniques for metasearch, e.g., (Lochbaum & Streeter, 1989; Kantor, 1994;
Boyan, Freitag, & Joachims, 1994; Bartell, Cottrell, & Belew, 1994).

We note that this metasearch problem exhibits several properties that suggest
a general approach such as ours. For instance, approaches that learn to combine
similarity scores are not applicable, since the similarity scores of web
search engines are often unavailable. In the experiments presented here,
the learning algorithm was provided with ordered lists for each search engine
without any associated scores. To further demonstrate the merits of our approach,
we also describe experiments with partial feedback--that is, with preference
judgments that are less informative than the relevance judgments more typically
used in improving search engines.

260

Learning to Order Things ML Search Experts UNIV Search Experts NAME NAME
"NAME" "NAME" title:"NAME" "NAME" PLACE NAME +LASTNAME title:"home page"
title:NAME NAME +LASTNAME title:homepage title:"NAME" NAME +LASTNAME machine
learning title:"NAME" PLACE NAME +LASTNAME "machine learning" NAME title:"home
page" NAME +LASTNAME case based reasoning NAME title:"homepage" NAME +LASTNAME
"case based reasoning" NAME welcome NAME +LASTNAME PLACE NAME url:index.html
NAME +LASTNAME "PLACE" NAME url:home.html NAME +LASTNAME url:index.html "NAME"
title:"home page" NAME +LASTNAME url:home.html "NAME" title:"homepage" NAME
+LASTNAME url:~*LASTNAME* "NAME" welcome NAME +LASTNAME url:~LASTNAME "NAME"
url:index.html NAME +LASTNAME url:LASTNAME "NAME" url:home.html

"NAME" PLACE title:"home page" "NAME" PLACE title:"homepage" "NAME" PLACE
welcome "NAME" PLACE url:index.html "NAME" PLACE url:home.html

Table 1: Search (and ranking) experts used in the metasearch experiments.
In the associated queries, NAME is replaced with the person's (or university's)
full name, LASTNAME with the person's last name, and PLACE is replaced with
the person's affiliation (or university's location). Sequences of words enclosed
in quotes must appear as a phrase, and terms prefixed by title: and url:
must appear in that part of the web page. Words prefixed by a "+" must appear
in the web page; other words may or may not appear.

5.1 Test Problems and Encoding We chose to simulate the problem of learning
a domain-specific search engine--i.e., an engine that searches for pages
of a particular, narrow type. Ahoy! (Shakes, Langheinrich, & Etzioni, 1997)
is one instance of such a domain-specific search engine. As test cases, we
picked two problems: retrieving the home pages of machine learning researchers
(ML), and retrieving the home pages of universities (UNIV). To obtain sample
queries, we obtained a listing of machine learning researchers, identified
by name and affiliated institution, together with their home pages,4 and
a similar list for universities, identified by name and (sometimes) geographical
location.5 Each entry on a list was viewed as a query, with the associated
URL the sole relevant web page.

4. From http://www.aic.nrl.navy.mil/,aha/research/machine-learning.html,
a list maintained by David

Aha. 5. From Yahoo!

261

Cohen, Schapire, & Singer We then constructed a series of special-purpose
"search experts" for each domain. These were implemented as query expansion
methods which converted a name/affiliation pair (or a name/location pair)
to a likely-seeming Altavista query. For example, one expert for the UNIV
domain searched for the university name appearing as a phrase, together with
the phrase "home page" in the title; another expert for the ML domain searched
for all the words in the person's name plus the words "machine" and "learning,"
and further enforces a strict requirement that the person's last name appear.
Overall, we defined 16 search experts for the ML domain and 22 for the UNIV
domain; these are summarized in Table 1. Each search expert returned the
top 30 ranked web pages. In the ML domain, there were 210 searches for which
at least one search expert returned the named home page; for the UNIV domain,
there were 290 such searches. The task of the learning system is to find
an appropriate way of combining the output of these search experts.

To give a more precise description of the search experts, for each query
t, we first constructed the set Xt consisting of all web pages returned by
all of the expanded queries defined by the search experts. Next, each search
expert i was represented as a preference function Rti. We chose these preference
functions to be rank orderings defined with respect to an ordering function
f ti in the natural way: we assigned a rank of f ti = 30 to the first listed
page, f ti = 29 to the second-listed page, and so on, finally assigning a
rank of f ti = 0 to every page not retrieved in the top 30 by the expanded
query associated with expert i.

To encode feedback, we considered two schemes. In the first, we simulated
complete relevance feedback--that is, for each query, we constructed feedback
in which the sole relevant page was preferred to all other pages. In the
second, we simulated the sort of feedback that could be collected from "click
data"--i.e., from observing a user's interactions with a metasearch system.
For each query, after presenting a ranked list of pages, we noted the rank
of the one relevant web page. We then constructed a feedback ranking in which
the relevant page is preferred to all preceding pages. This would correspond
to observing which link the user actually followed, and making the assumption
that this link was preferred to previous links.

It should be emphasized that both of these forms of feedback are simulated,
and contain less noise than would be expected from real user data. In reality
some fraction of the relevance feedback would be missing or erroneous, and
some fraction of click data would not satisfy the assumption stated above.

5.2 Evaluation and Results To evaluate the expected performance of a fully-trained
system on novel queries in this domain, we employed leave-one-out testing.
For each query t, we trained the learning system on all the other queries,
and then recorded the rank of the learned system on query t. For complete
relevance feedback, this rank is invariant of the ordering of the training
examples, but for the "click data" feedback, it is not; the feedback collected
at each stage depends on the behavior of the partially learned system, which
in turn depends on the previous training examples. Thus for click data training,
we trained on 100 randomly chosen permutations of the training data and recorded
the median rank for t.

262

Learning to Order Things 5.2.1 Performance Relative to Individual Experts
The theoretical results provide a guarantee of performance relative to the
performance of the best individual search (ranking) expert. It is therefore
natural to consider comparing the performance of the learned system to the
best of the individual experts. However, for each search expert, only the
top 30 ranked web pages for a query are known; if the single relevant page
for a query is not among these top 30, then it is impossible to compute any
natural measures of performance for this query. This complicates any comparison
of the learned system to the individual search experts.

However, in spite of the incomplete information about the performance of
the search experts, it is usually possible to tell if the learned system
ranks a web page higher than a particular expert.6 Motivated by this, we
performed a sign test: we compared the rank of the learning systems to the
rank given by each search expert, checking to see whether this rank was lower,
and discarding queries for which this comparison was impossible. We then
used a normal approximation to the binomial distribution to test the following
two null hypotheses (where the probability is taken over the distribution
from which the queries are drawn):

H1. With probability at least 0.5, the search expert performs better than
the learning

system (i.e., gives a lower rank to the relevant page than the learning system
does.)

H2. With probability at least 0.5, the search expert performs no worse than
the learning

system (i.e., gives an equal or lower rank to the relevant page.)

In training, we explored learning rates in the range [0:001; 0:999]. For
complete feedback in the ML domain, hypothesis H1 can be rejected with high
confidence (p ? 0:999) for every search expert and every learning rate 0:01
^ fi ^ 0:99. The same holds in the UNIV domain for all learning rates 0:02
^ fi ^ 0:99. The results for click data training are nearly as strong, except
that 2 of the 22 search experts in the UNIV domain show a greater sensitivity
to the learning rate: for these engines, H1 can only be rejected with high
confidence for 0:3 ^ fi ^ 0:6. To summarize, with high confidence, in both
domains, the learned ranking system is no worse than any individual search
expert for moderate values of fi.

Hypothesis H2 is more stringent since it can be rejected only if we are sure
that the learned system is strictly better than the expert. With complete
feedback in the ML domain and 0:3 ^ fi ^ 0:8, hypothesis H2 can be rejected
with confidence p ? 0:999 for 14 of the 16 search experts. For the remaining
two experts the learned system does perform better more often, but the difference
is not significant. In the UNIV domain, the results are similar. For 0:2
^ fi ^ 0:99, hypothesis H2 can be rejected with confidence p ? 0:999 for
21 of the 22 search experts, and the learned engine tends to perform better
than the single remaining expert.

Again, the results for click data training are only slightly weaker. In the
ML domain, hypothesis H2 can be rejected for all but three experts for all
but the most extreme learning rates; in the UNIV domain, hypothesis H2 can
be rejected for all but two experts for 0:4 ^ fi ^ 0:6. For the remaining
experts and learning rates the differences are not statistically

6. The only time this cannot be determined is when neither the learned system
nor the expert ranks the

relevant web pages in the top 30, a case of little practical interest.

263

Cohen, Schapire, & Singer significant; however, it is not always the case
that the learned engine tends to perform better.

To summarize the experiments, for moderate values of fi the learned system
is, with high confidence, strictly better than most of the search experts
in both domains, and never significantly worse than any expert. When trained
with full relevance judgments, the learned system performs better on average
than any individual expert.

5.2.2 Other Performance Measures We measured the number of queries for which
the correct web page was in the top k ranked pages, for various values of
k. These results are shown in Figure 10. Here the lines show the performance
of the learned systems (with fi = 0:5, a generally favorable learning rate)
and the points correspond to the individual experts. In most cases, the learned
system closely tracks the performance of the best expert at every value of
k. This is especially interesting since no single expert is best at all values
of k.

The final graph in this figure investigates the sensitivity of this measure
to the learning rate fi. As a representative illustration, we varied fi in
the ML domain and plotted the top-k performance of the system learned from
complete feedback for three values of k. Note that performance is roughly
comparable over a wide range of values for fi.

Another plausible measure of performance is the average rank of the (single)
relevant web page. We computed an approximation to average rank by artificially
assigning a rank of 31 to every page that was either unranked, or ranked
above rank 30. (The latter case is to be fair to the learned system, which
is the only one for which a rank greater than 30 is possible.) A summary
of these results for fi = 0:5 is given in Table 2, together with some additional
data on top-k performance. In the table, we give the top-k performance for
three values of k, and average rank for several ranking systems: the two
learned systems; the naive query, i.e., the person or university's name;
and the single search expert that performed best with respect to each performance
measure. Note that not all of these experts are distinct since several experts
scored the best on more than one measure.

The table illustrates the robustness of the learned systems, which are nearly
always competitive with the best expert for every performance measure listed.
The only exception to this is that the system trained on click data trails
the best expert in top-k performance for small values of k. It is also worth
noting that in both domains, the naive query (simply the person or university's
name) is not very effective: even with the weaker click data feedback, the
learned system achieves a 36% decrease in average rank over the naive query
in the ML domain, and a 46% decrease in the UNIV domain.

To summarize the experiments, on these domains the learned system not only
performs much better than naive search strategies, but also consistently
performs at least as well as, and perhaps slightly better than, any single
domain-specific search expert. This observation holds regardless of the performance
metric considered; for nearly every metric we computed, the learned system
always equals, and usually exceeds, the performance of the search expert
that is best for that metric. Finally, the performance of the learned system
is almost as good with the weaker "click data" training as with complete
relevance feedback.

264

Learning to Order Things 0 50 100 150 200 250

0 5 10 15 20 25 30 35 # queries in top k

k

ML: queries answered in top k

Learned System - Full feedbackLearned System - Click data

Individual Rankers

0 50 100 150 200 250 300

0 5 10 15 20 25 30 35 # queries in top k

k

UNIV: queries answered in top k

Learned System - Full feedbackLearned System - Click data

Individual Rankers

0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85

0 0.2 0.4 0.6 0.8 1 % Relevant

Learning Rate k=1 k=4 k=8

Figure 10: Top and middle: Performance of the learned system versus individual
experts

for two different domains. Bottom: the percentage of time the relevant web
page was in the top-k list for k = 1,4, and 8.

265

Cohen, Schapire, & Singer

ML Domain University Domain Top 1 Top 10 Top 30 Avg Rank Top 1 Top 10 Top
30 Avg Rank Learned (Full Feed.) 114 185 198 4.9 111 225 253 7.8 Learned
(Click Data) 93 185 198 4.9 87 229 259 7.8 Naive 89 165 176 7.7 79 157 191
14.4 Best (Top 1) 119 170 184 6.7 112 221 247 8.2 Best (Top 10) 114 182 190
5.3 111 223 249 8.0 Best (Top 30) 97 181 194 5.6 111 223 249 8.0 Best (Avg
Rank) 114 182 190 5.3 111 223 249 8.0

Table 2: Comparison of learned systems and individual search queries.

6. Related Work Problems that involve ordering and ranking have been investigated
in various fields such as decision theory, the social sciences, information
retrieval and mathematical economics (Black, 1958; Kemeny & Snell, 1962;
Cooper, 1968; Fishburn, 1970; Roberts, 1979; Salton & McGill, 1983; French,
1989; Yao, 1995). Among the wealth of literature on the subject, the closest
to ours appears to be the work of Kemeny and Snell (1962) which was extended
by Yao (1995) and used by Balabanov'ic and Shoham (1997) in their FAB collaborative
filtering system. These works use a similar notion of ordering functions
and feedback; however, they assume that both the ordering functions and the
feedback are complete and transitive. Hence, it is not possible to leave
elements unranked, or to have inconsistent feedback which violates the transitivity
requirements. It is therefore difficult to combine and fuse inconsistent
and incomplete orderings in the Kemeny and Snell model.

There are also several related intractability results. Most of them are concerned
with the difficulty in reaching consensus in voting systems based on preference
ordering. Specifically, Bartholdi, Tovey and Trick (1989) study the problem
of finding a winner in an election when the preferences of all voters are
irreflexive, antisymmetric, transitive, and complete. Thus, their setting
is more restrictive than ours. They study two similar schemes to decide on
a winner of an election. The first was invented by Dodgson (1876) (better
known by his pen name, Lewis Carroll) and the second is due to Kemeny (1959).
For both models, they show that the problem of finding a winner in an election
is NP-hard. Among these two models, the one suggested by Kemeny is the closest
to ours. However, as mentioned above, this model is more restrictive as it
does not allow voters to abstain (preferences are required to be complete)
or to be inconsistent (all preferences are transitive).

As illustrated by the experiments, the problem of learning to rank is closely
related to the problem of combining the results of different search engines.
Many methods for this have been proposed by the information retrieval community,
and many of these are adaptive, using relevance judgments to make an appropriate
choice of parameters. However, generally, rankings are combined by combining
the scores that were used to rank documents (Lochbaum & Streeter, 1989; Kantor,
1994). It is also frequently assumed that other properties of the objects
(documents) to be ranked are available, such as word frequencies. In contrast,
in our experiments, instances are atomic entities with no associated properties
except for their position in various rank-orderings. Similarly, we make minimal
assump266

Learning to Order Things tions about the rank-orderings--in particular, we
do not assume scores are available. Our methods are thus applicable to a
broader class of ranking problems.

General optimization methods have also been adopted to adjust parameters
of an IR system so as to improve agreement with a set of user-given preference
judgments. For instance, Boyan, Freitag, and Joachims (1994) use simulated
annealing to improve agreement with "click data," and Bartell, Cottrell and
Belew (1994) use conjugate gradient descent to choose parameters for a linear
combination of scoring functions, each associated with a different search
expert. Typically, such approaches offer few guarantees of efficiency, optimality,
or generalization performance.

Another related task is collection fusion. Here, several searches are executed
on disjoint subsets of a large collection, and the results are combined.
Several approaches to this problem that do not rely on combining ranking
scores have been described (Towell, Voorhees, Gupta, & Johnson-Laird, 1995;
Voorhees, Gupta, & Johnson-Laird, 1994). However, although the problem is
superficially similar to the one presented here, the assumption that the
different search engines index disjoint sets of documents actually makes
the problem quite different. In particular, since it is impossible for two
engines to give different relative orderings to the same pair of documents,
combining the rankings can be done relatively easily.

Etzioni et al. (1996) formally considered another aspect of metasearch--the
task of optimally combining information sources with associated costs and
time delays. Our formal results are disjoint from theirs, as they assume
that every query has a single recognizable correct answer, rendering ordering
issues unimportant.

There are many other applications in machine learning, reinforcement learning,
neural networks, and collaborative filtering that employ ranking and preferences,
e.g., (Utgoff & Saxena, 1987; Utgoff & Clouse, 1991; Caruana, Baluja, & Mitchell,
1996; Resnick & Varian, 1997), While our work is not directly relevant, it
might be possible to use the framework suggested in this paper in similar
settings. This is one of our future research goals.

Finally, we would like to note that the framework and algorithms presented
in this paper can be extended in several ways. Our current research focuses
on efficient batch algorithms for combining preference functions, and on
using restricted ranking experts for which the problem of finding an optimal
total ordering can be solved in polyomial time (Freund, Iyer, Schapire, &
Singer, 1998).

7. Conclusions In many applications, it is desirable to order rather than
classify instances. We investigated a two-stage approach to learning to order
in which one first learns a preference function by conventional means, and
then orders a new set of instances by finding the total ordering that best
approximates the preference function. The preference function that is learned
is a binary function PREF(u; v), which returns a measure of confidence reflecting
how likely it is that u is preferred to v. This is learned from a set of
"experts" which suggest specific orderings, and from user feedback in the
form of assertions of the form "u should be preferred to v".

We have presented two sets of results on this problem. First, we presented
an online learning algorithm for learning a weighted combination of ranking
experts which is based

267

Cohen, Schapire, & Singer on an adaptation of Freund and Schapire's Hedge
algorithm. Second, we explored the complexity of the problem of finding a
total ordering that agrees best with a preference function. We showed that
this problem is NP-complete even in a highly restrictive case, namely, preference
predicates that are linear combinations of a certain class of well-behaved
"experts" called rank orderings. However, we also showed that for any preference
predicate, there is a greedy algorithm that always obtains a total ordering
that is within a factor of two of optimal. We also presented an algorithm
that first divides the set of instances into strongly connected components
and then uses the greedy algorithm (or full enumeration, for small components)
to find an approximately good order within large strongly connected components.
We found that this approximation algorithm works very well in practice and
often finds the best order.

We also presented experimental results in which these algorithms were used
to combine the results of a number of "search experts," each of which corresponds
to a domain-specific strategy for searching the web. We showed that in two
domains, the learned system closely tracks and often exceeds the performance
of the best of these search experts. These results hold for either traditional
relevance feedback models of learning, or from weaker feedback in the form
of simulated "click data." The performance of the learned systems also clearly
exceeds the performance of more naive approaches to searching.

Acknowledgments We would like to thank Noga Alon, Edith Cohen, Dana Ron,
and Rick Vohra for numerous helpful discussions. An extended abstract of
this paper appeared in Advances in Neural Information Processing Systems
10, MIT Press, 1998.

References Balabanov'ic, M., & Shoham, Y. (1997). FAB: Content-based, collaborative
recommendation. Communications of the ACM, 40 (3), 66-72.

Bartell, B., Cottrell, G., & Belew, R. (1994). Automatic combination of multiple
ranked

retrieval systems. In Seventeenth Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval.

Bartholdi, J., Tovey, C., & Trick, M. (1989). Voting schemes for which it
can be difficult

to tell who won the elections. Social Choice and Welfare, 6, 157-165.

Berger, B., & Shor, P. (1997). Tight bounds for the acyclic subgraph problem.
Journal of

Algorithms, 25, 1-18.

Black, D. (1958). Theory of Committees and Elections. Cambridge University
Press. Boyan, J., Freitag, D., & Joachims, T. (1994). A machine learning
architecture for optimizing web search engines. Tech. rep. WS-96-05, American
Association of Artificial Intelligence.

268

Learning to Order Things Caruana, R., Baluja, S., & Mitchell, T. (1996).
Using the future to `Sort Out' the present:

Rankprop and multitask learning for medical risk evaluation. In Advances
in Neural Information Processing Systems (NIPS) 8.

Cooper, W. (1968). Expected search length: A single measure of retrieval
effectiveness

based on the weak ordering action of retrieval systems. American Documentation,
19, 30-41.

Dodgson, C. (1876). A method for taking votes on more than two issues. Clarendon
Press,

Oxford. Reprinted with discussion in (Black, 1958).

Etzioni, O., Hanks, S., Jiang, T., Karp, R. M., Madani, O., & Waarts, O.
(1996). Efficient

information gathering on the internet. In Proceedings of the 37th Annual
Symposium on Foundations of Computer Science (FOCS-96) Burlington, Vermont.
IEEE Computer Society Press.

Even, G., Naor, J., Rao, S., & Schieber, B. (1996). Divide-and-conquer approximation
algorithms via spreading metrics. In 36th Annual Symposium on Foundations
of Computer Science (FOCS-96), pp. 62-71 Burlington, Vermont. IEEE Computer
Society Press.

Even, G., Naor, J., Schieber, B., & Sudan, M. (1998). Approximating minimum
feedback

sets and multicuts in directed graphs. Algorithmica, 20 (2), 151-174.

Fishburn, F. (1970). Utility Theory for Decision Making. Wiley, New York.
French, S. (1989). Decision Theory: An Introduction to the Mathematics of
Rationality.

Ellis Horwood Series in Mathematics and Its Applications.

Freund, Y., Iyer, R., Schapire, R., & Singer, Y. (1998). An efficient boosting
algorithm for

combining preferences. In Machine Learning: Proceedings of the Fifteenth
International Conference.

Freund, Y., & Schapire, R. (1997). A decision-theoretic generalization of
on-line learning

and an application to boosting. Journal of Computer and System Sciences,
55 (1), 119-139.

Galil, Z., & Megido, N. (1977). Cyclic ordering is NP-complete. Theoretical
Computer

Science, 5, 179-182.

Gary, M., & Johnson, D. (1979). Computers and Intractability: A Guide to
the Theory of

NP-completeness. W. H. Freeman and Company, New York.

Kantor, P. (1994). Decision level data fusion for routing of documents in
the TREC3

context: a best case analysis of worst case results. In Proceedings of the
third text retrieval conference (TREC-3).

Kemeny, J. (1959). Mathematics without numbers. Daedalus, 88, 571-591. Kemeny,
J., & Snell, J. (1962). Mathematical Models in the Social Sciences. Blaisdell,
New

York.

269

Cohen, Schapire, & Singer Littlestone, N. (1988). Learning quickly when irrelevant
attributes abound: A new linearthreshold algorithm. Machine Learning, 2 (4).

Littlestone, N., & Warmuth, M. (1994). The weighted majority algorithm. Information
and

Computation, 108 (2), 212-261.

Lochbaum, K., & Streeter, L. (1989). Comparing and combining the effectiveness
of latent semantic indexing and the ordinary vector space model for information
retrieval. Information processing and management, 25 (6), 665-676.

Resnick, P., & Varian, H. (1997). Introduction to special section on Recommender
Systems.

Communication of the ACM, 40 (3).

Roberts, F. (1979). Measurement theory with applications to decision making,
utility, and

social sciences. Addison Wesley, Reading, MA.

Salton, G., & McGill, M. (1983). Introduction to Modern Information Retrieval.
McGrawHill.

Seymour, P. (1995). Packing directed circuits fractionally. Combinatorica,
15, 281-288. Shakes, J., Langheinrich, M., & Etzioni, O. (1997). Dynamic
reference sifting: a case study

in the homepage domain. In Proceedings of WWW6.

Shmoys, D. (1997). Cut problems and their application to divide-and-conquer.
In

Hochbaum, D. (Ed.), Approximation algorithms for NP-Hard Problems. PWS Publishing
Company, New York.

Towell, G., Voorhees, E., Gupta, N., & Johnson-Laird, B. (1995). Learning
collection fusion

strategies for information retrieval. In Machine Learning: Proceedings of
the Twelfth International Conference Lake Taho, California. Morgan Kaufmann.

Utgoff, P., & Clouse, J. (1991). Two kinds of training information for evaluation
function

learning. In Proceedings of the Ninth National Conference on Artificial Intelligence
(AAAI-91), pp. 596-600 Cambridge, MA. AAAI Press/MIT PRess.

Utgoff, P., & Saxena, S. (1987). Learning a preference predicate. In Proceedings
of the

Fourth International Workshop on Machine Learning, pp. 115-121 San Francisco,
CA. Morgan Kaufmann.

Voorhees, E., Gupta, N., & Johnson-Laird, B. (1994). The collection fusion
problem. In Seventeenth Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval.

Yao, Y. (1995). Measuring retrieval effectiveness based on user preference
of documents.

Journal of the American Society for Information Science, 46 (2), 133-145.

270