Existence of Multiagent Equilibria with Limited Agents

M. Bowling and M. Veloso

Volume 22, 2004

Links to Full Text:

Abstract:
Multiagent learning is a necessary yet challenging problem as multiagent systems become more prevalent and environments become more dynamic. Much of the groundbreaking work in this area draws on notable results from game theory, in particular, the concept of Nash equilibria. Learners that directly learn an equilibrium obviously rely on their existence. Learners that instead seek to play optimally with respect to the other players also depend upon equilibria since equilibria are fixed points for learning. From another perspective, agents with limitations are real and common. These may be undesired physical limitations as well as self-imposed rational limitations, such as abstraction and approximation techniques, used to make learning tractable. This article explores the interactions of these two important concepts: equilibria and limitations in learning. We introduce the question of whether equilibria continue to exist when agents have limitations. We look at the general effects limitations can have on agent behavior, and define a natural extension of equilibria that accounts for these limitations. Using this formalization, we make three major contributions: (i) a counterexample for the general existence of equilibria with limitations, (ii) sufficient conditions on limitations that preserve their existence, (iii) three general classes of games and limitations that satisfy these conditions. We then present empirical results from a specific multiagent learning algorithm applied to a specific instance of limited agents. These results demonstrate that learning with limitations is feasible, when the conditions outlined by our theoretical analysis hold.

Extracted Text

 

Journal of Artificial Intelligence Research 22 (2004) 353–384 Submitted 8/03; published 12/04 Existence of Multiagent Equilibria with Limited Agents Michael Bowling BOWLING@CS.UALBERTA.CA Department of Computing Science, University of Alberta Edmonton, Alberta, Canada, T6G 2E8Manuela Veloso MMV@CS.CMU.EDU Computer Science Department, Carnegie Mellon University Pittsburgh PA, 15213-3891 Abstract Multiagent learning is a necessary yet challenging problem as multiagent systems become more prevalent and environments become more dynamic. Much of the groundbreaking work in this areadraws on notable results from game theory, in particular, the concept of Nash equilibria. Learnersthat directly learn an equilibrium obviously rely on their existence. Learners that instead seek toplay optimally with respect to the other players also depend upon equilibria since equilibria arefixed points for learning. From another perspective, agents with limitations are real and common.These may be undesired physical limitations as well as self-imposed rational limitations, such asabstraction and approximation techniques, used to make learning tractable. This article explores theinteractions of these two important concepts: equilibria and limitations in learning. We introducethe question of whether equilibria continue to exist when agents have limitations. We look at thegeneral effects limitations can have on agent behavior, and define a natural extension of equilibriathat accounts for these limitations. Using this formalization, we make three major contributions: (i)a counterexample for the general existence of equilibria with limitations, (ii) sufficient conditionson limitations that preserve their existence, (iii) three general classes of games and limitations thatsatisfy these conditions. We then present empirical results from a specific multiagent learningalgorithm applied to a specific instance of limited agents. These results demonstrate that learningwith limitations is feasible, when the conditions outlined by our theoretical analysis hold. 1. Introduction Multiagent domains are becoming more prevalent as more applications and situations require multi-ple agents. Learning in these systems is as useful and important as in single-agent domains, possiblymore so. Optimal behavior in a multiagent system may depend on the behavior of the other agents.For example, in robot soccer, passing the ball may only be optimal if the defending goalie is goingto move to block the player’s shot and no defender will move to intercept the pass. This challengeis complicated by the fact that the behavior of the other agents is often not predictable by the agentdesigner, making learning and adaptation a necessary component of the agent itself. In addition, thebehavior of the other agents, and therefore the optimal response, can be changing as they also adaptto achieve their own goals. Game theory provides a framework for reasoning about these strategic interactions. The game theoretic concepts of stochastic games and Nash equilibria are the foundation for much of the recentresearch in multiagent learning, (e.g., Littman, 1994; Hu & Wellman, 1998; Greenwald & Hall,2002; Bowling & Veloso, 2002). A Nash equilibrium defines a course of action for each agent, suchthat no agent could benefit by changing their behavior. So, all agents are playing optimally, giventhat the other agents continue to play according to the equilibrium. c 2004 AI Access Foundation. All rights reserved.

BOWLING & VELOSO From the agent design perspective, optimal agents in realistic environments are not practical. Agents are faced with all sorts of limitations. Some limitations may physically prevent certainbehavior, e.g., a soccer robot that has traction limits on its acceleration. Other limitations are self-imposed to help guide an agent’s learning, e.g., using a subproblem solution for advancing theball down the field. In short, limitations prevent agents from playing optimally and possibly fromfollowing a Nash equilibrium. This clash between the concept of equilibrium and the reality of limited agents is a topic of critical importance. Do equilibria exist when agents have limitations? Are there classes of domainsor classes of limitations where equilibria are guaranteed to exist? This article introduces these ques-tions and provides concrete answers. In particular, we introduce two models of limitations: implicitgames and restricted policy spaces. We use these models to demonstrate in two key counterexam-ples the threat that limitations pose to the existence of equilibria. We refine a sufficient condition forthe existence of equilibria, and use to prove existence in three classes of games and limitations. Thisanalysis is peppered with examples of applications of these results. We conclude with an empiricalexample of learning with limitations, and a brief survey of related work. The article is organized as follows. Section 2 introduces the stochastic game framework as a model for multiagent learning. We define the game theoretic concept of equilibrium, and examinethe dependence of current multiagent learning algorithms on this concept. Section 3 enumerates andclassifies some common agent limitations and presents two formal models incorporating the effectsof limitations into the stochastic game framework. Section 4 is the major contribution of the article,presenting both proofs of existence for certain domains and limitations as well as counterexamplesfor others. Section 5 gives an example of how these results affect and relate to one particular multi-agent learning algorithm. We present interesting results of applying a multiagent learning algorithmin a setting with limited agents. Section 6 compares our approach with other investigations of agentlimitations in the fields of game theory and artificial intelligence. Finally, Section 7 concludes withimplications of this work and future directions. 2. Stochastic Games A stochastic game is a tuple (n, S, A1...n, T, R1...n), where, • n is the number of agents, • S is a set of states, • Ai is the set of actions available to agent i with A being the joint action space, A1 × . . . × An, • T is a transition function, S × A × S → [0, 1], such that, ∀s ∈ S ∀a ∈ A T (s, a, s ) = 1, s ∈S • and Ri is a reward function for the ith agent, S × A → R. This framework is very similar to the framework of a Markov Decision Process (MDP). Instead of asingle agent, though, there are multiple agents whose joint action determines the next state and re-wards to the agents. The goal of an agent, as in MDPs, is to maximize its long-term reward. Notice, 354

EXISTENCE OF MULTIAGENT EQUILIBRIA though, that each agent has its own independent reward function that it is seeking to maximize. Thegoal of maximizing “long-term reward” will be made formal in Section 2.2. Stochastic games can also be thought of as an extension of the concept of matrix games to mul- tiple states. Two common matrix games are in Figure 1. In these games there are two players; oneselects a row and the other selects a column of the matrix. The entry of the matrix they jointly selectdetermines the payoffs. Rock-Paper-Scissors in Figure 1(a) is a zero-sum game, where the columnplayer receives the negative of the row player’s payoff. In the general case (general-sum games;e.g., Bach or Stravinsky in Figure 1(b)) each player has an independent matrix that determines itspayoff. Stochastic games, then, can be viewed as having a matrix game associated with each state.The immediate payoffs at a particular state are determined by the matrix entries Ri(s, a). Afterselecting actions and receiving their rewards from the matrix game, the players are transitioned toanother state and associated matrix game, which is determined by their joint action. So stochas-tic games contain both MDPs (when n = 1) and matrix games (when |S| = 1) as subsets of theframework.  0 −1 1  2 0 Rr(s0, ·) = 1 0 −1   Rr(s0, ·) = 0 1 −1 1 0  0 1 −1  1 0 Rc(s0, ·) = −1 0 1 R   c(s0, ·) = 0 2 1 −1 0 (a) Rock-Paper-Scissors (b) Bach or Stravinsky Table 1: Two example matrix games. 2.1 Policies Unlike in single-agent settings, deterministic policies, which associate a single action with everystate, can often be exploited in multiagent settings. Consider Rock-Paper-Scissors as shown inFigure 1(a). If the column player were to play any action deterministically, the row player couldwin a payoff of one every time. This fact requires us to consider stochastic strategies and policies. Astochastic policy for player i, πi : S → P D(Ai), is a function that maps states to mixed strategies,which are probability distributions over the player’s actions. We use the notation Πi to be the set ofall possible stochastic policies available to player i, and Π = Π1 × . . . × Πn to be the set of jointpolicies of all the players. We also use the notation π−i to refer to a particular joint policy of allthe players except player i, and Π−i to refer to the set of such joint policies. Finally, the notation πi, π−i refers to the joint policy where player i follows πi while the other players follow their policy from π−i. In this work, we make the distinction between the concept of stochastic policies and mixtures of policies. A mixture of policies, σi : P D(S → Ai), is a probability distribution over the set of deter-ministic policies. An agent following a mixture of policies selects a deterministic policy accordingto its mixture distribution at the start of the game and always follows this policy. This differenceis similar to the distinction between mixed strategies and behavioral strategies in extensive-formgames (Kuhn, 1953). Our work focuses on stochastic policies as they (i) are a more compact repre- 355

BOWLING & VELOSO sentation requiring |Ai||S| parameters instead of |Ai||S| parameters to represent the complete spaceof policies, (ii) are the common notion of stochastic policies in single-agent behavior learning, (e.g.,Jaakkola, Singh, & Jordan, 1994; Sutton, McAllester, Singh, & Mansour, 2000; Ng, Parr, & Koller,1999), and (iii) do not require the artificial commitment to a single deterministic policy at the startof the game, which can be difficult to understand within a learning context. 2.2 Reward Formulations There are a number of possible reward formulations in single-agent learning that define the agent’snotion of optimality. These formulations also apply to stochastic games. We will explore two ofthese reward formulations in this article: discounted reward and average reward. Although thiswork focuses on discounted reward, many of our theoretical results also apply to average reward. Discounted Reward. In the discounted reward formulation, the value of future rewards is dimin-ished by a discount factor γ. Formally, given a joint policy π for all the agents, the value to agent iof starting at state s ∈ S is, ∞ V π | i (s) = γtE rti s0 = s, π , (1) t=0 where rt is the immediate reward to player i at time t with the expectation conditioned on s as the i initial state and the players following the joint policy π. In our formulation, we will assume an initial state, s0 ∈ S, is given and define the goal of each agent i as maximizing V π(s i 0). This formulation differs from the usual goal in MDPs and stochastic games, which is to simultaneously maximize the value of all states. We require this weaker goalsince our exploration into agent limitations makes simultaneous maximization unattainable.1 Thissame distinction was required by Sutton and colleagues (2000) in their work on parameterizedpolicies, one example of an agent limitation. Average Reward. In the average reward formulation all rewards in the sequence are equallyweighted. Formally, this corresponds to, T 1 V π | i (s) = lim E rti s0 = s, π , (2) T →∞ T t=0 with the expectation defined as in Equation 1. As is common with this formulation, we assume thatthe stochastic game is ergodic. A stochastic game is ergodic if for all joint policies any state canbe reached in finite time from any other state with non-zero probability. This assumption makes thevalue of a policy independent of the initial state. Therefore, ∀s, s ∈ S V π i (s) = V π i (s ). So any policy that maximizes the average value from one state maximizes the average value fromall states. These results along with more details on the average reward formulation for MDPs aresummarized by Mahadevan (1996). 1. This fact is demonstrated later by the example in Fact 5 in Section 4. In this game with the described limitation, if the column player randomizes among its actions, then the row player cannot simultaneously maximize the value ofthe left and right states. 356

EXISTENCE OF MULTIAGENT EQUILIBRIA For either formulation we will use the notation V π to refer to the value of the joint policy π to agent i i, which in either formulation is simply V π(s i 0), where s0 can be any arbitrary state for the average reward formulation. 2.3 Best-Response and Equilibria Even with the concept of stochastic policies and well-defined reward formulations, there are, ingeneral, still no optimal policies that are independent of the other players’ policies. We can, though,define a notion of best-response. Definition 1 For a game, the best-response function for player i, BRi(π−i), is the set of all policiesthat are optimal given the other player(s) play the joint policy π−i. A policy πi is optimal given π−iif and only if, π ,π ∀ π − π ∈ i,π−i ≥ i i i Πi V V . i i The major advancement that has driven much of the development of game theory, matrix games,and stochastic games is the notion of a best-response equilibrium, or Nash equilibrium (Nash, Jr.,1950). Definition 2 A Nash equilibrium is a joint policy, πi=1...n, with ∀i = 1, . . . , n πi ∈ BRi(π−i). Basically, an equilibrium is a policy for each player where each is playing a best-response to theother players’ policies. Hence, no player can do better by changing policies given that all the otherplayers continue to follow the equilibrium policy. What makes the notion of an equilibrium inter-esting is that at least one, possibly many, exist in all matrix games and most stochastic games. Thiswas proven by Nash (1950) for matrix games, Shapley (1953) for zero-sum discounted stochasticgames, Fink (1964) for general-sum discounted stochastic games, and Mertens and Neyman (1981)for zero-sum average reward stochastic games. The existence of equilibria for general-sum averagereward stochastic games is still an open problem (Filar & Vrieze, 1997). In the Rock-Paper-Scissors example in Figure 1(a), the only equilibrium consists of each player playing the mixed strategy where all the actions have equal probability. In the Bach-or-Stravinskyexample in Figure 1(b), there are three equilibria. Two consist of both players selecting their firstaction or both selecting their second. The third involves both players selecting their preferred coop-erative action with probability 2/3, and the other action with probability 1/3. 2.4 Learning in Stochastic Games Learning in stochastic games has received much attention in recent years as the natural extensionof MDPs to multiple agents. The Minimax-Q algorithm (Littman, 1994) was the first reinforcementlearning algorithm to explicitly consider the stochastic game framework. Developed for discountedreward, zero-sum stochastic games, the essence of the algorithm was to use Q-learning to learn thevalues of joint actions. The value of the next state was then computed by solving for the value ofthe unique Nash equilibrium of that state’s Q-values. Littman and Szepesvari (1996) proved thatunder usual exploration requirements for both players, Minimax-Q would converge to the Nashequilibrium of the game, independent of the opponent’s play beyond the exploration requirement. 357

BOWLING & VELOSO Other algorithms have since been presented for learning in stochastic games. We will summarizethese algorithms by broadly grouping them into two categories: equilibrium learners and best-response learners. The main focus of this summarization is to demonstrate how the existence ofequilibria under limitations is a critical question to existing algorithms. Equilibrium Learners. Minimax-Q has been extended in many different ways. Nash-Q (Hu &Wellman, 1998), Friend-or-Foe-Q (Littman, 2001), and Correlated-Q (Greenwald & Hall, 2002) areall variations on this same theme with different restrictions on the applicable class of games or theexact notion of equilibrium learned. All of the algorithms, though, seek to learn an equilibriumof the game directly, by iteratively computing intermediate equilibria. Some of the algorithms havetheoretical guarantees of convergence to equilibrium play, others have empirical results to this effect.Like Minimax-Q, though, the policies learned are independent of the play of the other agents beyondexploration requirements. We refer collectively to these algorithms as equilibrium learners. Whatis important to observe is that these algorithms depend explicitly on the existence of equilibria. Ifan agent or agents were limited in such a way so that no equilibria existed then these algorithmswould be, for the most part, ill-defined.2 Best-Response Learners. Another class of algorithms is the class ofbest-response learners. Thesealgorithms do not explicitly seek to learn an equilibrium, instead seeking to learn best-responses tothe other agents. The simplest example of one of these algorithms is Q-learning (Watkins, 1989).Although not an explicitly multiagent algorithm, it was one of the first algorithms applied to mul-tiagent environments (Tan, 1993; Sen, Sekaran, & Hale, 1994). Another less naive best-responselearning algorithm is WoLF-PHC (Bowling & Veloso, 2002), which varies the learning rate to ac-count for the other agents learning simultaneously. Other best-response learners include FictitiousPlay (Robinson, 1951; Vrieze, 1987), Opponent-Modeling Q-Learning (Uther & Veloso, 1997),Joint Action Learners (Claus & Boutilier, 1998), and any single-agent learning algorithm that learnsoptimal policies. Although these algorithms have no explicit dependence on equilibria, there is animportant implicit dependence. If algorithms that learn best-responses converge when playing eachother, then it must be to a Nash equilibrium (Bowling & Veloso, 2002). Therefore, all learning fixedpoints are Nash equilibria. In the context of agent limitations, this means that if limitations causeequilibria to not exist, then best-response learners could not converge. This nonexistence of equilibria is exactly one of the problems faced by Q-learning in stochastic games. Q-learning is limited to deterministic policies. The deterministic policy limitation can, infact, cause no equilibria to exist (see Fact 1 in Section 4.) So there are many games for which Q-learning cannot converge when playing with other best-response learners, such as other Q-learners. In summary, both equilibrium and best-response learners depend, in some way, on the existence ofequilibria. The next section explores agent limitations that are likely to be faced in realistic learningsituations. In Section 4, we then present our main results examining the effect these limitations haveon the existence of equilibria, and consequently on both equilibrium and best-response learners. 2. It should be noted that in the case of Minimax-Q, the algorithm and solution concept are still well-defined. A policy that maximizes its worst-case value may still exist even if limitations make it such that no equilibria exist. But, thisminimax optimal policy might not necessarily be part of an equilibrium. Later, in Section 4, Fact 5, we present anexample of a zero-sum stochastic game and agent limitations where the minimax optimal policies exist but do notcomprise an equilibrium. 358

EXISTENCE OF MULTIAGENT EQUILIBRIA 3. Limitations The solution concept of Nash equilibrium depends on all the agents playing optimally. From theagent development perspective, agents have limitations that prevent this from being a reality. Theworking definition of limitation in this article is anything that can restrict the agent from learningor playing optimal policies. Broadly speaking, limitations can be classified into two categories:physical limitations and rational limitations. Physical limitations are those caused by the interactionof the agent with its environment and are often unavoidable. Rational limitations are limitationsspecifically chosen by the agent designer to make the learning problem tractable, either in memoryor time. We briefly explore some of these limitations informally before presenting a formal modelof limitations that attempts to capture their effect within the stochastic game framework. 3.1 Physical Limitations One obvious physical limitation is that the agent simply is broken. A mobile agent may cease tomove or less drastically may lose the use of one of its actuators preventing certain movements.Similarly, another agent may appear to be “broken” when in fact the motion is simply outside itscapabilities. For example, in a mobile robot environment where the “rules” allow robots to moveup to two meters per second, there may be a robot that is not capable of reaching that speed. Anagent that is not broken, may suffer from poor control where its actions are not always carried outas desired, e.g., due to poorly tuned servos, inadequate wheel traction, or high system latency. Another common physical limitation is hardwired behavior. Most agents in dynamic domains need some amount of hard-wiring for fast response and safety. For example, many mobile robotplatforms are programmed to immediately stop if an obstacle is too close. These hardwired actionsprevent certain behavior by the agent, which is often unsafe but is potentially optimal. Sensing is a common area of agent limitations containing everything from noise to partial ob- servability. Here we’ll mention just one broad category of sensing problems: state aliasing. Thisoccurs when an agent cannot distinguish between two different states of the world. An agent mayneed to remember past states and actions in order to properly distinguish the states, or may simplyexecute the same action in both states. 3.2 Rational Limitations Rational limitations are a requirement for agents to learn in even moderately sized problems. Tech-niques for making learning scale, which often focus on near-optimal solutions, continue to be pro-posed and investigated in single-agent learning. They are likely to be even more necessary in multi-agent environments which tend to have larger state spaces. We will examine a few specific methods. In domains with sparse rewards one common technique is reward shaping, (e.g., Mataric, 1994). A designer artificially rewards the agent for actions the designer believes to be progressing towardthe sparse rewards. These additional rewards can often speed learning by focusing exploration, butalso can cause the agent to learn suboptimal policies. For example, in robotic soccer moving theball down the field is a good heuristic for goal progression, but at times the optimal goal-scoringpolicy is to pass the ball backward to an open teammate. Subproblem reuse also has a similar effect, where a subgoal is used in a portion of the state space to speed learning, (e.g., Hauskrecht, Meuleau, Kaelbling, Dean, & Boutilier, 1998; Bowling& Veloso, 1999). These subgoals, though, may not be optimal for the global problem and so prevent 359

BOWLING & VELOSO the agent from playing optimally. Temporally abstract options, either provided (Sutton, Precup, &Singh, 1998) or learned (McGovern & Barto, 2001; Uther, 2002), also enforce a particular sub-policy on a portion of the state space. Although in theory, the primitive actions are still availableto the agents to play optimal policies, in practice abstraction away from primitive actions is oftennecessary in large or continuous state spaces. Parameterized policies are receiving a great deal of attention as a way for reinforcement learning to scale to large problems, (e.g., Williams & Baird, 1993; Sutton et al., 2000; Baxter & Bartlett,2000). The idea is to give the learner a policy that depends on far less parameters than the entirepolicy space actually would require. Learning is then performed in the smaller space of parametersusing gradient techniques. Parameterized policies simplify and speed learning at the expense ofpossibly not being able to represent the optimal policy in the parameter space. 3.3 Models of Limitations This enumeration of limitations shows that there are a number and variety of limitations with whichagents may be faced, and they cannot be realistically avoided. In order to understand their impacton equilibria, we model limitations formally within the game theoretic framework. We introducetwo models that capture broad classes of limitations: implicit games and restricted policy spaces. Implicit Games. Limitations may cause an agent to play suboptimally, but it may be that theagent is actually playing optimally in a different game. If this new game can be defined within thestochastic game framework we call this the implicit game, in contrast to the original game called theexplicit game. For example, reward shaping adds artificial rewards to help guide the agent’s search.Although the agent is no longer learning an optimal policy in the explicit game, it is learning anoptimal policy of some game, specifically the game with these additional rewards added to thatagent’s Ri function. Another example is due to broken actuators preventing an agent from takingsome action. The agent may be suboptimal in the explicit game, while still being optimal in theimplicit game defined by removing these actions from the agent’s action set, Ai. We can formalizethis concept in the following definition. Definition 3 Given a stochastic game (n, S, A1...n, T, R1...n) the tuple (n, S, ˆ A1...n, ˆ T , ˆ R1...n) is an implicit game if and only if it is itself a stochastic game and there exist mappings, τi : S × ˆ Ai × Ai → [0, 1], such that, ∀s, s ∈ S ∀ˆ a ˆ i ∈ ˆ Ai T (s, ˆ ai , s ) = Πn , s ). i=1...n i=1τi(s, ˆ ai, ai) T (s, ai i=1...n a∈A The τi’s are mappings from the implicit action space into stochastic actions in the explicit actionspace. The τi mappings insure that, for each player, the choices available in the implicit game are a restric-tion of the choices in the explicit game. In other words, for any policy in the implicit game, thereexists an equivalent, possibly stochastic policy, in the explicit game. Reward shaping, broken actuators, and exploration can all be captured within this model. For reward shaping the implicit game is (n, S, A1...n, T, ˆ R1...n), where ˆ Ri adds the shaped reward into 360

EXISTENCE OF MULTIAGENT EQUILIBRIA the original reward, Ri. In this case the τ mappings are just the identity, 1 if a = a τi(s, a, a ) = . 0 otherwise For the broken actuator example, let a0 ∈ A ∈ A i i be some null action for agent i and let ab i i be some broken action for agent i that under the limitation has the same effect as the null action. Theimplicit game, then, is (n, S, A1...n, ˆ T , ˆ R1...n), where, ˆ T (s, a0, a T (s, a, s ) = i −i , s ) if ai = abi T (s, a, s ) otherwise ˆ R(s, a0, a R(s, a) = i −i ) if ai = abi , R(s, a) otherwise and, a0 if a = ab τ i i i(s, a) = . a otherwise For -exploration, as with broken actuators, only ˆ T and ˆ R need to be defined. They are just the combination of T or R with the transition probabilities or reward values of selecting a random action. Limitations captured by this model can be easily analyzed with respect to their effect on the existence of equilibria. We now restrict ourselves to discounted reward stochastic games, whereequilibria existence is known. Using the intuitive definition of an equilibrium as a joint policy suchthat “no player can do better by changing policies,” an equilibrium in the implicit game achievesthis definition for the explicit game. Since all discounted reward stochastic games have at least oneequilibrium, so must the implicit game, which is also in this class. This equilibrium for the implicitgame is then an equilibrium in the explicit game given that the agents are limited. On the other hand, many of the limitations described above cannot be modeled in this way. None of the limitations of abstraction, subproblem reuse, parameterized policies, or state aliasinglend themselves to being described by this model. This leads us to our second, and in many waysmore general, model of limitations. Restricted Policy Spaces. The second model is that of restricted policy spaces, which models limi-tations as restricting the agent from playing certain policies. For example, an -exploration strategyrestricts the player to policies that select all actions with some minimum probability. Parameter-ized policy spaces have a restricted policy space corresponding to the space of policies that can berepresented by their parameters. We can define this formally. Definition 4 A restricted policy space for player i is a non-empty and compact subset, Πi ⊆ Πi, ofthe complete space of stochastic policies. The assumption of compactness3 may at first appear strange, but it is not particularly limiting, andis critical for any equilibrium analysis. 3. Since Πi is a subset of a bounded set, the requirement that Πi is compact merely adds that the limit point of any sequence of elements from the set is also in the set. 361

BOWLING & VELOSO Physical Limitations Implicit Games Restricted Policies Broken ActuatorsHardwired BehaviorPoor ControlState AliasingRational Limitations Implicit Games Restricted Policies Reward Shaping or IncentivesExplorationState Abstraction/OptionsSubproblemsParameterized Policy Table 2: Common agent limitations. The column check-marks correspond to whether the limitation can be modeled straightforwardly using implicit games and/or restricted policy spaces. It should be straightforward to see that parameterized policies, state aliasing (with no memory), and subproblem reuse, as well as exploration and broken actuators all can be captured as a restric-tion on policies the agent can play. Therefore they can be naturally described as restricted policyspaces. On the other hand, the analysis of the existence of equilibria under this model is not at allstraightforward. Since restricted policy spaces capture most of the really interesting limitations wehave discussed, this is precisely the focus of the rest of this article. Before moving on to this analysis, we summarize our enumeration of limitations in Table 2. Thelimitations that we have been discussing are listed as well as denoting the model that most naturallycaptures their effect on agent behavior. 4. Existence of Equilibria In this section we define formally the concept of a restricted equilibrium, which account for agents’restricted policy spaces. We then carefully analyze what can be proven about the existence of re-stricted equilibria. The results presented range from somewhat trivial examples (Facts 1, 2, 3, and 4)and applications of known results from game theory and basic analysis (Theorems 1 and 5) to resultsthat we believe are completely new (Theorems 2, 3, and 4), as well as a critical counterexample tothe wider existence of restricted equilibria (Fact 5). But all of the results are in a sense novel sincethis specific question has received no direct attention in the game theory nor the multiagent learningliterature. 4.1 Restricted Equilibria We begin by defining the concept of an equilibrium under the model of restricted policy spaces.First we need a notion of best-response that accounts for the players’ limitations. Definition 5 A restricted best-response for player i, BRi(π−i), is the set of all policies from Πi thatare optimal given the other player(s) play the joint policy π−i. We can now use this to define an equilibrium. 362

EXISTENCE OF MULTIAGENT EQUILIBRIA Definition 6 A restricted equilibrium is a joint policy, πi=1...n, where, πi ∈ BRi(π−i). Hence, no player can do better by changing policies given that all the other players continue tofollow the equilibrium policy, and the player can only switch to policies within their restrictedpolicy space. 4.2 Existence of Restricted Equilibria We can now state some results about when equilibria are preserved by restricted policy spaces, andwhen they are not. Unless otherwise stated (as in Theorems 2 and 4, which only apply to discountedreward), the results presented here apply equally to both the discounted reward and the averagereward formulations. We will separate the proofs for the two reward formulations when needed.The first four facts show that the question of the existence of restricted equilibria does not have atrivial answer. Fact 1 Restricted equilibria do not necessarily exist. Proof. Consider the Rock-Paper-Scissors matrix game with players restricted to the space of deter-ministic policies. There are nine joint deterministic policies, and none of these joint policies are anequilibrium. Fact 2 There exist restricted policy spaces such that restricted equilibria exist. Proof. One trivial restricted equilibrium is in the case where all agents have a singleton policysubspace. The singleton joint policy therefore must be a restricted equilibrium. Fact 3 If π∗ is a Nash equilibrium and π∗ ∈ Π, then π∗ is a restricted equilibrium. Proof. If π∗ is a Nash equilibrium, then we have π ∀i ∈ 1 . . . n ∀π i,π∗ −i i ∈ Πi V π∗ ≥ i V . i Since Πi ⊆ Πi, then we also have π ∀i ∈ 1 . . . n ∀π i,π∗ −i i ∈ Πi V π∗ ≥ i V , i and thus π∗ is a restricted equilibrium. On the other hand, the converse is not true; not all restricted equilibria are of this trivial variety. Fact 4 There exist non-trivial restricted equilibria that are neither Nash equilibria nor come fromsingleton policy spaces. Proof. Consider the Rock-Paper-Scissors matrix game from Figure 1. Suppose the column player isforced, due to some limitation, to play “Paper” exactly half the time, but is free to choose between“Rock” and “Scissors” otherwise. This is a restricted policy space that excludes the only Nashequilibrium of the game. We can solve this game using the implicit game model, by giving the 363

BOWLING & VELOSO R Restricted Policy SpaceNash EquilibriumRestricted Equilibrium s1 P S s2 Explicit Game Implicit Game  0 −1 1   − 1 0  2 Payoffs  1 0 −1   1 − 1     2 2  −1 1 0 0 12 Nash Equilibrium 1 , 1 , 1 , 1 , 1 , 1 0, 1 , 2 , 2 , 1 3 3 3 3 3 3 3 3 3 3 Restricted Equilibrium 0, 1 , 2 , 1 , 1 , 1 3 3 3 2 6 Figure 1: Example of a restricted equilibrium that is not a Nash equilibrium. Here, the column player in Rock-Paper-Scissors is restricted to playing only linear combinations of thestrategies s1 = 1 , 1 , 0 and s , 1 . 2 2 2 = 0, 12 2 limited player only two actions, s1 = (0.5, 0.5, 0) and s2 = (0, 0.5, 0.5), which the player can mixbetween. This is depicted graphically in Figure 1. We can solve the implicit game and convertthe two actions back to actions of the explicit game to find a restricted equilibrium. Notice thisrestricted equilibrium is not a Nash equilibrium. Notice that the Fact 4 example has a convex policy space, i.e., all linear combinations of policies in the set are also in the set. Also, notice that the Fact 1 counterexample has a non-convex policyspace. These examples suggest that restricted equilibria may exist as long as the restricted policyspace is convex. The convexity restriction is, in fact, sufficient for matrix games. Theorem 1 When |S| = 1, i.e. in matrix games, if Πi is convex, then there exists a restrictedequilibrium. The proof is based on an application of Rosen’s theorem for concave games (Rosen, 1965) andis included in Appendix A. Surprisingly, the convexity restriction is not sufficient for stochasticgames. Fact 5 For a stochastic game, even if Πi is convex, restricted equilibria do not necessarily exist. Proof. Consider the stochastic game in Figure 2. This is a zero-sum game where only the payoffsto the row player are shown. The discount factor γ ∈ (0, 1). The actions available to the row playerare U and D, and for the column player L or R. From the initial state, s0, the column player mayselect either L or R which results in no rewards but with high probability, 1 − , transitions to thespecified state (regardless of the row player’s action), and with low probability, , transitions to theopposite state. Notice that this stochasticity is not explicitly shown in Figure 2. In each of the 364

EXISTENCE OF MULTIAGENT EQUILIBRIA resulting states the players play the matrix game shown and then deterministically transition backto the initial state. Notice that this game is unichain, where all the states are in a single ergodicset, thus satisfying the average reward formulation requirement. Also, notice that the game withoutlimitations is guaranteed to have an equilibrium in both the discounted reward and the averagereward, since it is zero-sum, formulations. 0 0 s L 0 R   0 0 1 − 1 − 1 0  sL sR 0 0     0 0 0 1 Figure 2: An example stochastic game where convex restricted policy spaces do not preserve the existence of equilibria. Now consider the restricted policy space where players have to play their actions with the same probability in all states. So, Πi = πi ∈ Πi|∀s, s ∈ S ∀a ∈ A πi(s, a) = πi(s , a) . (3) Notice that this is a convex set of policies. That is, if policies x1 and x2 are in Πi (according toEquation 3), then for any α ∈ [0, 1], x3 must also be in Πi, where, x3(s, a) = αx1(s, a) + (1 − α)x2(s, a). (4) The convexity of Πi can be seen by examining x3(s , a) for any s ∈ S. From Equation 4, we have, x3(s , a) = αx1(s , a) + (1 − α)x2(s , a) (5) = αx1(s, a) + (1 − α)x2(s, a) (6) = x3(s, a). (7) Therefore, x3 is in Πi and hence Πi is convex. This game, though, does not have a restricted equilibrium. The four possible joint deterministic policies, (U, L), (U, R), (D, L), and (D, R), are not equilibria. So if there exists an equilibriumit must be mixed. Consider any mixed strategy for the row player. If this strategy plays U withprobability less than 1 then the only best-response for the column player is to play L; if greater than 2 1 then the only best-response is to play R; if equal then the only best-responses are to play L or R 2 deterministically. In all cases, all best-responses are deterministic, so this rules out mixed strategyequilibria, and so no equilibrium exists. Fact 5 demonstrates that convexity is not a strong enough property to guarantee the existence of restricted equilibria. The discussion in Section 2.4 concluded that the majority of existing mul-tiagent learning algorithms hold some dependence on the existence of equilibria. Therefore, this 365

BOWLING & VELOSO result may have serious implications for the scaling of multiagent learning to large problems. Wewill explore these implications by first examining this example more closely, in order to identify anew sufficient condition. We will then seek to rescue the equilibrium notion by exploring specificsituations where its existence can be preserved. 4.2.1 A SUFFICIENT CONDITION FOR EQUILIBRIA Standard equilibrium proof techniques fail in the Fact 5 counterexample because the player’s best-response sets are not convex, even though their restricted policy spaces are convex. Notice thatthe best-response to the row player mixing equally between actions is to play either of its actionsdeterministically. But, linear combinations of these actions (e.g., mixing equally) are not best-responses. This intuition is proven in the following lemma. Lemma 1 For any stochastic game, if Πi is convex and for all π−i ∈ Π−i, BRi(π−i) is convex,then there exists a restricted equilibrium. The proof of this lemma relies on Kakutani’s fixed point theorem using the convexity of the best-response as the critical condition for applying the theorem. Complete details of the proof can befound in Appendix B. The consequence of this lemma is that, if we can prove that the sets of restricted best-responses are convex then restricted equilibria exist. Notice that the convexity of the restricted policy space isnecessary for this condition (e.g., consider a uniform reward function resulting in BR(π−i) = Π).The Fact 5 counterexample demonstrates that it is not sufficient, since a restricted policy space canbe convex without the restricted best-response sets being convex. We now look at placing further restrictions either on the restricted policy spaces or the stochas- tic game, to insure provably convex best-response sets. After each existence proof for restrictedequilibria, we give one or more practical examples of domains where the theoretical results apply.These results begin to enumerate a few general classes where restricted equilibria are guaranteed toexist. 4.2.2 A SUBCLASS OF RESTRICTED POLICIES The first result for general stochastic games uses a stronger notion of convexity for restricted policyspaces. Definition 7 A restricted policy space Πi is statewise convex, if it is the Cartesian product over allstates of convex strategy sets. Equivalently, if for all x1, x2 ∈ Πi and all functions α : S → [0, 1],the policy x3(s, a) = α(s)x1(s, a) + (1 − α(s))x2(s, a) is also in Πi. In words, statewise convex policy spaces allow the policy’s action probabilities at each state to bemodified (within the restricted set) independently. We can show that this is a sufficient condition. Theorem 2 In the discounted reward formulation, if Πi is statewise convex, then there exists arestricted equilibrium. Proof. With statewise convex policy spaces, there exist optimal policies in the strong sense asmentioned in Section 2. Specifically, there exists a policy that can simultaneously maximize thevalue of all states. Formally, for any π−i there exists a πi ∈ Πi such that, π ,π ∀ π − s ∈ S ∀π ∈ i,π−i i i i Πi V (s) ≥ V (s). i i 366

EXISTENCE OF MULTIAGENT EQUILIBRIA Suppose this were not true, i.e., there were two policies each of which maximized the value ofdifferent states. We can construct a new policy that in each state follows the policy whose value islarger for that state. This policy will maximize the value of both states that those policies maximized,and due to statewise convexity is also in Πi. We will use that fact to redefine optimality to this strongsense for this proof. We will now make use of Lemma 1. First, notice the lemma’s proof still holds even with this new definition of optimality. We just showed that under this redefinition, BRi(π−i) is non-empty, andthe same argument for compactness of BRi(π−i) holds. So we can make use of Lemma 1 and whatremains is to prove that BRi(π−i) is convex. Since π−i is a fixed policy for all the other players thisdefines an MDP for player i (Filar & Vrieze, 1997, Corollary 4.2.11). So we need to show that the setof polices from the player’s restricted set that are optimal for this MDP is a convex set. Concretely,if x1, x2 ∈ Π are optimal for this MDP, then the policy x3(s, a) = αx1(s, a) + (1 − α)x2(s, a) isalso optimal for any α ∈ [0, 1]. Since x1 and x2 are optimal in the strong sense, i.e., maximizingthe value of all states simultaneously, then they must have the same per-state value. Here, we will use the notation V x(s) to refer to the value of policy x from state s in this fixed MDP. The value function for any policy is the unique solution to the Bellman equations, specifically, ∀s ∈ S V x(s) = x(s, a) R(s, a) + γ T (s, a, s )V x(s ) . (8) a s For x3 then we get the following, V x3 (s) = x3(s, a) R(s, a) + γ T (s, a, s )V x3 (s ) (9) a s = (αx1(s, a) + (1 − α)x2(s, a)) R(s, a) + γ T (s, a, s )V x3 (s ) (10) a s = α x1(s, a) R(s, a) + γ T (s, a, s )V x3 (s ) + a s (1 − α) x2(s, a)) R(s, a) + γ T (s, a, s )V x3 (s ) . (11) a s Notice that V x3 (s) = V x1 (s) = V x2 (s) is a solution to these equations, and therefore is the uniquesolution for the value of x3. Therefore, x3 has the same values as x1 and x2, and hence is also opti-mal. Therefore BRi(π−i) is convex, and from Lemma 1 we get the existence of restricted equilibriaunder this stricter notion of optimality, which also makes the policies a restricted equilibrium underour original notion of optimality, that is only maximizing the value of the initial state. Example: Multirobot Indoor Navigation. Consider an indoor navigation domain with multiplerobots traversing constrained hallways. The agents are the robots, all with their own navigationalgoals. A robot’s state consists of a highly discretized representation of position and orientationresulting in an intractable number of states per robot. The global state is the joint states of all ofthe robots and is known by all the agents. The robots’ actions are low-level navigation commands.We will construct an example of a useful restricted policy space for this domain that is statewiseconvex, and therefore satisfies the conditions of Theorem 2. Define a set of key decision points, K, 367

BOWLING & VELOSO made up of the hallway intersections. These decision points define a subset of the complete statespace, Kn ⊆ S. While between key decision points, each agent will use a fixed policy assigningprobabilities to actions to reach the next decision point. The agents’ policies are restricted so thatdecisions are only made when all agents have arrived at a key decision point, where they all simulta-neously choose an action contingent on the global state. The agents can independently assign actionprobabilities for all states in Kn, while all the action probabilities in S − Kn are fixed. Hence, this isa statewise convex restricted policy space. Theorem 2 therefore applies and we conclude that thereexists restricted equilibria in this domain. Despite the above example, most rational limitations that allow reinforcement learning to scale are unlikely to be statewise convex restrictions. They often have a strong dependence betweenstates. For example, parameterized policies involve far less parameters than the number of states,which is often intractably large. Either the majority of states have fixed action probabilities, as in theabove example, or there will be fart too many parameters to optimize. Similarly, subproblems forcewhole portions of the state space to follow the same subproblem solution. Therefore, these portionsof the state space do not select their actions independently. Abstraction and state aliasing use thesame action probabilities across multiple states, and so also are not statewise convex. We now lookat retaining a simple convexity condition on the restricted policy spaces, but examine constrainedspaces of stochastic games. 4.2.3 SUBCLASSES OF STOCHASTIC GAMES One way to relax from statewise convexity to general convexity is to consider only a subset ofstochastic games. Theorem 1 is one example, where restricted equilibria exist for the subclassof matrix games with convex restricted policy spaces. Matrix games have a highly constrained“transition” function allowing only self transitions, which are independent of the players’ actions.We can generalize this idea beyond single-state games. Theorem 3 Consider no-control stochastic games, where all transitions are independent of theplayers’ actions, i.e., ∀s, s ∈ S ∀a, b ∈ A T (s, a, s ) = T (s, b, s ). If Πi is convex, then there exists a restricted equilibrium. Proof (Discounted Reward). This proof also makes use of Lemma 1, leaving us only to show thatBRi(π−i) is convex. Just as in the proof of Theorem 2 we will consider the MDP defined for playeri when the other players follow the fixed policy π−i. As before it suffices to show that for this MDP,if x1, x2 ∈ Π are optimal for this MDP, then the policy x3(s, a) = αx1(s, a) + (1 − α)x2(s, a) isalso optimal for any α ∈ [0, 1]. Again we use the notation V x(s) to refer to the traditional value of a policy x at state s in this fixed MDP. Since T (s, a, s ) is independent of a, we can simplify the Bellman equations (Equa-tion 8) to V x(s) = x(s, a)R(s, a) + γ x(s, a)T (s, a, s )V x(s ) (12) a s a = x(s, a)R(s, a) + γ T (s, ·, s )V x(s ). (13) a s 368

EXISTENCE OF MULTIAGENT EQUILIBRIA For the policy x3, the value of state s is then, V x3 (s) = α x1(s, a)R(s, a) + (1 − α) x2(s, a)R(s, a) + a a γ T (s, ·, s )V x3 (s ). (14) s Using equation 13 for both x1 and x2 we get, V x3 (s) = α(V x1 (s) − γ T (s, ·, s )V x1 (s )) + s (1 − α)(V x2 (s) − γ T (s, ·, s )V x2 (s )) + s γ T (s, ·, s )V x3 (s ) (15) s = αV x1 (s) + (1 − α)V x2 (s) + γ T (s, ·, s ) V x3 (s ) − αV x1 (s ) − (1 − α)V x2 (s ) (16) s Notice that a solution to these equations is V x3 (s) = αV x1 (s) + (1 − α)V x2 (s), and the Bellmanequations must have a unique solution. Hence, V x3 (s0) is equal to V x1(s0) and V x2(s0), which areequal since both are optimal. So x3 is optimal, and BRi(π) is convex. Applying Lemma 1 we getthat restricted equilibria exist. Proof (Average Reward). An equivalent definition to Equation 2 of a policy’s average reward is, V π i = dπ(s) π(s, a)R(s, a), (17) s∈S a where dπ(s) defines the distribution over states visited while following π after infinite time. For astochastic game or MDP that is unichain we know that this distribution is independent of the initialstate. In the case of no-control stochastic games or MDPs, this distribution becomes independentof the actions and policies of the players, and depends solely on the transition probabilities. SoEquation 17 can be written, V π i = d(s) π(s, a)R(s, a). (18) s∈S a As before, we must show that BRi(π−i) is convex to apply Lemma 1. Consider the MDP definedfor player i when the other players follow the policy π−i. It suffices to show that for this MDP, ifx1, x2 ∈ Π are optimal for this MDP, then the policy x3(s, a) = αx1(s, a) + (1 − α)x2(s, a) is alsooptimal for any α ∈ [0, 1]. Using Equation 18, we can write the value of x3 as, V x3 = d(s) x i 3(s, a)R(s, a) (19) s∈S a = d(s) (αx1(s, a) + (1 − α)x2(s, a)) R(s, a) (20) s∈S a 369

BOWLING & VELOSO = d(s) αx1(s, a)R(s, a) + (1 − α)x2(s, a)R(s, a) (21) a a = α d(s) x1(s, a)R(s, a) + (1 − α) d(s) x2(s, a)R(s, a) (22) s∈S a s∈S a = αV x1 (s) + (1 − α)V x2 (s) (23) i i = αV x1 (s) + (1 − α)V x1 (s) = V x1 (s). (24) i i i Therefore x3 has the same average reward as x1 and so is also optimal. So BRi(π−i) is convex andby Lemma 1 there exists an equilibrium. Example: Space Probe. Consider a space probe traveling on a predefined planetary trajectory.Consider using a multiagent approach to coordinating the gathering of data from scientific instru-ments. Each instrument is an agent making independent decisions and maximizing an assignedreward function. The state of the system is the position and orientation of the probe, which fol-lows a fixed transition function. The state determines the effectiveness of the various data-gatheringtasks (i.e., the agents’ rewards), which also may be determined by the decisions of the other in-struments. For example, camera images may be occluded when collecting atmospheric samples, ortwo instruments may draw too much power if activated simultaneously, causing both to fail. Thisis a no-control stochastic game, as the instruments’ actions do not control the state, which is thepredefined trajectory of the probe. By Theorem 3, restricted equilibria exist for all convex restrictedpolicy spaces in this domain. We can now merge Theorem 2 and Theorem 3 allowing us to prove existence of equilibria for a general class of games where only one of the player’s actions affects the next state. Sincethis theorem’s sufficient conditions are the most general of those presented, multiple examples ofapplications follow. Theorem 4 Consider single-controller stochastic games (Filar & Vrieze, 1997), where all transi-tions depend solely on player 1’s actions, i.e., ∀s, s ∈ S ∀a, b ∈ A a1 = b1 ⇒ T (s, a, s ) = T (s, b, s ). In the discounted reward formulation, if Π1 is statewise convex and Πi=1 is convex, then there existsa restricted equilibrium. Proof. This proof again makes use of Lemma 1, leaving us to show that BRi(π−i) is convex. Fori = 1 we use the argument from the proof of Theorem 2. For i = 1 we use the argument fromTheorem 3. Example: Haunted House Ride. An example that helps to illustrate the concept of a single-controllerstochastic game is an amusement park haunted house ride. Consider the system as consisting ofmultiple agents including the ride operator and the ride’s passengers. Suppose the ride operatorhas choices over different ride routes therefore determining the state of the world for the passengeragents. The passengers, on the other hand, can only choose at each state where to look and whento try and scare their fellow passengers. Therefore, one agent (the operator) controls the transitions,while all the agents’ (operator and passengers) affect the resulting rewards. Consider that the ride 370

EXISTENCE OF MULTIAGENT EQUILIBRIA operator may be restricted to only making decisions at a few key decision points determined bythe ride designer. Therefore, the operator’s policy space is statewise convex. Hence, accordingto Theorem 4 if the passengers have convex restricted policy spaces resulting from their imposedlimitations, then restricted equilibria exist. Example: Mobile Data Gathering. The general pattern of the haunted house ride can also be seenin practical situations. Consider mobile data gathering, such as an extended version of the spaceprobe example presented earlier. There is one navigational agent whose actions determine the state(e.g., position and orientation). The other agents are instruments choosing how to gather data at anygiven moment, as in the no-control version described above. The rewards are determined by all ofthe agents actions, but the state transition is defined only by the navigational agent. This fits thedefinition of a single-controller stochastic game and the Theorem holds for appropriate restrictedpolicy spaces. Example: Asymmetric Competition The above examples, although not strictly team games, do havea cooperative nature to them. Single-controller stochastic games can also be competitive. Con-sider an economic competition between two firms, where one firm’s supply technology is primarilyinfrastructure-based, while the other’s is very fluid. For example, a transportation company usingrail lines versus a company using trucks. Since one agent’s actions are implemented on a timescalefar smaller than the other agent, we can effectively ignore that firm’s state. The global state is justthe built infrastructure of the first agent, and is only influenced by that agent’s actions. The rewards,though, are determined by both actions of the agents, and may well be zero-sum. If the first firmuses a statewise convex restricted policy space (e.g., using key decision points) and the other firmuses a convex restricted policy space, then restricted equilibria exist. The previous results have looked at stochastic games whose transition functions have particu- lar properties. Our final theorem examines stochastic games where the rewards have a particularstructure. Specifically we address team games, where the agents all receive equal payoffs. Theorem 5 For team games, i.e., ∀i, j ∈ 1, . . . , n ∀s ∈ S ∀a ∈ A Ri(s, a) = Rj(s, a), there exists a restricted equilibrium. Proof. The only constraints on the players’ restricted policy spaces are those stated at the beginningof this section: non-empty and compact. Since Π is compact, being a Cartesian product of compactsets, and player one’s value in either formulation is a continuous function of the joint policy, thenthe value function attains its maximum (Gaughan, 1993, Corollary 3.11). Specifically, there existsπ∗ ∈ Π such that, ∀π ∈ Π V π∗ ≥ 1 V π 1 . Since ∀iVi = V1 we then get that the policy π∗ maximizes all the players’ rewards, and so eachmust be playing a restricted best-response to the others’ policies. Example: Distributed Network Routing. Consider a distributed network routing domain wheremessage passing at nodes is controlled by an intelligent agent. The agent can only observe itsown queue of incoming messages and congestion on outgoing links. Hence each agent has analiased view of the global state. Therefore, policies mapping the observed state to actions are convex 371

BOWLING & VELOSO restricted policy spaces in the fully-observable stochastic game. All the agents seek to maximizeglobal throughput. Hence, they have an identical reward function based on global packet arrival. Aslong as further agent limitations preserve the convexity of their policy spaces, by Theorem 5 thereexists a restricted equilibrium. 4.3 Summary Facts 1 and 5 provide counterexamples that show the threat limitations pose to equilibria. Theo-rems 1, 2, 3, 4, and 5 give us five general classes of stochastic games and restricted policy spaceswhere equilibria are guaranteed to exist. The fact that equilibria do not exist in general raisesconcerns about equilibria as a general basis for multiagent learning in domains where agents havelimitations. On the other hand, combined with the model of implicit games, the presented theoreti-cal results lays the initial groundwork for understanding when equilibria can be relied on and whentheir existence may be in question. These contributions also provide some formal foundation forapplying multiagent learning in limited agent problems. 5. Learning with Limitations In Section 2, we highlighted the importance of the existence of equilibria to multiagent learningalgorithms. This section presents results of applying a particular learning algorithm to a settingof limited agents. We use the best-response learner, WoLF-PHC (Bowling & Veloso, 2002). Thisalgorithm is rational, that is, it is guaranteed to converge to a best-response if the other players’ poli-cies converge and appropriate decayed exploration is used (Singh, Jaakkola, Littman, & Szepesv´ari,2000). In addition, it has been empirically shown to converge in self-play, where both players useWoLF-PHC for learning. In this article we apply this algorithm in self-play to matrix games, bothwith and without player limitations. Since the algorithm is rational, if the players converge thentheir converged policies must be an equilibrium (Bowling & Veloso, 2002). The specific limitations we examine fall into both the restricted policy space model as well as the implicit game model. One player is restricted to playing strategies that are the convex hullof a subset of the available strategies. From Theorem 1, there exists a restricted equilibrium withthese limitations. For best-response learners, this amounts to a possible convergence point for theplayers. For the limited player, the WoLF-PHC algorithms were modified slightly so that the playermaintains Q-values of its restricted set of available strategies and performs its usual hill-climbing inthe mixed space of these strategies. The unlimited player is unchanged and completely uninformedof the limitation of its opponent. For all the experiments, we use very small learning rates and alarge number of trials to better display how the algorithm learns and converges. 5.1 Rock-Paper-Scissors The first game we examine is Rock-Paper-Scissors. Figure 3 shows the results of learning whenneither player is limited. Each graph shows the mixed policy the player is playing over time. Thelabels to the right of the graph signify the probabilities of each action in the game’s unique Nashequilibrium. Observe that the players’ strategies converge to this learning fixed point. Figure 4 shows the results of restricting player 1 to a convex restricted policy space, defined by requiring the player to play “Paper” exactly half the time. This is the same restriction as showngraphically in Figure 1. The graphs again show the players’ strategies over time, and the labels 372

EXISTENCE OF MULTIAGENT EQUILIBRIA Player 1: Unlimited Player 2: Unlimited E(Reward) = 0 E(Reward) = 0 1 1 P(Rock) P(Rock) P(Paper) P(Paper) 0.8 P(Scissors) 0.8 P(Scissors) 0.6 0.6 0.4 Rock 0.4 Rock Paper Paper Scissors Action Probabilities Scissors 0.2 Action Probabilities 0.2 0 0 0 100000 200000 300000 0 100000 200000 300000 Number of Trials Number of Trials Figure 3: WoLF-PHC in Rock-Paper-Scissors. Neither player is limited. to the right now label the game’s restricted equilibrium, which accounts for the limitation (seeFigure 1). The player’s strategies now converge to this new learning fixed point. If we examine theexpected rewards to the players, we see that the unrestricted player gets a higher expected reward inthe restricted equilibrium than in the game’s Nash equilibrium (1/6 compared to 0.) In summary,both players learn optimal best-response policies with the unrestricted learner appropriately takingadvantage of the other player’s limitation. Player 1: Limited Player 1: Unlimited E(Reward) = -0.167 E(Reward) = 0.167 1 1 P(Rock) P(Rock) P(Paper) P(Paper) 0.8 P(Scissors) 0.8 P(Scissors) Scissors 0.6 0.6 Paper 0.4 0.4 Rock Paper Action Probabilities 0.2 Action Probabilities Scissors 0.2 0 0 Rock 0 100000 200000 300000 0 100000 200000 300000 Number of Trials Number of Trials Figure 4: WoLF-PHC in Rock-Paper-Scissors. Player 1 must play “Paper” with probability 1 . 2 5.2 Colonel Blotto The second game we examined is “Colonel Blotto” (Gintis, 2000), which is also a zero-sum matrixgame. In this game, players simultaneously allot regiments to one of two battlefields. If one playerallots more armies to a battlefield than the other, they receive a reward of one plus the number ofarmies defeated, and the other player loses this amount. If the players tie, then the reward is zerofor both. In the unlimited game, the row player has four regiments to allot, and the column playerhas only three. The matrix of payoffs for this game is shown in Figure 5. 373

BOWLING & VELOSO 3-0 2-1 1-2 0-3 3-0 2-1 1-2 0-3 4-0  4 2 1 0  4-0  -4 -2 -1 0  3-1 1 3 0 −1 −1 −3 0 1   3-1   R     r = 2-2  −2 2 2 −2  Rc = 2-2  2 −2 −2 2  1-3      −1 0 3 1  1-3  1 0 −3 −1  0-4 0 1 2 4 0-4 0 −1 −2 −4 Figure 5: Colonel Blotto Game. Figure 6 shows experimental results with unlimited players. The labels on the right signify the probabilities associated with the Nash equilibrium to which the players’ strategies converge.Player 1 is then given the limitation that it could only allot two of its armies, the other two wouldbe allotted randomly. This is also a convex restricted policy space and therefore by Theorem 1has a restricted equilibrium. Figure 7 shows the learning results. The labels to the right corre-spond to the action probabilities for the restricted equilibrium, which was computed by hand. As inRock-Paper-Scissors, the players’ strategies converge to the new learning fixed point. Similarly, theexpected reward for the unrestricted player resulting from the restricted equilibrium is considerablyhigher than that of the Nash equilibrium (0 to −14/9), as the player takes advantage of the other’slimitation. Player 1: Unlimited Player 2: Unlimited E(Reward) = 1.56 E(Reward) = -1.56 1 1 P(4-0) P(3-0) P(3-1) P(2-1) 0.8 P(2-2) 0.8 P(1-2) P(1-3) P(0-3) P(0-4) 0.6 0.6 4-0, 0-4 2-1, 1-2 0.4 0.4 Action Probabilities 0.2 Action Probabilities 0.2 2-2 3-0, 0-3 0 3-1, 1-3 0 0 400000 800000 1.2e+06 0 400000 800000 1.2e+06 Number of Trials Number of Trials Figure 6: WoLF-PHC in Colonel Blotto. Neither player is limited. There is one final observations about these results. In Section 3 we discussed the use of rational limitations to speed learning. Even in these very small single-state problems, our results demon-strate that limitations can be used to speed learning. Notice that convergence occurs more quicklyin the limited situations where one of the players has less parameters and less freedom in its policyspace. In the case of the Colonel Blotto game this is a dramatic difference. (Notice the x-axes differby a factor of four!) In games with very large state spaces this will be even more dramatic. Agentswill need to make use of rational limitations to do any learning at all, and similarly the less restrictedagents will likely be able to benefit from taking advantage of the more limited learners.4 4. Strictly speaking, restricted agents are not always at a disadvantage. In zero-sum games, such as those for which results are presented here, and team-games where all players rewards are identical, restrictions can never benefit theagent. In general-sum games, such as Bach or Stravinsky from Table 1(b), this is not true. In this case, an imposed 374

EXISTENCE OF MULTIAGENT EQUILIBRIA Player 2: Limited Player 2: Unlimited E(Reward) = 0 E(Reward) = 0 1 1 P(4-0) P(3-0) P(3-1) P(2-1) 0.8 P(2-2) 0.8 P(1-2) P(1-3) P(0-3) P(0-4) 0.6 0.6 3-0, 0-3 0.4 0.4 3-1, 2-2, Action Probabilities 0.2 1-3 Action Probabilities 0.2 4-0, 0-4 0 0 2-1, 1-2 0 100000 200000 300000 0 100000 200000 300000 Number of Trials Number of Trials Figure 7: WoLF-PHC in Colonel Blotto. Player one is forced to randomly allot two regiments. 6. Related Work It is commonly recognized that humans are not capable of being fully rational. There is a wholebody of work in the social sciences, particularly economics and game theory, relating to understand-ing humans’ bounded rationality (Simon, 1982). Bounded rationality is also a reality for artificialagents. As we discussed in Section 3, agent limitations are unavoidable in realistic problems. As aresult, there has been much interesting work in both the fields of game theory and artificial intelli-gence on how limitations effect agent behavior. A useful distinction presented by Simon (1976) is the difference between substantive rationality and procedural rationality. Substantive rationality is primarily concerned with whether a resultingbehavior is appropriate for the achievement of specific goals given specific constraints. Proceduralrationality is concerned with the behavior that is the result of appropriate deliberation. The dif-ference is in whether it is the outcome that is appropriate to the goals or the deliberation that isappropriate to the goals. The approach taken in this work is primarily procedural. We take the variety of well motivated learning procedures, both equilibrium learners and best-response learners, as given decision mech-anisms. We also take the variety of mechanisms for making learning tractable (e.g., abstraction andapproximation) as given mechanisms. The focus of our investigation is to see what is the resultingbehavior of these combined mechanisms. In particular, do equilibria exist for equilibrium learnersto learn and to which best-response learners can converge? This differs from other investigationswhich we’ll briefly review here. 6.1 Game Theory Much of the research in game theory on modeling bounded rationality, has focused on the inves-tigation of procedural rationality (Rubinstein, 1998). This approach is crucial given the goal ofmodeling, explaining, and predicting human behavior. An example of this approach is that taken byOsborne and Rubinstein (1997). They define a decision procedure for mapping actions to discreteoutcomes, rather than distributions over outcomes. This mapping is generated stochastically ac- restriction, such as the row player being forced to select its first action, can actually improve the agent’s equilibriumvalue (see also Littman & Stone, 2001; Gilboa & Samet, 1989). 375

BOWLING & VELOSO cording to the actual distribution of outcomes for given actions. Given a mapping, the agent selectsthe action whose associated outcome is preferred. Given the decision procedure, they investigatewhether an equilibrium fixed point exists and also how well such a procedure corresponds to humanbehavior in particular games. Our work follows a similar approach of considering the outcome of using particular learning procedures. On the other hand, we are not dealing with the problem of human behavior but ratherthe behavior of artificial learning agents and how imposed limitations effect interactions betweenagents. By generalizing these effects into the models of implicit games and restricted policy spaces,we then ask similar questions. Do equilibrium fixed points exist and what is the expected outcomeof employing such learning rules? This simplifies our problem since we’re only analyzing theoutcomes of known machine learning techniques rather than explaining human behavior. In addition to the work along the lines of procedural rationality, there is a large body of tradi- tional work following along the lines of substantive rationality. The particular goals of this approachare to model the agents’ limitations in the game itself and find optimal behavior given the constraintsof these limitations. These results include Bayesian games where other agents’ utilities are uncertaina priori. Extensive form games with imperfect information is also an interesting model accountingfor situations of state aliasing. Finally, strategies as finite state machines in repeated games tryto account for limitations on computation and memory. A summary of this work can be found instandard game theory texts (e.g., Osborne & Rubinstein, 1994)). 6.2 Artificial Agents While risking over generalizing, much of the work in the field of artificial agents that examines theeffect of limitations on agent behavior has focused more on substantive rationality. In particular,a common thread of this work is the development of models of interaction and beliefs along withefficient algorithms for computing optimal or near-optimal behavior given these models. This ap-proach focuses on what is appropriate behavior rather than what is the result of appropriate decisionprocedures. We will briefly consider some of the recent work that consider limitations and agentbehavior, and give some comparison to the approach and results we have taken in this work. 6.2.1 COMPUTATIONAL LIMITATIONS Larson and Sandholm (2001) consider the problem of reasoning in the presence of agents withlimited computational resources. Specifically, they examine a two agent setting where agents haveintractable individual problems, but may gain an improvement on their solutions by solving the alsointractable joint problem. An agent’s strategy determines how to invest its limited computationalresources to improve its solution for its own problem, its opponent’s problem, or the joint problem.When the deadline is reached, the agents must decide whether to simply use its own computedsolution or bargain over a joint solution. Among other results, they demonstrate that the concept ofNash equilibrium can be extended to this setting, where the decision of computation is part of theagent’s strategy. They call this concept deliberation equilibria and present algorithms for findingthese equilibria solutions in certain situations. The concept of deliberation equilibria has many similarities to the restricted equilibria concept we presented in Section 4. The core idea is to incorporate the agents’ limitations into their choicesof strategies. Restricted equilibria, though, incorporate limitations as restrictions on the agents’choices, rather than incorporating limitations by adding computational decisions into their choices. 376

EXISTENCE OF MULTIAGENT EQUILIBRIA This difference, though, is a fundamental one as the two approaches solve different problems. Thecomplexity of decision-making in the Larson and Sandholm setting is directly due to the compu-tational limitations placed on the agents. In Chapter 3, the complexity of decision-making is partof the task, and agent limitations are required in order to simplify the problem and make learningtractable. This core difference also explains the difference in results, since restricted equilibria arenot guaranteed to exist in general. 6.2.2 GAME-TREE SEARCH WITH OPPONENT KNOWLEDGE Jansen’s Ph.D. thesis (1992) examined the issue of whether “optimal” play in the game-theoreticsense is in fact always “optimal”. In the case of a fallible opponent that may make an incor-rect evaluation of future states, the answer is naturally no. He proposed the idea of speculativeplay, which uses a prediction function for the opponent in determining its opponent’s moves whensearching a game-tree. The thesis focuses on deterministic, alternating-move, perfect informationgames, specifically a subgame of chess. He demonstrated that using speculative play with end-gamedatabases as knowledge of the opponent has the potential for improving play against humans. Inthis work, limitations on human rationality are recognized, and so game-theoretic optimal play isdiscarded for optimization against particulars of human play. 6.2.3 PLANNING WITH OPPONENT MODELS. Riley and Veloso (2000, 2002) present similar ideas for using opponent models to make decisions ina less abstract domain. They examine the problem of adapting to a specific opponent in simulatedrobotic soccer (Noda, Matsubara, Hiraki, & Frank, 1998). This work recognizes that modelingopponents is not a new idea and there has been a great deal of related research in the field ofplan recognition. The work, though, contributes an actual technique for using opponent models toimprove the team behavior, hence learning. The team designer provides a set of opponent models of possible behavior of the other team. These models define a distribution of players’ positions as a function of the ball’s trajectory andtheir initial positions. During play, the actual opponent is matched to one of these models throughBayesian updating from actual observations. The model is then used to plan during set-play situa-tions, such as throw-ins or goal-kicks. The planning uses hill-climbing on the ball-trajectory to finda trajectory with low probability of being intercepted by the other team given the matched opponentmodel of their movement. The planner then determines the positions of players and their actionsso as to carry out the planned ball trajectory. This is compiled into a simple temporal networkwhich can be executed in a distributed manner by the players. They demonstrate through controlledempirical experiments that the plans generated depend on the opponent model that is used duringplanning. The effectiveness of this technique in actual play against a real opponents, that may ormay not match well with the models, is still unknown. This technique allows a team of agents to adapt their behavior in a principled manner to better address a specific team. This does not strictly address the problem of concurrent learning in amultiagent setting, which is the primary focus of this work. The critical assumption is that theopponents’ play is not changing over time. The authors mention this briefly and provide a smallchange to their Bayesian update through the addition of weight sharing. A small weight is addedto each model to keep its probability bounded from zero. This effectively introduces a non-zeroprobability that the other team may change its behavior after each observation. There are no results, 377

BOWLING & VELOSO though, of the outcome of this approach if the other agents are also adapting. For example, it isnot clear how robust the algorithm is to the other team employing a similar or identical approach todefending against a set-play. Will adaptation oscillate or is there an equilibrium point? How wouldsuch oscillation effect performance? 6.2.4 LIMITATIONS AS FINITE AUTOMATA. Carmel and Markovitch (1996, 1997) examined a model-based approach to multiagent learning.They introduced algorithms for efficiently learning a model of the opponent as a deterministic fi-nite automaton (DFA). Using this learned model, they then compute the optimal best response au-tomaton. They examined this technique in the situation of repeated games and showed that theirtechnique can learn effectively against a variety of DFA opponents. They also showed that theiralgorithm learned much more quickly (i.e., was more data efficient) than Q-learning, a model-freereinforcement learner. It is unknown how this compares with a model-based reinforcement learnersuch as prioritized sweeping (Moore & Atkeson, 1993). This technique also does not address the problem of concurrent learning in a multiagent envi- ronment. Although a learning rule can be be modeled by a DFA, the number of states that a rulerequires may be huge. For example, Q-learning with a decaying learning rate would be a completelyintractable DFA. Also, the model-based learning agent is required to be strictly more powerful thanits opponent, since its learning rule both learns the opponent’s DFA and also generates its own. It isunclear whether any sort of equilibrium fixed point would be reached or even exists if two model-based DFA learners played each other. It is also not certain how DFA modeling scales to multiplestate stochastic games. 6.2.5 SUMMARY As we stated earlier, this body of work focuses primarily on substantive rationality: what is appro-priate behavior given these assumptions? In addition, most of the approaches include the assumptionof an asymmetric superiority over the other decision making agents. In the case of constructing DFAmodels of opponent behavior, there is an implicit assumption that the opponent behavior is simpleenough to be constructed by the adapting agent. In the case of planning with opponent models, itis assumed that the opponent’s behavior is not changing and so the learned model is accurate inpredicting their future behavior. Although the work recognizes that other agents have limitations,they do so in a way that expects other agents to be strictly inferior. There are no results, empirical ortheoretical, of applying these model-based techniques to situations of self-play, where all agents usethe same adaptation procedure. Larson and Sandholm’s work on computational limitations allowsfor limitations on the part of both agents. They do so by adding the complexity of computationaldecisions into the strategy space. Hence, their approach does not make this assumption. 7. Conclusion Nash equilibrium is a crucial concept in multiagent learning both for algorithms that directly learnan equilibrium and algorithms that learn best-responses. Agent limitations, though, are unavoidablein realistic settings and can prevent agents from playing optimally or playing the equilibrium. Inthis article, we introduce and answer two critical questions: Do equilibria exist when agents have 378

EXISTENCE OF MULTIAGENT EQUILIBRIA limitations? Not necessarily. Are there classes of domains or classes of limitations where equilibriaare guaranteed to exist? Yes. We have proven that for some classes of stochastic games and agent limitations equilibria are guaranteed to exist. We have also given counterexamples that help understand why equilibria do notexist in the general case. In addition to these theoretical results, we demonstrate the implications ofthese results in a real learning algorithm. We present empirical results that show that learning withlimitations is possible, and equilibria under limitations is relevant. There are two main future directions for this work. The first is continuing to explore the the- oretical existence of equilibria. We have proven the existence of equilibria for some interestingclasses of stochastic games and restricted policy spaces. We have also established in Lemma 1 akey criterion, the convexity of best-response sets, as the basis for further theoretical results. It isstill unknown whether there are other general classes of games and limitations for which equilibriaexist. The second direction is the practical application of multiagent learning algorithms to real prob- lems when agents have real limitations. The theoretical results we have presented and the empiricalresults on simple matrix games, give a foundation as well as encouraging evidence. There are stillchallenging questions to answer. How do specific limitations map on to the models that we exploredin this article? Do equilibria exist for stochastic games and limitations typically faced in practice?What is the goal of learning when equilibria do not exist? This article lays the groundwork for ex-ploring these and other important learning issues that are relevant to realistic multiagent scenarios. Acknowledgments The authors are indebted to Martin Zinkevich for numerous insights as well as finding the exampledemonstrating Fact 5. The authors are grateful to Craig Boutilier and Tuomas Sandholm for helpfuldiscussions. In addition, we owe gratitude to the team of anonymous reviewers for their careful eyesand invaluable comments. This research was sponsored by the United States Air Force under Cooperative Agreements No. F30602-00-2-0549 and No. F30602-98-2-0135. The views and conclusions contained in thisdocument are those of the authors and should not be interpreted as necessarily representing theofficial policies or endorsements, either expressed or implied, of the Defense Advanced ResearchProjects Agency (DARPA), the Air Force, or the US Government. Appendix A. Proof of Theorem 1 Proof. One might think of proving this theorem by appealing to implicit games as was used inFact 4. In fact, if Πi was a convex hull of a finite number of strategies, this would be the case. Inorder to prove it for any convex Πi we apply Rosen’s theorem about the existence of equilibria inconcave games (Rosen, 1965). In order to use this theorem we need to show the following: 1. Πi is non-empty, compact, and convex. 2. V π as a function over π ∈ Π is continuous. i π ,π−i 3. For any π ∈ Π, the function over π ∈ Π i is concave. i i defined as Vi 379

BOWLING & VELOSO Condition 1 is by assumption. In matrix games, where S = s0, we can simplify the definition ofa policy’s value from Equations 1 and 2. For discounted reward we get, 1 V π i = Ri(s, a)Πn 1 − γ i=1πi(s0, ai), (25) a∈A For average reward, the value is just the value of the one-shot matrix game given the players’policies. This is equivalent to setting γ = 0 in the Equation 25. Equation 25 shows that forboth reward formulations the value is a multilinear function with respect to the joint policy and,therefore, is continuous. So Condition 2 is satisfied. Observe that by fixing the policies for all butone player, Equation 25 becomes a linear function over the remaining player’s policy and so is alsoconcave, satisfying Condition 3. Therefore Rosen’s theorem applies and this game has a restrictedequilibrium. Appendix B. Proof of Lemma 1 Proof. The proof relies on Kakutani’s fixed point theorem. We first need to show some facts aboutthe restricted best-response function. First, remember that Πi is non-empty and compact. Also, notethat the value (with both discounted and average reward) to a player at any state of a joint policyis a continuous function of that joint policy (Filar & Vrieze, 1997, Theorem 4.3.7 and Lemma5.1.4). Therefore, from basic analysis (Gaughan, 1993, Theorem 3.5 and Corollary 3.11), the set ofmaximizing (or optimal) points must be a non-empty and compact set. So BRi(π−i) is non-emptyand compact. Define the set-valued function, F : Π → Π, F (π) = ×n i=1 BRi(π−i). We want to show F has a fixed point. To apply Kakutani’s fixed point theorem we must show thefollowing conditions to be true, 1. Π is a non-empty, compact, and convex subset of a Euclidean space, 2. F (π) is non-empty, 3. F (π) is compact and convex, and 4. F is upper hemi-continuous. Since the Cartesian product of non-empty, compact, and convex sets is non-empty, compact, andconvex we have condition (1) by the assumptions on Πi. By the facts of BRi from above and thelemma’s assumptions we similarly get conditions (2) and (3). What remains is to show condition (4). Consider two sequences xj → x ∈ Π and yj → y ∈ Π such that ∀j yj ∈ F (xj). It must be shown that y ∈ F (x), or just yi ∈ BRi(x). Let v be yi’s valueagainst x. For the purposes of contradiction assume that there exists a y with higher value, v , than i yi. Let δ = v − v. Since the value function is continuous, we can choose a point in the sequencewhere the policies are close enough to their limit points that they make an arbitrarily small changeto the value. Specifically, let N be large enough that the value of y against xN differs from v by at i most δ/4, and the value of yi against xN differs from v by at most δ/4, and the value of yN against i 380

EXISTENCE OF MULTIAGENT EQUILIBRIA xN differs from yi against xN by at most δ/4. Adding all of these together, we have a point in thesequence yN whose value against xN is less than the value of y against xN . So yN / ∈ BR i i i i(xN ), and therefore yN / ∈ F (xN ) creating our contradiction. The comparison of values of these various four joint policies is shown in Figure 8 and helps illustrate the resulting contradiction. (yi, x) (yi, xN ) (yN , xN ) (y , xN ) (y , x) i i i v v δ/4 δ/4 −δ/4 Increasing Value Figure 8: An illustration of the demonstration by contradiction that the best-response functions are upper hemi-continuous. We can now apply Kakutani’s fixed point theorem. So there exists π ∈ Π such that π ∈ F (π). This means πi ∈ BRi(π−i), and therefore this is a restricted equilibrium. References Baxter, J., & Bartlett, P. L. (2000). Reinforcement learning in POMDP’s via direct gradient ascent. In Proceedings of the Seventeenth International Conference on Machine Learning, pp. 41–48,Stanford University. Morgan Kaufman. Bowling, M., & Veloso, M. (1999). Bounding the suboptimality of reusing subproblems. In Pro- ceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pp. 1340–1345, Stockholm, Sweden. Morgan Kaufman. An earlier version appeared in the Proceedingsof the NIPS Workshop on Abstraction in Reinforcement Learning, 1998. Bowling, M., & Veloso, M. (2002). Multiagent learning using a variable learning rate. Artificial Intelligence, 136, 215–250. Carmel, D. (1997). Model-based Learning of Interaction Strategies in Multi-agent Systems. Ph.D. thesis, Computer Science Department, Technion. Carmel, D., & Markovitch, S. (1996). Learning models of intelligent agents. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, Menlo Park, CA. AAAI Press. Claus, C., & Boutilier, C. (1998). The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the Fifteenth National Conference on Artificial Intelligence, pp.746–752, Menlo Park, CA. AAAI Press. Filar, J., & Vrieze, K. (1997). Competitive Markov Decision Processes. Springer Verlag, New York. Fink, A. M. (1964). Equilibrium in a stochastic n-person game. Journal of Science in Hiroshima University, Series A-I, 28, 89–93. Gaughan, E. D. (1993). Introduction to Analysis, 4th Edition. Brooks/Cole Publishing Company, Pacific Grove, CA. Gilboa, I., & Samet, D. (1989). Bounded versus unbounded rationality: The tyranny of the weak. Games and Economic Behavior, 213–221. 381

BOWLING & VELOSO Gintis, H. (2000). Game Theory Evolving. Princeton University Press. Greenwald, A., & Hall, K. (2002). Correlated Q-learning. In Proceedings of the AAAI Spring Symposium Workshop on Collaborative Learning Agents. Hauskrecht, M., Meuleau, N., Kaelbling, L. P., Dean, T., & Boutilier, C. (1998). Hierarchical so- lution of Markov decision processes using macro-actions. In Proceedings of the FourteenthAnnual Conference on Uncertainty in Artificial Intelligence (UAI-98). Hu, J., & Wellman, M. P. (1998). Multiagent reinforcement learning: Theoretical framework and an algorithm. In Proceedings of the Fifteenth International Conference on Machine Learning,pp. 242–250, San Francisco. Morgan Kaufman. Jaakkola, T., Singh, S. P., & Jordan, M. I. (1994). Reinforcement learning algorithm for partially ob- servable Markov decision problems. In Advances in Neural Information Processing Systems6. MIT Press. Jansen, P. J. (1992). Using Knowledge About the Opponent in Game-tree Search. Ph.D. thesis, Computer Science Department, Carnegie Mellon University. Kuhn, H. W. (1953). Extensive games and the problem of information. In Kuhn, H. W., & Tucker, A. W. (Eds.), Contributions to the Theory of Games II, pp. 193–216. Princeton UniversityPress. Reprinted in (Kuhn, 1997). Kuhn, H. W. (Ed.). (1997). Classics in Game Theory. Princeton University Press. Larson, K., & Sandholm, T. (2001). Bargaining with limited computation: Deliberation equilibrium. Artificial Intelligence, 132(2), 183–217. Littman, M. (2001). Friend-or-foe Q-learning in general-sum games. In Proceedings of the Eigh- teenth International Conference on Machine Learning, pp. 322–328. Morgan Kaufman. Littman, M., & Stone, P. (2001). Leading best-response strategies in repeated games. In Seven- teenth Annual International Joint Conference on Artificial Intelligence Workshop on Eco-nomic Agents, Models, and Mechanisms. Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the Eleventh International Conference on Machine Learning, pp. 157–163.Morgan Kaufman. Littman, M. L., & Szepesv´ari, G. (1996). A generalized reinforcement-learning model: Convergence and applications. In Proceedings of the 13th International Conference on Machine Learning,pp. 310–318, Bari, Italy. Morgan Kaufmann. Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms, and em- pirical results. Machine Learning, 22, 159–196. Mataric, M. J. (1994). Reward functions for accelerated learning. In Proceedings of the Eleventh International Conference on Machine Learning, San Francisco. Morgan Kaufman. McGovern, A., & Barto, A. G. (2001). Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the Eighteenth International Conference on MachineLearning, pp. 361–368. Morgan Kaufman. Mertens, J. F., & Neyman, A. (1981). Stochastic games. International Journal of Game Theory, 10, 53–56. 382

EXISTENCE OF MULTIAGENT EQUILIBRIA Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13, 103–130. Nash, Jr., J. F. (1950). Equilibrium points in n-person games. PNAS, 36, 48–49. Reprinted in (Kuhn, 1997). Ng, A. Y., Parr, R., & Koller, D. (1999). Policy search via density estimation. In Advances in Neural Information Processing Systems 12, pp. 1022–1028. MIT Press. Noda, I., Matsubara, H., Hiraki, K., & Frank, I. (1998). Soccer server: a tool for research on multi- agent systems. Applied Artificial Intelligence, 12, 233–250. Osborne, M. J., & Rubinstein, A. (1994). A Course in Game Theory. The MIT Press. Osborne, M. J., & Rubinstein, A. (1997). Games with procedurally rational players. Working papers 9702, McMaster University, Department of Economics. Riley, P., & Veloso, M. (2000). On Behavior Classification in Adversarial Environments, pp. 371– 380. Springer-Verlag. Riley, P., & Veloso, M. (2002). Planning for distributed execution through use of probabilistic opponent models. In Proceedings of the Sixth International Conference on AI Planning andScheduling, Toulouse, France. Robinson, J. (1951). An iterative method of solving a game. Annals of Mathematics, 54, 296–301. Reprinted in (Kuhn, 1997). Rosen, J. B. (1965). Existence and uniqueness of equilibrium points for concave n-person games. Econometrica, 33, 520–534. Rubinstein, A. (1998). Modeling Bounded Rationality. The MIT Press. Sen, S., Sekaran, M., & Hale, J. (1994). Learning to coordinate without sharing information. In Proceedings of the 13th National Conference on Artificial Intelligence. Shapley, L. S. (1953). Stochastic games. PNAS, 39, 1095–1100. Reprinted in (Kuhn, 1997). Simon, H. A. (1976). From substantive to procedural rationality. In Latis, S. J. (Ed.), Methods and Appraisals in Economics, pp. 129–148. Cambridge University Press, New York. Simon, H. A. (1982). Models of Bounded Rationality. MIT Press, Cambridge, MA. Singh, S., Jaakkola, T., Littman, M. L., & Szepesv´ari, C. (2000). Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for re- inforcement learning with function approximation. In Advances in Neural Information Pro-cessing Systems 12, pp. 1057–1063. MIT Press. Sutton, R. S., Precup, D., & Singh, S. (1998). Intra-option learning about temporally abstract ac- tions. In Proceedings of the Fifteenth International Conference on Machine Learning, pp.556–564, San Francisco. Morgan Kaufman. Tan, M. (1993). Multi-agent reinforcement learning: Independent vs. cooperative agents. In Pro- ceedings of the Tenth International Conference on Machine Learning, pp. 330–337, Amherst,MA. 383

BOWLING & VELOSO Uther, W., & Veloso, M. (1997). Adversarial reinforcement learning. Tech. rep., Carnegie Mellon University. Unpublished. Uther, W. T. B. (2002). Tree Based Hierarchical Reinforcement Learning. Ph.D. thesis, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA. Available as technical re-port CMU-CS-02-169. Vrieze, O. J. (1987). Stochastic Games with Finite State and Action Spaces. No. 33 in CWI Tracts. Centrum voor Wiskunde en Informatica. Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph.D. thesis, King’s College, Cam- bridge, UK. Williams, R. J., & Baird, L. C. (1993). Tight performance bounds on greedy policies based on imperfect value functions. Technical report, College of Computer Science, Northeastern Uni-versity. 384