M. Bowling and M. Veloso
Volume 22, 2004
Links to Full Text:
Journal of Artiï¬cial Intelligence Research 22 (2004) 353â384 Submitted 8/03; published 12/04 Existence of Multiagent Equilibria with Limited Agents Michael Bowling BOWLING@CS.UALBERTA.CA Department of Computing Science, University of Alberta Edmonton, Alberta, Canada, T6G 2E8Manuela Veloso MMV@CS.CMU.EDU Computer Science Department, Carnegie Mellon University Pittsburgh PA, 15213-3891 Abstract Multiagent learning is a necessary yet challenging problem as multiagent systems become more prevalent and environments become more dynamic. Much of the groundbreaking work in this areadraws on notable results from game theory, in particular, the concept of Nash equilibria. Learnersthat directly learn an equilibrium obviously rely on their existence. Learners that instead seek toplay optimally with respect to the other players also depend upon equilibria since equilibria areï¬xed points for learning. From another perspective, agents with limitations are real and common.These may be undesired physical limitations as well as self-imposed rational limitations, such asabstraction and approximation techniques, used to make learning tractable. This article explores theinteractions of these two important concepts: equilibria and limitations in learning. We introducethe question of whether equilibria continue to exist when agents have limitations. We look at thegeneral effects limitations can have on agent behavior, and deï¬ne a natural extension of equilibriathat accounts for these limitations. Using this formalization, we make three major contributions: (i)a counterexample for the general existence of equilibria with limitations, (ii) sufï¬cient conditionson limitations that preserve their existence, (iii) three general classes of games and limitations thatsatisfy these conditions. We then present empirical results from a speciï¬c multiagent learningalgorithm applied to a speciï¬c instance of limited agents. These results demonstrate that learningwith limitations is feasible, when the conditions outlined by our theoretical analysis hold. 1. Introduction Multiagent domains are becoming more prevalent as more applications and situations require multi-ple agents. Learning in these systems is as useful and important as in single-agent domains, possiblymore so. Optimal behavior in a multiagent system may depend on the behavior of the other agents.For example, in robot soccer, passing the ball may only be optimal if the defending goalie is goingto move to block the playerâs shot and no defender will move to intercept the pass. This challengeis complicated by the fact that the behavior of the other agents is often not predictable by the agentdesigner, making learning and adaptation a necessary component of the agent itself. In addition, thebehavior of the other agents, and therefore the optimal response, can be changing as they also adaptto achieve their own goals. Game theory provides a framework for reasoning about these strategic interactions. The game theoretic concepts of stochastic games and Nash equilibria are the foundation for much of the recentresearch in multiagent learning, (e.g., Littman, 1994; Hu & Wellman, 1998; Greenwald & Hall,2002; Bowling & Veloso, 2002). A Nash equilibrium deï¬nes a course of action for each agent, suchthat no agent could beneï¬t by changing their behavior. So, all agents are playing optimally, giventhat the other agents continue to play according to the equilibrium. c 2004 AI Access Foundation. All rights reserved.
BOWLING & VELOSO From the agent design perspective, optimal agents in realistic environments are not practical. Agents are faced with all sorts of limitations. Some limitations may physically prevent certainbehavior, e.g., a soccer robot that has traction limits on its acceleration. Other limitations are self-imposed to help guide an agentâs learning, e.g., using a subproblem solution for advancing theball down the ï¬eld. In short, limitations prevent agents from playing optimally and possibly fromfollowing a Nash equilibrium. This clash between the concept of equilibrium and the reality of limited agents is a topic of critical importance. Do equilibria exist when agents have limitations? Are there classes of domainsor classes of limitations where equilibria are guaranteed to exist? This article introduces these ques-tions and provides concrete answers. In particular, we introduce two models of limitations: implicitgames and restricted policy spaces. We use these models to demonstrate in two key counterexam-ples the threat that limitations pose to the existence of equilibria. We reï¬ne a sufï¬cient condition forthe existence of equilibria, and use to prove existence in three classes of games and limitations. Thisanalysis is peppered with examples of applications of these results. We conclude with an empiricalexample of learning with limitations, and a brief survey of related work. The article is organized as follows. Section 2 introduces the stochastic game framework as a model for multiagent learning. We deï¬ne the game theoretic concept of equilibrium, and examinethe dependence of current multiagent learning algorithms on this concept. Section 3 enumerates andclassiï¬es some common agent limitations and presents two formal models incorporating the effectsof limitations into the stochastic game framework. Section 4 is the major contribution of the article,presenting both proofs of existence for certain domains and limitations as well as counterexamplesfor others. Section 5 gives an example of how these results affect and relate to one particular multi-agent learning algorithm. We present interesting results of applying a multiagent learning algorithmin a setting with limited agents. Section 6 compares our approach with other investigations of agentlimitations in the ï¬elds of game theory and artiï¬cial intelligence. Finally, Section 7 concludes withimplications of this work and future directions. 2. Stochastic Games A stochastic game is a tuple (n, S, A1...n, T, R1...n), where, ⢠n is the number of agents, ⢠S is a set of states, ⢠Ai is the set of actions available to agent i with A being the joint action space, A1 à . . . à An, ⢠T is a transition function, S à A à S â [0, 1], such that, âs â S âa â A T (s, a, s ) = 1, s âS ⢠and Ri is a reward function for the ith agent, S à A â R. This framework is very similar to the framework of a Markov Decision Process (MDP). Instead of asingle agent, though, there are multiple agents whose joint action determines the next state and re-wards to the agents. The goal of an agent, as in MDPs, is to maximize its long-term reward. Notice, 354
EXISTENCE OF MULTIAGENT EQUILIBRIA though, that each agent has its own independent reward function that it is seeking to maximize. Thegoal of maximizing âlong-term rewardâ will be made formal in Section 2.2. Stochastic games can also be thought of as an extension of the concept of matrix games to mul- tiple states. Two common matrix games are in Figure 1. In these games there are two players; oneselects a row and the other selects a column of the matrix. The entry of the matrix they jointly selectdetermines the payoffs. Rock-Paper-Scissors in Figure 1(a) is a zero-sum game, where the columnplayer receives the negative of the row playerâs payoff. In the general case (general-sum games;e.g., Bach or Stravinsky in Figure 1(b)) each player has an independent matrix that determines itspayoff. Stochastic games, then, can be viewed as having a matrix game associated with each state.The immediate payoffs at a particular state are determined by the matrix entries Ri(s, a). Afterselecting actions and receiving their rewards from the matrix game, the players are transitioned toanother state and associated matrix game, which is determined by their joint action. So stochas-tic games contain both MDPs (when n = 1) and matrix games (when |S| = 1) as subsets of theframework.  0 â1 1  2 0 Rr(s0, ·) = 1 0 â1 ï£ ï£¸ Rr(s0, ·) = 0 1 â1 1 0  0 1 â1  1 0 Rc(s0, ·) = â1 0 1 R ï£ ï£¸ c(s0, ·) = 0 2 1 â1 0 (a) Rock-Paper-Scissors (b) Bach or Stravinsky Table 1: Two example matrix games. 2.1 Policies Unlike in single-agent settings, deterministic policies, which associate a single action with everystate, can often be exploited in multiagent settings. Consider Rock-Paper-Scissors as shown inFigure 1(a). If the column player were to play any action deterministically, the row player couldwin a payoff of one every time. This fact requires us to consider stochastic strategies and policies. Astochastic policy for player i, Ïi : S â P D(Ai), is a function that maps states to mixed strategies,which are probability distributions over the playerâs actions. We use the notation Î i to be the set ofall possible stochastic policies available to player i, and Î = Î 1 à . . . à Πn to be the set of jointpolicies of all the players. We also use the notation Ïâi to refer to a particular joint policy of allthe players except player i, and Î âi to refer to the set of such joint policies. Finally, the notation Ïi, Ïâi refers to the joint policy where player i follows Ïi while the other players follow their policy from Ïâi. In this work, we make the distinction between the concept of stochastic policies and mixtures of policies. A mixture of policies, Ïi : P D(S â Ai), is a probability distribution over the set of deter-ministic policies. An agent following a mixture of policies selects a deterministic policy accordingto its mixture distribution at the start of the game and always follows this policy. This differenceis similar to the distinction between mixed strategies and behavioral strategies in extensive-formgames (Kuhn, 1953). Our work focuses on stochastic policies as they (i) are a more compact repre- 355
BOWLING & VELOSO sentation requiring |Ai||S| parameters instead of |Ai||S| parameters to represent the complete spaceof policies, (ii) are the common notion of stochastic policies in single-agent behavior learning, (e.g.,Jaakkola, Singh, & Jordan, 1994; Sutton, McAllester, Singh, & Mansour, 2000; Ng, Parr, & Koller,1999), and (iii) do not require the artiï¬cial commitment to a single deterministic policy at the startof the game, which can be difï¬cult to understand within a learning context. 2.2 Reward Formulations There are a number of possible reward formulations in single-agent learning that deï¬ne the agentâsnotion of optimality. These formulations also apply to stochastic games. We will explore two ofthese reward formulations in this article: discounted reward and average reward. Although thiswork focuses on discounted reward, many of our theoretical results also apply to average reward. Discounted Reward. In the discounted reward formulation, the value of future rewards is dimin-ished by a discount factor γ. Formally, given a joint policy Ï for all the agents, the value to agent iof starting at state s â S is, â V Ï | i (s) = γtE rti s0 = s, Ï , (1) t=0 where rt is the immediate reward to player i at time t with the expectation conditioned on s as the i initial state and the players following the joint policy Ï. In our formulation, we will assume an initial state, s0 â S, is given and deï¬ne the goal of each agent i as maximizing V Ï(s i 0). This formulation differs from the usual goal in MDPs and stochastic games, which is to simultaneously maximize the value of all states. We require this weaker goalsince our exploration into agent limitations makes simultaneous maximization unattainable.1 Thissame distinction was required by Sutton and colleagues (2000) in their work on parameterizedpolicies, one example of an agent limitation. Average Reward. In the average reward formulation all rewards in the sequence are equallyweighted. Formally, this corresponds to, T 1 V Ï | i (s) = lim E rti s0 = s, Ï , (2) T ââ T t=0 with the expectation deï¬ned as in Equation 1. As is common with this formulation, we assume thatthe stochastic game is ergodic. A stochastic game is ergodic if for all joint policies any state canbe reached in ï¬nite time from any other state with non-zero probability. This assumption makes thevalue of a policy independent of the initial state. Therefore, âs, s â S V Ï i (s) = V Ï i (s ). So any policy that maximizes the average value from one state maximizes the average value fromall states. These results along with more details on the average reward formulation for MDPs aresummarized by Mahadevan (1996). 1. This fact is demonstrated later by the example in Fact 5 in Section 4. In this game with the described limitation, if the column player randomizes among its actions, then the row player cannot simultaneously maximize the value ofthe left and right states. 356
EXISTENCE OF MULTIAGENT EQUILIBRIA For either formulation we will use the notation V Ï to refer to the value of the joint policy Ï to agent i i, which in either formulation is simply V Ï(s i 0), where s0 can be any arbitrary state for the average reward formulation. 2.3 Best-Response and Equilibria Even with the concept of stochastic policies and well-deï¬ned reward formulations, there are, ingeneral, still no optimal policies that are independent of the other playersâ policies. We can, though,deï¬ne a notion of best-response. Deï¬nition 1 For a game, the best-response function for player i, BRi(Ïâi), is the set of all policiesthat are optimal given the other player(s) play the joint policy Ïâi. A policy Ïi is optimal given Ïâiif and only if, Ï ,Ï â Ï â Ï â i,Ïâi ⥠i i i Î i V V . i i The major advancement that has driven much of the development of game theory, matrix games,and stochastic games is the notion of a best-response equilibrium, or Nash equilibrium (Nash, Jr.,1950). Deï¬nition 2 A Nash equilibrium is a joint policy, Ïi=1...n, with âi = 1, . . . , n Ïi â BRi(Ïâi). Basically, an equilibrium is a policy for each player where each is playing a best-response to theother playersâ policies. Hence, no player can do better by changing policies given that all the otherplayers continue to follow the equilibrium policy. What makes the notion of an equilibrium inter-esting is that at least one, possibly many, exist in all matrix games and most stochastic games. Thiswas proven by Nash (1950) for matrix games, Shapley (1953) for zero-sum discounted stochasticgames, Fink (1964) for general-sum discounted stochastic games, and Mertens and Neyman (1981)for zero-sum average reward stochastic games. The existence of equilibria for general-sum averagereward stochastic games is still an open problem (Filar & Vrieze, 1997). In the Rock-Paper-Scissors example in Figure 1(a), the only equilibrium consists of each player playing the mixed strategy where all the actions have equal probability. In the Bach-or-Stravinskyexample in Figure 1(b), there are three equilibria. Two consist of both players selecting their ï¬rstaction or both selecting their second. The third involves both players selecting their preferred coop-erative action with probability 2/3, and the other action with probability 1/3. 2.4 Learning in Stochastic Games Learning in stochastic games has received much attention in recent years as the natural extensionof MDPs to multiple agents. The Minimax-Q algorithm (Littman, 1994) was the ï¬rst reinforcementlearning algorithm to explicitly consider the stochastic game framework. Developed for discountedreward, zero-sum stochastic games, the essence of the algorithm was to use Q-learning to learn thevalues of joint actions. The value of the next state was then computed by solving for the value ofthe unique Nash equilibrium of that stateâs Q-values. Littman and Szepesvari (1996) proved thatunder usual exploration requirements for both players, Minimax-Q would converge to the Nashequilibrium of the game, independent of the opponentâs play beyond the exploration requirement. 357
BOWLING & VELOSO Other algorithms have since been presented for learning in stochastic games. We will summarizethese algorithms by broadly grouping them into two categories: equilibrium learners and best-response learners. The main focus of this summarization is to demonstrate how the existence ofequilibria under limitations is a critical question to existing algorithms. Equilibrium Learners. Minimax-Q has been extended in many different ways. Nash-Q (Hu &Wellman, 1998), Friend-or-Foe-Q (Littman, 2001), and Correlated-Q (Greenwald & Hall, 2002) areall variations on this same theme with different restrictions on the applicable class of games or theexact notion of equilibrium learned. All of the algorithms, though, seek to learn an equilibriumof the game directly, by iteratively computing intermediate equilibria. Some of the algorithms havetheoretical guarantees of convergence to equilibrium play, others have empirical results to this effect.Like Minimax-Q, though, the policies learned are independent of the play of the other agents beyondexploration requirements. We refer collectively to these algorithms as equilibrium learners. Whatis important to observe is that these algorithms depend explicitly on the existence of equilibria. Ifan agent or agents were limited in such a way so that no equilibria existed then these algorithmswould be, for the most part, ill-deï¬ned.2 Best-Response Learners. Another class of algorithms is the class ofbest-response learners. Thesealgorithms do not explicitly seek to learn an equilibrium, instead seeking to learn best-responses tothe other agents. The simplest example of one of these algorithms is Q-learning (Watkins, 1989).Although not an explicitly multiagent algorithm, it was one of the ï¬rst algorithms applied to mul-tiagent environments (Tan, 1993; Sen, Sekaran, & Hale, 1994). Another less naive best-responselearning algorithm is WoLF-PHC (Bowling & Veloso, 2002), which varies the learning rate to ac-count for the other agents learning simultaneously. Other best-response learners include FictitiousPlay (Robinson, 1951; Vrieze, 1987), Opponent-Modeling Q-Learning (Uther & Veloso, 1997),Joint Action Learners (Claus & Boutilier, 1998), and any single-agent learning algorithm that learnsoptimal policies. Although these algorithms have no explicit dependence on equilibria, there is animportant implicit dependence. If algorithms that learn best-responses converge when playing eachother, then it must be to a Nash equilibrium (Bowling & Veloso, 2002). Therefore, all learning ï¬xedpoints are Nash equilibria. In the context of agent limitations, this means that if limitations causeequilibria to not exist, then best-response learners could not converge. This nonexistence of equilibria is exactly one of the problems faced by Q-learning in stochastic games. Q-learning is limited to deterministic policies. The deterministic policy limitation can, infact, cause no equilibria to exist (see Fact 1 in Section 4.) So there are many games for which Q-learning cannot converge when playing with other best-response learners, such as other Q-learners. In summary, both equilibrium and best-response learners depend, in some way, on the existence ofequilibria. The next section explores agent limitations that are likely to be faced in realistic learningsituations. In Section 4, we then present our main results examining the effect these limitations haveon the existence of equilibria, and consequently on both equilibrium and best-response learners. 2. It should be noted that in the case of Minimax-Q, the algorithm and solution concept are still well-deï¬ned. A policy that maximizes its worst-case value may still exist even if limitations make it such that no equilibria exist. But, thisminimax optimal policy might not necessarily be part of an equilibrium. Later, in Section 4, Fact 5, we present anexample of a zero-sum stochastic game and agent limitations where the minimax optimal policies exist but do notcomprise an equilibrium. 358
EXISTENCE OF MULTIAGENT EQUILIBRIA 3. Limitations The solution concept of Nash equilibrium depends on all the agents playing optimally. From theagent development perspective, agents have limitations that prevent this from being a reality. Theworking deï¬nition of limitation in this article is anything that can restrict the agent from learningor playing optimal policies. Broadly speaking, limitations can be classiï¬ed into two categories:physical limitations and rational limitations. Physical limitations are those caused by the interactionof the agent with its environment and are often unavoidable. Rational limitations are limitationsspeciï¬cally chosen by the agent designer to make the learning problem tractable, either in memoryor time. We brieï¬y explore some of these limitations informally before presenting a formal modelof limitations that attempts to capture their effect within the stochastic game framework. 3.1 Physical Limitations One obvious physical limitation is that the agent simply is broken. A mobile agent may cease tomove or less drastically may lose the use of one of its actuators preventing certain movements.Similarly, another agent may appear to be âbrokenâ when in fact the motion is simply outside itscapabilities. For example, in a mobile robot environment where the ârulesâ allow robots to moveup to two meters per second, there may be a robot that is not capable of reaching that speed. Anagent that is not broken, may suffer from poor control where its actions are not always carried outas desired, e.g., due to poorly tuned servos, inadequate wheel traction, or high system latency. Another common physical limitation is hardwired behavior. Most agents in dynamic domains need some amount of hard-wiring for fast response and safety. For example, many mobile robotplatforms are programmed to immediately stop if an obstacle is too close. These hardwired actionsprevent certain behavior by the agent, which is often unsafe but is potentially optimal. Sensing is a common area of agent limitations containing everything from noise to partial ob- servability. Here weâll mention just one broad category of sensing problems: state aliasing. Thisoccurs when an agent cannot distinguish between two different states of the world. An agent mayneed to remember past states and actions in order to properly distinguish the states, or may simplyexecute the same action in both states. 3.2 Rational Limitations Rational limitations are a requirement for agents to learn in even moderately sized problems. Tech-niques for making learning scale, which often focus on near-optimal solutions, continue to be pro-posed and investigated in single-agent learning. They are likely to be even more necessary in multi-agent environments which tend to have larger state spaces. We will examine a few speciï¬c methods. In domains with sparse rewards one common technique is reward shaping, (e.g., Mataric, 1994). A designer artiï¬cially rewards the agent for actions the designer believes to be progressing towardthe sparse rewards. These additional rewards can often speed learning by focusing exploration, butalso can cause the agent to learn suboptimal policies. For example, in robotic soccer moving theball down the ï¬eld is a good heuristic for goal progression, but at times the optimal goal-scoringpolicy is to pass the ball backward to an open teammate. Subproblem reuse also has a similar effect, where a subgoal is used in a portion of the state space to speed learning, (e.g., Hauskrecht, Meuleau, Kaelbling, Dean, & Boutilier, 1998; Bowling& Veloso, 1999). These subgoals, though, may not be optimal for the global problem and so prevent 359
BOWLING & VELOSO the agent from playing optimally. Temporally abstract options, either provided (Sutton, Precup, &Singh, 1998) or learned (McGovern & Barto, 2001; Uther, 2002), also enforce a particular sub-policy on a portion of the state space. Although in theory, the primitive actions are still availableto the agents to play optimal policies, in practice abstraction away from primitive actions is oftennecessary in large or continuous state spaces. Parameterized policies are receiving a great deal of attention as a way for reinforcement learning to scale to large problems, (e.g., Williams & Baird, 1993; Sutton et al., 2000; Baxter & Bartlett,2000). The idea is to give the learner a policy that depends on far less parameters than the entirepolicy space actually would require. Learning is then performed in the smaller space of parametersusing gradient techniques. Parameterized policies simplify and speed learning at the expense ofpossibly not being able to represent the optimal policy in the parameter space. 3.3 Models of Limitations This enumeration of limitations shows that there are a number and variety of limitations with whichagents may be faced, and they cannot be realistically avoided. In order to understand their impacton equilibria, we model limitations formally within the game theoretic framework. We introducetwo models that capture broad classes of limitations: implicit games and restricted policy spaces. Implicit Games. Limitations may cause an agent to play suboptimally, but it may be that theagent is actually playing optimally in a different game. If this new game can be deï¬ned within thestochastic game framework we call this the implicit game, in contrast to the original game called theexplicit game. For example, reward shaping adds artiï¬cial rewards to help guide the agentâs search.Although the agent is no longer learning an optimal policy in the explicit game, it is learning anoptimal policy of some game, speciï¬cally the game with these additional rewards added to thatagentâs Ri function. Another example is due to broken actuators preventing an agent from takingsome action. The agent may be suboptimal in the explicit game, while still being optimal in theimplicit game deï¬ned by removing these actions from the agentâs action set, Ai. We can formalizethis concept in the following deï¬nition. Deï¬nition 3 Given a stochastic game (n, S, A1...n, T, R1...n) the tuple (n, S, Ë A1...n, Ë T , Ë R1...n) is an implicit game if and only if it is itself a stochastic game and there exist mappings, Ïi : S Ã Ë Ai à Ai â [0, 1], such that, âs, s â S âË a Ë i â Ë Ai T (s, Ë ai , s ) = Î n , s ). i=1...n i=1Ïi(s, Ë ai, ai) T (s, ai i=1...n aâA The Ïiâs are mappings from the implicit action space into stochastic actions in the explicit actionspace. The Ïi mappings insure that, for each player, the choices available in the implicit game are a restric-tion of the choices in the explicit game. In other words, for any policy in the implicit game, thereexists an equivalent, possibly stochastic policy, in the explicit game. Reward shaping, broken actuators, and exploration can all be captured within this model. For reward shaping the implicit game is (n, S, A1...n, T, Ë R1...n), where Ë Ri adds the shaped reward into 360
EXISTENCE OF MULTIAGENT EQUILIBRIA the original reward, Ri. In this case the Ï mappings are just the identity, 1 if a = a Ïi(s, a, a ) = . 0 otherwise For the broken actuator example, let a0 â A â A i i be some null action for agent i and let ab i i be some broken action for agent i that under the limitation has the same effect as the null action. Theimplicit game, then, is (n, S, A1...n, Ë T , Ë R1...n), where, Ë T (s, a0, a T (s, a, s ) = i âi , s ) if ai = abi T (s, a, s ) otherwise Ë R(s, a0, a R(s, a) = i âi ) if ai = abi , R(s, a) otherwise and, a0 if a = ab Ï i i i(s, a) = . a otherwise For -exploration, as with broken actuators, only Ë T and Ë R need to be deï¬ned. They are just the combination of T or R with the transition probabilities or reward values of selecting a random action. Limitations captured by this model can be easily analyzed with respect to their effect on the existence of equilibria. We now restrict ourselves to discounted reward stochastic games, whereequilibria existence is known. Using the intuitive deï¬nition of an equilibrium as a joint policy suchthat âno player can do better by changing policies,â an equilibrium in the implicit game achievesthis deï¬nition for the explicit game. Since all discounted reward stochastic games have at least oneequilibrium, so must the implicit game, which is also in this class. This equilibrium for the implicitgame is then an equilibrium in the explicit game given that the agents are limited. On the other hand, many of the limitations described above cannot be modeled in this way. None of the limitations of abstraction, subproblem reuse, parameterized policies, or state aliasinglend themselves to being described by this model. This leads us to our second, and in many waysmore general, model of limitations. Restricted Policy Spaces. The second model is that of restricted policy spaces, which models limi-tations as restricting the agent from playing certain policies. For example, an -exploration strategyrestricts the player to policies that select all actions with some minimum probability. Parameter-ized policy spaces have a restricted policy space corresponding to the space of policies that can berepresented by their parameters. We can deï¬ne this formally. Deï¬nition 4 A restricted policy space for player i is a non-empty and compact subset, Î i â Î i, ofthe complete space of stochastic policies. The assumption of compactness3 may at ï¬rst appear strange, but it is not particularly limiting, andis critical for any equilibrium analysis. 3. Since Î i is a subset of a bounded set, the requirement that Î i is compact merely adds that the limit point of any sequence of elements from the set is also in the set. 361
BOWLING & VELOSO Physical Limitations Implicit Games Restricted Policies Broken ActuatorsHardwired BehaviorPoor ControlState AliasingRational Limitations Implicit Games Restricted Policies Reward Shaping or IncentivesExplorationState Abstraction/OptionsSubproblemsParameterized Policy Table 2: Common agent limitations. The column check-marks correspond to whether the limitation can be modeled straightforwardly using implicit games and/or restricted policy spaces. It should be straightforward to see that parameterized policies, state aliasing (with no memory), and subproblem reuse, as well as exploration and broken actuators all can be captured as a restric-tion on policies the agent can play. Therefore they can be naturally described as restricted policyspaces. On the other hand, the analysis of the existence of equilibria under this model is not at allstraightforward. Since restricted policy spaces capture most of the really interesting limitations wehave discussed, this is precisely the focus of the rest of this article. Before moving on to this analysis, we summarize our enumeration of limitations in Table 2. Thelimitations that we have been discussing are listed as well as denoting the model that most naturallycaptures their effect on agent behavior. 4. Existence of Equilibria In this section we deï¬ne formally the concept of a restricted equilibrium, which account for agentsârestricted policy spaces. We then carefully analyze what can be proven about the existence of re-stricted equilibria. The results presented range from somewhat trivial examples (Facts 1, 2, 3, and 4)and applications of known results from game theory and basic analysis (Theorems 1 and 5) to resultsthat we believe are completely new (Theorems 2, 3, and 4), as well as a critical counterexample tothe wider existence of restricted equilibria (Fact 5). But all of the results are in a sense novel sincethis speciï¬c question has received no direct attention in the game theory nor the multiagent learningliterature. 4.1 Restricted Equilibria We begin by deï¬ning the concept of an equilibrium under the model of restricted policy spaces.First we need a notion of best-response that accounts for the playersâ limitations. Deï¬nition 5 A restricted best-response for player i, BRi(Ïâi), is the set of all policies from Î i thatare optimal given the other player(s) play the joint policy Ïâi. We can now use this to deï¬ne an equilibrium. 362
EXISTENCE OF MULTIAGENT EQUILIBRIA Deï¬nition 6 A restricted equilibrium is a joint policy, Ïi=1...n, where, Ïi â BRi(Ïâi). Hence, no player can do better by changing policies given that all the other players continue tofollow the equilibrium policy, and the player can only switch to policies within their restrictedpolicy space. 4.2 Existence of Restricted Equilibria We can now state some results about when equilibria are preserved by restricted policy spaces, andwhen they are not. Unless otherwise stated (as in Theorems 2 and 4, which only apply to discountedreward), the results presented here apply equally to both the discounted reward and the averagereward formulations. We will separate the proofs for the two reward formulations when needed.The ï¬rst four facts show that the question of the existence of restricted equilibria does not have atrivial answer. Fact 1 Restricted equilibria do not necessarily exist. Proof. Consider the Rock-Paper-Scissors matrix game with players restricted to the space of deter-ministic policies. There are nine joint deterministic policies, and none of these joint policies are anequilibrium. Fact 2 There exist restricted policy spaces such that restricted equilibria exist. Proof. One trivial restricted equilibrium is in the case where all agents have a singleton policysubspace. The singleton joint policy therefore must be a restricted equilibrium. Fact 3 If Ïâ is a Nash equilibrium and Ïâ â Î , then Ïâ is a restricted equilibrium. Proof. If Ïâ is a Nash equilibrium, then we have Ï âi â 1 . . . n âÏ i,Ïâ âi i â Î i V Ïâ ⥠i V . i Since Î i â Î i, then we also have Ï âi â 1 . . . n âÏ i,Ïâ âi i â Î i V Ïâ ⥠i V , i and thus Ïâ is a restricted equilibrium. On the other hand, the converse is not true; not all restricted equilibria are of this trivial variety. Fact 4 There exist non-trivial restricted equilibria that are neither Nash equilibria nor come fromsingleton policy spaces. Proof. Consider the Rock-Paper-Scissors matrix game from Figure 1. Suppose the column player isforced, due to some limitation, to play âPaperâ exactly half the time, but is free to choose betweenâRockâ and âScissorsâ otherwise. This is a restricted policy space that excludes the only Nashequilibrium of the game. We can solve this game using the implicit game model, by giving the 363
BOWLING & VELOSO R Restricted Policy SpaceNash EquilibriumRestricted Equilibrium s1 P S s2 Explicit Game Implicit Game  0 â1 1   â 1 0  2 Payoffs  1 0 â1   1 â 1  ï£ ï£¸ ï£ 2 2  â1 1 0 0 12 Nash Equilibrium 1 , 1 , 1 , 1 , 1 , 1 0, 1 , 2 , 2 , 1 3 3 3 3 3 3 3 3 3 3 Restricted Equilibrium 0, 1 , 2 , 1 , 1 , 1 3 3 3 2 6 Figure 1: Example of a restricted equilibrium that is not a Nash equilibrium. Here, the column player in Rock-Paper-Scissors is restricted to playing only linear combinations of thestrategies s1 = 1 , 1 , 0 and s , 1 . 2 2 2 = 0, 12 2 limited player only two actions, s1 = (0.5, 0.5, 0) and s2 = (0, 0.5, 0.5), which the player can mixbetween. This is depicted graphically in Figure 1. We can solve the implicit game and convertthe two actions back to actions of the explicit game to ï¬nd a restricted equilibrium. Notice thisrestricted equilibrium is not a Nash equilibrium. Notice that the Fact 4 example has a convex policy space, i.e., all linear combinations of policies in the set are also in the set. Also, notice that the Fact 1 counterexample has a non-convex policyspace. These examples suggest that restricted equilibria may exist as long as the restricted policyspace is convex. The convexity restriction is, in fact, sufï¬cient for matrix games. Theorem 1 When |S| = 1, i.e. in matrix games, if Î i is convex, then there exists a restrictedequilibrium. The proof is based on an application of Rosenâs theorem for concave games (Rosen, 1965) andis included in Appendix A. Surprisingly, the convexity restriction is not sufï¬cient for stochasticgames. Fact 5 For a stochastic game, even if Î i is convex, restricted equilibria do not necessarily exist. Proof. Consider the stochastic game in Figure 2. This is a zero-sum game where only the payoffsto the row player are shown. The discount factor γ â (0, 1). The actions available to the row playerare U and D, and for the column player L or R. From the initial state, s0, the column player mayselect either L or R which results in no rewards but with high probability, 1 â , transitions to thespeciï¬ed state (regardless of the row playerâs action), and with low probability, , transitions to theopposite state. Notice that this stochasticity is not explicitly shown in Figure 2. In each of the 364
EXISTENCE OF MULTIAGENT EQUILIBRIA resulting states the players play the matrix game shown and then deterministically transition backto the initial state. Notice that this game is unichain, where all the states are in a single ergodicset, thus satisfying the average reward formulation requirement. Also, notice that the game withoutlimitations is guaranteed to have an equilibrium in both the discounted reward and the averagereward, since it is zero-sum, formulations. 0 0 s L 0 R   ï£0 0 1 â 1 â 1 0  sL sR 0 0     ï£0 0 ï£0 1 Figure 2: An example stochastic game where convex restricted policy spaces do not preserve the existence of equilibria. Now consider the restricted policy space where players have to play their actions with the same probability in all states. So, Î i = Ïi â Î i|âs, s â S âa â A Ïi(s, a) = Ïi(s , a) . (3) Notice that this is a convex set of policies. That is, if policies x1 and x2 are in Î i (according toEquation 3), then for any α â [0, 1], x3 must also be in Î i, where, x3(s, a) = αx1(s, a) + (1 â α)x2(s, a). (4) The convexity of Î i can be seen by examining x3(s , a) for any s â S. From Equation 4, we have, x3(s , a) = αx1(s , a) + (1 â α)x2(s , a) (5) = αx1(s, a) + (1 â α)x2(s, a) (6) = x3(s, a). (7) Therefore, x3 is in Î i and hence Î i is convex. This game, though, does not have a restricted equilibrium. The four possible joint deterministic policies, (U, L), (U, R), (D, L), and (D, R), are not equilibria. So if there exists an equilibriumit must be mixed. Consider any mixed strategy for the row player. If this strategy plays U withprobability less than 1 then the only best-response for the column player is to play L; if greater than 2 1 then the only best-response is to play R; if equal then the only best-responses are to play L or R 2 deterministically. In all cases, all best-responses are deterministic, so this rules out mixed strategyequilibria, and so no equilibrium exists. Fact 5 demonstrates that convexity is not a strong enough property to guarantee the existence of restricted equilibria. The discussion in Section 2.4 concluded that the majority of existing mul-tiagent learning algorithms hold some dependence on the existence of equilibria. Therefore, this 365
BOWLING & VELOSO result may have serious implications for the scaling of multiagent learning to large problems. Wewill explore these implications by ï¬rst examining this example more closely, in order to identify anew sufï¬cient condition. We will then seek to rescue the equilibrium notion by exploring speciï¬csituations where its existence can be preserved. 4.2.1 A SUFFICIENT CONDITION FOR EQUILIBRIA Standard equilibrium proof techniques fail in the Fact 5 counterexample because the playerâs best-response sets are not convex, even though their restricted policy spaces are convex. Notice thatthe best-response to the row player mixing equally between actions is to play either of its actionsdeterministically. But, linear combinations of these actions (e.g., mixing equally) are not best-responses. This intuition is proven in the following lemma. Lemma 1 For any stochastic game, if Î i is convex and for all Ïâi â Î âi, BRi(Ïâi) is convex,then there exists a restricted equilibrium. The proof of this lemma relies on Kakutaniâs ï¬xed point theorem using the convexity of the best-response as the critical condition for applying the theorem. Complete details of the proof can befound in Appendix B. The consequence of this lemma is that, if we can prove that the sets of restricted best-responses are convex then restricted equilibria exist. Notice that the convexity of the restricted policy space isnecessary for this condition (e.g., consider a uniform reward function resulting in BR(Ïâi) = Î ).The Fact 5 counterexample demonstrates that it is not sufï¬cient, since a restricted policy space canbe convex without the restricted best-response sets being convex. We now look at placing further restrictions either on the restricted policy spaces or the stochas- tic game, to insure provably convex best-response sets. After each existence proof for restrictedequilibria, we give one or more practical examples of domains where the theoretical results apply.These results begin to enumerate a few general classes where restricted equilibria are guaranteed toexist. 4.2.2 A SUBCLASS OF RESTRICTED POLICIES The ï¬rst result for general stochastic games uses a stronger notion of convexity for restricted policyspaces. Deï¬nition 7 A restricted policy space Î i is statewise convex, if it is the Cartesian product over allstates of convex strategy sets. Equivalently, if for all x1, x2 â Î i and all functions α : S â [0, 1],the policy x3(s, a) = α(s)x1(s, a) + (1 â α(s))x2(s, a) is also in Î i. In words, statewise convex policy spaces allow the policyâs action probabilities at each state to bemodiï¬ed (within the restricted set) independently. We can show that this is a sufï¬cient condition. Theorem 2 In the discounted reward formulation, if Î i is statewise convex, then there exists arestricted equilibrium. Proof. With statewise convex policy spaces, there exist optimal policies in the strong sense asmentioned in Section 2. Speciï¬cally, there exists a policy that can simultaneously maximize thevalue of all states. Formally, for any Ïâi there exists a Ïi â Î i such that, Ï ,Ï â Ï â s â S âÏ â i,Ïâi i i i Î i V (s) ⥠V (s). i i 366
EXISTENCE OF MULTIAGENT EQUILIBRIA Suppose this were not true, i.e., there were two policies each of which maximized the value ofdifferent states. We can construct a new policy that in each state follows the policy whose value islarger for that state. This policy will maximize the value of both states that those policies maximized,and due to statewise convexity is also in Î i. We will use that fact to redeï¬ne optimality to this strongsense for this proof. We will now make use of Lemma 1. First, notice the lemmaâs proof still holds even with this new deï¬nition of optimality. We just showed that under this redeï¬nition, BRi(Ïâi) is non-empty, andthe same argument for compactness of BRi(Ïâi) holds. So we can make use of Lemma 1 and whatremains is to prove that BRi(Ïâi) is convex. Since Ïâi is a ï¬xed policy for all the other players thisdeï¬nes an MDP for player i (Filar & Vrieze, 1997, Corollary 4.2.11). So we need to show that the setof polices from the playerâs restricted set that are optimal for this MDP is a convex set. Concretely,if x1, x2 â Î are optimal for this MDP, then the policy x3(s, a) = αx1(s, a) + (1 â α)x2(s, a) isalso optimal for any α â [0, 1]. Since x1 and x2 are optimal in the strong sense, i.e., maximizingthe value of all states simultaneously, then they must have the same per-state value. Here, we will use the notation V x(s) to refer to the value of policy x from state s in this ï¬xed MDP. The value function for any policy is the unique solution to the Bellman equations, speciï¬cally, âs â S V x(s) = x(s, a) R(s, a) + γ T (s, a, s )V x(s ) . (8) a s For x3 then we get the following, V x3 (s) = x3(s, a) R(s, a) + γ T (s, a, s )V x3 (s ) (9) a s = (αx1(s, a) + (1 â α)x2(s, a)) R(s, a) + γ T (s, a, s )V x3 (s ) (10) a s = α x1(s, a) R(s, a) + γ T (s, a, s )V x3 (s ) + a s (1 â α) x2(s, a)) R(s, a) + γ T (s, a, s )V x3 (s ) . (11) a s Notice that V x3 (s) = V x1 (s) = V x2 (s) is a solution to these equations, and therefore is the uniquesolution for the value of x3. Therefore, x3 has the same values as x1 and x2, and hence is also opti-mal. Therefore BRi(Ïâi) is convex, and from Lemma 1 we get the existence of restricted equilibriaunder this stricter notion of optimality, which also makes the policies a restricted equilibrium underour original notion of optimality, that is only maximizing the value of the initial state. Example: Multirobot Indoor Navigation. Consider an indoor navigation domain with multiplerobots traversing constrained hallways. The agents are the robots, all with their own navigationalgoals. A robotâs state consists of a highly discretized representation of position and orientationresulting in an intractable number of states per robot. The global state is the joint states of all ofthe robots and is known by all the agents. The robotsâ actions are low-level navigation commands.We will construct an example of a useful restricted policy space for this domain that is statewiseconvex, and therefore satisï¬es the conditions of Theorem 2. Deï¬ne a set of key decision points, K, 367
BOWLING & VELOSO made up of the hallway intersections. These decision points deï¬ne a subset of the complete statespace, Kn â S. While between key decision points, each agent will use a ï¬xed policy assigningprobabilities to actions to reach the next decision point. The agentsâ policies are restricted so thatdecisions are only made when all agents have arrived at a key decision point, where they all simulta-neously choose an action contingent on the global state. The agents can independently assign actionprobabilities for all states in Kn, while all the action probabilities in S â Kn are ï¬xed. Hence, this isa statewise convex restricted policy space. Theorem 2 therefore applies and we conclude that thereexists restricted equilibria in this domain. Despite the above example, most rational limitations that allow reinforcement learning to scale are unlikely to be statewise convex restrictions. They often have a strong dependence betweenstates. For example, parameterized policies involve far less parameters than the number of states,which is often intractably large. Either the majority of states have ï¬xed action probabilities, as in theabove example, or there will be fart too many parameters to optimize. Similarly, subproblems forcewhole portions of the state space to follow the same subproblem solution. Therefore, these portionsof the state space do not select their actions independently. Abstraction and state aliasing use thesame action probabilities across multiple states, and so also are not statewise convex. We now lookat retaining a simple convexity condition on the restricted policy spaces, but examine constrainedspaces of stochastic games. 4.2.3 SUBCLASSES OF STOCHASTIC GAMES One way to relax from statewise convexity to general convexity is to consider only a subset ofstochastic games. Theorem 1 is one example, where restricted equilibria exist for the subclassof matrix games with convex restricted policy spaces. Matrix games have a highly constrainedâtransitionâ function allowing only self transitions, which are independent of the playersâ actions.We can generalize this idea beyond single-state games. Theorem 3 Consider no-control stochastic games, where all transitions are independent of theplayersâ actions, i.e., âs, s â S âa, b â A T (s, a, s ) = T (s, b, s ). If Î i is convex, then there exists a restricted equilibrium. Proof (Discounted Reward). This proof also makes use of Lemma 1, leaving us only to show thatBRi(Ïâi) is convex. Just as in the proof of Theorem 2 we will consider the MDP deï¬ned for playeri when the other players follow the ï¬xed policy Ïâi. As before it sufï¬ces to show that for this MDP,if x1, x2 â Î are optimal for this MDP, then the policy x3(s, a) = αx1(s, a) + (1 â α)x2(s, a) isalso optimal for any α â [0, 1]. Again we use the notation V x(s) to refer to the traditional value of a policy x at state s in this ï¬xed MDP. Since T (s, a, s ) is independent of a, we can simplify the Bellman equations (Equa-tion 8) to V x(s) = x(s, a)R(s, a) + γ x(s, a)T (s, a, s )V x(s ) (12) a s a = x(s, a)R(s, a) + γ T (s, ·, s )V x(s ). (13) a s 368
EXISTENCE OF MULTIAGENT EQUILIBRIA For the policy x3, the value of state s is then, V x3 (s) = α x1(s, a)R(s, a) + (1 â α) x2(s, a)R(s, a) + a a γ T (s, ·, s )V x3 (s ). (14) s Using equation 13 for both x1 and x2 we get, V x3 (s) = α(V x1 (s) â γ T (s, ·, s )V x1 (s )) + s (1 â α)(V x2 (s) â γ T (s, ·, s )V x2 (s )) + s γ T (s, ·, s )V x3 (s ) (15) s = αV x1 (s) + (1 â α)V x2 (s) + γ T (s, ·, s ) V x3 (s ) â αV x1 (s ) â (1 â α)V x2 (s ) (16) s Notice that a solution to these equations is V x3 (s) = αV x1 (s) + (1 â α)V x2 (s), and the Bellmanequations must have a unique solution. Hence, V x3 (s0) is equal to V x1(s0) and V x2(s0), which areequal since both are optimal. So x3 is optimal, and BRi(Ï) is convex. Applying Lemma 1 we getthat restricted equilibria exist. Proof (Average Reward). An equivalent deï¬nition to Equation 2 of a policyâs average reward is, V Ï i = dÏ(s) Ï(s, a)R(s, a), (17) sâS a where dÏ(s) deï¬nes the distribution over states visited while following Ï after inï¬nite time. For astochastic game or MDP that is unichain we know that this distribution is independent of the initialstate. In the case of no-control stochastic games or MDPs, this distribution becomes independentof the actions and policies of the players, and depends solely on the transition probabilities. SoEquation 17 can be written, V Ï i = d(s) Ï(s, a)R(s, a). (18) sâS a As before, we must show that BRi(Ïâi) is convex to apply Lemma 1. Consider the MDP deï¬nedfor player i when the other players follow the policy Ïâi. It sufï¬ces to show that for this MDP, ifx1, x2 â Î are optimal for this MDP, then the policy x3(s, a) = αx1(s, a) + (1 â α)x2(s, a) is alsooptimal for any α â [0, 1]. Using Equation 18, we can write the value of x3 as, V x3 = d(s) x i 3(s, a)R(s, a) (19) sâS a = d(s) (αx1(s, a) + (1 â α)x2(s, a)) R(s, a) (20) sâS a 369
BOWLING & VELOSO = d(s) αx1(s, a)R(s, a) + (1 â α)x2(s, a)R(s, a) (21) a a = α d(s) x1(s, a)R(s, a) + (1 â α) d(s) x2(s, a)R(s, a) (22) sâS a sâS a = αV x1 (s) + (1 â α)V x2 (s) (23) i i = αV x1 (s) + (1 â α)V x1 (s) = V x1 (s). (24) i i i Therefore x3 has the same average reward as x1 and so is also optimal. So BRi(Ïâi) is convex andby Lemma 1 there exists an equilibrium. Example: Space Probe. Consider a space probe traveling on a predeï¬ned planetary trajectory.Consider using a multiagent approach to coordinating the gathering of data from scientiï¬c instru-ments. Each instrument is an agent making independent decisions and maximizing an assignedreward function. The state of the system is the position and orientation of the probe, which fol-lows a ï¬xed transition function. The state determines the effectiveness of the various data-gatheringtasks (i.e., the agentsâ rewards), which also may be determined by the decisions of the other in-struments. For example, camera images may be occluded when collecting atmospheric samples, ortwo instruments may draw too much power if activated simultaneously, causing both to fail. Thisis a no-control stochastic game, as the instrumentsâ actions do not control the state, which is thepredeï¬ned trajectory of the probe. By Theorem 3, restricted equilibria exist for all convex restrictedpolicy spaces in this domain. We can now merge Theorem 2 and Theorem 3 allowing us to prove existence of equilibria for a general class of games where only one of the playerâs actions affects the next state. Sincethis theoremâs sufï¬cient conditions are the most general of those presented, multiple examples ofapplications follow. Theorem 4 Consider single-controller stochastic games (Filar & Vrieze, 1997), where all transi-tions depend solely on player 1âs actions, i.e., âs, s â S âa, b â A a1 = b1 â T (s, a, s ) = T (s, b, s ). In the discounted reward formulation, if Î 1 is statewise convex and Î i=1 is convex, then there existsa restricted equilibrium. Proof. This proof again makes use of Lemma 1, leaving us to show that BRi(Ïâi) is convex. Fori = 1 we use the argument from the proof of Theorem 2. For i = 1 we use the argument fromTheorem 3. Example: Haunted House Ride. An example that helps to illustrate the concept of a single-controllerstochastic game is an amusement park haunted house ride. Consider the system as consisting ofmultiple agents including the ride operator and the rideâs passengers. Suppose the ride operatorhas choices over different ride routes therefore determining the state of the world for the passengeragents. The passengers, on the other hand, can only choose at each state where to look and whento try and scare their fellow passengers. Therefore, one agent (the operator) controls the transitions,while all the agentsâ (operator and passengers) affect the resulting rewards. Consider that the ride 370
EXISTENCE OF MULTIAGENT EQUILIBRIA operator may be restricted to only making decisions at a few key decision points determined bythe ride designer. Therefore, the operatorâs policy space is statewise convex. Hence, accordingto Theorem 4 if the passengers have convex restricted policy spaces resulting from their imposedlimitations, then restricted equilibria exist. Example: Mobile Data Gathering. The general pattern of the haunted house ride can also be seenin practical situations. Consider mobile data gathering, such as an extended version of the spaceprobe example presented earlier. There is one navigational agent whose actions determine the state(e.g., position and orientation). The other agents are instruments choosing how to gather data at anygiven moment, as in the no-control version described above. The rewards are determined by all ofthe agents actions, but the state transition is deï¬ned only by the navigational agent. This ï¬ts thedeï¬nition of a single-controller stochastic game and the Theorem holds for appropriate restrictedpolicy spaces. Example: Asymmetric Competition The above examples, although not strictly team games, do havea cooperative nature to them. Single-controller stochastic games can also be competitive. Con-sider an economic competition between two ï¬rms, where one ï¬rmâs supply technology is primarilyinfrastructure-based, while the otherâs is very ï¬uid. For example, a transportation company usingrail lines versus a company using trucks. Since one agentâs actions are implemented on a timescalefar smaller than the other agent, we can effectively ignore that ï¬rmâs state. The global state is justthe built infrastructure of the ï¬rst agent, and is only inï¬uenced by that agentâs actions. The rewards,though, are determined by both actions of the agents, and may well be zero-sum. If the ï¬rst ï¬rmuses a statewise convex restricted policy space (e.g., using key decision points) and the other ï¬rmuses a convex restricted policy space, then restricted equilibria exist. The previous results have looked at stochastic games whose transition functions have particu- lar properties. Our ï¬nal theorem examines stochastic games where the rewards have a particularstructure. Speciï¬cally we address team games, where the agents all receive equal payoffs. Theorem 5 For team games, i.e., âi, j â 1, . . . , n âs â S âa â A Ri(s, a) = Rj(s, a), there exists a restricted equilibrium. Proof. The only constraints on the playersâ restricted policy spaces are those stated at the beginningof this section: non-empty and compact. Since Î is compact, being a Cartesian product of compactsets, and player oneâs value in either formulation is a continuous function of the joint policy, thenthe value function attains its maximum (Gaughan, 1993, Corollary 3.11). Speciï¬cally, there existsÏâ â Î such that, âÏ â Î V Ïâ ⥠1 V Ï 1 . Since âiVi = V1 we then get that the policy Ïâ maximizes all the playersâ rewards, and so eachmust be playing a restricted best-response to the othersâ policies. Example: Distributed Network Routing. Consider a distributed network routing domain wheremessage passing at nodes is controlled by an intelligent agent. The agent can only observe itsown queue of incoming messages and congestion on outgoing links. Hence each agent has analiased view of the global state. Therefore, policies mapping the observed state to actions are convex 371
BOWLING & VELOSO restricted policy spaces in the fully-observable stochastic game. All the agents seek to maximizeglobal throughput. Hence, they have an identical reward function based on global packet arrival. Aslong as further agent limitations preserve the convexity of their policy spaces, by Theorem 5 thereexists a restricted equilibrium. 4.3 Summary Facts 1 and 5 provide counterexamples that show the threat limitations pose to equilibria. Theo-rems 1, 2, 3, 4, and 5 give us ï¬ve general classes of stochastic games and restricted policy spaceswhere equilibria are guaranteed to exist. The fact that equilibria do not exist in general raisesconcerns about equilibria as a general basis for multiagent learning in domains where agents havelimitations. On the other hand, combined with the model of implicit games, the presented theoreti-cal results lays the initial groundwork for understanding when equilibria can be relied on and whentheir existence may be in question. These contributions also provide some formal foundation forapplying multiagent learning in limited agent problems. 5. Learning with Limitations In Section 2, we highlighted the importance of the existence of equilibria to multiagent learningalgorithms. This section presents results of applying a particular learning algorithm to a settingof limited agents. We use the best-response learner, WoLF-PHC (Bowling & Veloso, 2002). Thisalgorithm is rational, that is, it is guaranteed to converge to a best-response if the other playersâ poli-cies converge and appropriate decayed exploration is used (Singh, Jaakkola, Littman, & Szepesv´ari,2000). In addition, it has been empirically shown to converge in self-play, where both players useWoLF-PHC for learning. In this article we apply this algorithm in self-play to matrix games, bothwith and without player limitations. Since the algorithm is rational, if the players converge thentheir converged policies must be an equilibrium (Bowling & Veloso, 2002). The speciï¬c limitations we examine fall into both the restricted policy space model as well as the implicit game model. One player is restricted to playing strategies that are the convex hullof a subset of the available strategies. From Theorem 1, there exists a restricted equilibrium withthese limitations. For best-response learners, this amounts to a possible convergence point for theplayers. For the limited player, the WoLF-PHC algorithms were modiï¬ed slightly so that the playermaintains Q-values of its restricted set of available strategies and performs its usual hill-climbing inthe mixed space of these strategies. The unlimited player is unchanged and completely uninformedof the limitation of its opponent. For all the experiments, we use very small learning rates and alarge number of trials to better display how the algorithm learns and converges. 5.1 Rock-Paper-Scissors The ï¬rst game we examine is Rock-Paper-Scissors. Figure 3 shows the results of learning whenneither player is limited. Each graph shows the mixed policy the player is playing over time. Thelabels to the right of the graph signify the probabilities of each action in the gameâs unique Nashequilibrium. Observe that the playersâ strategies converge to this learning ï¬xed point. Figure 4 shows the results of restricting player 1 to a convex restricted policy space, deï¬ned by requiring the player to play âPaperâ exactly half the time. This is the same restriction as showngraphically in Figure 1. The graphs again show the playersâ strategies over time, and the labels 372
EXISTENCE OF MULTIAGENT EQUILIBRIA Player 1: Unlimited Player 2: Unlimited E(Reward) = 0 E(Reward) = 0 1 1 P(Rock) P(Rock) P(Paper) P(Paper) 0.8 P(Scissors) 0.8 P(Scissors) 0.6 0.6 0.4 Rock 0.4 Rock Paper Paper Scissors Action Probabilities Scissors 0.2 Action Probabilities 0.2 0 0 0 100000 200000 300000 0 100000 200000 300000 Number of Trials Number of Trials Figure 3: WoLF-PHC in Rock-Paper-Scissors. Neither player is limited. to the right now label the gameâs restricted equilibrium, which accounts for the limitation (seeFigure 1). The playerâs strategies now converge to this new learning ï¬xed point. If we examine theexpected rewards to the players, we see that the unrestricted player gets a higher expected reward inthe restricted equilibrium than in the gameâs Nash equilibrium (1/6 compared to 0.) In summary,both players learn optimal best-response policies with the unrestricted learner appropriately takingadvantage of the other playerâs limitation. Player 1: Limited Player 1: Unlimited E(Reward) = -0.167 E(Reward) = 0.167 1 1 P(Rock) P(Rock) P(Paper) P(Paper) 0.8 P(Scissors) 0.8 P(Scissors) Scissors 0.6 0.6 Paper 0.4 0.4 Rock Paper Action Probabilities 0.2 Action Probabilities Scissors 0.2 0 0 Rock 0 100000 200000 300000 0 100000 200000 300000 Number of Trials Number of Trials Figure 4: WoLF-PHC in Rock-Paper-Scissors. Player 1 must play âPaperâ with probability 1 . 2 5.2 Colonel Blotto The second game we examined is âColonel Blottoâ (Gintis, 2000), which is also a zero-sum matrixgame. In this game, players simultaneously allot regiments to one of two battleï¬elds. If one playerallots more armies to a battleï¬eld than the other, they receive a reward of one plus the number ofarmies defeated, and the other player loses this amount. If the players tie, then the reward is zerofor both. In the unlimited game, the row player has four regiments to allot, and the column playerhas only three. The matrix of payoffs for this game is shown in Figure 5. 373
BOWLING & VELOSO 3-0 2-1 1-2 0-3 3-0 2-1 1-2 0-3 4-0  4 2 1 0  4-0  -4 -2 -1 0  3-1 1 3 0 â1 â1 â3 0 1   3-1   R     r = 2-2  â2 2 2 â2  Rc = 2-2  2 â2 â2 2  1-3     ï£ â1 0 3 1  1-3 ï£ 1 0 â3 â1  0-4 0 1 2 4 0-4 0 â1 â2 â4 Figure 5: Colonel Blotto Game. Figure 6 shows experimental results with unlimited players. The labels on the right signify the probabilities associated with the Nash equilibrium to which the playersâ strategies converge.Player 1 is then given the limitation that it could only allot two of its armies, the other two wouldbe allotted randomly. This is also a convex restricted policy space and therefore by Theorem 1has a restricted equilibrium. Figure 7 shows the learning results. The labels to the right corre-spond to the action probabilities for the restricted equilibrium, which was computed by hand. As inRock-Paper-Scissors, the playersâ strategies converge to the new learning ï¬xed point. Similarly, theexpected reward for the unrestricted player resulting from the restricted equilibrium is considerablyhigher than that of the Nash equilibrium (0 to â14/9), as the player takes advantage of the otherâslimitation. Player 1: Unlimited Player 2: Unlimited E(Reward) = 1.56 E(Reward) = -1.56 1 1 P(4-0) P(3-0) P(3-1) P(2-1) 0.8 P(2-2) 0.8 P(1-2) P(1-3) P(0-3) P(0-4) 0.6 0.6 4-0, 0-4 2-1, 1-2 0.4 0.4 Action Probabilities 0.2 Action Probabilities 0.2 2-2 3-0, 0-3 0 3-1, 1-3 0 0 400000 800000 1.2e+06 0 400000 800000 1.2e+06 Number of Trials Number of Trials Figure 6: WoLF-PHC in Colonel Blotto. Neither player is limited. There is one ï¬nal observations about these results. In Section 3 we discussed the use of rational limitations to speed learning. Even in these very small single-state problems, our results demon-strate that limitations can be used to speed learning. Notice that convergence occurs more quicklyin the limited situations where one of the players has less parameters and less freedom in its policyspace. In the case of the Colonel Blotto game this is a dramatic difference. (Notice the x-axes differby a factor of four!) In games with very large state spaces this will be even more dramatic. Agentswill need to make use of rational limitations to do any learning at all, and similarly the less restrictedagents will likely be able to beneï¬t from taking advantage of the more limited learners.4 4. Strictly speaking, restricted agents are not always at a disadvantage. In zero-sum games, such as those for which results are presented here, and team-games where all players rewards are identical, restrictions can never beneï¬t theagent. In general-sum games, such as Bach or Stravinsky from Table 1(b), this is not true. In this case, an imposed 374
EXISTENCE OF MULTIAGENT EQUILIBRIA Player 2: Limited Player 2: Unlimited E(Reward) = 0 E(Reward) = 0 1 1 P(4-0) P(3-0) P(3-1) P(2-1) 0.8 P(2-2) 0.8 P(1-2) P(1-3) P(0-3) P(0-4) 0.6 0.6 3-0, 0-3 0.4 0.4 3-1, 2-2, Action Probabilities 0.2 1-3 Action Probabilities 0.2 4-0, 0-4 0 0 2-1, 1-2 0 100000 200000 300000 0 100000 200000 300000 Number of Trials Number of Trials Figure 7: WoLF-PHC in Colonel Blotto. Player one is forced to randomly allot two regiments. 6. Related Work It is commonly recognized that humans are not capable of being fully rational. There is a wholebody of work in the social sciences, particularly economics and game theory, relating to understand-ing humansâ bounded rationality (Simon, 1982). Bounded rationality is also a reality for artiï¬cialagents. As we discussed in Section 3, agent limitations are unavoidable in realistic problems. As aresult, there has been much interesting work in both the ï¬elds of game theory and artiï¬cial intelli-gence on how limitations effect agent behavior. A useful distinction presented by Simon (1976) is the difference between substantive rationality and procedural rationality. Substantive rationality is primarily concerned with whether a resultingbehavior is appropriate for the achievement of speciï¬c goals given speciï¬c constraints. Proceduralrationality is concerned with the behavior that is the result of appropriate deliberation. The dif-ference is in whether it is the outcome that is appropriate to the goals or the deliberation that isappropriate to the goals. The approach taken in this work is primarily procedural. We take the variety of well motivated learning procedures, both equilibrium learners and best-response learners, as given decision mech-anisms. We also take the variety of mechanisms for making learning tractable (e.g., abstraction andapproximation) as given mechanisms. The focus of our investigation is to see what is the resultingbehavior of these combined mechanisms. In particular, do equilibria exist for equilibrium learnersto learn and to which best-response learners can converge? This differs from other investigationswhich weâll brieï¬y review here. 6.1 Game Theory Much of the research in game theory on modeling bounded rationality, has focused on the inves-tigation of procedural rationality (Rubinstein, 1998). This approach is crucial given the goal ofmodeling, explaining, and predicting human behavior. An example of this approach is that taken byOsborne and Rubinstein (1997). They deï¬ne a decision procedure for mapping actions to discreteoutcomes, rather than distributions over outcomes. This mapping is generated stochastically ac- restriction, such as the row player being forced to select its ï¬rst action, can actually improve the agentâs equilibriumvalue (see also Littman & Stone, 2001; Gilboa & Samet, 1989). 375
BOWLING & VELOSO cording to the actual distribution of outcomes for given actions. Given a mapping, the agent selectsthe action whose associated outcome is preferred. Given the decision procedure, they investigatewhether an equilibrium ï¬xed point exists and also how well such a procedure corresponds to humanbehavior in particular games. Our work follows a similar approach of considering the outcome of using particular learning procedures. On the other hand, we are not dealing with the problem of human behavior but ratherthe behavior of artiï¬cial learning agents and how imposed limitations effect interactions betweenagents. By generalizing these effects into the models of implicit games and restricted policy spaces,we then ask similar questions. Do equilibrium ï¬xed points exist and what is the expected outcomeof employing such learning rules? This simpliï¬es our problem since weâre only analyzing theoutcomes of known machine learning techniques rather than explaining human behavior. In addition to the work along the lines of procedural rationality, there is a large body of tradi- tional work following along the lines of substantive rationality. The particular goals of this approachare to model the agentsâ limitations in the game itself and ï¬nd optimal behavior given the constraintsof these limitations. These results include Bayesian games where other agentsâ utilities are uncertaina priori. Extensive form games with imperfect information is also an interesting model accountingfor situations of state aliasing. Finally, strategies as ï¬nite state machines in repeated games tryto account for limitations on computation and memory. A summary of this work can be found instandard game theory texts (e.g., Osborne & Rubinstein, 1994)). 6.2 Artiï¬cial Agents While risking over generalizing, much of the work in the ï¬eld of artiï¬cial agents that examines theeffect of limitations on agent behavior has focused more on substantive rationality. In particular,a common thread of this work is the development of models of interaction and beliefs along withefï¬cient algorithms for computing optimal or near-optimal behavior given these models. This ap-proach focuses on what is appropriate behavior rather than what is the result of appropriate decisionprocedures. We will brieï¬y consider some of the recent work that consider limitations and agentbehavior, and give some comparison to the approach and results we have taken in this work. 6.2.1 COMPUTATIONAL LIMITATIONS Larson and Sandholm (2001) consider the problem of reasoning in the presence of agents withlimited computational resources. Speciï¬cally, they examine a two agent setting where agents haveintractable individual problems, but may gain an improvement on their solutions by solving the alsointractable joint problem. An agentâs strategy determines how to invest its limited computationalresources to improve its solution for its own problem, its opponentâs problem, or the joint problem.When the deadline is reached, the agents must decide whether to simply use its own computedsolution or bargain over a joint solution. Among other results, they demonstrate that the concept ofNash equilibrium can be extended to this setting, where the decision of computation is part of theagentâs strategy. They call this concept deliberation equilibria and present algorithms for ï¬ndingthese equilibria solutions in certain situations. The concept of deliberation equilibria has many similarities to the restricted equilibria concept we presented in Section 4. The core idea is to incorporate the agentsâ limitations into their choicesof strategies. Restricted equilibria, though, incorporate limitations as restrictions on the agentsâchoices, rather than incorporating limitations by adding computational decisions into their choices. 376
EXISTENCE OF MULTIAGENT EQUILIBRIA This difference, though, is a fundamental one as the two approaches solve different problems. Thecomplexity of decision-making in the Larson and Sandholm setting is directly due to the compu-tational limitations placed on the agents. In Chapter 3, the complexity of decision-making is partof the task, and agent limitations are required in order to simplify the problem and make learningtractable. This core difference also explains the difference in results, since restricted equilibria arenot guaranteed to exist in general. 6.2.2 GAME-TREE SEARCH WITH OPPONENT KNOWLEDGE Jansenâs Ph.D. thesis (1992) examined the issue of whether âoptimalâ play in the game-theoreticsense is in fact always âoptimalâ. In the case of a fallible opponent that may make an incor-rect evaluation of future states, the answer is naturally no. He proposed the idea of speculativeplay, which uses a prediction function for the opponent in determining its opponentâs moves whensearching a game-tree. The thesis focuses on deterministic, alternating-move, perfect informationgames, speciï¬cally a subgame of chess. He demonstrated that using speculative play with end-gamedatabases as knowledge of the opponent has the potential for improving play against humans. Inthis work, limitations on human rationality are recognized, and so game-theoretic optimal play isdiscarded for optimization against particulars of human play. 6.2.3 PLANNING WITH OPPONENT MODELS. Riley and Veloso (2000, 2002) present similar ideas for using opponent models to make decisions ina less abstract domain. They examine the problem of adapting to a speciï¬c opponent in simulatedrobotic soccer (Noda, Matsubara, Hiraki, & Frank, 1998). This work recognizes that modelingopponents is not a new idea and there has been a great deal of related research in the ï¬eld ofplan recognition. The work, though, contributes an actual technique for using opponent models toimprove the team behavior, hence learning. The team designer provides a set of opponent models of possible behavior of the other team. These models deï¬ne a distribution of playersâ positions as a function of the ballâs trajectory andtheir initial positions. During play, the actual opponent is matched to one of these models throughBayesian updating from actual observations. The model is then used to plan during set-play situa-tions, such as throw-ins or goal-kicks. The planning uses hill-climbing on the ball-trajectory to ï¬nda trajectory with low probability of being intercepted by the other team given the matched opponentmodel of their movement. The planner then determines the positions of players and their actionsso as to carry out the planned ball trajectory. This is compiled into a simple temporal networkwhich can be executed in a distributed manner by the players. They demonstrate through controlledempirical experiments that the plans generated depend on the opponent model that is used duringplanning. The effectiveness of this technique in actual play against a real opponents, that may ormay not match well with the models, is still unknown. This technique allows a team of agents to adapt their behavior in a principled manner to better address a speciï¬c team. This does not strictly address the problem of concurrent learning in amultiagent setting, which is the primary focus of this work. The critical assumption is that theopponentsâ play is not changing over time. The authors mention this brieï¬y and provide a smallchange to their Bayesian update through the addition of weight sharing. A small weight is addedto each model to keep its probability bounded from zero. This effectively introduces a non-zeroprobability that the other team may change its behavior after each observation. There are no results, 377
BOWLING & VELOSO though, of the outcome of this approach if the other agents are also adapting. For example, it isnot clear how robust the algorithm is to the other team employing a similar or identical approach todefending against a set-play. Will adaptation oscillate or is there an equilibrium point? How wouldsuch oscillation effect performance? 6.2.4 LIMITATIONS AS FINITE AUTOMATA. Carmel and Markovitch (1996, 1997) examined a model-based approach to multiagent learning.They introduced algorithms for efï¬ciently learning a model of the opponent as a deterministic ï¬-nite automaton (DFA). Using this learned model, they then compute the optimal best response au-tomaton. They examined this technique in the situation of repeated games and showed that theirtechnique can learn effectively against a variety of DFA opponents. They also showed that theiralgorithm learned much more quickly (i.e., was more data efï¬cient) than Q-learning, a model-freereinforcement learner. It is unknown how this compares with a model-based reinforcement learnersuch as prioritized sweeping (Moore & Atkeson, 1993). This technique also does not address the problem of concurrent learning in a multiagent envi- ronment. Although a learning rule can be be modeled by a DFA, the number of states that a rulerequires may be huge. For example, Q-learning with a decaying learning rate would be a completelyintractable DFA. Also, the model-based learning agent is required to be strictly more powerful thanits opponent, since its learning rule both learns the opponentâs DFA and also generates its own. It isunclear whether any sort of equilibrium ï¬xed point would be reached or even exists if two model-based DFA learners played each other. It is also not certain how DFA modeling scales to multiplestate stochastic games. 6.2.5 SUMMARY As we stated earlier, this body of work focuses primarily on substantive rationality: what is appro-priate behavior given these assumptions? In addition, most of the approaches include the assumptionof an asymmetric superiority over the other decision making agents. In the case of constructing DFAmodels of opponent behavior, there is an implicit assumption that the opponent behavior is simpleenough to be constructed by the adapting agent. In the case of planning with opponent models, itis assumed that the opponentâs behavior is not changing and so the learned model is accurate inpredicting their future behavior. Although the work recognizes that other agents have limitations,they do so in a way that expects other agents to be strictly inferior. There are no results, empirical ortheoretical, of applying these model-based techniques to situations of self-play, where all agents usethe same adaptation procedure. Larson and Sandholmâs work on computational limitations allowsfor limitations on the part of both agents. They do so by adding the complexity of computationaldecisions into the strategy space. Hence, their approach does not make this assumption. 7. Conclusion Nash equilibrium is a crucial concept in multiagent learning both for algorithms that directly learnan equilibrium and algorithms that learn best-responses. Agent limitations, though, are unavoidablein realistic settings and can prevent agents from playing optimally or playing the equilibrium. Inthis article, we introduce and answer two critical questions: Do equilibria exist when agents have 378
EXISTENCE OF MULTIAGENT EQUILIBRIA limitations? Not necessarily. Are there classes of domains or classes of limitations where equilibriaare guaranteed to exist? Yes. We have proven that for some classes of stochastic games and agent limitations equilibria are guaranteed to exist. We have also given counterexamples that help understand why equilibria do notexist in the general case. In addition to these theoretical results, we demonstrate the implications ofthese results in a real learning algorithm. We present empirical results that show that learning withlimitations is possible, and equilibria under limitations is relevant. There are two main future directions for this work. The ï¬rst is continuing to explore the the- oretical existence of equilibria. We have proven the existence of equilibria for some interestingclasses of stochastic games and restricted policy spaces. We have also established in Lemma 1 akey criterion, the convexity of best-response sets, as the basis for further theoretical results. It isstill unknown whether there are other general classes of games and limitations for which equilibriaexist. The second direction is the practical application of multiagent learning algorithms to real prob- lems when agents have real limitations. The theoretical results we have presented and the empiricalresults on simple matrix games, give a foundation as well as encouraging evidence. There are stillchallenging questions to answer. How do speciï¬c limitations map on to the models that we exploredin this article? Do equilibria exist for stochastic games and limitations typically faced in practice?What is the goal of learning when equilibria do not exist? This article lays the groundwork for ex-ploring these and other important learning issues that are relevant to realistic multiagent scenarios. Acknowledgments The authors are indebted to Martin Zinkevich for numerous insights as well as ï¬nding the exampledemonstrating Fact 5. The authors are grateful to Craig Boutilier and Tuomas Sandholm for helpfuldiscussions. In addition, we owe gratitude to the team of anonymous reviewers for their careful eyesand invaluable comments. This research was sponsored by the United States Air Force under Cooperative Agreements No. F30602-00-2-0549 and No. F30602-98-2-0135. The views and conclusions contained in thisdocument are those of the authors and should not be interpreted as necessarily representing theofï¬cial policies or endorsements, either expressed or implied, of the Defense Advanced ResearchProjects Agency (DARPA), the Air Force, or the US Government. Appendix A. Proof of Theorem 1 Proof. One might think of proving this theorem by appealing to implicit games as was used inFact 4. In fact, if Î i was a convex hull of a ï¬nite number of strategies, this would be the case. Inorder to prove it for any convex Î i we apply Rosenâs theorem about the existence of equilibria inconcave games (Rosen, 1965). In order to use this theorem we need to show the following: 1. Î i is non-empty, compact, and convex. 2. V Ï as a function over Ï â Î is continuous. i Ï ,Ïâi 3. For any Ï â Î , the function over Ï â Î i is concave. i i deï¬ned as Vi 379
BOWLING & VELOSO Condition 1 is by assumption. In matrix games, where S = s0, we can simplify the deï¬nition ofa policyâs value from Equations 1 and 2. For discounted reward we get, 1 V Ï i = Ri(s, a)Î n 1 â γ i=1Ïi(s0, ai), (25) aâA For average reward, the value is just the value of the one-shot matrix game given the playersâpolicies. This is equivalent to setting γ = 0 in the Equation 25. Equation 25 shows that forboth reward formulations the value is a multilinear function with respect to the joint policy and,therefore, is continuous. So Condition 2 is satisï¬ed. Observe that by ï¬xing the policies for all butone player, Equation 25 becomes a linear function over the remaining playerâs policy and so is alsoconcave, satisfying Condition 3. Therefore Rosenâs theorem applies and this game has a restrictedequilibrium. Appendix B. Proof of Lemma 1 Proof. The proof relies on Kakutaniâs ï¬xed point theorem. We ï¬rst need to show some facts aboutthe restricted best-response function. First, remember that Î i is non-empty and compact. Also, notethat the value (with both discounted and average reward) to a player at any state of a joint policyis a continuous function of that joint policy (Filar & Vrieze, 1997, Theorem 4.3.7 and Lemma5.1.4). Therefore, from basic analysis (Gaughan, 1993, Theorem 3.5 and Corollary 3.11), the set ofmaximizing (or optimal) points must be a non-empty and compact set. So BRi(Ïâi) is non-emptyand compact. Deï¬ne the set-valued function, F : Î â Î , F (Ï) = Ãn i=1 BRi(Ïâi). We want to show F has a ï¬xed point. To apply Kakutaniâs ï¬xed point theorem we must show thefollowing conditions to be true, 1. Î is a non-empty, compact, and convex subset of a Euclidean space, 2. F (Ï) is non-empty, 3. F (Ï) is compact and convex, and 4. F is upper hemi-continuous. Since the Cartesian product of non-empty, compact, and convex sets is non-empty, compact, andconvex we have condition (1) by the assumptions on Î i. By the facts of BRi from above and thelemmaâs assumptions we similarly get conditions (2) and (3). What remains is to show condition (4). Consider two sequences xj â x â Î and yj â y â Î such that âj yj â F (xj). It must be shown that y â F (x), or just yi â BRi(x). Let v be yiâs valueagainst x. For the purposes of contradiction assume that there exists a y with higher value, v , than i yi. Let δ = v â v. Since the value function is continuous, we can choose a point in the sequencewhere the policies are close enough to their limit points that they make an arbitrarily small changeto the value. Speciï¬cally, let N be large enough that the value of y against xN differs from v by at i most δ/4, and the value of yi against xN differs from v by at most δ/4, and the value of yN against i 380
EXISTENCE OF MULTIAGENT EQUILIBRIA xN differs from yi against xN by at most δ/4. Adding all of these together, we have a point in thesequence yN whose value against xN is less than the value of y against xN . So yN / â BR i i i i(xN ), and therefore yN / â F (xN ) creating our contradiction. The comparison of values of these various four joint policies is shown in Figure 8 and helps illustrate the resulting contradiction. (yi, x) (yi, xN ) (yN , xN ) (y , xN ) (y , x) i i i v v δ/4 δ/4 âδ/4 Increasing Value Figure 8: An illustration of the demonstration by contradiction that the best-response functions are upper hemi-continuous. We can now apply Kakutaniâs ï¬xed point theorem. So there exists Ï â Î such that Ï â F (Ï). This means Ïi â BRi(Ïâi), and therefore this is a restricted equilibrium. References Baxter, J., & Bartlett, P. L. (2000). Reinforcement learning in POMDPâs via direct gradient ascent. In Proceedings of the Seventeenth International Conference on Machine Learning, pp. 41â48,Stanford University. Morgan Kaufman. Bowling, M., & Veloso, M. (1999). Bounding the suboptimality of reusing subproblems. In Pro- ceedings of the Sixteenth International Joint Conference on Artiï¬cial Intelligence, pp. 1340â1345, Stockholm, Sweden. Morgan Kaufman. An earlier version appeared in the Proceedingsof the NIPS Workshop on Abstraction in Reinforcement Learning, 1998. Bowling, M., & Veloso, M. (2002). Multiagent learning using a variable learning rate. Artiï¬cial Intelligence, 136, 215â250. Carmel, D. (1997). Model-based Learning of Interaction Strategies in Multi-agent Systems. Ph.D. thesis, Computer Science Department, Technion. Carmel, D., & Markovitch, S. (1996). Learning models of intelligent agents. In Proceedings of the Thirteenth National Conference on Artiï¬cial Intelligence, Menlo Park, CA. AAAI Press. Claus, C., & Boutilier, C. (1998). The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the Fifteenth National Conference on Artiï¬cial Intelligence, pp.746â752, Menlo Park, CA. AAAI Press. Filar, J., & Vrieze, K. (1997). Competitive Markov Decision Processes. Springer Verlag, New York. Fink, A. M. (1964). Equilibrium in a stochastic n-person game. Journal of Science in Hiroshima University, Series A-I, 28, 89â93. Gaughan, E. D. (1993). Introduction to Analysis, 4th Edition. Brooks/Cole Publishing Company, Paciï¬c Grove, CA. Gilboa, I., & Samet, D. (1989). Bounded versus unbounded rationality: The tyranny of the weak. Games and Economic Behavior, 213â221. 381
BOWLING & VELOSO Gintis, H. (2000). Game Theory Evolving. Princeton University Press. Greenwald, A., & Hall, K. (2002). Correlated Q-learning. In Proceedings of the AAAI Spring Symposium Workshop on Collaborative Learning Agents. Hauskrecht, M., Meuleau, N., Kaelbling, L. P., Dean, T., & Boutilier, C. (1998). Hierarchical so- lution of Markov decision processes using macro-actions. In Proceedings of the FourteenthAnnual Conference on Uncertainty in Artiï¬cial Intelligence (UAI-98). Hu, J., & Wellman, M. P. (1998). Multiagent reinforcement learning: Theoretical framework and an algorithm. In Proceedings of the Fifteenth International Conference on Machine Learning,pp. 242â250, San Francisco. Morgan Kaufman. Jaakkola, T., Singh, S. P., & Jordan, M. I. (1994). Reinforcement learning algorithm for partially ob- servable Markov decision problems. In Advances in Neural Information Processing Systems6. MIT Press. Jansen, P. J. (1992). Using Knowledge About the Opponent in Game-tree Search. Ph.D. thesis, Computer Science Department, Carnegie Mellon University. Kuhn, H. W. (1953). Extensive games and the problem of information. In Kuhn, H. W., & Tucker, A. W. (Eds.), Contributions to the Theory of Games II, pp. 193â216. Princeton UniversityPress. Reprinted in (Kuhn, 1997). Kuhn, H. W. (Ed.). (1997). Classics in Game Theory. Princeton University Press. Larson, K., & Sandholm, T. (2001). Bargaining with limited computation: Deliberation equilibrium. Artiï¬cial Intelligence, 132(2), 183â217. Littman, M. (2001). Friend-or-foe Q-learning in general-sum games. In Proceedings of the Eigh- teenth International Conference on Machine Learning, pp. 322â328. Morgan Kaufman. Littman, M., & Stone, P. (2001). Leading best-response strategies in repeated games. In Seven- teenth Annual International Joint Conference on Artiï¬cial Intelligence Workshop on Eco-nomic Agents, Models, and Mechanisms. Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the Eleventh International Conference on Machine Learning, pp. 157â163.Morgan Kaufman. Littman, M. L., & Szepesv´ari, G. (1996). A generalized reinforcement-learning model: Convergence and applications. In Proceedings of the 13th International Conference on Machine Learning,pp. 310â318, Bari, Italy. Morgan Kaufmann. Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms, and em- pirical results. Machine Learning, 22, 159â196. Mataric, M. J. (1994). Reward functions for accelerated learning. In Proceedings of the Eleventh International Conference on Machine Learning, San Francisco. Morgan Kaufman. McGovern, A., & Barto, A. G. (2001). Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the Eighteenth International Conference on MachineLearning, pp. 361â368. Morgan Kaufman. Mertens, J. F., & Neyman, A. (1981). Stochastic games. International Journal of Game Theory, 10, 53â56. 382
EXISTENCE OF MULTIAGENT EQUILIBRIA Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13, 103â130. Nash, Jr., J. F. (1950). Equilibrium points in n-person games. PNAS, 36, 48â49. Reprinted in (Kuhn, 1997). Ng, A. Y., Parr, R., & Koller, D. (1999). Policy search via density estimation. In Advances in Neural Information Processing Systems 12, pp. 1022â1028. MIT Press. Noda, I., Matsubara, H., Hiraki, K., & Frank, I. (1998). Soccer server: a tool for research on multi- agent systems. Applied Artiï¬cial Intelligence, 12, 233â250. Osborne, M. J., & Rubinstein, A. (1994). A Course in Game Theory. The MIT Press. Osborne, M. J., & Rubinstein, A. (1997). Games with procedurally rational players. Working papers 9702, McMaster University, Department of Economics. Riley, P., & Veloso, M. (2000). On Behavior Classiï¬cation in Adversarial Environments, pp. 371â 380. Springer-Verlag. Riley, P., & Veloso, M. (2002). Planning for distributed execution through use of probabilistic opponent models. In Proceedings of the Sixth International Conference on AI Planning andScheduling, Toulouse, France. Robinson, J. (1951). An iterative method of solving a game. Annals of Mathematics, 54, 296â301. Reprinted in (Kuhn, 1997). Rosen, J. B. (1965). Existence and uniqueness of equilibrium points for concave n-person games. Econometrica, 33, 520â534. Rubinstein, A. (1998). Modeling Bounded Rationality. The MIT Press. Sen, S., Sekaran, M., & Hale, J. (1994). Learning to coordinate without sharing information. In Proceedings of the 13th National Conference on Artiï¬cial Intelligence. Shapley, L. S. (1953). Stochastic games. PNAS, 39, 1095â1100. Reprinted in (Kuhn, 1997). Simon, H. A. (1976). From substantive to procedural rationality. In Latis, S. J. (Ed.), Methods and Appraisals in Economics, pp. 129â148. Cambridge University Press, New York. Simon, H. A. (1982). Models of Bounded Rationality. MIT Press, Cambridge, MA. Singh, S., Jaakkola, T., Littman, M. L., & Szepesv´ari, C. (2000). Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for re- inforcement learning with function approximation. In Advances in Neural Information Pro-cessing Systems 12, pp. 1057â1063. MIT Press. Sutton, R. S., Precup, D., & Singh, S. (1998). Intra-option learning about temporally abstract ac- tions. In Proceedings of the Fifteenth International Conference on Machine Learning, pp.556â564, San Francisco. Morgan Kaufman. Tan, M. (1993). Multi-agent reinforcement learning: Independent vs. cooperative agents. In Pro- ceedings of the Tenth International Conference on Machine Learning, pp. 330â337, Amherst,MA. 383
BOWLING & VELOSO Uther, W., & Veloso, M. (1997). Adversarial reinforcement learning. Tech. rep., Carnegie Mellon University. Unpublished. Uther, W. T. B. (2002). Tree Based Hierarchical Reinforcement Learning. Ph.D. thesis, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA. Available as technical re-port CMU-CS-02-169. Vrieze, O. J. (1987). Stochastic Games with Finite State and Action Spaces. No. 33 in CWI Tracts. Centrum voor Wiskunde en Informatica. Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph.D. thesis, Kingâs College, Cam- bridge, UK. Williams, R. J., & Baird, L. C. (1993). Tight performance bounds on greedy policies based on imperfect value functions. Technical report, College of Computer Science, Northeastern Uni-versity. 384