N. L. Zhang and W. Liu
Volume 7, 1997
Links to Full Text:Journal of Artificial Intelligence Research 7 (1997) 199-230 Submitted 5/97; published 11/97 A Model Approximation Scheme for Planning in Partially Observable Stochastic Domains Nevin L. Zhang lzhang@cs.ust.hk Wenju Liu wliu@cs.ust.hk Department of Computer Science Hong Kong University of Science and Technology Hong Kong, China Abstract Partially observable Markov decision processes (POMDPs) are a natural model for planning problems where effects of actions are nondeterministic and the state of the world is not completely observable. It is difficult to solve POMDPs exactly. This paper proposes a new approximation scheme. The basic idea is to transform a POMDP into another one where additional information is provided by an oracle. The oracle informs the planning agent that the current state of the world is in a certain region. The transformed POMDP is consequently said to be region observable. It is easier to solve than the original POMDP. We propose to solve the transformed POMDP and use its optimal policy to construct an approximate policy for the original POMDP. By controlling the amount of additional information that the oracle provides, it is possible to find a proper tradeoff between computational time and approximation quality. In terms of algorithmic contributions, we study in details how to exploit region observability in solving the transformed POMDP. To facilitate the study, we also propose a new exact algorithm for general POMDPs. The algorithm is conceptually simple and yet is significantly more efficient than all previous exact algorithms. 1. Introduction In a completely observable and deterministic world, to plan is to find a sequence of actions that will lead an agent to achieve a goal. In real-world applications, the world is rarely completely observable and effects of actions are almost always nondeterministic. For this reason, a growing number of researchers concern themselves with planning in partially observable stochastic domains (e.g., Dean & Wellman, 1991; Cassandra et al., 1994; Parr & Russell, 1995; Boutilier & Poole, 1996). Partially observable Markov decision processes (POMDPs) can be used as a model for planning in such domains. In this model, nondeterminism in effects of actions is encoded by transition probabilities, partial observability of the world by observation probabilities, and goals and criteria for good plans by reward functions (see Section 2 for details). POMDPs are classified into finite horizon POMDPs and infinite horizon POMDPs depending on the number of time points considered. Infinite horizon POMDPs are usually used for planning because one typically does not know beforehand the number of steps it takes to achieve a goal. This paper is concerned with how to solve an infinite horizon POMDP. cfl1997 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved. Zhang & Liu 1.1 Difficulties in Solving POMDPs When the world is fully observable, a POMDP reduces to a Markov decision process (MDP). MDPs have been studied extensively in the dynamic-programming literature (e.g., Puterman, 1990; Bertsekas, 1987, White, 1993). Recent works have concentrated on how to deal with large state spaces (Dean et al., 1993; Boutilier et al., 1995; Dean & Lin, 1995). We are concerned with the partially observable case. This case is considerably more difficult than the fully observable case for two related reasons. First, when the agent knows exactly in which state the world currently is, information from the past (past observations and actions) is irrelevant to the current decision. This is the Markov property. On the other hand, when the agent does not fully observe the state of the world, past information becomes relevant because it can help the agent to better estimate the current state of the world. The problem is that the number of possible states of past information increases exponentially with time. Second, in MDPs the effects of an action are fully observed at the next time point. In POMDPs, on the other hand, the effects of an action are not fully observed at the next time point. Hence one cannot clearly tell the effects of the current action from those of the agent's future behaviors. To properly evaluate the effects of an action, one needs to look into the future and consider the combination of the action with each of the agent's possible behaviors in a, possibly large, number of future steps. The problem is that the number of ways the agent can behave is exponential in the number of future steps considered. 1.2 Previous Work Previous methods for solving POMDPs are usually classified into exact methods and approximate methods (Lovejoy, 1991a). They can also can be classified according to which of the aforementioned two difficulties they directly address. Most previous methods address the difficulty of exponential number of future behaviors (Sondik, 1971; Sondik & Mendelssohn, 1979; Monahan, 1982; Cheng, 1988; Lovejoy, 1991b; and Cassandra et al., 1994). They prune from consideration behaviors that can never be optimal no matter what the information state is (Section 4). Other methods deal with the problem of exponential number of past information states either by aggregating them (Platzman, 1977; White & Schere, 1994) or by considering only a subset of them (Lovejoy, 1992; Brafman, 1997; Hauskrecht, 1997). They are approximation methods in nature. 1.3 Model Approximation In previous approximation methods, approximation takes place in the process of solving a POMDP. We advocate model approximation methods. Such a method approximates a POMDP itself by another one that is easier to solve and uses the solution of the latter to construct an approximate solution to the original POMDP. Model approximation can be in the form of a more informative observation model, or a more deterministic action model, or a simpler state space, or a combination of two or all of the three alternatives. Cassandra et al. (1996) proposed to approximate POMDPs by using MDPs. This is an example of model approximation in the form of a more informative observation model. There is also some work on reducing the size of the state spaces of MDPs 200 Model Approximation for Planning under Uncertainty by aggregation (e.g., Bertsekas & Castanon, 1989; Dean & Lin, 1995; Dean & Givan, 1997). Such work can be conceivably extended to POMDPs, leading to model approximation in the form of a simpler state space. We are not aware of any model approximation schemes in the form of a more deterministic action model. 1.4 Our Proposal This paper proposes a new model approximation scheme in the form of a more informative observation model. It is a generalization of the idea of approximating POMDPs by using MDPs. We transform a POMDP by assuming that, in addition to the observations obtained by itself, the agent also receives a report from an oracle who knows the true state of the world. The oracle does not report the true state itself. Instead, it selects, from a predetermined list of candidate regions, a region that contains the true state and reports that region. The transformed POMDP is said to be region observable because the agent knows for sure that the true state is in the region reported by the oracle. When all candidate regions are singletons, the oracle actually reports the true state of the world. The region observable POMDP reduces to an MDP. MDPs are much easier to solve than POMDPs. One would expect the region observable POMDP to be solvable when all candidate regions are small. In terms of approximation quality, the larger the candidate regions, the less additional information the oracle provides and hence the more accurate the approximation. In the extreme case when there is only one candidate region and it consists of all possible states of the world, the oracle provides no additional information at all. Consequently, the region observable POMDP is identical to the original POMDP. A method for determining approximation quality will be described later in this paper. It allows one to make the tradeoff between approximation quality and computational time as follows: start with small candidate regions and increase their sizes gradually until the approximation becomes accurate enough or the region observable POMDP becomes untractable. Due to problem characteristics, accurate approximation can usually be achieved with small candidate regions. In many applications the agent often has a good idea about the state of the world (e.g., Simmons & Koenig, 1995). Take robot path planning as an example. Observing a landmark, a room number for instance, would imply that the robot is at the proximity of that landmark. Observing a feature about the world, a corridor T-junction for instance, might imply the robot is in one of several regions. Taking history into account, the robot might be able to determine a unique region for its current location. Also, an action usually moves the state of the world to only a few "nearby" states. Thus if the robot has a good idea about the current state of world, it should continue to have a good idea about it in the next few steps. When the agent has a good idea about the state of the world at all times, the oracle does not provide much additional information even with small candidate regions and hence approximation is accurate. Region observable POMDPs with small candidate regions are much easier to solve than general POMDPs. 201 Zhang & Liu 1.5 Organization We will first show how POMDPs can be used as a model for planning in partially observable stochastic domains (Section 2) and give a concise review of the theory of POMDPs (Sections 3 and 4). We will then propose a new method for dynamic-programming updates, a key step in algorithms that solve POMDPs via value iteration (Section 5). Thereafter, we will formally introduce the concept of region observable POMDPs (Section 6) and develop an algorithm for solving region observable POMDPs (Sections 7, 8, and 9). In Section 10, we will discuss decision making for the original POMDPs based on the solutions of their region observable approximations, followed by a method for determining approximation quality (Section 11) and a method to make the tradeoff between approximation quality and computational time (Section 12). Finally, empirical results will be reported in Section 13 and conclusions will be provided in Section 14. 2. Planning in Stochastic Domains and POMDPs To specify a planning problem, one needs to give a set S of possible states of the world, a set O of possible observations, and a set A of possible actions. The sets O and A are always assumed to be finite in the literature, while the state space S can be continuous as well as finite. In this paper, we consider only finite state space. One needs also to give an observation model which describes the relationship between observations and the state of the world, and an action model which describes effects of actions. As a background example, consider path planning for a robot who acts in an office environment. Here S is the set of all location-orientation pairs, O is the set of possible sensor readings, and A consists of actions move-forward, turn-left, turn-right, and declare-goal. The current observation o depends on the current state of the world s. Due to sensor noise, this dependency is uncertain in nature. The observation o sometimes also depends on the action that the robot has just taken a-. The minus sign in the subscript indicates the previous time point. In the POMDP model, the dependency of o upon s and a- is numerically characterized by a conditional probability P (ojs; a-), which is usually referred to as the observation probability. It is the observation model. In a region observable POMDP, the current observation also depends on the previous state of the world s-. The observation probability for this case can be written as P (ojs; a-; s-). The state s+ the world will be in at the next time point depends on the current action a and the current state s. The plus sign in the subscript indicates the next time point. This dependency is again uncertain in nature due to uncertainty in the actuator. In the POMDP model, the dependency of s+ upon s and a is numerically characterized by a conditional probability P (s+js; a), which is usually referred to as the transition probability. It is the action model. On many occasions, we need to consider the joint conditional probability P (s+; o+js; a) of the next state of the world and the next observation given the current state and the current action. It is given by P (s+; o+js; a) = P (s+js; a)P (o+js+; a; s): 202 Model Approximation for Planning under Uncertainty Knowledge about the initial state, if available, is represented as a probability distribution P0 over S. When the agent knows the initial state with certainty, P0 is 1 at the initial state and 0 everywhere else. The planning goal is encoded by a reward function such as the following: r(s; a) = ( 1 if a=declare-goal and s=goal,0 otherwise. The preference for short plans is encoded by discounting future rewards with respect to the current reward (see the next section). In summary, a POMDP consists of a set of possible states of the world, a set of possible observations, a set of possible actions, a observation probability, a transition probability, and a reward function. An MDP has the same ingredients as an POMDP except that it has no observation probability. This is because the state of the world is completely observed in an MDP. 3. Basics of POMDPs This section reviews several concepts and results related to POMDPs. 3.1 Belief States In a POMDP, an agent chooses and executes an action at each time point. The choice is made based on information from the past (past observations and past actions) and the current observation. The amount of memory required to store past observations and actions increases linearly with time. This makes it difficult to maintain past information after a long period of time. The standard way to overcome this difficulty is to maintain, instead of past information, the agent's belief state -- the probability distribution P (stjot; atGamma 1; otGamma 1; : : : ; a1; o1; P0) of the current state st of the world given past information and the current observation. It is well known that the belief state is a sufficient statistic in the sense that it captures all the information contained in past information and the current observation that is useful for action selection. Hence the agent can base its decision solely on the belief state. Compared with maintaining past information, maintaining the belief state is desirable because the number of possible states of the world is finite. One needs only to maintain a fixed and finite number of probability values1. The initial belief state is P0. One question is how the agent should update its belief state as time goes by. Following Littman (1994), we use b to denote a belief state. For any state s, b(s) is the probability that the world is in state s. The set of all possible belief states will be denoted by B. Suppose b is the current belief state, and a is the current action. If the observation o+ is obtained at the next time point, then the agent should update its belief state from b to a new belief state b+ given by b+(s+) = k X s P (s+; o+js; a)b(s); (1) 1. Storing these values exactly could require unbounded precision. Approximations are implicitly being made due to the fact that machine precision is bounded. 203 Zhang & Liu where k=1= Ps;s+ P (s+; o+js; a)b(s) is the normalization constant (e.g., Littman, 1994). 3.2 POMDPs as MDPs For any belief state b and any action a, define r(b; a) = X s b(s)r(s; a): (2) It is the expected immediate reward for taking action a in belief state b. For any belief state b, any action a, and any observation o+, define P (o+jb; a) = X s;s+ P (s+; o+js; a)b(s): (3) It is the probability of observing o+ at the next time point given that the current belief state is b and the current action is a. Let b+ be the belief state given by equation (1). P (o+jb; a) can also be understood as the probability of the next belief state being b+ given that the current action is a and the current belief state is b. A POMDP over world state space S can be viewed as an MDP over the belief state space B. The the reward function and the transition probability of the MDP are given by Equations (2) and (3) respectively. 3.3 Optimal Policies At each time point, the agent consults its belief state and chooses an action. A policy ss prescribes an action for each possible belief state. Formally it is a mapping from B to A. For each belief state b, ss(b) is the action prescribed by ss for b. Suppose b0 is the current belief state. If an agent follows a policy ss, then its current action is ss(b0) and the immediate reward is r0(b; ss(b0)); with probability P (o+jb0; ss(b0)), the agent's next belief state b1 will be as given by Equation (1), the next action will be ss(b1), and the next reward will be r1(b1; ss(b1)); and so on and so forth. The quality of a policy is measured by the expected discounted rewards it garners. Formally the value function of a policy ss is defined for each belief state b0 to be the following expectation: V ss(b0) = Eb0 [ 1X i=0 fliri(bi; ss(bi))]; (4) where 0^fl!1 is the discount factor. A policy ss1 dominates another policy ss2 if for each belief state b2B V ss1(b) * V ss2(b): (5) Domination is a partial ordering among policies. It is well known that there exists a policy that dominates all other policies (e.g., Puterman, 1990). Such a policy is called an optimal policy. The value function of an optimal policy is called the optimal value function and is denoted by V Lambda . 204 Model Approximation for Planning under Uncertainty 3.4 Value Iteration Value iteration is a standard way for solving infinite horizon MDPs (Bellman, 1957). It begins with an arbitrary initial function V Lambda 0 (b) and improves it iteratively by using the following equation V Lambda t (b) = maxa[r(b; a) + fl X o+ P (o+jb; a)V Lambda tGamma 1(b+)]; (6) where b+ is the belief state given by equation (1). If V Lambda 0 =0, then V Lambda t is called the t-step optimal value function. For any belief state b, V Lambda t (b) is the optimal expected reward the agent can get in t steps starting from b. The following theorem (Puterman, 1990, page 361) tells one when to terminate value iteration and how to construct a "good enough" policy. Theorem 1 Let ss the policy given by ss(b) = arg maxa[r(b; a) + fl X o+ P (o+jb; a)V Lambda t (b+)]: (7) If maxb2BjV Lambda t (b) Gamma V Lambda tGamma 1(b)j ^ ffl, then maxb2BjV ss(b) Gamma V Lambda (b)j ^ 2fflfl1 Gamma fl :2 (8) The quantity maxb2BjV Lambda t (b) Gamma V Lambda tGamma 1(b)j is sometimes called the Bellman residual and the policy ss is called the greedy policy based on V Lambda t . Algorithms for POMDPs are classified into exact or approximation algorithms depending on whether they compute the t-step optimal value function V Lambda t exactly (Lovejoy, 1991a). In the next two sections, we discuss the theoretical foundations of exact algorithms and develop a new exact algorithm. Thereafter, we propose a new approximation algorithm. 4. Piecewise Linearity and Implicit Value Iteration Since the belief space is continuous, exact value iteration cannot be carried out explicitly. Fortunately, it can be carried out implicitly due to the piecewise linearity of the t-step optimal value functions. To explain piecewise linearity, we need the concept of policy trees. 4.1 Policy Trees A t-step policy tree pt (Littman, 1994) prescribes an action for the current time point and an action for each possible information scenario (o1; : : : ; oi; a0; : : : ; aiGamma 1) at each of the next tGamma 1 time points i. Figure 1 shows a 3-step policy tree. The tree reads as follows. Move-forward at the current time point. At the next time point, if o1=0 is observed then turn-left. Thereafter if o2=0 is observed then turn-left again; else if o2=1 is observed then declare-goal; else if o2=2 is observed then move-forward. And so on and so forth. To relate back to the introduction, a t-step policy tree prescribes a way the agent might behave at the current and the next tGamma 1 time points. 205 Zhang & Liu O1 1a 0 1 2 0 1 2 2 1 1aa0 1a 0 forward turn-left turn-right forward O O2 2 2 2 2 2 2 2 2 2 2 2 O a a a a a a a a a 0 1 2 turn-left declare-goal forward declare-goal turn-right declare-goal turn-left declare-goal turn-left Figure 1: A 3-step policy tree. When t?1, the subtree rooted at the o1 node will be called a o-rooted tGamma 1-step policy tree, and will be denoted by ffitGamma 1. It is a mapping from the set of possible observations O to the set of all possible tGamma 1-step policy trees; it prescribes a tGamma 1 step policy tree ffitGamma 1(o) for each possible observation o. In our example, ffi2(o1=0) is the 2-step policy tree rooted at the uppermost a1 node. When t?1, a t-step policy tree pt has two components: an action a for the current time point and an o-rooted tGamma 1-step policy tree ffitGamma 1 for the next tGamma 1 time points. For this reason, we shall sometimes write pt as a pair (a; ffitGamma 1) and call a the first action of pt. By altering the actions on the edges out of the a-nodes, one obtains different t-step policy trees. The set of all possible t-step policy trees will be denoted by Pt. A 1-step policy tree is simply an action, and hence P1 is the same as the set of possible actions A. 4.2 State Value Functions of Policy Trees For any state s and any t-step policy tree pt=(a; ffitGamma 1), recursively define Vpt(s) = r(s; a) + fl X o+ Xs+ VffitGamma 1(o+)(s+)P (s+; o+js; a); (9) where the second term is to be understood as 0 when t=1. It is the expected discounted total reward the agent receives at the current time and during the next tGamma 1 time points if the world is currently in state s and the agent behaves according to the policy tree pt. We call Vpt the state value function of the t-step policy tree pt. Without mentioning the policy tree, we shall sometimes call Vpt a t-step state value function. The collection of all t-step state value functions will be denoted by Vt, i.e. Vt = fVptjpt2Ptg: For convenience, we let V0 consist of one single function of s that is zero for all s. 206 Model Approximation for Planning under Uncertainty 4.3 State Space Functions and Belief Space Functions It is worthwhile to point out that a t-step state value function is a state space function, i.e. a function over the state space S, while the t-step optimal value function is a belief space function, i.e. a function over the belief space B. We often use notations such as ff or fi to refer to state space functions. A state space function ff(s) induces a belief space function through ff(b) = X s ff(s)b(s): Regarding b as a vector with one component b(s) for each s, the induced belief space function is a linear combination of the components of b. For convenience, we simply say that ff(b) is linear in b. A collection V of state space functions induces a belief space function through V(b) = maxff2V ff(b): (10) Note that we are using V to denote both a set of state space functions and the belief space function it induces. When V is empty, V(b) is 0 by definition. The induced belief space function V(b) is piecewise linear in b in the sense that, for each ff2V, it equals ff(b) in the region fbjff(b)*fi(b) for any other fi2Vg of the belief space B and hence is linear in b in that region. 4.4 Piecewise Linearity of Optimal Value Functions The following theorem was first proved by Sondik (1971). It first appeared in its present form in Littman (1994). Theorem 2 (Piecewise Linearity) The t-step optimal value function V Lambda t is the same as the belief space function induced by the collection of all t-step state value functions Vt, i.e. for any belief state b V Lambda t (b) = Vt(b):2 The theorem is true for the following reasons. V Lambda t (b) is the reward the agent receives if it behaves optimally and for any policy tree pt, Vpt(b) is the reward the agent gets if it behaves according to pt. Because one of the policy trees must be optimal, V Lambda t (b) = maxptVpt(b)=Vt(b). Due to this theorem, we say that the collection Vt of state value functions is representation of V Lambda t . 4.5 Parsimonious Representations The size of Vt increases exponentially with t. As a matter of fact, the total number of t-step policy trees (Cassandra, 1994) is: jPtj = jAj jOjtGamma 1 jOjGamma 1 : There are potentially the same number of t-step state value functions. Fortunately, many of the state value functions can be pruned without affecting the induced belief space function. Let us make this property more explicit. 207 Zhang & Liu Let W and X be two sets of state space functions. We say that W covers X if it induces the same belief space function as X does, i.e. if W(b) = X (b) for any belief state b. We say that W parsimoniously covers X if W covers X and none of its proper subsets do. When W covers or parsimoniously covers X , we refer to W as a covering or a parsimonious covering of X . Theorem 3 All parsimonious coverings of a set of state space functions consist of the same number of state space functions. 2 The theorem has been known for sometime (e.g., Littman, 1994). Due to this theorem, one can also define a parsimonious covering as a covering that contains the minimum number of state space functions. A parsimonious covering ^Vt of Vt is also a representation of V Lambda t in the sense that ^Vt(b) = V Lambda t (b) for any belief state b. This representation is parsimonious because it consists of the fewest number of state space functions among all the representations of V Lambda t . 4.6 Dynamic-Programming Updates The question now is how to obtain a parsimonious covering of Vt. As will be shown in the next section, it is possible to obtain a parsimonious covering of Vt by starting from a parsimonious covering of VtGamma 1. The process of computing a parsimonious covering of Vt from a parsimonious covering of VtGamma 1 is called dynamic-programming updates (Littman et al., 1995). It is a key step in algorithms that solve POMDPs via value iteration. Previous algorithms for dynamic-programming updates include the enumeration and pruning algorithms by Monahan (1992), Eagle (1984), and Lark (White, 1991), the onepass algorithm by Sondik (1971), the linear support and relaxed region algorithms by Cheng (1988), and the witness algorithm by Cassandra et al. (1994) and Littman (1994). The witness algorithm has been empirically proved to be the most efficient among all those algorithms (Littman et al., 1995). 4.7 Implicit Value Iteration The procedure solvePOMDP shown in Figure 2 carries out value iteration implicitly: instead inductively computing the t-step optimal value function V Lambda t itself, it computes a parsimonious covering of Vt -- a set of state space functions that represents V Lambda t . In the procedure, the subroutine update( ^VtGamma 1) takes a parsimonious covering ^VtGamma 1 of VtGamma 1 and returns a parsimonious covering ^Vt of Vt. It can be implemented using any of the algorithms mentioned in the previous subsection. The subroutine stop( ^Vt; ^VtGamma 1; ffl) determines whether the Bellman residual has fallen below the threshold ffl from the parsimonious coverings ^VtGamma 1 and ^Vt of VtGamma 1 and Vt. See Littman (1994) for an implementation of this subroutine. Procedure solvePOMDP terminates when the Bellam residual falls below the threshold ffl and return a set of state space functions. The set ^Vt of state space functions returned represents the t-step optimal value function V Lambda t . It is the solution to the input POMDP. The planning agent keeps ^Vt in its memory. When it needs to make a decision, the agent consults its belief state b and chooses an action using (7) with V Lambda t (b+) replaced by ^Vt(b+). 208 Model Approximation for Planning under Uncertainty -------------------------------------------- Procedure solvePOMDP(M; ffl): ffl Input: M -- A POMDP, ffl -- A positive number. ffl Output: A set of state space functions. 1. t 0, ^V0 f0g. 2. Do ffl t=t+1. ffl ^Vt update( ^VtGamma 1): while stop( ^Vt; ^VtGamma 1; ffl) = no. 3. Return ^Vt. -------------------------------------------- Figure 2: Implicit value iteration. 5. A New Algorithm for Dynamic-Programming Updates This section proposes a new algorithm for dynamic-programming updates. There are four subsections. In the first three subsections, we show that a parsimonious covering of Vt can be obtained by starting from a parsimonious covering of VtGamma 1 and, while doing so, introduce concepts and results that are necessary for the development of the new algorithm. 5.1 Relationship Between VtGamma 1 and Vt Suppose W and X are two sets of state space functions. The cross sum WLX of W and X is the following set of state space functions: WMX = fff+fijff2W; fi2X g: It is evident that the cross sum operation is commutative and associative. Consequently, we can talk about the cross sum LNi=0 Wi of a list of sets W0, . . . , WN of state space functions. For any action a and any observation o+, define Qa;o+ = ffl X s+ ff(s+)P (s+; o+js; a)jff2VtGamma 1g: (11) Note that since a and o+ are given, each member fl Ps+ ff(s+)P (s+; o+js; a) of the above set is a function of s, in other words, a state space function. Hence Qa;o+ is a set of state space functions. Let 0, 1, . . . , N be an enumeration of all possible values of o+. We useL o+Qa;o+ to denote the cross sum L N i=0 Qa;i. Proposition 1 Vt = [a[fr(s; a)gL(Lo+Qa;o+)]: 209 Zhang & Liu Proof: By the definition of the set Vt and Equation (9), a state space function ff is in Vt if and only if there exist action a and fio+ 2 VtGamma 1 for each o+ 2 O such that ff(s) = r(s; a) + X o+ fl X s+ fio+ (s+)P (s+; o+js; a): The proposition follows. 2 5.2 Properties of Coverings Lemma 1 Suppose W, X , and Y are three sets of state space functions. If W parsimoniously covers X and X covers Y, then W parsimoniously covers of Y. 2 Lemma 2 Let W, W0, X , and X 0 be four sets of state space functions. If W0 covers W and X 0 covers X , then 1. W0LX 0 covers WLX . 2. W0[X 0 covers W[X .2 5.3 Coverings of Vt from Parsimonious Coverings of VtGamma 1 Let ^VtGamma 1 be a parsimonious covering of VtGamma 1. For any action a and any observation o+, define Q0a;o+ = ffl X s+ ff(s+)P (s+; o+js; a)jff2 ^VtGamma 1g: Note that the definition of Q0a;o+ is the same as that of Qa;o+ except that VtGamma 1 is replaced by ^VtGamma 1. Also define V0t = [a[fr(s; a)gM(Mo +Q 0a;o + )]: Proposition 2 The set V0t covers Vt. Formal proof of this proposition is given in Appendix A. Informally, the fact that ^VtGamma 1 covers VtGamma 1 implies that Q0a;o+ covers Qa;o+ , which in turn implies that V0t covers Vt due to Proposition 1 and Lemma 2. According to Proposition 2 and Lemma 1, one can obtain a parsimonious covering of Vt by finding a parsimonious covering of V0t and V0t is defined in terms of a parsimonious covering ^VtGamma 1 of VtGamma 1. This is why we said that a parsimonious covering of Vt can be obtained by starting from a parsimonious covering of VtGamma 1. Monahan's exhaustive method finds a parsimonious covering of V0t by enumerating all the state space functions in V0t one by one and detecting those that can be pruned by solving linear programs. Lark's algorithm works in a similar fashion except that its linear programs have fewer constraints. Since V0t consists of jAjj ^VtGamma 1jjOj state space functions, enumerating them one by one is very expensive. Other algorithms (Sondik, 1971; Cheng, 1988; Littman, 1994) avoid this difficulty by exploiting the structures of V0t. 210 Model Approximation for Planning under Uncertainty ---------------------------------------------- Procedure incrPruning(fW0; W1; : : : ; WN g): 1. WW0. 2. For i = 1 to N , W purge(WMWi): 3. Return W. Procedure update( ^VtGamma 1): ffl Input: ^VtGamma 1 -- a parsimonious covering of VtGamma 1. ffl Input: a parsimonious covering of Vt. 1. For each a2A, (a) Compute the sets Q0a;0, Q0a;1, . . . , and Q0a;N . (b) Wa incrPruning(fQ0a;0; Q0a;1; : : : ; Q0a;N g). 2. Return purge([a[fr(s; a)gLWa]). ---------------------------------------------- Figure 3: Incremental pruning and dynamic-programming updates. 5.4 A New Algorithm for Dynamic-Programming Updates Let purge(W) be a subroutine that takes a set W of state space functions and returns a parsimonious covering of W. An implementation of purge can be found in the appendix. Let W0, . . . , WN be sets of state space functions. Consider the procedure incrPruning shown in Figure 3. Let n be a number between 0 and N . Using Lemmas 1 and 2, one can easily show by induction that, at the end of the nth pass through the for-loop, the set W is a parsimonious covering of Lni=0 Wi. Consequently, the procedures returns a parsimonious covering of LNi=0 Wi. The procedure is named incremental pruning because pruning takes place after each cross sum. Let 0, 1, . . . , N be an enumeration of all the possible observations, i.e. possible instantiations of o+. For any action a, suppose the sets Q0a;0, Q0a;1, . . . , and Q0a;N have been computed. Applying incrPruning to those sets, we get a parsimonious covering ofL N i=0 Q0a;i. Denote it by Wa. According to Lemma 2, [a[fr(s; a)gLWa] covers V0t. Dueto Lemma 1, this fact implies that a parsimonious covering of [ a[fr(s; a)gLWa] is also a parsimonious covering of V0t and hence of Vt. Thus, a parsimonious covering of Vt can be found from a parsimonious covering of VtGamma 1 using the procedure update shown in Figure 3. We also use the term incremental pruning to refer to the above algorithm for dynamicprogramming updates. It has been shown elsewhere (Cassandra et al., 1977) that incremental pruning has the same asymptotic complexity as the witness algorithm and empirically it significantly outperforms the latter. 211 Zhang & Liu 6. Region-Based Model Approximation We have so far been concerned with exact algorithms. Experiments with incremental pruning, presently the most efficient exact algorithm, have revealed that it can solve only small POMDPs (Cassandra et al., 1997). One needs to resort to approximation in order to solve large real-world problems. Most previous approximation methods solve a POMDP directly; they approximate the t-step optimal value function of the POMDP. In the rest of this paper, we develop a new method that approximates a POMDP itself by another that has a more informative observation model and is hence easier to solve. The latter POMDP is solved and its solution is used to construct a solution to the original POMDP. 6.1 The Basic Idea We make the following assumption about problem characteristics. in a POMDP M, even though an agent does not know the true state of the world, it often has a good idea about the state. Justifications for this assumption were given in the introduction and empirical evidence is presented by Simmons & Koenig (1995). Consider another POMDP M0 that is the same as M except that in addition to the observation made by itself, the agent also receives a report from an oracle who knows the true state of the world. The oracle does not report the true state itself. Instead it selects, from a predetermined list of candidate regions, a region that contains the true state and reports that region. More information is available to the agent in M0 than in M; additional information is provided by the oracle. Since in M the agent already has a good idea about the true state of the world, the oracle might not provide much additional information even when the candidate regions are small. Consequently, M0 could be a good approximation of M. In M0, the agent knows for sure that the true state of the world is in the region reported by the oracle. For this reason, we say that it is region observable. The region observable POMDP M0 can be much easier to solve than M when the candidate regions are small. For example, if the oracle is allowed to report only singleton regions, then it actually reports the true state of the world and hence M0 is an MDP. MDPs are much easier to solve than POMDPs. One would expect the region observable POMDP M0 to be solvable when the candidate regions are small. 6.2 Spectrum of Approximations If the region reported by the oracle is always the set of all possible states, then no additional information is provided, because the report that the true state of the world is one of the possible states has no information content. In this case, M0 has the same solution as M and solving M0 is equivalent to solving M directly. This is one extreme of the spectrum. At the other extreme, if all the candidate regions are singletons, the oracle reports the true state of the world. Maximum amount of additional information is provided and M0 is actually an MDP. The MDP might not be a good approximation of M but it is much easier to solve than M. 212 Model Approximation for Planning under Uncertainty Previous methods for solving a POMDP either solve it directly or to approximate it by using a MDP. By allowing the oracle to report regions that are neither singletons nor the set of all possible states, this paper opens up the possibility of exploring the spectrum between those two extremes. One way to explore the spectrum is to start with singleton candidate regions and increase their sizes gradually. Approximation quality and computational time both increase as one goes along. One stops when the approximation is accurate enough or the region observable POMDP becomes intractable. A method for determining approximation quality will be described later. We now set out to make these ideas more concrete by starting with the concept of region systems. 6.3 Region Systems A region is simply a subset of states of the world. A region system is a collection of regions such that no region is a subset of another region in the collection and the union of all regions equals the set of all possible states of the world. We use R to denote a region and R to denote a region system. Region systems are used to restrict the regions that the oracle can choose to report. The choice of a region system determines the computational complexity of the region observable POMDP M0 and approximation quality. How to choose regions so as to make proper tradeoff between computational time and approximation quality is an open research issue. Here is a preliminary approach. The idea is to create a region for each state by including its "nearby" states. We say a state s0 is ideally reachable in one step from another state s if after executing a certain action in state s, the probability of the world ending up in state s0 is the highest. A state sk is ideally reachable in k steps from another state s0 if there are state s1, . . . , skGamma 1 such that si+1 is ideally reachable from si in one step for all 0^i^kGamma 1. Any state is ideally reachable from itself in 0 step. For any non-negative integer k, the radius-k region centered at a state s is the set of states that are ideally reachable from s in k or less steps. A radius-k region system is the one obtained by creating a radius-k region for each state and then removing, one after another, regions that are subsets of others. When k is 0, the radius-k region system consists of singleton regions. On the other hand, if each state is reachable from any other state in k or less steps, there is only one region in the radius-k region system -- the set of all possible states. 6.4 Region Observable POMDPs Given a region system R and a POMDP M, we construct a region observable POMDP M0 by assuming that at each time point the agent not only obtains an observation by itself but also receives a report from an oracle who knows the true state of the world. The oracle does not report the true state itself. Instead it chooses from R one region that contains the true state and reports that region. The amount of additional information provided by the oracle depends not only on the region system used but also on the way the oracle chooses regions. For example, if the 213 Zhang & Liu oracle always reports the region centered at the true state, then it implicitly reports the true state itself. In order to provide as little additional information as possible, the oracle should consider what the agent already knows. However, it cannot take the entire history of past actions and observations into account because if it did, M0 would not be a POMDP. The current observation would depend on the entire history. For any non-negative state space function f (s) and any region R, we call the quantity supp(f; R)= Ps2R f (s)= Ps2S f (s) the degree of support of f by R. Note that when f is a probability distribution, the denominator is 1. If R supports f to degree 1, we say that R fully supports f . We suggest the following region-selection rule for the oracle. Let s- be the previous true state of the world, a- be the previous action, and o be the current observation. The oracle should choose, among all the regions in R that contain the true state of the world, one that supports the function P (s; ojs-; a-) of s to the maximum degree. Where there is more than one such regions, choose the one that comes first in a predetermined ordering among the regions. Here are some arguments in support of the rule. If the previous world state s- were known to the agent, then its current belief state b(s), a function of s, would be proportional to P (s; ojs-; a-). In this case, the rule minimizes additional information in the sense that the region reported supports the current belief state to the maximum degree. If the previous world state is known to be around s-, the same is roughly true. Also if the current observation is informative enough, being a landmark for instance, to ensure that the world state is in a certain region, then the region chosen using the rule fully supports the current belief state. In such a case, no additional information is provided. Despite those arguments, we do not claim that the rule described above is optimal. Finding a rule that minimizes additional information is still an open problem. The probability P (Rjs; o; s-; a-) of a region R being chosen under the above scheme is given by P (Rjs; o; s-; a-) = 8?!?: 1 if R is the first region s.t. s2R and for any other region R0P s02R P (s0; ojs-; a-)* Ps02R0 P (s0; ojs-; a-)0 otherwise. The region observable POMDP M0 differs from the original POMDP M only in terms of observation; in addition to the observation o made by itself, the agent also receives a report R from the oracle. We shall denote an observation in M0 by z and write z=(o; R). The observation model of M0 is given by P (zjs; a-; s-) = P (o; Rjs; a-; s-) = P (ojs; a-)P (Rjs; o; s-; a-): The joint conditional probability P (s+; z+js; a) of the next state s+ of the world and the next observation z+ given the current state s and the current action a is P (s+; z+js; a) = P (s+js; a)P (z+js+; a; s): 214 Model Approximation for Planning under Uncertainty 7. Solving Region Observable POMDPs In principle, the region observable POMDP M0 can be solved in the same way as general POMDPs using the procedure solvePOMDP. It is not advisable to do so, however, since solvePOMDP does not automatically exploit region observability. This section and the next two sections develop an algorithm for solving M0 that takes advantage of region observability. 7.1 Restricted Value Iteration For any region R, let BR be the set of belief states that are fully supported by R. Let R be the region system underlying the region observable POMDP M0. Define BR = [R2RBR. It is easy to see that, in M0, no matter what the current belief state b is, the next belief state b+ must be in BR. We assume that the initial belief state is in BR. Then all possible belief states the agent might encounter are in BR. This implies that policies for M0 need only be defined over BR and value iteration for M0 can be restricted to the subset BR of B. We restrict value iteration for M0 to BR for the sake of efficiency. Doing so implies that the t-step optimal value function of M0, denoted by U Lambda t , is defined only over BR and the Bellman residual is now maxb2BRjU Lambda t (b) Gamma U Lambda tGamma 1(b)j. To avoid confusion, we call it the restricted Bellman residual and call U Lambda t the restricted t-step optimal value function. Since BR is continuous, restricted value iteration cannot be carried out implicitly. The next subsection shows how it can be carried implicitly. 7.2 Implicit Restricted Value Iteration Let W and X be two sets of state space functions and let R be a region. We say that W covers X in region R if, for any b2BR, W(b) = X (b): We say that W parsimoniously covers X in region R if W covers X in region R and none of its proper subsets do. When W covers or parsimoniously covers X in a region, we refer to W as a regional covering or a parsimonious regional covering of X . Let Ut be the set of all t-step state value functions of M0. According to Theorem 2, U Lambda t (b) = Ut(b) for any belief state b2BR. For each region R, suppose ^Ut;R is a set of state space functions that parsimoniously covers Ut in region R. Then the collection f ^Ut;RjR2Rg is a representation of U Lambda t in the sense that for any b2BR, U Lambda t (b) = ^Ut;Rb (b); (12) where Rb is such that b2BR, i.e. such that Rb fully supports b. As will be shown in the next section, parsimonious regional coverings of Ut can be obtained from parsimonious regional coverings of UtGamma 1. Let ROPOMDPupdate(R; ^UtGamma 1;R+jR+2Rg) be a procedure that takes a region R and parsimonious regional covering f ^UtGamma 1;R+ jR+2Rg 215 Zhang & Liu ------------------------------------------------------ Procedure solveROPOMDP(M0; ffl) ffl Input: M0 -- A region observable POMDP, ffl -- A positive number. ffl Output: A list of sets of state space functions. 1. t0. 2. For R2R, ^U0;R f0g. 3. Do ffl t t+1. ffl For each R2R, ^Ut;R ROPOMDPupdate(R; f ^UtGamma 1;R + jR+2Rg): while ROPOMDPstop(f ^Ut;RjR2Rg; f ^UtGamma 1;R+ jR+2Rg; ffl) = no. 4. Return f ^Ut;RjR2Rg. ------------------------------------------------------ Figure 4: Implicit restricted value iteration for region-observable POMDPs. of UtGamma 1 and returns a set of state space functions that parsimoniously covers Ut in region R 2. Let ROPOMDPstop be a procedure that determines, from parsimonious regional coverings of UtGamma 1 and Ut, whether the restricted Bellman residual has fallen below a predetermined threshold. The procedure solveROPOMDP shown in Figure 4 carries out restricted value iteration implicitly: instead inductively computing the restricted t-step optimal value function U Lambda t itself, it computes parsimonious regional coverings of Ut. In other words, it computes sets of state space functions that represent U Lambda t in the sense of (12). Let ss0 be the greedy policy for M0 based on U Lambda t . For any b2BR, ss0(b) is defined by Equation (7) with o+ replaced by z+=(o+; R+) and V Lambda t replaced by U Lambda t . Since the list f ^Ut;RjR2Rg of sets of state space functions returned by solveROPOMDP represents U Lambda t in the sense of (12), we have that for any b2BR ss0(b) = arg maxa[r(b; a) + fl X o+;R+ P ((o+; R+)jb; a) ^Ut;R+ (b+)]: (13) The next two sections show how to implement the procedures ROPOMDPupdate and ROPOMDPstop. 2. The string "ROPOMDP" in ROPOMDPupdate stands for region-observable POMDP. 216 Model Approximation for Planning under Uncertainty 8. Dynamic-Programming Updates for Region Observable POMDPs This section shows how the incremental pruning algorithm developed in Section 5 can be adapted to compute parsimonious regional coverings of Ut from parsimonious regional coverings of UtGamma 1. 8.1 Properties of Regional Coverings Lemma 3 Let R be a region and let W, X , and Y be three sets of state space functions. If W parsimoniously covers X in region R and X covers Y in region R, then W parsimoniously covers Y in region R. 2 Lemma 4 Let R be a region and let W, W0, X , and X 0 be four sets of state space functions. If W0 and X 0 respectively cover W and X in region R, then 1. W0LX 0 covers WLX in region R. 2. W0[X 0 covers W[X in region R. 2 8.2 Regional Coverings of Ut from Parsimonious Regional Coverings of UtGamma 1 From parsimonious regional coverings ^UtGamma 1;R+ (R+2R) of UtGamma 1, this subsection constructs, for each region R2R, a set Ut;R of state space functions and shows that it covers Ut in region R. For any action a and any observation z+=(o+; R+) of M0, let Qa;z+;R be the set of all state space functions fi that are of the following form: fi(s) = ( fl Ps+ ff(s +)P (s+; z+js; a) if s2R 0 otherwise. (14) where ff 2 ^UtGamma 1;R+ . Define Ut;R = [a[fr(s; a)gM(Mz +Qa;z+;R)]: Proposition 3 The set Ut;R covers Ut in region R. Formal proof of this proposition can be found in Appendix A. Informally, the fact that^ UtGamma 1;R+ covers UtGamma 1 in region R+ implies that Qa;z+;R covers Qa;z+ in region R, where Qa;z+ is given by (11) with o+ and VtGamma 1 replaced by z+ and UtGamma 1. This fact in turn implies that Ut;R covers Ut in region R because of Proposition 1 and Lemma 4. 8.3 Possible Observations at the Next Time Point In the definition of Ut;R, the cross sum is taken over all possible observations. This subsection shows that some of the possible observations can be skipped. For any action a and any region R, define Za;R = fz+j X s+ P (s+; z+js; a) ? 0 for some s2R g: 217 Zhang & Liu -------------------------------------------------------------------- Procedure ROPOMDPupdate(R; ^UtGamma 1;R+jR+2Rg): ffl Inputs: R -- A region, and for any region R+,^ UtGamma 1;R+ parsimoniously covers UtGamma 1 in region R+. ffl Output: A set of state space functions that parsimoniously covers Ut in region R. 1. For each action a, (a) Compute the set Za;R and enumerate its members as 0; 1; : : : ; M . (b) For i=0 to M , compute the set Qa;i;R. (c) WarestrictedIncrPruning(fQa;0;R; Qa;1;R; : : : ; Qa;M;Rg; R): 2. Return purge([a[fr(s; a)g L Wa]; R). Subroutine restrictedIncrPruning(fW0; W1; : : : ; WM g; R): 1. Let WW0. 2. For i=1 to M , Wpurge(WMWi; R): 3. Return W. -------------------------------------------------------------------- Figure 5: Dynamic-programming updates for region observable POMDPs. It is the set of observations that the agent can possibly receive at the next time point given that the current state of the world lies in region R and the current action is a. There are many observations outside this set. As a matter of fact, an observation z+=(o+; R+) is not in the set if it is not possible to reach region R+ from region R in one step. For any z+=(o+; R+), if z+ =2Za;R, then Ps+ P (s+; z+js; a) = 0 for all s2R. In such a case, Qa;z+;R = f0g according to (14). Since, f0gLW=W for any set W of state space functions, we have Ut;R = [a[fr(s; a)gM(Mz +2Za;RQa;z+;R)]: 8.4 Parsimonious Regional Covering of Ut Proposition 3 and Lemma 3 imply that, for any region R, a set of state space functions parsimoniously covers Ut in region R if and only if it parsimoniously covering Ut;R in region R. According to Lemmas 3 and 4, a set of state space functions that parsimoniously covers Ut;R in region R can be found using the procedure ROPOMDPupdate shown in Figure 5 (c.f. Section 5.4). In the procedure, the subroutine purge(W; R) takes a set W of state space 218 Model Approximation for Planning under Uncertainty ------------------------------------------------------ Procedure ROPOMDPstop(f ^Ut;RjR2Rg; f ^UtGamma 1;RjR2Rg; ffl) ffl Inputs: ffl -- A positive number, and for any region R^ Ut;R covers Ut in region R, and ^UtGamma 1;R covers UtGamma 1 in region R. ffl Outputs: yes -- If the restricted Bellman residual ^ ffl, no -- Otherwise. 1. For each region R, (a) flag yes. (b) For each ff2 ^UtGamma 1;R, flag no if dominate(ff; ^Ut;R; R; ffl) 6= nil: (c) For each ff2 ^Ut;R, flag no if dominate(ff; ^UtGamma 1;R; R; ffl) 6= nil: (d) Return no if flag = no. 2. Return yes. ------------------------------------------------------ Figure 6: Procedure for determining whether the restricted Bellman residual has fallen below a threshold. functions and region R, and returns a set of state space functions that parsimoniously covers W in region R. An implementation of this subroutine can be found in Appendix B. 9. The Stopping Condition This section shows how to determine whether the restricted Bellman residual has fallen below a predetermined threshold ffl from regional coverings of Ut and UtGamma 1. For any region R, let ^Ut;R and ^UtGamma 1;R be two sets of state space functions that respectively cover Ut and UtGamma 1 in region R. By the definition of regional coverings, we have Lemma 5 The restricted Bellman residual is no larger than ffl if and only if for any region R and any belief state b2BR, 1. For any ff2 ^Ut;R, ff(b) ^ ^UtGamma 1;R(b)+ffl; and 2. For any ff2 ^UtGamma 1;R, ff(b) ^ ^Ut;R(b)+ffl:2 219 Zhang & Liu Let dominate(ff; W; R; ffl) be a procedure that returns a belief state b in BR such that ff(b) ? W(b)+ffl. If such a belief state does not exist, it returns nil. An implementation of this procedure can be found in Appendix B. The procedure ROPOMDPupdate shown in Figure 6 returns yes if the restricted Bellman residual has fallen below ffl and no otherwise. A couple of notes are in order. First, when the reward function r(s; a) is non-negative, U Lambda t increases with t. In this case, the restricted Bellman residual becomes maxb2BR(U Lambda t (b)Gamma U Lambda tGamma 1(b)). Consequently, step (c) can be skipped. Second, when r(s; a) takes negative values for some s and some a, a constant can be added to it so that it becomes non-negative. Adding a constant to r(s; a) does not affect the optimal policy. However, it makes it easier to determine whether the restricted Bellman residual has fallen below a threshold. 10. Decision Making for the Original POMDP Suppose we have solved the region observable POMDP M0. The next step is to construct a policy ss for the original POMDP M based on the solution for M0. Even though it is our assumption that in the original POMDP M the agent has a good idea about the state of the world at all times, there is no guarantee that its belief state are always in BR. There is no oracle in M. A policy should prescribe actions for belief states in BR as well as for belief states outside BR. One issue here is that the policy ss0 for M0 is defined only for belief states in BR. Fortunately, ss0 can be naturally extended to the entire belief space by ignoring the constraint b2BR in Equation (13). We hence define a policy ss for M as follows: for any b2B, ss(b) = arg maxa[r(b; a) + fl X o+;R+ P ((o+; R+)jb; a) ^Ut;R+ (b+)]: (15) Let k be the radius of the region system underlying M0. The policy ss given above will be referred to as the radius-k approximate policy for M. The entire process of obtaining this policy, including the construction and solving of the region observable POMDP M0, will be referred to as region-based approximation. It is worthwhile to compare this equation with Equation (7). In Equation (7), there are two terms on the right hand side. The first term is the immediate reward for taking action a and the second term is the discounted future reward the agent can expect to receive if it behaves optimally. Their sum is the total expected reward for taking action a. The action with the highest total reward is chosen. The second term is difficult to obtain. In essence, Equation (15) approximates the second term using the optimal expected future reward the agent can receive with the help of the oracle, which is easier to compute. It should be emphasized that the presence of the oracle is assumed only in the process of computing the radius-k approximate policy. The oracle is not present when executing the policy. 11. Quality of Approximation and Simulation In general, the quality of an approximate policy ss is measured by the distance between its value function V ss(b) and the optimal value function V Lambda (b). This measurement does not 220 Model Approximation for Planning under Uncertainty consider what an agent might know about the initial state of the world. As such, it is not appropriate for a policy obtained through region-based approximation. One cannot expect such a policy to be of good quality if an agent is very uncertain about the initial state of the world because it is obtained under the assumption that an agent has a good idea about the state of the world at all times. This section describes a scheme for determining the quality of an approximate policy in cases where an agent knows the initial state of the world with certainty. The scheme can be generalized to cases where there is a small amount of uncertainty about the initial state; for example, cases where the initial state is known to be in some small region. An agent might need to reach a goal from different initial states at different times. Let P (s) be the frequency it will start from state s3. The quality of an approximate policy ss can be measured by Ps jV Lambda (s) Gamma V ss(s)jP (s), where V Lambda (s) and V ss(s) denote the rewards the agent can expect to receive starting from state s if it behaves optimally or according to ss respectively. By definition V Lambda (s)*V ss(s) for all s. Let U Lambda be the optimal value function of the region observable POMDP M0. Since more information is available to the agent in M0, U Lambda (s)*V Lambda (s) for all s. Therefore, Ps[U Lambda (s) Gamma V ss(s)]P (s) is an upper bound on Ps[V Lambda (s) Gamma V ss(s)]P (s). Let ss0 be the policy for M0 given by (13). When the restricted Bellman residual is small, ss0 is close to optimal for M0 and the value function V ss 0 of ss0 is close to U Lambda . Consequently,P s[V ss 0(s) Gamma V ss(s)]P (s) is an upper bound on P s[V Lambda (s) Gamma V ss(s)]P (s) when the restrictedBellman residual is small enough. One way to estimate the quantity Ps[V ss 0(s) Gamma V ss(s)]P (s) is to conduct a large number of simulation trials. In each trial, an initial state is randomly generated according to P (s). The agent is informed of the initial state. Simulation takes place in both M and M0. In M, the agent chooses, at each step, an action using ss based on its current belief state. The action is passed to a simulator which randomly generates the next state of the world and the next observation according to the transition and observation probabilities. The observation (but not the state) is passed to the agent, who updates its belief state and chooses the next action. And so on and so forth. The trial terminates when the agent chooses the action declare-goal or a maximum number of steps is reached. Simulation in M0 takes place in a similar manner except that the observations and the observation probabilities are different and actions are chosen using ss0. If the goal is correctly declared at the end of a trial, the agent receives a reward in the amount fln, where n is the number of steps. Otherwise, the agent receive no reward. The quantity Ps[V ss 0(s) Gamma V ss(s)]P (s) can be estimated using the difference between the average reward received in the trials for M0 and the average reward received in the trials for M. 12. Tradeoff Between Quality and Complexity Intuitively, the larger the radius of the region system, the less the amount of additional information the oracle provides. Hence the closer M0 is to M and the narrower the gap between Ps V ss 0(s)P (s) and P s V ss(s)P (s). Although we have not theoretically proved this, 3. This is not to be confused with the initial belief state P0, which represents the agent's knowledge about the ninitial state at a particular trial. 221 Zhang & Liu empirical results (see the next section) do suggest that Ps V ss(s)P (s) increases with the radius of the region system while Ps V ss 0 (s)P (s) decreases with it. In the extreme case when there is one region in the region system that contains all the possible states of the world, M and M0 are identical and hence so are Ps V ss 0(s)P (s) and P s V ss(s)P (s). These discussions lead to the following scheme for making the tradeoff between complexity and quality: Start with the radius-0 region system and increase the radius gradually until the quantity Ps[V ss 0(s)Gamma V ss(s)]P (s) becomes sufficiently small or the region observable POMDP M0 becomes untractable. 13. Simulation Experiments Simulation experiments have been carried out to show that (1) approximation quality increases with radius of region system and (2) where there is not much uncertainty, a POMDP can be accurately approximated by a region-observable POMDP that can be solved exactly. This section reports on the experiments. 13.1 Synthetic Office Environments Our experiments were conducted using two synthetic office environments borrowed from Cassandra et al. (1996) with some minor modifications. Layouts of the environments are shown in Figure 7, where squares represent locations. Each location is represented as four states in the POMDP model, one for each orientation. The dark locations are rooms connected to corridors by doorways. In each environment, a robot needs to reach the goal location with the correct orientation. At each step, the robot can execute one of the following actions: move-forward, turn-left, turn-right, and declare-goal. The two sets of action models given in Figure 7 were used in our experiments. For the action move-forward, the term F-F (0.01) means that with probability 0.01 the robot actually moves two steps forward. The other terms are to be interpreted similarly. If an outcome cannot occur in a certain state of the world, then the robot is left in the last state before the impossible outcome. In each state, the robot is able to perceive in each of three nominal directions (front, left, and right) whether there is a doorway, wall, open, or it is undetermined. The two sets of observation models shown in Figure 7 were used in our experiments. 13.2 Complexity of Solving the POMDPs One of the POMDPs has 280 possible states while the other has 200. They both have 64 possible observations and 4 possible actions. Since the largest POMDPs that researchers have been able to solve exactly so far have less than 20 states and 15 observations, it is safe to say no existing exact algorithms can solve those two POMDPs. We were able to solve the radius-0 and radius-1 approximations (region observable POMDPs) of the two POMDPs on a SUN SPARC20 computer. The threshold for the Bellman residual was set at 0.001 and the discount factor at 0.99. The amounts of time it took in CPU seconds are collected in the following table. 222 Model Approximation for Planning under Uncertainty N Goal(east) Environment A N Goal(south) Enviroment B Transition Probabilities Action Standard outcomes Noisy outcomes move-forward N (0.11), F (0.88), F-F (0.01) N (0.2), F (0.7), F-F (0.1) turn-left N (0.05), L (0.9), L-L (0.05) N (0.15), L (0.7), L-L (0.15) turn-right N (0.05), R (0.9), R-R (0.05) N (0.15), R (0.7), R-R (0.15) declare-goal N (1.0) N (1.0) Observation Probabilities Actual case Standard observations Noisy observations wall wall (0.90), open (0.04), doorway (0.04), undetermined (0.02) wall (0.70), open (0.19), doorway (0.09), undetermined (0.02) open wall (0.02), open (0.90), doorway (0.06), undetermined (0.02) wall (0.19), open (0.70), doorway (0.09), undetermined (0.02) doorway wall (0.15), open (0.15), doorway (0.69), undetermined (0.01) wall (0.15), open (0.15), doorway (0.69), undetermined (0.01) Figure 7: Synthetic Office Environments. Environment Standard models Noisy models Radius-0 Radius-1 Radius-0 Radius-1 A 1.26 3373 1.35 5984 B 0.61 2437 0.72 3952 We see that the radius-1 approximations took much longer time to solve than the radius-0 approximations. Also notice that the region observable POMDPs with noisy action and observation models took more time to solve that those with the standard models. This suggests that the more nondeterministic the actions and the less informative the observations, the more difficult it is to solve a POMDP. 223 Zhang & Liu We were unable to solve the radius-2 approximations. Other approximation techniques need to be incorporated in order to solve the approximations based on region systems with radius larger than or equal to 2. 13.3 Approximation Quality for Standard Models To determine the quality of the radius-0 and radius-1 approximate policies for the POMDPs with standard action and observation models, 1000 simulation trials were conducted using the scheme described in Section 11. It was assumed that the agent is equally likely to start from any state. Average rewards obtained in the original POMDPs M (i.e. without the help of the oracle) and in the corresponding region-observable POMDPs M0 (i.e. with the help of the oracle) are shown in the following table. Environment A Environment B radius-0 radius-1 radius-0 radius-1 Average reward in M 0.806535 0.815695 0.866118 0.868533 Average reward in M0 0.827788 0.818534 0.883271 0.876356 Difference 0.021253 0.002839 0.017153 0.007823 We see that, when the radius-0 policies were used, the differences between the rewards obtained in M and those obtained in M0 are very small in both environments. This indicates that the radius-0 region observable POMDPs (i.e. MDPs) are accurate approximations of the original POMDPs. The radius-0 approximate policies are close to optimal for the original POMDPs. When the radius-1 policies were used, the differences are even smaller; the rewards obtained in M and those obtained in M0 are essentially the same. Consider the rewards obtained in the original POMDPs. We see that they are larger when radius-1 policies were used than when radius-0 policies were used. This supports our claim that approximation quality increases with radius of region system. There is a another fact worth mentioning. The differences between rewards obtained in M and those obtained in M0 are larger in Environment B than in Environment A. This is because Environment B is more symmetric and consequently observations are less effective in disambiguating uncertainty in the agent's belief about the state of the world. 13.4 Approximation Quality for Noisy Models One thousand trials were also conducted for the POMDPs with noisy action and observation models. Results are shown in the following table. Environment A Environment B radius-0 radius-1 radius-0 radius-1 Average reward in M 0.596670 0.634934 0.445653 0.565099 Average reward in M0 0.812898 0.722441 0.871903 0.818365 Difference 0.214228 0.087507 0.426250 0.253266 224 Model Approximation for Planning under Uncertainty We see that the differences between rewards obtained in M and rewards obtained in M0 are significantly smaller when the radius-1 policies were used than when the radius-0 policies were used. This is the case especially in Environment A. Also the rewards obtained in M are larger when the radius-1 policies were used than when the radius-0 policies were used. Those again support our claim that approximation quality increases with radius of region system. As far as absolute approximation quality is concerned, the radius-0 POMDPs (i.e. MDPs) are obviously very poor approximations of the original POMDPs; when the radius-0 policies were used, the rewards obtained in M are significantly smaller than the rewards obtained in M0. For Environment A, the radius-1 approximation is fairly accurate. However, the radius-1 approximation remains poor for Environment B. The radius of region system needs to be increased. Tracing through the trials step by step, we observed some interesting facts. In Environment B, the agent, under the guidance of the radius-1 approximate policy, was able to quickly get to the neighborhood of the goal even when starting from far way. The fact that the Environment around the goal is highly symmetric was the cause of the poor performance. Often the agent was not able to determine whether it was at the goal location (room), or in the opposite room, or in the left most room, or in the room to the right of the goal location. The performance would be close to optimal if the goal location had some distinct features. In Environment A, the agent, again under the guidance of the radius-1 approximate policy, was able to reach and declare the goal successfully once it got to the neighborhood. However, it often took many unnecessarily steps before reaching the neighborhood due to uncertainty in the effects of the turning actions. For example, when the agent reached the lower left corner from above, it was facing downward. The agent executed the action turn.left. Fifteen percent of the time, it ended up facing upward instead of to the right. The agent then decided to move-forward, thinking that it was approaching the goal. But it was actually moving upward and did not realize this until a few steps later. The agent would perform much better if there were informative landmarks around the corners. 14. Conclusions We propose to approximate a POMDP by using a region observable POMDP. The region observable POMDP has more informative observations and hence is easier to solve. A method for determining approximation quality is described, which allows one to make the tradeoff between approximation quality and computational time by starting with a coarse approximation and refining it gradually. Simulation experiments have shown that when there is not much uncertainty in the effects of actions and observations are informative, a POMDP can be accurately approximated by a region observable POMDP that can be solved exactly. However, this becomes infeasible as the degree of uncertainty increases. Other approximate methods need to be incorporated in order to solve region observable POMDPs whose radiuses are not small. 225 Zhang & Liu Acknowledgements The paper has benefited from discussions with Anthony R. Cassandra and Michael Littman. We also thank the associate editor Thomas L. Dean and the three anonymous reviewers for their insightful comments and suggestions and pointers to references. Research was supported by Hong Kong Research Council under grants HKUST 658/95E and Hong Kong University of Science and Technology under grant DAG96/97.EG01(RI). Appendix A: Proofs of Propositions 2 and 3 Lemma 6 Suppose W and X are two sets of state space functions. If W covers X , then for any non-negative function f (s), maxff2W X s ff(s)f (s) = maxfi2X X s fi(s)f (s):2 Proof of Proposition 2: Because of Proposition 1 and Lemma 2, it suffices to show that Q0a;o+ covers Qa;o+. By the definition of Qa;o+ and Equation (10), we have, for any belief state b, that Qa;o+(b) = maxff2VtGamma 1 X s [fl X s+ ff(s+)P (s+; o+js; a)]b(s) = fl maxff2VtGamma 1 X s+ ff(s+)[X s b(s)P (s+; o+js; a)] Since ^VtGamma 1 covers VtGamma 1 and the term within the square brackets is a non-negative function of s+, by Lemma 6 we have Qa;o+(b) = fl maxff2 ^VtGamma 1 X s+ ff(s+)[X s b(s)P (s+; o+js; a)] = maxff2 ^VtGamma 1 X s [fl X s+ ff(s+)P (s+; o+js; a)]b(s) = Q0a;o+(b); where the last equation is due to the definition of Q0a;o+ and Equation (10). So, Q0a;o+ does cover Qa;o+. The proposition is proved. 2 Lemma 7 For any observation z+=(o+; R+) of the region observable POMDP M0, P (s+; z+js; a) = 0; for any s+ =2R+. 2 Informally, this lemma says that the true state of the world must be in the region reported by the oracle. Lemma 8 Let W and X be two sets of state space functions and R be a region. If W covers X in region R, then for non-negative function f (s) that is 0 when s =2R, we have maxff2W X s ff(s)f (s) = maxfi2X X s fi(s)f (s):2 226 Model Approximation for Planning under Uncertainty Proof of Proposition 3: Because of Proposition 1 and Lemma 4, it suffices to show that Qa;z+;R covers Qa;z+ in region R, where Qa;z+ is given by (11) with o+ and VtGamma 1 replaced by z+ and UtGamma 1. Let b be any belief state in BR. Similar to the proof of Theorem 2, we have Qa;z+(b) = fl maxff2UtGamma 1 X s+ ff(s+)[X s b(s)P (s+; z+js; a)] = fl maxff2 ^UtGamma 1;R+ Xs + ff(s+)[X s b(s)P (s+; z+js; a)] = maxff2 ^UtGamma 1;R+ Xs2R[fl Xs + ff(s+)P (s+; z+js; a)]b(s) = maxfi2Qa;z +;R Xs fi(s)b(s) = Qa;z+;R(b); where the second equation is true because of the fact that ^UtGamma 1;R+ covers UtGamma 1 in region R+ and of Lemma 8. The term within the square brackets is a non-negative function of s+ and it is 0 when s+ =2R+ because of Lemma 7. The fourth equation is true because that b(s)=0 when s =2R. The proposition is proved. 2 Appendix B: Domination and Pruning This appendix describes implementation of the procedures dominate(ff; W; R; ffl), purge(W; R), and purge(W). They were not given in the main text because they are minor adaptations of existing algorithms. The procedure dominate(ff; W; R; ffl) takes, as inputs, a state space function ff, a set of state space functions W, a region R, and a nonnegative number ffl. It returns a belief state b in BR such that ff(b)?W(b)+ffl. If such a belief state does not exist, it returns nil. It can be implemented as follows. Procedure dominate(ff; W; R; ffl) ffl Inputs: ff -- A state space function, W -- A set of state space functions, R -- A region, ffl -- A nonnegative number. ffl Output: A belief state in BR or nil. 1. If W=;, return an arbitrary belief state in BR. 2. Solve the following linear program: Variables: x, b(s) for each s2R. Maximize: x Constraints: X s2R ff(s)b(s) * x + X s2R fi(s)b(s) for all fi2W;X s2R b(s) = 1 b(s) * 0 for all s2R: 227 Zhang & Liu 3. If x^ffl, return nil, else return b. The procedure purge(W; R) takes a set of state space functions W and a region R and returns a set of state space functions that parsimoniously covers W in region R. To implement it, we need two subroutines. A state space function ff pointwise dominates another state space function fi in a region R if ff(s)*fi(s) for all s2R. The subroutine pointwisePurge(W; R) returns a minimal subset W0 of W such that each state space function in W is pointwise dominated in the region R by at least one state space function in W0. Implementation of this subroutine is straightforward. The subroutine best(b; W; R) returns a state space function ff in W such that Ps2R b(s)ff(s) *P s2R b(s)fi(s) for any other state space function fi in W. Implementation of the subroutineis straightforward except for the issue of tie breaking. If the ties are not broken properly, purge(W; R) might return a regional covering of W that is not parsimonious. A correct way to break ties is as follows: Fix an ordering among states in R. This induces a lexicographic ordering among all state space functions. Among the tied state space functions, chose the one that is the largest under the lexicographic ordering (Littman, 1994). The following implementation of purge is based on Lark's algorithm (White, 1991). Procedure purge(W; R) ffl Inputs: W -- A set of state space functions, R -- A region. ffl Output: A set of state space functions that parsimoniously covers W in region R. 1. WpointwisePurge(W; R). 2. X ;. 3. While W6=;, (a) Pick a state space function ff from W. (b) bdominate(ff; X ; R; 0). (c) If b =nil, remove ff from W. (d) Else remove best(b; W; R) from W and add it to X . 4. Return X . Finally, the procedure purge(W) takes a set of state space functions W and returns a parsimonious covering of W. It can be implemented simply as follows. Procedure purge(W): ffl purge(W; S). Here, S is the set of all possible states of the world. 228 Model Approximation for Planning under Uncertainty References Bellman, R. (1957). Dynamic Programming. Princeton University Press. Bertsekas, D. P. (1987). Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall. Bertsekas, D. P., & Castanon, D. C. (1989). Adaptive Aggregation for Infinite Horizon Dynamic Programming. IEEE trans. on auto. control, vol 34, No 6. Boutilier, C., Dearden, R., & Goldszmidt, M. (1995). Exploiting structures in policy construction. In Proceedings of IJCAI-95, 1104-1111. Boutilier, C., & Poole, D. (1996). Computing optimal policies for partially observable decision processes using compact representations. In Proceedings of AAAI-96, 1168- 1175. Brafman, R. I. (1997). A heuristic variable grid solution method for POMDPs. In Proceedings of AAAI-97, 727-733. Cassandra, A. R. (1994). Optimal polices for partially observable Markov decision processes. TR CS-94-14, Department of Computer Science, Brown University, Providence, Rhode Island 02912, USA. Cassandra, A. R., Kaelbling, L. P., & Littman, M. L. (1994). Acting optimally in partially observable stochastic domains. In Proceedings of AAAI-94, 1023-1028. Cassandra, A. R., Kaelbling, L. P., & Kurien, J. (1996). Acting under uncertainty: Discrete Bayesian models for mobile-robot navigation. In Proceedings of IEEE/Robotics Society of Japan Conference on Intelligent Robotics and Systems (IROS-96). Cassandra, A. R., Littman, M. L., & Zhang, N. L. (1997). Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes. In Proceedings of Thirteenth Conference on Uncertainty in Artificial Intelligence, 54-61. Cheng, H. T. (1988). Algorithms for partially observable Markov decision processes. PhD thesis, University of British Columbia, Vancouver, BC, Canada. Dean, T. L., Givan, R., & Leach, S. (1997). Model reduction techniques for computing approximately optimal solution for Markov decision processes. In Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence, 124-131. Dean, T. L., Kaelbling, L. P., Kirman, J., & Nicholson A. (1993). Planning with deadlines in stochastic domains. In Proceedings of AAAI-93, 574-579. Dean T. L., & Lin, S. H. (1995). Decomposition techniques for planning in stochastic domains. TR CS-95-10, Department of Computer Science, Brown University, Providence, Rhode Island 02912, USA. Dean, T. L., & Wellman, M. P. (1991). Planning and Control. Morgan Kaufmann. 229 Zhang & Liu Eagle, J. N. (1984). The optimal search for a moving target when the search path is constrained. Operations Research, 32(5), 1107-1115. Hauskrecht, M. (1997). Incremental methods for computing bounds in partially observable Markov decision processes. In Proceedings of AAAI-97, 734-739. Littman, M. L. (1994). The witness algorithm: Solving partially observable Markov decision processes. TR CS-94-40, Department of Computer Science, Brown University, Providence, Rhode Island 02912, USA. Littman, M. L., Cassandra, A. R., & Kaelbling, L. P. (1995). Efficient dynamic-programming updates in partially observable Markov decision processes. TR CS-95-19, Department of Computer Science, Brown University, Providence, Rhode Island 02912, USA. Lovejoy, W. S. (1991a). A survey of algorithmic methods for solving partially observable Markov decision processes. Annals of Operations Research, 28 (1), 47-65. Lovejoy, W. S. (1991b). Computationally feasible bounds for partially observed Markov decision processes. Operations Research, 39 (1), 162-175. Monahan, G. E. (1982). A survey of partially observable Markov decision processes: theory, models, and algorithms. Management Science, 28 (1), 1-16. Parr, R., & Russell, S. (1995). Approximating optimal polices for partially observable stochastic domains. In Proceedings of IJCAI-95, 1088-1094. Platzman, L. K. (1977). Finite-memory estimation and control of finite probabilistic systems. Ph.D. Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology. Puterman, M. L. (1990). Markov decision processes. In D. P. Heyman and M. J. Sobel (eds.), Handbooks in OR & MS., Elsevier Science Publishers, Vol. 2, 331-434. Sondik, E. J. (1971). The optimal control of partially observable Markov processes. PhD thesis, Stanford University, Stanford, California, USA. Sondik, E. J., & Mendelssohn, R. (1979). Information seeking in Markov decision processes, Southwest Fisheries Center Administrative Report H-79-13, National Marine Fisheries Service, Honolulu, Hawaii. White III, C. C. (1991). Partially observed Markov decision processes: A survey. Annals of Operations Research, 32. White, D. J. (1993). Markov Decision Processes. John Wiley & Sons. White III, C. C., & Scherer, W. T., (1994). Finite-memory suboptimal design for partially observed Markov decision processes. Operations Research, 42(3), 440-455. 230