page 1  (7 pages)
2to next section

NEUROCONTROL BY REINFORCEMENT LEARNING

G. Schram? B.J.A. Kr?ose ?? R. Babu<=ska ? A.J. Krijgsman?

? Department of Electrical Engineering, Delft University of Technology, P.O.Box 5031, 2600 GA Delft, The Netherlands. Email: g.schram@et.tudelft.nl ?? Department of Computer Systems, University of Amsterdam, Kruislaan 403, 1098 SJ Amsterdam, The Netherlands. Email: krose@fwi.uva.nl

Abstract. Reinforcement learning (RL) is a model-free tuning and adaptation method for control of dynamic systems. Contrary to supervised learning, based usually on gradient descent techniques, RL does not require any model or sensitivity function of the process. Hence, RL can be applied to systems that are poorly understood, uncertain, nonlinear or for other reasons untractable with conventional methods. In reinforcement learning, the overall controller performance is evaluated by a scalar measure, called reinforcement. Depending on the type of the control task, reinforcement may represent an evaluation of the most recent control action or, more often, of an entire sequence of past control moves. In the latter case, the RL system learns how to predict the outcome of each individual control action. This prediction is then used to adjust the parameters of the controller. The mathematical background of RL is closely related to optimal control and dynamic programming. This paper gives a comprehensive overview of the RL methods and presents an application to the attitude control of a satellite. Some well known applications from the literature are reviewed as well.

1. INTRODUCTION

In order to efficiently control complex processes, new methods are being sought which can cope with nonlinear, time-varying and uncertain behaviour. Methods based on neural networks and fuzzy logic systems proved to be suitable for designing controllers for such systems (Sofge and White, 1992; Special Issue Journal A, 1995). The main problem in nonlinear control remains the tuning and adaptation of the controller. For this purpose, a model of the process is usually developed. Then, the controller is adjusted such that the inverse relation between (desired) process outputs and control actions is approximated. However, the modeling of a complex process can be a difficult and time consuming task. Moreover, modeling errors and time-varying parameters of the process can degrade the controller performance considerably. Therefore, techniques for tuning the controller without an explicit model of the process are desirable.

An interesting approach to tune nonlinear controllers without a model is reinforcement learning (RL) (Barto et al., 1983). The idea of RL originates from human and animal learning, based on repeated trials followed by a reward or punishment. Each trial may consist of a sequence of actions while the feedback (reinforcement), related to the achieved performance, is received at the

end. In such a case, no direct evaluation of the particular control actions is available and the RL system must learn to predict the goodness of each control action in the sequence. This prediction is then used to tune the parameters of the controller.

RL has been applied in the field of robot learning, enabling robots to improve their performance through interaction with the environment (Kr?ose, 1995). An illustrative example is a RL controller for a robot manipulator in the peg-in-hole insertion task (Gullapalli, 1995). In this application, the performance is evaluated on the basis of the position and the forces acting from the environment on the peg. Another example is a RL controller for a walking machine (Ilg and Berns, 1995), where the changes in the positions of the legs are used as the performance measure. Successful applications have also been reported in other fields. A manufacturing process for thermoplastic composite structures was controlled by a RL system (Sofge and White, 1992), capable of on-line tuning. Another interesting application is the control of a set of elevators (Crites and Barto, 1995), where the expected waiting time of all waiting passengers serves as a cost function.

The remaining part of this paper is organised as follows. Section 2 presents the basic concepts of RL and describes the learning algorithms as well as some implementational aspects. Section 3 shortly discusses the