Next: Conclusion Up: The Reinforcement Learning Previous: Equations for Reinforcement

## Issues

What type of problems are appropriate for RL? In a recent paper, Crites and Barto [3] have applied RL to the elevator problem defined above and shown that it produces significantly better performance than the best solutions previously available. RL has also been successfully applied in many areas, including process control, scheduling, resource allocation, queuing, and adaptive games. For example, we have applied RL to the problem of channel assignment in cellular telephone systems and shown that it yields better performance than previously available solutions (see Singh and Bertsekas [5]). Zhang and Dietterich [7] have shown similar results in a job-shop scheduling problem, and Tesauro [6] has used RL to develop the world's best computer backgammon player, which is nearly as good as the human champion. Numerous other applications are being presented at this year's Machine Learning and Neural Information Processing Society (NIPS) conferences.

When is convergence to an optimal policy guaranteed? Technically, RL will converge for all finite stationary Markov environments. What does that mean? First, it may not work if there are an infinite number of states, because it won't be able to explore them all. Second, it may not work if the environment is constantly changing, although in practice, small slow changes in the environment are accommodated well. Finally, the mathematics break down if the result of an action depends on states other than the current state. These limitations only affect the guarantee of convergence; there are many instances of RL working well despite violating these conditions. These limitations are actively being addressed in current research.

Is it sensible to treat all preferences as numeric rewards on a single scale? Theoretically, yes. There is a theorem (North [4]) that if you believe four fairly simple axioms about preferences, then you can derive the existence of a real-valued utility function. (The only mildly controversial axiom is substitutability: that if you prefer A to B, then you must prefer a coin flip between A and C to a coin flip between B and C.) Practically, it depends. Users often find it hard to articulate their preferences as numbers. (Example: you have to design the controller for a nuclear power plant. How many dollars is a human life worth?)

How should the program store the utility function? The utility function maps state-action pairs to real numbers. If the size of the state-action space is small enough, this function can be stored in the form of a table; otherwise, some form of compact function approximation is used. Statisticians have a variety of representations for this purpose: decision trees, polynomial approximations, neural networks, etc. Using such a representation not only saves space, it also gives us generalization: the program can take a sensible action from a state it has never seen before by interpolating or extrapolating from known states.

How much information should be kept as state? In some situations, there may be too many sensors, some of them giving redundant information. Too much information increases the size of the state space needlessly, and makes learning the utility function slower. Progress is being made in techniques to detect and remove redundancy. In other situations, there may be too few sensors. Too little information can prevent learning from making much progress. Memory can be used to augment the limited information in these situations.

What if you don't have a simulation of the elevator problem available? You could use state trajectories obtained by controlling and experimenting with the real elevator system to train the RL solution. In this case, methods for optimal experiment design (in choosing what actions to take) are important, because real life elevators run so much slower than simulations, and because we don't want to alienate passengers as the system is being trained.

How should agents communicate and interact? We designed the elevator solution as a single agent that runs all three elevators. But consider the problem of controlling cars on a freeway. A central controlling agent would be unmanageable --- we would need to use many interacting agents instead. Game theory and economic market theory can be used to extend the RL algorithm to handle these multi-agent situations.

So RL says I should always take the action with the highest estimated utility, right? Actually, no. If you know the true utility function, you should always take the action that maximizes it. But the estimated utility may be wrong. Taking non-optimal actions may get you ``off the beaten track'' enough to learn something new, thereby updating the estimate, and enabling much better actions in the future. So there is always a trade-off between exploitation of the best known action and exploration of the consequences of other actions.

Next: Conclusion Up: The Reinforcement Learning Previous: Equations for Reinforcement