Satinder Singh's Thesis

Go back to publications main page.

  • Singh S. (1994), Learning to Solve Markovian Decision Processes . Ph.D Thesis

    ABSTRACT: This dissertation is about building learning control architectures for agents embedded in finite, stationary, and Markovian environments. Such architectures give embedded agents the ability to improve autonomously the efficiency with which they can achieve goals. Machine learning researchers have developed reinforcement learning (RL) algorithms based on dynamic programming (DP) that use the agent's experience in its environment to improve its decision policy incrementally. This is achieved by adapting an evaluation function in such a way that the decision policy that is ``greedy'' with respect to it improves with experience. This dissertation focuses on finite, stationary and Markovian environments for two reasons: it allows the development and use of a strong theory of RL, and there are many challenging real-world RL tasks that fall into this category.

    This dissertation establishes a novel connection between stochastic approximation theory and RL that provides a uniform framework for understanding all the different RL algorithms that have been proposed to date. It also highlights a dimension that clearly separates all RL research from prior work on DP. Two other theoretical results showing how approximations affect performance in RL provide partial justification for the use of compact function approximators in RL. In addition, a new family of ``soft'' DP algorithms is presented. These algorithms converge to solutions that are more robust than the solutions found by classical DP algorithms.

    Despite all of the theoretical progress, conventional RL architectures scale poorly enough to make them impractical for many real-world problems. This dissertation studies two aspects of the scaling issue: the need to accelerate RL, and the need to build RL architectures that can learn to solve multiple tasks. It presents three RL architectures, CQ-L, H-DYNA, and BB-RL, that accelerate learning by facilitating transfer of training from simple to complex tasks. Each architecture uses a different method to achieve transfer of training; CQ-L uses the evaluation functions for simple tasks as building blocks to construct the evaluation function for complex tasks, H-DYNA uses the evaluation functions for simple tasks to build an abstract environment model, and BB-RL uses the decision policies found for the simple tasks as the primitive actions for the complex tasks. A mixture of theoretical and empirical results are presented to support the new RL architectures developed in this dissertation.