Satinder Singh's Papers Arranged by Topic

Papers Arranged by Topic

Refereed Publications
Unrefereed Publications
My one non-technical publication
An Almost Tutorial on Reinforcement Learning (from my Thesis)
Going nowhere papers
Thesis
Selected Abstracts

Click here to see papers arranged in reverse chronological order

Refereed Publications

Game Theory/Auctions

ATTac-2000: An Adaptive Autonomous Bidding Agent. With P. Stone, M. Littman, and M. Kearns. In Journal of Artificial Intelligence Research, Vol 15, pages 189-206, 2001.
gzipped postscript pdf.

Graphical Models for Game Theory. With M. Kearns and M. Littman. In UAI 2001.
gzipped postscript pdf.

An Efficient Exact Algorithm for Single Connected Graphical Games. With M. Littman and M. Kearns. To appear in NIPS 2001.
gzipped postscript pdf.

FAucS: An FCC Spectrum Auction Simulator for Autonomous Bidding Agents. With J.A. Csirik, M. Littman, and P. Stone. In WELCOM 2001 (Second International Workshop on Electronic Commerce).
gzipped postscript pdf.

ATTac-2000: An Adaptive Autonomous Bidding Agent. With P. Stone, M. Littman, and M. Kearns. In AGENTS 2001.
gzipped postscript.

Nash Convergence of Gradient Dynamics in General-Sum Games. With M. Kearns and Y. Mansour. In UAI 2000.
gzipped postscript.

Fast Planning in Stochastic Games. With M. Kearns and Y. Mansour. In UAI 2000.
gzipped postscript.

Human Computer Interaction

Spoken Dialogue Systems

Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun System. With D. Litman, M. Kearns and M. Walker. In Journal of Artificial Intelligence Research(JAIR), Volume 16, pages 105-133, 2002.
gzipped postscript pdf.

CobotDS: A Spoken Dialogue System for Chat. With M. Kearns, C. Isbell, D. Litman, and J. Howe. Submitted.
gzipped postscript pdf.

Empirical Evaluation of a Reinforcement Learning Spoken Dialogue System. With M. Kearns, D. Litman, and M. Walker. In AAAI 2000.
gzipped postscript.

Automatic Optimization of Dialogue Management. With D. Litman, M. Kearns and M. Walker. In COLING 2000.
gzipped postscript.

Reinforcement Learning for Spoken Dialogue Systems. With M. Kearns, D. Litman and M. Walker. In NIPS 1999.
gzipped postscript.

Topic Spotting

A Boosting Approach to Topic Spotting on Subdialogues. With M. Kary, M. Kearns, and M. Walker. In ICML 2000.
gzipped postscript .

LambdaMOO Agents

Cobot: A Social Reinforcement Learning Agent. With C. Isbell, C. Shelton, M. Kearns, and P. Stone. To appear in NIPS 2001.
gzipped postscript pdf.

A Social Reinforcement Learning Agent. With C. Isbell, C. Shelton, M. Kearns, and P. Stone. In Agents 2001. Winner of Best Paper Award.
gzipped postscript.

Cobot in LambdaMOO: A Social Statistics Agent. With C. Isbell, M. Kearns, D. Korman, and P. Stone. In AAAI 2000.
gzipped postscript.

Cobot in LambdaMOO: A Social Statistics Agent. With C. Isbell, M. Kearns, D. Korman, and P. Stone. In WIRE 2000 (this is an early workshop version of the AAAI paper with the same title).
gzipped postscript.

New RL representations

Learning Predictive State Representations. With M. Littman, R. Sutton, and P. Stone. Submitted.
gzipped postscript pdf.

Predictive Representations of State. With M. Littman and R. Sutton. To appear in NIPS 2001.
gzipped postscript pdf.

Convergence of Old and New Reinforcement Learning Algorithms

"Bias-Variance" Error Bounds for Temporal Difference Updates. With M. Kearns. In COLT 2000.
gzipped postscript.

On the Complexity of Policy Iteration. With Y. Mansour. In UAI 1999.
gzipped postscript.

Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms. With M. Kearns. In NIPS 1998.
gzipped postscript.

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms. With T. Jaakkola, M.L. Littman, and C. Szpesvari. In Machine Learning Journal, vol 38(3), pages 287-308, 2000.
gzipped postscript.

Near-Optimal Reinforcement Learning in Polynomial Time. With M. Kearns. In Machine Learning journal (shorter version appears in ICML 1998).
gzipped postscript pdf.

Analytical Mean Squared Error Curves for Temporal Difference Learning. With P. Dayan. In Machine Learning Journal.
gzipped postscript.
A shorter version appears in the NIPS 10 Proceedings.

Reinforcement Learning with Replacing Eligibility Traces. With R.S. Sutton. Appears in Machine Learning journal, vol. 22, pages 123-158, 1996.gzipped postscript.

Learning Curve Bounds for Markov Decision Processes with Undiscounted Rewards. With L.K. Saul. Appears in Proceedings of 9th Annual Workshop on Computational Learning Theory, 1996.
gzipped postscript.

Markov Decision Processes in Large State Spaces. With L.K. Saul. Appears in Proceedings of 8th Annual Workshop on Computational Learning Theory, pages 281-288, 1995.
gzipped postscript.

On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. With T. Jaakkola and M. Jordan. In Neural Computation, vol 6, number 6, pages 1185-1201, 1994.
gzipped postscript.

An Upper Bound on the Loss from Approximate Optimal-Value Functions. With R.C. Yee. In Machine Learning, vol 16, pages 227-233, 1994.
gzipped postscript.

On Step-Size and Bias in Temporal-Difference Learning. With R.S. Sutton. Appears in Proceedings of Eighth Yale Workshop on Adaptive and Learning Systems, 1994.
gzipped postscript.

Soft Dynamic Programming Algorithms: Convergence Proofs. In CLNL 1993.
gzipped postscript.

New Reinforcement Learning Algorithms

Eligibility Traces for Off-Policy Policy Evaluation. With D. Precup and R. Sutton. In ICML 2000.
gzipped postscript.

Approximate Planning for Factored POMDPs using Belief State Simplification. With D. McAllester. In UAI 1999.
gzipped postscript.

Near-Optimal Reinforcement Learning in Polynomial Time. With M. Kearns. In ICML 1998.
gzipped postscript.

Reinforcement Learning with Replacing Eligibility Traces. With R.S. Sutton. Appears in Machine Learning journal, vol. 22, pages 123-158, 1996.gzipped postscript.

Learning to Act using Real-Time Dynamic Programming. With A.G. Barto and S.J. Bradtke. In Artificial Intelligence, vol 72, pages 81-138, 1995.

Long Term Potentiation, Navigation and Dynamic Programming. With P. Dayan. In CNS 1996.
gzipped postscript.

Improving Policies Without Measuring Merits. With P. Dayan. In NIPS 8, 1996.
gzipped postscript.

Reinforcement Learning Algorithms for Average-Payoff Markovian Decision Processes. Appears in Proceedings of the Twelth National Conference on Artificial Intelligence, 1994.
gzipped postscript.

Soft Dynamic Programming Algorithms: Convergence Proofs. In CLNL 1993.
gzipped postscript.

Temporal Abstraction in Reinforcement Learning

Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. With R. Sutton and D. Precup. In Artificial Intelligence Journal, vol 112, pp. 181-211, 1999.
gzipped postscript.

Improved switching among temporally abstract actions. With R.S. Sutton, D. Precup, and B. Ravindran. In NIPS 1998.
gzipped postscript.

Intra-Option Learning about Temporally Abstract Actions. With R.S. Sutton, D. Precup. In ICML 1998.
gzipped postscript.

Theoretical Results on Reinforcement Learning with Temporally Abstract Behaviors. With D. Precup, R.S. Sutton. In Machine Learning: ECML-98, Proceedings of the 10th European Conference on Machine Learning, Chemnitz, Germany, pp. 382--393. Springer Verlag.
gzipped postscript.

Planning with Closed-Loop Macro Actions. With D. Precup and R.S. Sutton. In Proceedings of AAAI Fall Symposium, 1997. gzipped postscript.

Reinforcement Learning with a Hierarchy of Abstract Models. Appears in Proceedings of the Tenth National Conference on Artificial Intelligence, 1992.
gzipped postscript.

Scaling Reinforcement Learning Algorithms by Learning Variable Temporal Resolution Models. Appears in Proceedings of the Ninth Machine Learning Conference, 1992.
gzipped postscript.

Transfer of Learning by Composing Solutions of Elemental Sequential Tasks. In Machine Learning, vol 8, pages 323-339, 1992.
gzipped postscript.

The Efficient Learning of Multiple Task Sequences. In NIPS 4, 1992.
gzipped postscript.

Transfer of Learning Across Compositions of Sequential Tasks. Appears in Machine Learning: Proceedings of the Eighth International Workshop, 1991.
gzipped postscript.

Dealing with Partial Observability and Function Approximation in RL

Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes. With J. Loch. In ICML 1998.
gzipped postscript.

Reinforcement Learning With Soft State Aggregation. With T. Jaakkola and M. Jordan. In NIPS 7, 1995.
gzipped postscript.

Reinforcement Learning Algorithm for Partially Observable Markov Problems. With T. Jaakkola and M. Jordan. In NIPS 7, 1995.
gzipped postscript.

Learning Without State-Estimation in Partially Observable Markovian Decision Processes. With T. Jaakkola and M. Jordan. Appears in Machine Learning: Proceedings of the Eleventh International Conference, 1994.
gzipped postscript.

Exploiting Structure in Reinforcement Learning Problems

How to Dynamically Merge Markov Decision Processes. With D. Cohn. In NIPS 11 (1999).
gzipped postscript.

Application Papers

Optimizing admission control while ensuring quality of service in multimedia networks via reinforcement learning. With T.X. Brown and H. Tong. In NIPS 1998.
gzipped postscript.

Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems. With D. Bertsekas. In NIPS 10, 1997.
gzipped postscript

Predicting Lifetimes in Dynamically Allocated Memory. With D. Cohn. In NIPS 10, 1997.
gzipped postscript.

Robust Reinforcement Learning in Motion Planning. With A.G. Barto, R.A. Grupen, and C.I. Connolly. In NIPS 6, 1994.
gzipped postscript.( 68 KBytes)

Biological Modeling

Distributed Representation of Limb Motor Programs in Arrays of Adjustable Pattern Generators.. With N.E. Berthier, A.G. Barto, and J.C. Houk. Appears in Journal of Cognitive Neuroscience, vol 5:1, pages 56-78, 1993.

Reinforcement Learning with a Hierarchy of Abstract Models. Appears in Proceedings of the Tenth National Conference on Artificial Intelligence, 1992.
gzipped postscript.

Unrefereed, Published Papers

Hierarchical Optimal Control of MDPs. With A. McGovern, D. Precup, B. Ravindran, and R.S. Sutton. In Proceedings of the Tenth Yale Workshop on Adaptive and Learning Systems, 1998.
gzipped postscript.

On Step-Size and Bias in Temporal-Difference Learning. With R.S. Sutton. Appears in Proceedings of Eighth Yale Workshop on Adaptive and Learning Systems, 1994.
gzipped postscript.

Reinforcement Learning and Dynamic Programming. With A.G. Barto. Appears in Proceedings of Sixth Yale Workshop on Adaptive and Learning Systems, 1990.

On the Computational Economics of Reinforcement Learning. With A.G. Barto. In Proceedings of Connectionist Summer School, 1990.
gzipped postscript.

An Adaptive Sensorimotor Network Inspired by the Physiology of the Cerebellum. With J.C. Houk, C. Fisher, and A.G. Barto. Appears as a chapter in WT Miller, RS Sutton, and PJ Werbos, editors, Neural Network for Control, pages 301-348, 1989.

My one paper in a non-technical journal!

How to Make Software Agents Do the Right Thing: An Introduction to Reinforcement Learning. With P. Norvig and D. Cohn. In Dr. Dobbs journal, March issue, 1997.
gzipped postscript [html version].

An Almost Tutorial on RL (extracted from my Thesis)

An (Almost) Tutorial on Reinforcement Learning. gzipped postscript. Extracted from my 1993 thesis.

Going Nowhere Papers

Asynchronous Modified Policy Iteration with Single-sided Updates. With V. Gullapalli. Working Paper, 1993.
gzipped postscript.

Thesis

Learning to Solve Markovian Decision Processes. Department of Computer Science, University of Massachusetts, Amherst, 1993.
Ph.D Thesis.

Selected Abstracts

Singh, S, Sutton, RS (1996) "Reinforcement Learning with Replacing Eligibility Traces". Appears in Machine Learning journal, vol. 22, pages 123-158, 1996.

ABSTRACT: The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds of trace assign credit to prior events according to how recently they occurred, but only the conventional trace gives greater credit to repeated events. Our analysis is for conventional and replace-trace versions of the offline TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods converge under repeated presentations of the training set to the same predictions as two well known Monte Carlo methods. We then analyze the relative efficiency of the two Monte Carlo methods. We show that the method corresponding to conventional TD is biased, whereas the method corresponding to replace-trace TD is unbiased. In addition, we show that the method corresponding to replacing traces is closely related to the maximum likelihood solution for these tasks, and that its mean squared error is always lower in the long run. Computational results confirm these analyses and show that they are applicable more generally. In particular, we show that replacing traces significantly improve performance and reduce parameter sensitivity on the "Mountain-Car" task, a full reinforcement-learning problem with a continuous state space, when using a feature-based function approximator.

Sutton, RS, Singh, S (1994) "On Bias and Step Size in Temporal-Difference Learning", Proceedings of the Eighth Yale Workshop on Adaptive and Learning Systems, pp. 91-96, Yale University, New Haven, CT. (66K)

ABSTRACT: We present results for three new algorithms for setting the step-size parameters, alpha and lambda, of temporal-difference learning methods such as TD(lambda). The overall task is that of learning to predict the outcome of an unknown Markov chain based on repeated observations of its state trajectories. The new algorithms select step-size parameters online in such a way as to eliminate the bias normally inherent in temporal-difference methods. We compare our algorithms with conventional Monte Carlo methods. Monte Carlo methods have a natural way of setting the step size: for each state s they use a step size of 1/n_s, where n_s is the number of times state s has been visited. We seek and come close to achieving comparable step-size algorithms for TD(lambda). One new algorithm uses a lambda=1/n_s schedule to achieve the same effect as processing a state backwards with TD(0), but remains completely incremental. Another algorithm uses a lambda at each time equal to the estimated transition probability of the current transition. We present empirical results showing improvement in convergence rate over Monte Carlo methods and conventional TD(lambda). A limitation of our results at present is that they apply only to tasks whose state trajectories do not contain cycles.