Richard Sutton and Satinder Singh (1994) "On Bias and Step Size in Temporal-Difference Learning", Proceedings of the Eighth Yale Workshop on Adaptive and Learning Systems, pp. 91-96, Yale University, New Haven, CT. (66K)

ABSTRACT: We present results for three new algorithms for setting the step-size parameters, alpha and lambda, of temporal-difference learning methods such as TD(lambda). The overall task is that of learning to predict the outcome of an unknown Markov chain based on repeated observations of its state trajectories. The new algorithms select step-size parameters online in such a way as to eliminate the bias normally inherent in temporal-difference methods. We compare our algorithms with conventional Monte Carlo methods. Monte Carlo methods have a natural way of setting the step size: for each state s they use a step size of 1/n_s, where n_s is the number of times state s has been visited. We seek and come close to achieving comparable step-size algorithms for TD(lambda). One new algorithm uses a lambda=1/n_s schedule to achieve the same effect as processing a state backwards with TD(0), but remains completely incremental. Another algorithm uses a lambda at each time equal to the estimated transition probability of the current transition. We present empirical results showing improvement in convergence rate over Monte Carlo methods and conventional TD(lambda). A limitation of our results at present is that they apply only to tasks whose state trajectories do not contain cycles.