Utility (over a finite agent lifetime) is defined as the expected sum of the immediate reward and the long-time reward under the best possible policy:
where is the state at time step t,
is the
immediate reward of executing action
in state
, N is
the number of steps in the lifetime of the agent, and
is the
reward at time step t.). The operator
stands for
taking an expectation over all sources of randomness in the system.
Utility (over an infinite agent lifetime) is defined similarly:
To avoid the mathematical awkwardness of infinite sums, we introduce a
discount factor, , which counts future rewards less
than immediate rewards. This is similar to the compound interest that
banks use.
The utility of a state can be defined in terms of the utility of the next state:
This gives a system of equations, called the Bellman Optimality Equations (e.g., Bertsekas [2]), one for each possible state-action pair, the solution to which is the utility function.
The dynamic programming method for solving the Bellman equations is to
iterate the following, :
The RL rule for updating the estimated utility is:
where is a small number less than one that determines the
rate of change of the estimate. Notice that the second part of
Equation 5 is a lot like Equation 3, except that
there are no expectation signs E anywhere. See Barto, Bradtke, and
Singh [1] for additional information
on RL methods.