Consider tasks in which an agent interacts with an environment and the goal is to select actions in a way that maximizes future rewards.
The optimal action-value function is defined as the maximum expected return achievable after seeing some sequence and then taking some action , where .
If the optimal value of the sequence at the next timestep was known for all possible actions , then the optimal strategy is to select the action that maximizes the expected value.
In practice, estimating the action-value function is impractical because it needs to be evaluated separately for each sequence. Instead, a neural network can be used as a Q-function approximator where it is trained by minimizing a sequence of loss functions for each iteration .
Differentiating the loss with respect to the weights gives:
Experience replay is used to store the agent's experience at each time step in a dataset , pooled over many episodes into a replay memory. Q-Learning updates are applied to minibatch samples of experience drawn at random from this pool of experiences. The agent then selects an action according to an -greedy policy.