Deep Q-Learning

Consider tasks in which an agent interacts with an environment E and the goal is to select actions in a way that maximizes future rewards.

The optimal action-value function is defined as the maximum expected return achievable after seeing some sequence s and then taking some action a, where Q(s,a)=maxπ E[Rt|st=s,at=a,π].

If the optimal value Q(s,a) of the sequence s at the next timestep was known for all possible actions a, then the optimal strategy is to select the action that maximizes the expected value.

Q(s,a)=EsE[r+γ maxa Q(s,a)|s,a]

In practice, estimating the action-value function is impractical because it needs to be evaluated separately for each sequence. Instead, a neural network can be used as a Q-function approximator where it is trained by minimizing a sequence of loss functions for each iteration i.

Li(θi)=E[(yiQ(s,a;θi))2]

Differentiating the loss with respect to the weights gives:

θiLi(θi)=Es,aρ();sE[(r+γ maxaQ(s,a;θi1)Q(s,a;θi))θiQ(s,a;θi)]

Experience replay is used to store the agent's experience at each time step et=(st,at,rt,st+1) in a dataset D=[e1,...,en], pooled over many episodes into a replay memory. Q-Learning updates are applied to minibatch samples of experience drawn at random from this pool of experiences. The agent then selects an action according to an ϵ-greedy policy.