Deep Q-Learning

Consider tasks in which an agent interacts with an environment $\mathcal{E}$ and the goal is to select actions in a way that maximizes future rewards.

The optimal action-value function is defined as the maximum expected return achievable after seeing some sequence $s$ and then taking some action $a$, where $Q^*(s,a) = max_{\pi} \ \mathbb{E}[R_t | s_t = s, a_t = a, \pi]$.

If the optimal value $Q*(s',a')$ of the sequence $s'$ at the next timestep was known for all possible actions $a'$, then the optimal strategy is to select the action that maximizes the expected value.

$$ Q^*(s,a) = \mathbb{E}_{s' \sim \mathcal{E}}[r + \gamma \ \underset{a'}{max} \ Q^*(s', a') | s,a] $$

In practice, estimating the action-value function is impractical because it needs to be evaluated separately for each sequence. Instead, a neural network can be used as a Q-function approximator where it is trained by minimizing a sequence of loss functions for each iteration $i$.

$$ L_i(\theta_i) = \mathbb{E} \left[ (y_i - Q(s, a; \theta_i))^2 \right] $$

Differentiating the loss with respect to the weights gives:

$$ \nabla_{\theta_i} L_i(\theta_i) = \mathbb{E}_{s,a \sim \rho(\cdot); s' \sim \mathcal{E}} \left[ \left( r + \gamma \ \underset{a'}{max} Q(s', a'; \theta_{i-1}) - Q(s,a;\theta_i) \right) \nabla_{\theta_i} Q(s,a;\theta_i) \right] $$

Experience replay is used to store the agent's experience at each time step $e_t = (s_t, a_t, r_t, s_{t+1})$ in a dataset $\mathcal{D} = [e_1, ..., e_n]$, pooled over many episodes into a replay memory. Q-Learning updates are applied to minibatch samples of experience drawn at random from this pool of experiences. The agent then selects an action according to an $\epsilon$-greedy policy.