Deep Q-Learning¶

Consider tasks in which an agent interacts with an environment $E$ and the goal is to select actions in a way that maximizes future rewards.

The optimal action-value function is defined as the maximum expected return achievable after seeing some sequence $s$ and then taking some action $a$ , where $Q^{*} (s, a) = m a x_{π} E [R_{t} | s_{t} = s, a_{t} = a, π]$ .

If the optimal value $Q * (s^{'}, a^{'})$ of the sequence $s^{'}$ at the next timestep was known for all possible actions $a^{'}$ , then the optimal strategy is to select the action that maximizes the expected value.

$Q^{*} (s, a) = E_{s^{'} \sim E} [r + γ \underset{a^{'}}{m a x} Q^{*} (s^{'}, a^{'}) | s, a]$

In practice, estimating the action-value function is impractical because it needs to be evaluated separately for each sequence. Instead, a neural network can be used as a Q-function approximator where it is trained by minimizing a sequence of loss functions for each iteration $i$ .

$L_{i} (θ_{i}) = E [(y_{i} - Q (s, a; θ_{i}))^{2}]$

Differentiating the loss with respect to the weights gives:

$\nabla_{θ_{i}} L_{i} (θ_{i}) = E_{s, a \sim ρ (\cdot); s^{'} \sim E} [(r + γ \underset{a^{'}}{m a x} Q (s^{'}, a^{'}; θ_{i - 1}) - Q (s, a; θ_{i})) \nabla_{θ_{i}} Q (s, a; θ_{i})]$

Experience replay is used to store the agent's experience at each time step $e_{t} = (s_{t}, a_{t}, r_{t}, s_{t + 1})$ in a dataset $D = [e_{1}, . . ., e_{n}]$ , pooled over many episodes into a replay memory. Q-Learning updates are applied to minibatch samples of experience drawn at random from this pool of experiences. The agent then selects an action according to an $ϵ$ -greedy policy.