Things
I will going to build the algorithm known as DQN. I decide to use openai gym to reproduce the result described in Human-level control through deep reinforcement learning. The article can be download from here for free.
I notice that there are so many repositories about DQN, but a lot of them don’t generate any benchmark for algorithms. So I think I need a benchmark and algorithm comparision, and I decide to write my own repository.
I think writing code is going to be painful, so I need to write down the pains I have. By the way, I’m using tensorflow and gym together.
DQN
The main goal of DQN is to estimate accumulate reward, known as Q-value, using deep neural network. The algorithm consists of several parts.
- Preprocessing
Working directly with raw Atari 2600 frames
, we need to convert $210 \times 160 \times 3$ pixel images into $84 \times 84 \times 4$ input vectors. - Neural Network
Input demension of network is $84 \times 84 \times 4$.The output layer is a fully-connected linear layer with a single output for each valid action. The number of valid actions varied between 4 and 18 on the games we considered.
- Experience Replay
Store the transition in a memory.
The whole algorithm in the paper.
Deep Q-learning with experience replay
Initialize replay memory $D$ to capacity $N$
Initialize action-value function $Q$ with random weights $\theta$
Initialize target action-value function $\hat{Q}$ with weights $\theta^- = \theta$
For episode = $1, M$ do
Initialize sequence $s_1=\{x_1\}$ and preprocessed sequence $\phi_1 = \phi(s_1)$
For $t$ = $1, T$ do
With probability $\varepsilon$ select a random action $a_t$, otherwise select $a_t = \arg\max_a Q(\phi(S_t), a; \theta)$
Execute action $a_t$ in emulator and observe reward $r_t$ and image $x_{t+1}$
Set $s_{t+1} = s_t, a_t, x_{t+1}$ and preprocess $\phi_{t+1} = \phi(s_{t+1})$
Store transition $(\phi_t, a_t, r_t, \phi_{t+1})$ in $D$
Sample random minibatch of transitions $(\phi_j, a_j, r_j, \phi_{j+1})$ from $D$
Set $y_j = \left\{ \begin{matrix} r_j & \mbox{if episode terminates at step j+1} \\ r_j+\gamma \max_{a'} \hat{Q}(\phi_j, a'; \theta^-) & \mbox{otherwise} \end{matrix} \right.$
Perform a gradient descent step on $(y_j-Q(\phi_j,a_j;\theta))^2$ with respect to the network parameters $\theta$
Every $C$ steps reset $\hat{Q}=Q$
End For
End For
Preprocessing
First, to encode a single frame we take the maximum value for each pixel colour value over the frame being encoded and the previous frame. This was necessary to remove flickering that is present in games where some objects appear only in even frames while other objects appear only in odd frames, an artefact caused by the limited number of sprites Atari 2600 can display at once.
So in order to encode a single frame, we need two frames, current frame and previous frame, which will be stored in a queue.
Neural Network
Model Architecture
- input layer $84 \times 84 \times 4$.
- conv2d, $8 \times 8$ kernel size with stride 4, 32 kernels, relu.
- conv2d, $4 \times 4$ kernel size with stride 2, 64 kernels, relu.
- conv2d, $3 \times 3$ kernel size with stride 1, 64 kernels, relu.
- full-connected, 512 rectifier units.
- output layer, a single output for each valid action, varied between 4 and 18 on the games.
Parameters Update
More precisely, every C updates we clone the network Q to obtain a target network $\hat{Q}$and use $\hat{Q}$ % for generating the Q-learning targets $y_j$ for the following C updates to Q.
Experience Replay
First, we use a technique known as experience replay in which we store the agent’s experiences at each time-step, $e_t = (s_t, a_t, r_t, s_{t+1})$, in a dataset $D_t = \{e_1, \dots ,e_t\}$, pooled over many episodes (where the end of an episode occurs when a terminal state is reached) into a replay memory.
In practice, our algorithm only stores the last $N$ experience tuples in the replay memory,and samples uniformly at random from $D$ when performing updates.This approach is in some respects limited because the memory buffer does not differentiate important transitions and always overwrites with recent transitions owing to the finite memory size $N$.
And there is a improvement of experience replay.
A more sophisticated sampling strategy might emphasize transitions from which we can learn the most, similar to prioritized sweeping.
Tricks
Replace network parameters
Here I find some elegant code about replacing the parameters of target network. First, set a collection name, example. Second, while usingget_varible
, set the collection name, example. Third, get all varibles in the collection usingget_collection
, example.Replay memory
Here is a lecture about reinforcemnt learning and its tensorflow implement. The whole lesson can be found here. This class is calledCS 20SI: Tensorflow for Deep Learning Research
, and you can find more information here.