Jan 2019
tl;dr: Trains a DQN to solve flappy bird.
Overall impression
It is quite impressive that the paper is a project paper from a stanford undergrad. The project largely follows the DeepMind Nature 2015 paper on DQN.
Key ideas
- Many of the ideas are standard from the Nature DQN paper
- Multiple time-stamped states as input: the agent needs to learn velocity from a sequence of states. Input is therefore wxhxHistoryLength
- Experience replay: decorrelates non-stationary and highly correlated experiences
- Reward shaping: adding a rewardAlive to speed up training. The original reward is sparse, only incrementing when an obstacle is passed.
- $\epsilon$-greedy to solve exploration vs exploitation
- Target network: Periodic updated for training stability.
- DQN trained direclty on hard task cannot perform well on easy tasks. However, the author argues that train on easy tasks first an then fine-tune on hard tasks will solve this problem.
- The network is quit simple, with 3 conv layers with large kernel and large strides and 1 fc layer.
- Training is not consistent. Training longer does not necessarily lead to better performance.
Technical details
- Background of the game (city skiline?) is removed to reduce clutter.
- Exploration probability changing from 1 to 0.1
- Reward discount=0.95
- rewardAlive=0.1, rewardPipe=1.0
- 600,000 iterations
Notes
- The network will probably perform equally well with the background as the background is stationary.
- The network architecture could be improved.
- Tables IV and V shows the benefit of transfer learning. The tables are a bit hard to interpret at a glance. The columns represents the different networks, and the rows represents the deployment environment. Notably, in Table V, the DQN (hard) network performs really well in hard tasks but fails miserably in easy and medium tasks.
- The benefit from transfer learning seems very interesting, and contradicts the findings in supervised learning (such as noted in ‘Overcoming catastrophic forgetting in neural networks’). The author say it only trained 199 iterations. Is this a typo?
Code
Code Notes
- The training code is heavily based on this repo to train DQN on Atari pong or tetris. However the main difference is that in previous code is the frame skipping strategy. The Atari code picks an action and repeats the action for the next K-1 frames, the flappy bird code keeps idle for the next K-1 frames. This is due to the nature of the game. Too much flap causes the bird to go too high and crash.
- Another difference is that the Atari code uses random play to initialize experience replay buffer, but the flappy bird repo uses the $\epsilon$-greedy training routine. This should not be a huge difference.
- This repo does not implement the target network. In order to copy one network to another, use
target_net_param = tf.assign(net_param)
, as shown in this repo and this one.
- Two things to remember to use the global variable
- Create the global step variable (either through
tf.Variable
or
tf.train.create_global_variable
)
- Pass it to tf.train.Optimizer.minimize for automatic increment