Using cuDNN to Speed Up DQN Training on Jetson TX1

Feb 22, 2017

I had an idea about speeding up trainig of DeepMind’s DQN by NVIDIA’s cuDNN. Then I found out it was really easy to do that with Torch7!

All I needed to do was just to convert the neural network to ‘cudnn’ after it’s been created/loaded and cuda()’ed. More specifically, I added the corresponding code into dqn/NeuralQLearner.lua. For an explanation of cudnn.benchmark and cudnn.fastest, please refer to the official cudnn.torch page.

    if self.gpu and self.gpu >= 0 then
        self.network:cuda()
        -- I added this part...
        if self.cudnn then
            cudnn.benchmark = true
            cudnn.fastest = true
            cudnn.convert(self.network, cudnn)
            print('*** Using cudnn ***')
        end
    else
        self.network:float()
    end

In addition to cudnn, I also thought about reducing the time the DQN trainer spending on displaying game images, so that the trainer could spend more of its time doing useful training work. So I implemented that in the code too. With some trial, I picked 3 as the ‘display_freq’. That is, I let the DQN trainer display only 1 out of 3 images during training. This way, I effectively reduced CPU consumption on image display while still able to clearly see the progress of the game.

Here’s the result on Jetson TX1 after I implemented both cudnn and display_freq=3. (Note that DQN training does not really start until running for ‘learn_start (5000)’ steps.) The numbers in the table below were all obtained while the DQN was trained for Atari ‘pong’ game with TX1 CPU running at max clock frequency (sudo ~/jetson_clocks.sh).

Test Case Train 1000 steps % Baseline: display all frames, no cudnn 56 s 100 Improvement #1: display 1/3 frames 46 s 82 Improvement #2: 1/3 frames, with cudnn 37 s 66

When I looked deeper at the ‘Improvement #2’ case (as shown in the screenshot below), I saw both CPU (‘cpu’ in ~/tegrastat output) and GPU (‘GR3D’) of TX1 were far away from fully loaded, while one of the 4 CPUs was constantly at ~90% loading. I think this likely indicated the bottleneck was lying in the Atari emulator (‘xitari’ and ‘alewrap’ in Lua). In other words, I think the DQN on TX1 could train much faster if the Atari emulator was able to generate game images at higher rate. Hopefully this would work out better when I train the DQN with the Nintendo Famicom Mini console.

If you’d also like to run this cudnn accelerated DQN, you can refer to my earlier post, Training DeepMind’s DQN to Play Pong. And the repository is here:

https://github.com/jkjung-avt/DeepMind-Atari-Deep-Q-Learner.

Using cuDNN to Speed Up DQN Training on Jetson TX1

Using cuDNN to Speed Up DQN Training on Jetson TX1

Recommend

Letting a Random Agent Play Galaga

Getting Around Memory Leak Problem of Torch7's image.display() Interface

Nintendo AI Agent Training in Action, Finally...

How to Install OpenCV (3.4.0) on Jetson TX2

How to Install Caffe and PyCaffe on Jetson TX2

Deep Learning Cats Dogs Tutorial on Jetson TX2

Trying out TensorRT on Jetson TX2

How I Built My Own Deep Learning PC

Training a Fish Detector with NVIDIA DetectNet (Part 1/2)

Training a Fish Detector with NVIDIA DetectNet (Part 2/2)

About Joyk