Processing 2.4 million game frames per second, reducing AI training costs by 80%, Google open-sources RL parallel computing framework

Latest update time：2021-09-02 00:26

Reads：

Xiaocha from Aofei Temple
Quantum Bit Report | Public Account QbitAI

The most painful thing for the rich is that they have a lot of hardware but cannot achieve the effect of 1+1=2.

This is the case with parallel computing in AI training. Even if you have a thousand GPUs, you cannot achieve a thousand times the effect of single-machine training.

Recently, Google, which is not short of money, open-sourced a SEED RL framework that makes it easier to run AI training on thousands of machines, with results up to 4 times better than previous methods .

If you have enough money, you can save 80% of the training cost by performing large-scale parallel computing in the cloud. Considering that the training cost of a large AI model can easily reach millions, this is really considerable.

When training AI to play soccer games, SEED RL can process 2.4 million frames per second . If calculated at 60fps, this is equivalent to processing 11 hours of game footage per second.

SEED RL Architecture

The previous generation of distributed reinforcement learning agent IMPALA has an architecture that includes two parts: Actor and Learner.

Actors typically run on the CPU and iterate between steps taken in the environment and performing inference on the model to predict the next action.

The Actor often updates the parameters of the inference model, and after collecting a sufficient number of observations, it sends the observations and action trajectories to the Learner to optimize the Learner.

In this architecture, Learner trains the model on GPUs using distributed inference input from hundreds of machines.

But IMPALA has many disadvantages:

1. Using CPU for neural network reasoning is inefficient. As the model becomes larger and the amount of computation increases, the problem will become more and more serious.

2. The bandwidth of model parameters between Actor and Learner becomes a performance bottleneck.

3. Low resource utilization efficiency. Actors alternate between the environment and reasoning tasks, and the computational requirements of these two tasks are different, making it difficult to fully utilize resources on the same machine.

The SEED RL architecture solves these shortcomings. Actors can perform reasoning on AI hardware accelerators such as GPUs and TPUs, speeding up reasoning by ensuring that model parameters and states are kept local and avoiding data transfer bottlenecks.

In contrast to the IMPALA architecture, Actors in SEED RL only perform actions in the environment. The Learner performs reasoning centrally on a hardware accelerator using batches of data from multiple Actors.

SEED RL uses the gPRC framework’s networking library to keep latency low while sending observations to the learner at each environment step. This enables SEED RL to achieve up to one million queries per second on a single machine.

Learner can be expanded to thousands of cores, and the number of Actors can be expanded to thousands of machines, thus achieving a training speed of millions of frames per second.

SEED RL uses two state-of-the-art algorithms: V-trace and R2D2 .

V-trace is responsible for predicting the distribution of actions from the sampled actions, and R2D2 is responsible for selecting an action based on the predicted future value of the action.

V-trace is a method based on policy gradients, which was first adopted by IMPALA. Since Actor and Learner are executed asynchronously, V-trace works well in asynchronous architectures.

The second algorithm is R2D2, a Q-learning method that DeepMind used to improve the performance of reinforcement learning agents on Atari games by a factor of 4 and surpass human performance on 52 games.

This approach allows Q-learning algorithms to run on large-scale hardware while still using RNNs.

Experimental Results

Google conducted the benchmark test on Google Research Football, an open source football game project recently developed by DeepMind.

Using 64 Cloud TPU cores, a data transfer speed of 2.4 million frames per second was achieved, an 80x improvement over the previous state-of-the-art distributed IMPALA .

To achieve the same speed, IMPALA requires 14,000 CPUs, while SEED RL only uses 4,160 CPUs. For the same speed, IMPALA requires 3 to 4 times the CPUs of SEED RL.

By optimizing hardware accelerators for parallel computing, we can safely and boldly increase the size of the model.

In the above football game task, by increasing the model size and input resolution, some difficulties that were not solved before can be solved, and the efficiency of training the model can be greatly improved.

Portal

Paper address:
https://arxiv.org/abs/1910.06591

GitHub address:
https://github.com/google-research/seed_rl

The author is a contracted author of NetEase News and NetEase "Each has its own attitude"

-over-

Registration for the third NVIDIA Image Processing Open Course is now open. At 8pm on March 26, NVIDIA experts will share how to use the transfer learning toolkit to accelerate the deployment of the Jetbot smart car inference engine .

Click on the QR code and remark "NVIDIA" to register, join the discussion group, and get the replay of the first two live broadcasts. The main lecturer will also join the group to communicate and interact with everyone~

Free registration | Image and video processing series live courses

Study Plan | Follow the latest developments in AI

Internal Reference has been upgraded! Expand your network of high-quality contacts, get the latest AI information & paper tutorials, welcome to join the AI Internal Reference community to learn together~