Google Brain AI quickly unlocks Atari, training takes less than two hours: prediction capabilities are "unprecedented"

Latest update time：2019-03-07

Reads：

Li Zilan posted from Aofei Temple
Quantum Bit Report | Public Account QbitAI

It’s not two hours of training, it’s the equivalent of two hours of human play.

For AI to play a game, does it have to play hundreds of thousands or millions of games before it can learn it?

Google Brain has built a targeted and efficient learning environment for its own reinforcement learning AI: SimPLe, a simulator based on video prediction .

The team says the simulator's predictive capabilities are unprecedented, sometimes even frame-by-frame:

△ The left is the simulator, the middle is the Ground Truth, and the right is the difference

With it, the learning pressure of AI is significantly reduced, and only the training intensity equivalent to that of humans playing for two hours is required to unlock the Atari game.

Compared with its excellent model-free predecessor Rainbow, the model-based SimPLE requires an order of magnitude less effort to train to the same performance.

△ Table tennis 21:0

What kind of simulator?

Unlike many of its game AI predecessors, SimPLe 's agents are not trained in real games.

Its game strategy was completely developed in the simulator.

Here, there is a video prediction model that predicts a result for each action of the AI. According to Google's tradition, it is also called the World Model.

Why do we need this model?

In many Atari games, it is difficult to obtain sufficiently diverse data through random exploration:

There are some places the agent may not have been to, and there are some actions the agent may not have performed.

Without rich environmental data, AI cannot learn more effectively.

Therefore, to allow the agent to explore the world in a more efficient way:

The team used an iterative process consisting of three alternating phases: data collection , model training , and strategy training .

In this way, as the agent's policy becomes better and better, the simulator's predictive power becomes stronger and stronger.

By complementing each other, intelligent agents can unlock game skills faster without having to make random trials and errors like a headless fly.

How to predict?

After trying several different architectures, the team found that the best model was a feed-forward CNN that encodes a sequence of input frames using a stack of convolutions.

Given the actions taken by the agent, the model can decode the next frame through a stack of deconvolutions.

The researchers also found that introducing stochasticity into the model is very effective, allowing the strategy to be trained in a richer range of scenarios.

The specific approach is to add a latent variable and add its samples to the bottleneck representation.

In the setting of this study, discrete variables are best used, encoded as sequences of bits.

It is a bit like a variational autoencoder: the posterior of the latent variable is approximated based on the entire sequence;

Take a value from the posterior and use it with the input frame and the agent's action to predict the next frame.

Finally, we have a Stochastic Discrete Model , which looks like this:

Promising results

Note that the team did not specifically tune the model or hyperparameters for different Atari games.

AI Player Performance

During the training process, the environment changed 400,000 frames, and the interaction between the agent and the environment was only 100,000 times: equivalent to two hours of human play.

You see, AI can already smack down the opponent's head when playing " Pong ". Maybe they have found the weakness of the system:

Even more interesting is the " Freeway " game.

It looks simple, but it requires a lot of exploration .

Here, the agent is a chicken and its progress is very slow because it keeps getting hit by cars.

Then, it is difficult to successfully cross the road, so there is almost no reward.

However, SimPLe captures such rare events, internalizes them into the prediction model , and learns excellent strategies.

By comparison, we found that in the road crossing game, to achieve the same result, SimPLe interacted with the environment an order of magnitude less than its predecessor Rainbow.

In most games, the number of interactions between SimPLe and the environment is more than half that of Rainbow.

Prediction Star

The simulator’s predictions played a major role in achieving this result.

The team found many perfectly predicted clips in the AI game videos, with the longest reaching 50 time steps.

For example, in the scene of crossing the road, there are 11 consecutive seconds in which every frame predicted by the model is exactly the same as the ground truth.

In addition, similar clips have also been found in Ping Pong and Brick .

The team said that extending the time for perfect prediction would be a good research direction.

There were also difficulties

In some games, the predictive model simply cannot generate useful predictions.

The most common reason, the researchers said, is that some very small objects affect the player's fate:

In Atlantis and Battle Zone, for example, bullets are small and fleeting.

The team said that if you want to notice them, you can let the video prediction model observe the game scene in slow motion and high definition.

Paper portal:
https://arxiv.org/abs/1903.00374

-over-

Join the community

Qbit now opens the "AI+Industry" community, which is aimed at practitioners, technicians, product personnel and other personnel in the AI industry. You can choose the corresponding industry community according to your industry. Reply to the keyword "industry group" in the dialogue interface of the Qbit public account (QbitAI) to obtain the way to join the group. The industry group will be reviewed, please understand.

In addition, the Quantum Bit AI community is recruiting. Students who are interested in AI are welcome to reply to the keyword "communication group" in the dialogue interface of the Quantum Bit official account (QbitAI) to obtain the method to join the group.

Sincere recruitment

Qbit is recruiting editors/reporters, and the work location is Beijing Zhongguancun. We look forward to talented and enthusiastic students to join us! For relevant details, please reply to the word "recruitment" in the dialogue interface of the Qbit public account (QbitAI).