Industry News

Is AlphaZero worth playing?

2018-06-02

DeepMind, an artificial intelligence company owned by Google, has released a new paper that describes how the team used AlphaGo's machine learning system to build a new project called AlphaZero. AlphaZero uses AI technology called reinforcement learning, which uses only basic rules, no human experience, training from scratch, swept the board game AI.

AlphaZero first conquered Go, and exploded another board game: Under the same conditions, the system had eight hours of training and defeated the first AI that defeated humans - Li Shishi version AlphaGo; after 4 hours of training, AI Elmo defeated the strongest chess game AI Stockfish and defeated the strongest (Japanese chess) AI Elmo in 2 hours. Even the strongest Go player, AlphaGo, was not spared. Trained for 34 hours, AlphaZero beat AlphaGo Zero who trained 72 hours.

Chart/Number of draws, draws or losses in the game from the AlphaZero perspective (from the DeepMind Team Paper)

Strengthening learning is so powerful. What is it?

Adit Deshpande, a well-known AI blogger from the University of California, Los Angeles (UCLA), published a series of articles on Deep Learning Research Review in his blog, which explains the power behind AlphaGo's victory. In his article, he introduced that the field of machine learning can be divided into three categories: supervised learning, unsupervised learning, and reinforcement learning. Reinforcement learning can learn different actions in different situations or environments to achieve the best results.

Photo/Adit Deshpande's blog Deep Learning Research Review Week 2: Reinforcement Learning

We imagine a small robot in a small room. We did not program this robot to move or walk or take any action. It's just standing there. We want it to move to a corner of the room, get reward points when you get there, and lose points each time you go. We hope that the robot will reach the designated location as far as possible, and the robot can move in four directions: east, south, west, and north. Robots are actually very simple. What kind of behavior is most valuable? Of course, it is a designated place. In order to get the greatest reward, we can only let robots use actions that maximize value.
Photo/Adit Deshpande's blog Deep Learning Research Review Week 2: Reinforcement Learning

What is the value of AlphaZero's explosion of human chess games?

AlphaGo Zero is a breakthrough, is AlphaZero also? Foreign experts analyzed that the latter had four breakthroughs in technology:

First, AlphaGo Zero optimizes according to the winning ratio, only considers victory, negative two kinds of results; And AlphaZero is according to the result to carry on the optimization, has taken into account the possibility such as tie.

Second, AlphaGo Zero will change the board direction for reinforcement learning, while AlphaZero will not. Go board is stacked, while chess and chess are not, so AlphaZero is more versatile.

Three, AlphaGo Zero will continue to choose the best version of the replacement rate, while AlphaZero only updates a neural network, reducing the risk of training bad results.

4. The hyperparameters in the search section of AlphaGo Zero are obtained through Bayesian optimization. Selection will have a great influence on the estimation result. AlphaZero reuses the same hyper-parameter for all games, so there is no need to make specific adjustments for the game.

The fourth paradigm of senior machine learning architect Tu Weiwei told geek park that AlphaZero has breakthroughs and limitations:

First, DeepMind The core of this thesis is to prove the versatility of the AlphaGo Zero strategy on the chess problem; there is no special highlight in the method. AlphaZero is actually an extended version of AlphaGo Zero strategy from Go to other similar board games, and beats the other technology based board game AI. They were the best before.

Second, AlphaZero is only a "universal" engine for similar board games that have a well-defined and perfect information game. AlphaZero will still face difficulties for more complex other issues.

Earlier, when Ryukyu Sun Jian interpreted AlphaGo Zero, he said, “Fortified learning can be extended to many other fields and it is not so easy to use it in the real world. For example, reinforcement learning can be used to research new drugs and new drugs. The structure needs to be searched. After the search, it is made into medicine. Then how to really test the medicine is effective. This closed-loop cost is very expensive and very slow. It is very difficult for you to make it as simple as playing chess."

Third, AlphaZero also needs a lot of computing resources to solve the relatively "simple" chess problem, and the cost is very high. According to geek parks, DeepMind stated in the paper that they used 5000 first-generation TPUs to generate self-playing games and used 64 second-generation TPUs to train neural networks. Previously, some experts said to a certain media that although the performance of the TPU is amazing, the cost will be very high. Some investors of an international venture capital organization have also made friends in this circle. One of the words is: "This expensive chip, I just look at..."

Fourth, the current AlphaZero may be a distance away from "Go God" on Go. Winning people does not represent God. The current network structure and training strategy are not optimal. Actually, it is worth further study.

Although there are certain limitations, its application scenarios are worth digging. There are many other research areas worth paying attention to in the direction of research that makes machine learning more general, such as AutoML, migration learning, and so on. At the same time, how to further obtain a more general AI engine at a lower cost (computational cost, domain expert cost) and make AI more valuable in practical applications is also worthy of attention.

Drip trips are a special area. According to geek parks, DJs use artificial intelligence technology to match drivers and passengers from unreasonable straight-line distances (possibly across rivers) to assignments. Passengers with the least time spent on cars experienced a lot of technical optimization. They also encountered problems and worked hard for them: When training artificial intelligence systems, technologies such as GPU clusters can be used. However, when drivers and passengers are matched, real-time performance is required and configuration is reduced. Therefore, how to ensure accuracy is also a research. Staff has been exploring the issue.

But Tu Weiwei affirmed DeepMind's efforts in the direction of "universal artificial intelligence."