In a paper published on the preprint server Arxiv.org, scientists at Alphabet’s DeepMind propose a new framework that learns an approximate best response to players within games of many kinds. They claim that it achieves consistently high performance against “worst-case opponents” — that is, players who aren’t good, yet at least play by the rules and actually complete the game — in a number of games including chess, Go, and Texas Hold’em.
DeepMind CEO Demis Hassabis often asserts that games are a convenient proving ground to develop algorithms that can be translated into the real world to work on challenging problems. Innovations like this new framework, then, could lay the groundwork for artificial general intelligence (AGI), which is the holy grail of AI — a decision-making AI system that automatically completes not only mundane, repetitive enterprise tasks like data entry, but which reasons about its environment. That’s the long-term goal of other research institutions, like OpenAI.
The level of performance against players is known as exploitability. Computing that exploitability is often computationally intensive because the number of actions players might take is so large. For example, one variant of Texas Hold’em — Heads-Up Limit Texas Hold’em — has roughly 1014 decision points, while Go has approximately 10170. One way to get around this is with a policy that can exploit a player to be evaluated, using reinforcement learning — an AI training technique that spurs software agents to complete goals via a system rewards — to compute the best response.
The framework the DeepMind researchers propose, which they call Approximate Best Response Information State Monte Carlo Tree Search (ABR IS-MCTS), approximates an exact best response on an information-state basis. Actors within the framework follow an algorithm to play a game while a learner derives information from various game outcomes to train a policy. Intuitively, ABR IS-MCTS tries to learn a strategy that, when the exploiter is given unlimited access to the strategy of the opponent, can create a valid and exploiting counterstrategy; it simulates what would happen if someone trained for years to exploit the opponent.
The researchers report that in experiments involving 200 actors (trained on a PC with 4 processors and 8GB of RAM) and a learner (10 processors and 20GB of RAM), ABR IS-MCTS achieved a win rate above 50% in every game it played and a rate above 70% in games other than Hex or Go (like Connect Four and Breakthrough). In backgammon, it won 80% of the time after training for 1 million episodes.
The coauthors say they see evidence of “substantial learning” in that when the actors’ learning steps are restricted, they tend to perform worse even after 100,000 episodes of training. They also note, however, that ABR IS-MCTS is quite slow in certain contexts, taking on average 150 seconds to calculate the exploitability of a particular kind of strategy (UniformRandom) in Kuhn poker, a simplified form of two-player poker.
Future work will involve extending the method to even more complex games.