3.2 Falsification with Monte Carlo Tree Search
3.2.1 Monte Carlo Tree Search
38 Chapter 3. Balancing Exploration and Exploitation Using MCTS
Figure 3.4: An example of the workflow of MCTS
who beat the world champion Lee Sedol1, leads to prevalence of the algorithm.
MCTS has a long history, but nowadays it mainly refers to theUpper Confidence Bounds applied to Trees (UCT)[93] algorithm. It concerns with a problem of search in a huge tree. In the basic setting, the tree yields a huge search space, but its number of nodes is still finite. The aim of MCTS is to select the most promising child of the root node to move, with which the search can gain the best reward. Here, the definition of reward depends on the problem context. For example, in the Go game, reward is correlated to the result of the game, i.e., winning or losing. In the following paragraphs, we use the example in Fig. 3.42to elaborate on how MCTS works. In the figure, there is a fraction attached with each node; the denominator is the number of total visits to that node, while the numerator is the number of winnings. Therefore, the fractions represent the rewards of nodes.
Selection The first step is to select a child node to reason about. Initially, only the root node is available; when there are several children that have already been expanded, the selection is based on theUpper Confidence Bound version 1 (UCB1)[95] algorithm. It selects the best candidate according to the following equation:
abest =arg max
a∈A
©
« Ra +c
s 2 lnN
Na
ª
®
¬
1https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
2The figure is from https://en.wikipedia.org/wiki/Monte_Carlo_tree_search
3.2 Falsification with Monte Carlo Tree Search 39
whereAis the set of children,Ra is the reward of childa,cis a scalar parameter, and N,Na are the numbers of visits to the parent and childarespectively. This algorithm exemplifies the balance between exploration and exploitation: on the one hand, if a child holds a good reward, that one is more likely to be exploited further; on the other hand, if a child has not been visited much often, it will get a better chance to be explored in the future. This balance can also be adjusted by tuningc, so that the user can impose a bias on either exploration or exploitation. For example in Fig. 3.4, when the search visits the root, there are three children to be selected; by applying UCB1, it selects the one with the best reward 7/10.
Expansion If all the children of a selected node have been expanded, then it just continues selection; otherwise, an unexpanded child will be expanded. This is usually done by choosing randomly from the list of the unexpanded children. Then, the expanded child will be initialized with the reward 0/0, as shown in Fig. 3.4.
Simulation (a.k.a. Playout or Rollout) Simulation aims to evaluate a node that has just been expanded. This is by applying Monte Carlo methods: firstly, it collects the sequence of nodes from the root to the node just expanded; then this sequence is concatenated by the nodes randomly chosen from the lower layers of the tree, ending with a leaf node. The sequence can be evaluated according to the context. For example, in a board game, this sequence identifies a way how a player plays the game, and thus a result, either “winning” or “losing”, comes out. In the case of “winning”, we update the numerator of the reward with 1; otherwise, it will be 0. In both cases, the number of visits will be updated to 1. In the example of Fig. 3.4, it loses the game, so the reward is updated to 0/1.
IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 4, NO. 1, MARCH 2012 10
Fig. 3. Asymmetric tree growth [68].
any moment in time; allowing the algorithm to run for additional iterations often improves the result.
It is possible to approximate an anytime version of minimax using iterative deepening. However, the gran- ularity of progress is much coarser as an entire ply is added to the tree on each iteration.
3.4.3 Asymmetric
The tree selection allows the algorithm to favour more promising nodes (without allowing the selection proba- bility of the other nodes to converge to zero), leading to an asymmetric tree over time. In other words, the building of the partial tree is skewed towards more promising and thus more important regions. Figure 3 from [68] shows asymmetric tree growth using the BAST variation of MCTS (4.2).
The tree shape that emerges can even be used to gain a better understanding about the game itself. For instance, Williams [231] demonstrates that shape analysis applied to trees generated during UCT search can be used to distinguish between playable and unplayable games.
3.5 Comparison with Other Algorithms When faced with a problem, the a priori choice between MCTS and minimax may be difficult. If the game tree is of nontrivial size and no reliable heuristic exists for the game of interest, minimax is unsuitable but MCTS is applicable (3.4.1). If domain-specific knowledge is readily available, on the other hand, both algorithms may be viable approaches.
However, as pointed out by Ramanujan et al. [164], MCTS approaches to games such as Chess are not as successful as for games such as Go. They consider a class of synthetic spaces in which UCT significantly outperforms minimax. In particular, the model produces bounded trees where there is exactly one optimal action per state; sub-optimal choices are penalised with a fixed additive cost. The systematic construction of the tree
ensures that the true minimax values are known.12In this domain, UCT clearly outperforms minimax and the gap in performance increases with tree depth.
Ramanujan et al. [162] argue that UCT performs poorly in domains with manytrap states(states that lead to losses within a small number of moves), whereas iter- ative deepening minimax performs relatively well. Trap states are common in Chess but relatively uncommon in Go, which may go some way towards explaining the algorithms’ relative performance in those games.
3.6 Terminology
The terms MCTS and UCT are used in a variety of ways in the literature, sometimes inconsistently, poten- tially leading to confusion regarding the specifics of the algorithm referred to. For the remainder of this survey, we adhere to the following meanings:
•Flat Monte Carlo: A Monte Carlo method with uniform move selection and no tree growth.
•Flat UCB: A Monte Carlo method with bandit-based move selection (2.4) but no tree growth.
•MCTS: A Monte Carlo method that builds a tree to inform its policy online.
•UCT: MCTS with any UCB tree selection policy.
•Plain UCT: MCTS with UCB1 as proposed by Kocsis and Szepesv´ari [119], [120].
In other words, “plain UCT” refers to the specific algo- rithm proposed by Kocsis and Szepesv´ari, whereas the other terms refer more broadly to families of algorithms.
4 VARIATIONS
Traditional game AI research focusses on zero-sum games with two players, alternating turns, discrete ac- tion spaces, deterministic state transitions and perfect information. While MCTS has been applied extensively to such games, it has also been applied to other domain types such as single-player games and planning prob- lems, multi-player games, real-time games, and games with uncertainty or simultaneous moves. This section describes the ways in which MCTS has been adapted to these domains, in addition to algorithms that adopt ideas from MCTS without adhering strictly to its outline.
4.1 Flat UCB
Coquelin and Munos [68] proposeflat UCBwhich effec- tively treats the leaves of the search tree as a single multi- armed bandit problem. This is distinct fromflat Monte Carlosearch (2.3) in which the actions for a given state are uniformly sampled and no tree is built. Coquelin and Munos [68] demonstrate that flat UCB retains the adaptivity of standard UCT while improving its regret bounds in certain worst cases where UCT is overly optimistic.
12. This is related to P-game trees (7.3).
input signalu robustnessJM(u), 'K
u(2)u(1)
· · · u(i) hill-clim bing
u(i+1)?
Figure 3.5: Asymmetric tree growth
Backpropagation In this step, the information ob- tained from simulation is used to update the nodes along the path from the root to the expanded node.
Namely, for all the nodes along that path, their de- nominators of rewards will be added by 1, and their numerators of rewards will be added by 0 or 1 depend- ing on the simulation result.
40 Chapter 3. Balancing Exploration and Exploitation Using MCTS
MCTS repeats this loop for many times, until reaching the given budget. Then, it can select the best child of the root that has the best reward as the output of the algorithm. The generated tree in the end is an asymmetric tree, as the one shown in Fig. 3.51: it has rich branches in those promising areas; meanwhile it does take care of other areas, though barely.
Although MCTS is an effective search method, it is not trivial to apply it in the falsification context. There are several challenges as listed below:
• Falsification reasons about an infinite search space, which is a different setting from the original one of MCTS. Therefore, it is a problem how to build up the search tree.
• Intuitively, the concept of “winning” in falsification may refer to finding a falsifying input. However, the latter is the ultimate goal in falsification, which is rare and hard to reach. Therefore, rewards must be redefined.
• Following up the last item, it is also a challenge to perform simulation (playout) in this context in order to evaluate each node.
In the following sections, we address these issues with the proposal of our two- layered optimization framework.