Nah that's off base in a number of ways. The process is:
- It seeds its evaluation function with random weightings. This will produce random evaluations of positions and hence random play.
- In each position, it does a Monte-Carlo tree search, using its evaluation function to evaluate future positions
- When the game ends, if it lost, it alters its weightings such that its evaluation function will evaluate the positions in the game more negatively, and the opposite if it wins
- The evaluation function is now very slightly better than random. It plays another game.
It doesn't learn heuristic rules in the way you suggest, it simply improves its evaluation function gradually over time. It doesn't either apply rules or choose a random move; it just applies its evaluation function every time and over time it gradually becomes less random. It's likely that material advantage is one of the things it first learns to value, because in the set of all possible chess positions, that's one of the most obvious markers of a winning position. But as I say, it's not a heuristic, it's inherent in the weightings in the network. It's difficult to explain that in a way that is conceptual rather than mathematical. The only way to figure out what it is valuing is to change positions slightly and see how its evaluation changes. If you gave it a whole pile of quiet positions and removed a knight from one side and a bishop from the other and asked it to re-evaluate, you could figure out what it thinks those pieces are worth relative to each other on average, but it's only an average.
I think e4 and d4 would be preferred quite early because the number of possible moves your pieces have is an easy metric and e4 and d4 open up the queen and bishop.
Quote:
It also teaches itself how to prune searches. It looks 20 moves ahead each move which means looking 1 move, 2 moves etc. ahead first. It knows that sometimes "This move looks really ****ty if you look 3 moves ahead, but it's the best move looking 20 moves ahead". But it also knows *under what conditions such sacrifice type plays tend to occur*. So it teaches itself when to consider sacrifice a piece and when not to, and it's much better at recognising this (and pruning) that current engines.
None of this makes any sense. All A0 ever changes is its evaluation function. It has no capacity to change anything about the way it does its tree search. I read the paper and have a better understanding of this now:
Quote:
Instead of a handcrafted evaluation function and move ordering heuristics, AlphaZero utilises a deep neural network (p, v) = fθ(s) with parameters θ. This neural network takes the board position s as an input and outputs a vector of move probabilities p with components pa = P r(a|s) for each action a, and a scalar value v estimating the expected outcome z from position s, v ≈ E[z|s]. AlphaZero learns these move probabilities and value estimates entirely from selfplay; these are then used to guide its search.
Instead of an alpha-beta search with domain-specific enhancements, AlphaZero uses a general purpose Monte-Carlo tree search (MCTS) algorithm. Each search consists of a series of simulated games of self-play that traverse a tree from root to leaf. Each simulation proceeds by selecting in each state s a move a with low visit count, high move probability and high value (averaged over the leaf states of simulations that selected a from s) according to the current neural network fθ. The search returns a vector π representing a probability distribution over moves, either proportionally or greedily with respect to the visit counts at the root state.
The following all concerns actual play, not learning:
Here's a super contrived example to try to explain this to the best of my understanding. Let's say you have a position A with only 2 legal moves, 1 and 2. AlphaZero's initial evaluation is some scalar value, let's say +0.5, and a vector of move probabilities for each move, say { 0.75, 0.25 }. But let's say that move 2 leads to a reasonably easily found forced mate. So initially, when it does Monte Carlo rollouts, it's selecting move 1 75% of the time and move 2 25% of the time, but it keeps track of visit counts as well. So if it randomly picks move 1 like 10 times in a row, that 25% chance of selection for move 2 will be modified to become more and more likely to be chosen. The exact details of the move selection algorithm aren't in the preprint, but will presumably be in the full paper.
Each time it selects a move, it plays out a full game from that point, using the same process recursively. The wins and losses in these games form the basis of a score which it gives to that move - what they call "value" in the quote above. Over time, it will notice that move 2 is actually getting a really great value. This then becomes a basis to choose it more often during the Monte Carlo rollouts, basically to investigate more thoroughly and see if the value is legit. (That's what the quote means when it says the three criteria for what moves are selected during MCTS are "low visit count, high move probability and high value" - exactly how these criteria are balanced against each other is unclear). In turn that leads to position A's value rapidly improving. You can see how if position A was actually down the search tree a bit, the good news about the forced win in move 2 would propagate up the tree, gradually making it more and more likely that the sequence of moves leading to it will be selected. So basically AlphaZero makes an initial evaluation of what moves are a good idea, that initial guess guides a series of self-play games, and its opinion of what moves are a good idea is modified based on the results of those games. Those modifications gradually propagate from the leaves of the tree back to the root.
Last edited by ChrisV; 12-30-2017 at 12:09 AM.