Google's AlphaZero AI Learns Chess in Four Hours; Then Beats Stockfish 28-0-72 - Page 6 - Chess and Board Game Strategy

Nah that's off base in a number of ways. The process is:

- It seeds its evaluation function with random weightings. This will produce random evaluations of positions and hence random play.
- In each position, it does a Monte-Carlo tree search, using its evaluation function to evaluate future positions
- When the game ends, if it lost, it alters its weightings such that its evaluation function will evaluate the positions in the game more negatively, and the opposite if it wins
- The evaluation function is now very slightly better than random. It plays another game.

It doesn't learn heuristic rules in the way you suggest, it simply improves its evaluation function gradually over time. It doesn't either apply rules or choose a random move; it just applies its evaluation function every time and over time it gradually becomes less random. It's likely that material advantage is one of the things it first learns to value, because in the set of all possible chess positions, that's one of the most obvious markers of a winning position. But as I say, it's not a heuristic, it's inherent in the weightings in the network. It's difficult to explain that in a way that is conceptual rather than mathematical. The only way to figure out what it is valuing is to change positions slightly and see how its evaluation changes. If you gave it a whole pile of quiet positions and removed a knight from one side and a bishop from the other and asked it to re-evaluate, you could figure out what it thinks those pieces are worth relative to each other on average, but it's only an average.

I think e4 and d4 would be preferred quite early because the number of possible moves your pieces have is an easy metric and e4 and d4 open up the queen and bishop.

Quote:

It also teaches itself how to prune searches. It looks 20 moves ahead each move which means looking 1 move, 2 moves etc. ahead first. It knows that sometimes "This move looks really ****ty if you look 3 moves ahead, but it's the best move looking 20 moves ahead". But it also knows *under what conditions such sacrifice type plays tend to occur*. So it teaches itself when to consider sacrifice a piece and when not to, and it's much better at recognising this (and pruning) that current engines.

None of this makes any sense. All A0 ever changes is its evaluation function. It has no capacity to change anything about the way it does its tree search. I read the paper and have a better understanding of this now:

Quote:

Instead of a handcrafted evaluation function and move ordering heuristics, AlphaZero utilises a deep neural network (p, v) = fθ(s) with parameters θ. This neural network takes the board position s as an input and outputs a vector of move probabilities p with components pa = P r(a|s) for each action a, and a scalar value v estimating the expected outcome z from position s, v ≈ E[z|s]. AlphaZero learns these move probabilities and value estimates entirely from selfplay; these are then used to guide its search.

Instead of an alpha-beta search with domain-specific enhancements, AlphaZero uses a general purpose Monte-Carlo tree search (MCTS) algorithm. Each search consists of a series of simulated games of self-play that traverse a tree from root to leaf. Each simulation proceeds by selecting in each state s a move a with low visit count, high move probability and high value (averaged over the leaf states of simulations that selected a from s) according to the current neural network fθ. The search returns a vector π representing a probability distribution over moves, either proportionally or greedily with respect to the visit counts at the root state.

The following all concerns actual play, not learning:

Here's a super contrived example to try to explain this to the best of my understanding. Let's say you have a position A with only 2 legal moves, 1 and 2. AlphaZero's initial evaluation is some scalar value, let's say +0.5, and a vector of move probabilities for each move, say { 0.75, 0.25 }. But let's say that move 2 leads to a reasonably easily found forced mate. So initially, when it does Monte Carlo rollouts, it's selecting move 1 75% of the time and move 2 25% of the time, but it keeps track of visit counts as well. So if it randomly picks move 1 like 10 times in a row, that 25% chance of selection for move 2 will be modified to become more and more likely to be chosen. The exact details of the move selection algorithm aren't in the preprint, but will presumably be in the full paper.

Each time it selects a move, it plays out a full game from that point, using the same process recursively. The wins and losses in these games form the basis of a score which it gives to that move - what they call "value" in the quote above. Over time, it will notice that move 2 is actually getting a really great value. This then becomes a basis to choose it more often during the Monte Carlo rollouts, basically to investigate more thoroughly and see if the value is legit. (That's what the quote means when it says the three criteria for what moves are selected during MCTS are "low visit count, high move probability and high value" - exactly how these criteria are balanced against each other is unclear). In turn that leads to position A's value rapidly improving. You can see how if position A was actually down the search tree a bit, the good news about the forced win in move 2 would propagate up the tree, gradually making it more and more likely that the sequence of moves leading to it will be selected. So basically AlphaZero makes an initial evaluation of what moves are a good idea, that initial guess guides a series of self-play games, and its opinion of what moves are a good idea is modified based on the results of those games. Those modifications gradually propagate from the leaves of the tree back to the root.

Last edited by ChrisV; 12-30-2017 at 12:09 AM.

Quote

01-01-2018 , 08:58 AM

#127

Yeti

Abominable

Join Date: Jun 2004 Posts: 23,208

AlphaGo doc was just added to Netflix

Quote

01-01-2018 , 10:36 AM

#128

ChrisV

Carpal \'Tunnel

Join Date: Jul 2004 Posts: 40,336

Thanks, will definitely watch, looks interesting. Well reviewed.

Quote

01-01-2018 , 10:51 AM

#129

ChrisV

Carpal \'Tunnel

Join Date: Jul 2004 Posts: 40,336

lol, I looked at one of the reviews on RT. This is the opening couple sentences:

Quote:

The game of go is over 3,000 years old. In all that time, it has never been 'solved' as chess has; there are some preferred starting strategies but no optimal overall strategy has ever been developed.

Not off to a good start.

Quote

01-01-2018 , 12:17 PM

#130

daveopie

old hand

Join Date: Jul 2009 Posts: 1,506

Quote:

Originally Posted by Yeti

AlphaGo doc was just added to Netflix

Thanks. I didn't know about it, and just watched it. I enjoyed it. There wasn't much in there about the computer science used or specific Go strategy - instead it focused on the human aspect, such as the drama of the outcome, and the feelings during the matches of the 2 men who played against Alpha Go. I guess that's to be expected from a Netflix show, and I thought it was well done.

Quote

01-01-2018 , 01:45 PM

#131

well named

poorly undertitled

Join Date: Jun 2007 Posts: 78,077

Quote:

Originally Posted by ChrisV

No, it starts its evaluation function with random weights, which will produce random play, but the evaluation function is updated as it learns.

Yeah, that sounds right. Thanks for the correction. I need to go back and read the article(s) Google published that I only skimmed before. Still, I'm having trouble with the idea of it getting stuck in some local part of the search space for bridge but not for chess. But at this point probably it's up to me to go read all that again, rather than on you to convince me

Quote:

Originally Posted by ChrisV

In chess, good moves are good moves even if play thereafter is suboptimal.

Sure, but part of my point was that it doesn't know whether a move is good until the end of the game, as far as closing the loop on training feedback. Obviously the point is to develop the evaluation function which then does have an opinion on moves in isolation, but during training the only feedback is the end of the game.

Quote:

Originally Posted by ChrisV

In bridge bidding, 1S is a better opening bid with a strong hand and spades than 4S, but only with highly specific followups.

It sounds to me like saying that a particular opening is good, but only some variations? But it looked to me like it was able, over time, to work some of that out in training.

Quote:

Originally Posted by ChrisV

By the "easy path" I mean that neural networks will select for whatever gives the largest initial gains, rather than routes that will ultimately produce the best outcomes.

And yet this would seem to imply a fairly large problem for it to learn to play chess well also, but it seems to do much better in deeper or more closed positions like this than other engines do? That was my impression based on a couple of the annotated games I saw.

Anyway, it's all cool stuff. Sorry if I'm just being annoying out of ignorance. When I have more time I'll try to educate myself a little more, especially on bidding in bridge.

Quote

01-01-2018 , 02:23 PM

#132

Louis Cyphre

Carpal \'Tunnel

Join Date: Jun 2006 Posts: 11,025

Quote:

Originally Posted by well named

Is it possible that A0 can do things like "if I make move x when I see the current pattern it often leads pattern y. Pattern y is correlated with winning therefore I should make move x"?

Quote

01-01-2018 , 02:55 PM

#133

well named

poorly undertitled

Join Date: Jun 2007 Posts: 78,077

Quote:

Originally Posted by Louis Cyphre

Is it possible that A0 can do things like "if I make move x when I see the current pattern it often leads pattern y. Pattern y is correlated with winning therefore I should make move x"?

Yes, as I understand it. Here's how the Arxiv article puts it:

Quote:

Instead of an alpha-beta search with domain-specific enhancements, AlphaZero uses a general- purpose Monte-Carlo tree search (MCTS) algorithm. Each search consists of a series of simu- lated games of self-play that traverse a tree from root sroot to leaf. Each simulation proceeds by selecting in each state s a move a with low visit count, high move probability and high value (averaged over the leaf states of simulations that selected a from s) according to the current neural network fθ. The search returns a vector π representing a probability distribution over moves, either proportionally or greedily with respect to the visit counts at the root state.

A "state s" means a complete board state, i.e. the position of every piece, and as I understand it pattern recognition is one of the main strengths of artificial neural networks, in the sense that the trained algorithm will favor similar kinds of moves across board states that are distinct but have various commonalities. It will come to analyze those positions as being similar, hence pattern recognition.

That paragraph is also probably pretty essential to the questions I was asking Chris. I'm reading the part about it returning a probability distribution for the next move 'a' as a way of avoiding the kind of pigeon-holing he thinks will occur, but he's pointing out that the probability distribution will become heavily weighted and might effectively prune a lot of options too early.

Quote

01-01-2018 , 05:17 PM

#134

br3nt00

Pooh-Bah

Join Date: Jan 2009 Posts: 3,664

AlphaGo documentary was great

Magnus documentary not as good but still good

Quote

01-08-2018 , 08:12 PM

#135

Lateralus

journeyman

Join Date: Nov 2005 Posts: 308

Just bumping this to agree that the AlphaGo Netflix movie was awesome, and is now available in most EU countries too...

Quote

01-11-2018 , 02:50 AM

#136

DanSmithHolla

enthusiast

Join Date: Oct 2017 Posts: 51

Are there any plans of AlphaGo playing in the future? Surely someone must have looked up what it plays against the Najdorf...

Quote

01-12-2018 , 06:45 PM

#137

jalfrezi

Carpal \'Tunnel

Join Date: May 2012 Posts: 12,231

You mean AlphaZero.

It's interesting that we've only been allowed to see 10 of the 100 games.

It played 44 million games against itself while learning, which isn't a huge number for a game such as chess with ~10^100 possible games. Is it possible that most of the other 90 games were substandard in some way that Google found embarrassing (eg drawing technically won endings, playing suboptimal openings), though not sufficiently so for Stockfish to be able to win?

(And as any Najdorf player knows, there's no good way of playing against it.

)

Quote

01-22-2018 , 11:35 AM

#138

The Yugoslavian

STTF HUC II Winner

Join Date: Sep 2004 Posts: 25,040

Well, it also played like 1200 other games (iirc) vs. Stockfish in various fixed openings. It'd be interesting to see all of them!

Quote

12-06-2018 , 05:02 PM

#139

watergun7

veteran

Join Date: Jun 2012 Posts: 2,796

http://science.sciencemag.org/conten...-Silver-SM.pdf

New paper published with new games and stats on openings (basically the French is very bad).

I found this very funny:

All draws in a quite complicated position.

Quote

12-06-2018 , 05:56 PM

#140

well named

poorly undertitled

Join Date: Jun 2007 Posts: 78,077

thanks for the link!

Quote

12-06-2018 , 06:01 PM

#141

watergun7

veteran

Join Date: Jun 2012 Posts: 2,796

https://deepmind.com/research/alphag...ero-resources/

All 100 games from 2017 released, 210 new games released...

Quote

12-06-2018 , 09:44 PM

#142

ChrisV

Carpal \'Tunnel

Join Date: Jul 2004 Posts: 40,336

They also played it against Stockfish in several new matches, first of all a 1,000 game match, in which AZ scored a smooth +155 -6 =839. They then tried a number of variants:

Quote:

To verify the robustness of AlphaZero, we played additional matches that started from common human openings (Fig. 3). AlphaZero defeated Stockfish in each opening, suggesting that AlphaZero has mastered a wide spectrum of chess play. The frequency plots in Fig. 3 and the time line in fig. S2 show that common human openings were independently discovered and played frequently by AlphaZero during self-play training. We also played a match that started from the set of opening positions used in the 2016 TCEC world championship; AlphaZero won convincingly in this match, too (26) (fig. S4). We played additional matches against the most recent development version of Stockfish (27) and a variant of Stockfish that uses a strong opening book (28). AlphaZero won all matches by a large margin (Fig. 2).

They also tried giving AZ 1/10th the thinking time of Stockfish; it won that match too.

Any doubts about whether AZ is stronger than Stockfish are now put to bed; it's much stronger. More here.

Quote

Page 6 of 6

First

1 2 3 4 5 6

Last

Post Reply Subscribe

...

Page 6 of 6

First

1 2 3 4 5 6

Last