Quote:
Originally Posted by suzzer99
I'd like to understand what "merging policy sets with value sets" means. Like I can understand that big blue just tries every possibly future move combination during each move (with priority tweaking for strategy I'm assuming).
But obviously if that worked for Go they'd have done it a long time ago. What's different about how AI approaches the problem?
I got pretty into a dumbed-down video game version kind of like Go called Ataxx. It was different than a lot of games in that you had to keep pumping in more quarters for the time you used on your turn only. So you had a huge incentive to act fast. Being a broke college student helped with the motivation. But you also lost your quarter if you lost to the ever toughening bad guys. So I kind of have a vague idea of the strategy concepts. Maybe.
My incredibly high-level and uninformed understanding is that original AlphaGo basically had two parts. There was a "policy network," which was originally trained on a huge corpus of human games, then refined further with self-play, and whose function was to pick out promising move candidates. The value network was (I think) totally self-trained and was used to decide when the search had arrived at a good outcome and stop looking. So basically, the AI would take current game state, ask the policy network for some likely-looking moves, then see if making any of those moves would put it in a great or terrible position, and, if not, iterate the process again until it runs out of time or finds a winner. I don't really understand what it means either to merge those two together.
Here's a great article about the old, obsolete version of AlphaGo:
https://www.wired.com/2016/05/google-alpha-go-ai/