RL 101 - Tic Tac Toe（井字棋游戏） - JOYK Joy of Geek, Geek News, Link all geek

井字棋游戏算是五子棋的简化版：

两个玩家，一个打圈(◯)，一个打叉（✗），轮流在3乘3的格上打自己的符号，最先以横、直、斜连成一线则为胜。如果双方都下得正确无误，将得和局。

井字棋是强化学习一个典型例子，可被归类为 Two players zero-sum game，RL 表格型求解方法实现 reinforcement-learning-an-introduction/tic_tac_toe.py

def train(epochs, print_every_n=500):
    player1 = Player(epsilon=0.01)
    player2 = Player(epsilon=0.01)
    judger = Judger(player1, player2)
    player1_win = 0.0
    player2_win = 0.0
    for i in range(1, epochs + 1):
        # 完成一轮游戏
        winner = judger.play(print_state=False)
        if winner == 1:
            player1_win += 1
        if winner == -1:
            player2_win += 1
        if i % print_every_n == 0:
            print('Epoch %d, player 1 winrate: %.02f, player 2 winrate: %.02f' % (i, player1_win / i, player2_win / i))
        # 一轮游戏结束，便可以获得 reward，并迭代更新所有 state 的价值
        player1.backup()
        player2.backup()
        # 重新初始化
        judger.reset()

    # 保存 value(state)，后面被用来选择 action
    player1.save_policy()
    player2.save_policy()



# The game is a zero sum game. If both players are playing with an optimal strategy, every game will end in a tie.
# So we test whether the AI can guarantee at least a tie if it goes second.
def play():
    while True:
        player1 = HumanPlayer()
        player2 = Player(epsilon=0)
        judger = Judger(player1, player2)
        player2.load_policy()
        # 对弈
        winner = judger.play()
        if winner == player2.symbol:
            print("You lose!")
        elif winner == player1.symbol:
            print("You win!")
        else:
            print("It is a tie!")

同样的可以使用 openai/gym 和 hill-a/stable-baselines 对井字棋进行抽象，但需根据其有两个 player 适当修改 reward 等。

除表格型方法之外可以引入神经网络来估计状态价值函数，输入可以是棋盘格的状态，例如可以使用 deepmind/open_spiel AlphaZero 来玩井字棋：

# https://github.com/deepmind/open_spiel/blob/master/docs/alpha_zero.md
# 安装 https://github.com/deepmind/open_spiel/blob/master/docs/install.md
# 训练模型 & 对弈
$ az_path=exp/tic_tac_toe_alpha_zero
$ python3 open_spiel/python/examples/tic_tac_toe_alpha_zero.py --path ${az_path}
$ python3 open_spiel/python/examples/mcts.py --game=tic_tac_toe --player1=human --player2=az --az_path=${az_path}/checkpoint-25

2020-12-26 21:26:57.202819: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-26 21:26:57.221343: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fcfa55182a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-12-26 21:26:57.221356: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
INFO:tensorflow:Restoring parameters from exp/tic_tac_toe_alpha_zero/checkpoint-25
I1226 21:26:57.446192 4433972672 saver.py:1293] Restoring parameters from exp/tic_tac_toe_alpha_zero/checkpoint-25
Initial state:
...
...
...
Choose an action (empty to print legal actions): 4
Player 0 sampled action: x(1,1)
Next state:
...
.x.
...
Player 1 sampled action: o(2,2)
Next state:
...
.x.
..o
Choose an action (empty to print legal actions): 7
Player 0 sampled action: x(2,1)
Next state:
...
.x.
.xo
Player 1 sampled action: o(0,1)
Next state:
.o.
.x.
.xo
Choose an action (empty to print legal actions): 3
Player 0 sampled action: x(1,0)
Next state:
.o.
xx.
.xo
Player 1 sampled action: o(1,2)
Next state:
.o.
xxo
.xo
Choose an action (empty to print legal actions): 2
Player 0 sampled action: x(0,2)
Next state:
.ox
xxo
.xo
Player 1 sampled action: o(2,0)
Next state:
.ox
xxo
oxo
Choose an action (empty to print legal actions): 0
Player 0 sampled action: x(0,0)
Next state:
xox
xxo
oxo
Returns: 0.0 0.0 , Game actions: x(1,1) o(2,2) x(2,1) o(0,1) x(1,0) o(1,2) x(0,2) o(2,0) x(0,0)
Number of games played: 1
Number of distinct games played: 1
Players: human az
Overall wins [0, 0]
Overall returns [0.0, 0.0]

除此之外 deepmind/open_spiel 还提供了 DQN 和表格型方法的对弈学习

# DQN agent vs Tabular Q-Learning agents trained on Tic Tac Toe.
$ python3 open_spiel/python/examples/tic_tac_toe_dqn_vs_tabular.py

AlphaZero 同样适用于除 GO 之外的 two players games。

封面取自 Welcome to Spinning Up in Deep RL!

RL 101 - Tic Tac Toe（井字棋游戏）

Recommend

网红燕窝“小仙炖”，难成另一个“完美日记”

网传武汉天马公司数千人发热官方否认：全员核酸检测为阴性

想做好分类导航设计？从这四要素开始！

互联网巨头将在线医疗变「围城」

技术平权的最佳典范--它让牛奶可以畅销全球

常见分布式应用系统设计图解（十二）：证券交易系统

设计一对通用的消息提示组件 —— Alert & Toast

运营商要把4G“变”5G了，你的手机网速要变慢？

简单理解vue中的nextTick

保姆级 NLP 学习路线

About Joyk