15

RL 101 - Tic Tac Toe(井字棋游戏)

 3 years ago
source link: https://zhuanlan.zhihu.com/p/339721682
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

井字棋游戏算是五子棋的简化版:

两个玩家,一个打圈(◯),一个打叉(✗),轮流在3乘3的格上打自己的符号,最先以横、直、斜连成一线则为胜。如果双方都下得正确无误,将得和局。

井字棋是强化学习一个典型例子,可被归类为 Two players zero-sum game,RL 表格型求解方法实现 reinforcement-learning-an-introduction/tic_tac_toe.py

def train(epochs, print_every_n=500):
    player1 = Player(epsilon=0.01)
    player2 = Player(epsilon=0.01)
    judger = Judger(player1, player2)
    player1_win = 0.0
    player2_win = 0.0
    for i in range(1, epochs + 1):
        # 完成一轮游戏
        winner = judger.play(print_state=False)
        if winner == 1:
            player1_win += 1
        if winner == -1:
            player2_win += 1
        if i % print_every_n == 0:
            print('Epoch %d, player 1 winrate: %.02f, player 2 winrate: %.02f' % (i, player1_win / i, player2_win / i))
        # 一轮游戏结束,便可以获得 reward,并迭代更新所有 state 的价值
        player1.backup()
        player2.backup()
        # 重新初始化
        judger.reset()

    # 保存 value(state),后面被用来选择 action
    player1.save_policy()
    player2.save_policy()



# The game is a zero sum game. If both players are playing with an optimal strategy, every game will end in a tie.
# So we test whether the AI can guarantee at least a tie if it goes second.
def play():
    while True:
        player1 = HumanPlayer()
        player2 = Player(epsilon=0)
        judger = Judger(player1, player2)
        player2.load_policy()
        # 对弈
        winner = judger.play()
        if winner == player2.symbol:
            print("You lose!")
        elif winner == player1.symbol:
            print("You win!")
        else:
            print("It is a tie!")

同样的可以使用 openai/gymhill-a/stable-baselines 对井字棋进行抽象,但需根据其有两个 player 适当修改 reward 等。

除表格型方法之外可以引入神经网络来估计状态价值函数,输入可以是棋盘格的状态,例如可以使用 deepmind/open_spiel AlphaZero 来玩井字棋:

# https://github.com/deepmind/open_spiel/blob/master/docs/alpha_zero.md
# 安装 https://github.com/deepmind/open_spiel/blob/master/docs/install.md
# 训练模型 & 对弈
$ az_path=exp/tic_tac_toe_alpha_zero
$ python3 open_spiel/python/examples/tic_tac_toe_alpha_zero.py --path ${az_path}
$ python3 open_spiel/python/examples/mcts.py --game=tic_tac_toe --player1=human --player2=az --az_path=${az_path}/checkpoint-25

2020-12-26 21:26:57.202819: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-26 21:26:57.221343: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fcfa55182a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-12-26 21:26:57.221356: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
INFO:tensorflow:Restoring parameters from exp/tic_tac_toe_alpha_zero/checkpoint-25
I1226 21:26:57.446192 4433972672 saver.py:1293] Restoring parameters from exp/tic_tac_toe_alpha_zero/checkpoint-25
Initial state:
...
...
...
Choose an action (empty to print legal actions): 4
Player 0 sampled action: x(1,1)
Next state:
...
.x.
...
Player 1 sampled action: o(2,2)
Next state:
...
.x.
..o
Choose an action (empty to print legal actions): 7
Player 0 sampled action: x(2,1)
Next state:
...
.x.
.xo
Player 1 sampled action: o(0,1)
Next state:
.o.
.x.
.xo
Choose an action (empty to print legal actions): 3
Player 0 sampled action: x(1,0)
Next state:
.o.
xx.
.xo
Player 1 sampled action: o(1,2)
Next state:
.o.
xxo
.xo
Choose an action (empty to print legal actions): 2
Player 0 sampled action: x(0,2)
Next state:
.ox
xxo
.xo
Player 1 sampled action: o(2,0)
Next state:
.ox
xxo
oxo
Choose an action (empty to print legal actions): 0
Player 0 sampled action: x(0,0)
Next state:
xox
xxo
oxo
Returns: 0.0 0.0 , Game actions: x(1,1) o(2,2) x(2,1) o(0,1) x(1,0) o(1,2) x(0,2) o(2,0) x(0,0)
Number of games played: 1
Number of distinct games played: 1
Players: human az
Overall wins [0, 0]
Overall returns [0.0, 0.0]

除此之外 deepmind/open_spiel 还提供了 DQN 和表格型方法的对弈学习

# DQN agent vs Tabular Q-Learning agents trained on Tic Tac Toe.
$ python3 open_spiel/python/examples/tic_tac_toe_dqn_vs_tabular.py

AlphaZero 同样适用于除 GO 之外的 two players games。

封面取自 Welcome to Spinning Up in Deep RL!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK