RL 101 - Tic Tac Toe(井字棋游戏)
source link: https://zhuanlan.zhihu.com/p/339721682
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
井字棋游戏算是五子棋的简化版:
两个玩家,一个打圈(◯),一个打叉(✗),轮流在3乘3的格上打自己的符号,最先以横、直、斜连成一线则为胜。如果双方都下得正确无误,将得和局。
井字棋是强化学习一个典型例子,可被归类为 Two players zero-sum game,RL 表格型求解方法实现 reinforcement-learning-an-introduction/tic_tac_toe.py
def train(epochs, print_every_n=500): player1 = Player(epsilon=0.01) player2 = Player(epsilon=0.01) judger = Judger(player1, player2) player1_win = 0.0 player2_win = 0.0 for i in range(1, epochs + 1): # 完成一轮游戏 winner = judger.play(print_state=False) if winner == 1: player1_win += 1 if winner == -1: player2_win += 1 if i % print_every_n == 0: print('Epoch %d, player 1 winrate: %.02f, player 2 winrate: %.02f' % (i, player1_win / i, player2_win / i)) # 一轮游戏结束,便可以获得 reward,并迭代更新所有 state 的价值 player1.backup() player2.backup() # 重新初始化 judger.reset() # 保存 value(state),后面被用来选择 action player1.save_policy() player2.save_policy() # The game is a zero sum game. If both players are playing with an optimal strategy, every game will end in a tie. # So we test whether the AI can guarantee at least a tie if it goes second. def play(): while True: player1 = HumanPlayer() player2 = Player(epsilon=0) judger = Judger(player1, player2) player2.load_policy() # 对弈 winner = judger.play() if winner == player2.symbol: print("You lose!") elif winner == player1.symbol: print("You win!") else: print("It is a tie!")
同样的可以使用 openai/gym 和 hill-a/stable-baselines 对井字棋进行抽象,但需根据其有两个 player 适当修改 reward 等。
除表格型方法之外可以引入神经网络来估计状态价值函数,输入可以是棋盘格的状态,例如可以使用 deepmind/open_spiel AlphaZero 来玩井字棋:
# https://github.com/deepmind/open_spiel/blob/master/docs/alpha_zero.md # 安装 https://github.com/deepmind/open_spiel/blob/master/docs/install.md # 训练模型 & 对弈 $ az_path=exp/tic_tac_toe_alpha_zero $ python3 open_spiel/python/examples/tic_tac_toe_alpha_zero.py --path ${az_path} $ python3 open_spiel/python/examples/mcts.py --game=tic_tac_toe --player1=human --player2=az --az_path=${az_path}/checkpoint-25 2020-12-26 21:26:57.202819: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2020-12-26 21:26:57.221343: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fcfa55182a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-12-26 21:26:57.221356: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version INFO:tensorflow:Restoring parameters from exp/tic_tac_toe_alpha_zero/checkpoint-25 I1226 21:26:57.446192 4433972672 saver.py:1293] Restoring parameters from exp/tic_tac_toe_alpha_zero/checkpoint-25 Initial state: ... ... ... Choose an action (empty to print legal actions): 4 Player 0 sampled action: x(1,1) Next state: ... .x. ... Player 1 sampled action: o(2,2) Next state: ... .x. ..o Choose an action (empty to print legal actions): 7 Player 0 sampled action: x(2,1) Next state: ... .x. .xo Player 1 sampled action: o(0,1) Next state: .o. .x. .xo Choose an action (empty to print legal actions): 3 Player 0 sampled action: x(1,0) Next state: .o. xx. .xo Player 1 sampled action: o(1,2) Next state: .o. xxo .xo Choose an action (empty to print legal actions): 2 Player 0 sampled action: x(0,2) Next state: .ox xxo .xo Player 1 sampled action: o(2,0) Next state: .ox xxo oxo Choose an action (empty to print legal actions): 0 Player 0 sampled action: x(0,0) Next state: xox xxo oxo Returns: 0.0 0.0 , Game actions: x(1,1) o(2,2) x(2,1) o(0,1) x(1,0) o(1,2) x(0,2) o(2,0) x(0,0) Number of games played: 1 Number of distinct games played: 1 Players: human az Overall wins [0, 0] Overall returns [0.0, 0.0]
除此之外 deepmind/open_spiel 还提供了 DQN 和表格型方法的对弈学习
# DQN agent vs Tabular Q-Learning agents trained on Tic Tac Toe. $ python3 open_spiel/python/examples/tic_tac_toe_dqn_vs_tabular.py
AlphaZero 同样适用于除 GO 之外的 two players games。
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK