使用 kubernetes 运行 tensorflow 分布式训练时的常见问题

2 years ago

source link: https://zjj2wry.github.io/post/tensorflow/dist/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

使用 kubernetes 运行 tensorflow 分布式训练时的常见问题

2019-02-11 247 words 1 min read

worker 执行完任务后没有正常退出(seession close 失败)

tensorflow 分布式训练可以使用 Supervisor 和 MonitoredTrainingSession, 后者是 tensorflow 推荐的方式，使用 tf-operator 运行分布式训练的时候，训练结束后 worker 没有正常退出，因为都是容器，导致的结果是资源无法正常回收。

详情见 issue

分布式训练的时候只需要 worker 和 ps 通信，详情见 issue

config_proto = tf.ConfigProto(device_filters = ['/job:ps', '/job:worker/task:%d' % task_index])
...
with tf.train.MonitoredTrainingSession(master=server.target,
                                       is_chief=(task_index == 0),
                                       checkpoint_dir=FLAGS.working_dir,
                                       hooks=hooks,
                                       config=config_proto) as sess:

训练的时候没有输出日志

python 的 stdout 是带缓冲的，在跑 k8s 分布式训练的时候经常出现本地运行的时候有日志，但是使用容器运行的时候没有日志。

执行 python 程序的时候加一个 -u, 比如 python -u main.py
通过传递环境变量 PYTHONUNBUFFERED=1

Author zhengjiajin

LastMod 2019-02-11

cnn 基本原理(概念) hugo 使用

Recommend

使用 kubernetes 运行 tensorflow 分布式训练时的常见问题

使用 kubernetes 运行 tensorflow 分布式训练时的常见问题

worker 执行完任务后没有正常退出(seession close 失败)

训练的时候没有输出日志

Recommend

使用 Go 进行 Socket 编程

网站运营干货：带你深度了解网站运营的技巧

网站运营学习篇：你知道新上线的网站该如何运营吗？

湖北大学生态学家首次发现雄蜘蛛交配完可将自己弹射飞，免于被吃掉的命运

以多样性计算，构建算力网络坚实根基

McCarthy Says He Would Urge for Trump Resignation After Jan 6 in Recording - The...

Base quality scores are essential to short read variant calling

Inhalt ein- und ausblenden

基于Redis集群的分布式锁

trampoline introduction

About Joyk