Do not underestimate the need for DevOps in AI.

Do not underestimate the need for DevOps in AI. Enter Deep Learning DevOps — DL Infrastructure Engineering.

Feb 18 ·3min read

uERNJbU.jpg!web

Why should you care?

As machine learning is getting more mature, the need to build infrastructure that supports running these workflows is even greater. In a large enterprise setting on an average, there are at least 200+ data scientists/DL/ ML engineers that run their model training and inferencing jobs. Ensuring that these users get easy hardware/software access to train their models is imperative. This sounds like an easy task, I’m here to tell you it is not.

There are multiple challenges. For example:

Abstraction of running jobs and ease of hardware access to data scientists is important.
Different users have different hardware and software requirements. Some users train their models using Tensorflow, while some using Pytorch and others use their own framework built in-house.
One team uses large scale XLNET/BERT training say using sixteen V100s, the other team uses pre-trained EfficientNet with say two T4 GPUs.
How to manage and maintain these resources?
Who actually reserves the control to these hardware resources? Is it the data scientists or the DevOps team?
Who sets the priority of the jobs (A job is anything that you want to run on hardware for ex, training, inferencing, etc.)?
How to maintain the sanity of resource allocations? We all know everybody wants their model to be trained first.
How to support data scientists who don’t know to use their allocated resources fully — For example, their GPU utilization across 32 GPUs is less than 25%?
How to handle security issues — for example, ensuring only desired users get access?
How to ensure all the accelerators and nodes are used effectively while not taking significant performance penalty?
Who helps data scientists profile their slow applications? This is sometimes an issue because data scientists are not necessarily building the most efficient model training pipelines. In other words, data scientists are not software engineers.
Some jobs are a chain of dependencies: perhaps they use lets say parallel server-worker architecture where some workers must be kicked off before others, how to handle them to maintain and the queue?
Who installs and maintains new tools- for example, MLFlow , KubeFlow , Polyaxon , Seldon , Pachyderm , Domino Data Lab , Argo , etc. that come as quick as the next month.
If these problems are already not sufficient to cause headaches, think about installing and maintaining hardware level drivers, for example, CUDA drivers, new packages, etc.
Some errors the data scientists run into are very ML library specific, for example, NCCL ring topology issues.
Dealing with deploying and supporting inferencing jobs is another beast of its own that needs its own infrastructure team.

A typical DevOps engineer doesn’t necessarily have expertise in supporting some very ML library-specific issues. On the other hand, data scientists themselves aren’t experts in managing large scale clusters and it’s not a good idea to hand off a large cluster to data scientists either. So who does the above work? At this point, you might be thinking, “hmm…. This sounds a lot like a SysAdmin role”, in a way, it is! However, since this entails the need for knowing ML concepts it needs someone to be a SysAdmin + ML Engineer = Enter ML/DL Infrastructure engineer and added along with large on-prem clusters usually come HPC .

Conclusion

DL Infrastructure Engineers are responsible for managing and maintaining clusters. This is truer when you move from cloud to on-prem. In the future (not really, we’re already seeing it happen), we’ll be able to see an infrastructure branch that caters to data scientists who are aware of ML and DevOps concepts. That way, let the scientists do their science and infrastructure engineers do their DL infrastructure ;)

Why should you care?

There are multiple challenges. For example:

Conclusion

Recommend

自然语言理解（NLU）难在哪儿？

PyTorch实战指南

力扣494——目标和

2020 年编程语言盘点展望：Java 老兵不死，Kotlin 蓄势待发

日志采集落地方案

24亿巨亏背后，北京文化的电视剧业务究竟怎么了？

阿里、字节：一套高效的iOS面试题

南极新高温记录 20.75℃

如何处理去世者的社交媒体账号

美国考虑切断华为的全球芯片供应链

About Joyk