Modifying AWS Batch Job Definition in new console causes jobs to get stuck in Ru...
source link: https://ralphwillgoss.github.io/blog/2021/06/23/modify-aws-batch-job-definition-in-new-console-causes-jobs-to-get-stuck-in-runnable-status
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Modifying AWS Batch Job Definition in new console causes jobs to get stuck in Runnable status
I provision all my infrastructure using Infrastructure as Code, in this case AWS CloudFormation.
There are times though where I will use the GUI, in this case the AWS Console, to tweak things to help with experiments and iteration speed.
I recently made a change to an AWS Batch Job Definition which inadvertedly caused all my jobs to get stuck in Runnable status.
Batch Jobs getting stuck in a Runnable status is a known issue with AWS Batch and there’s a AWS guide to trouble shooting it.
However, I did not expect changing some container path mappings in a job definition, to cause any issues.
After sometime trouble shooting, I discovered a bug in the New Batch console UI.
I could reproduce the bug as follows.
Looking at the New Batch experience console, my job definition looked fine:
However, after clicking around for a while I decide to disable the New Batch experience.
My initial change was Revision 18 but I don’t remember changing the job type to be Multi-node?
So I check the New Batch Job Requirements UI to see if multi node is enabled, but its not:
I then switch back to the old Batch console Job Requirements, to see if its enabled there too but its not:
However, the Old Batch console experience shows me something interesting.
I’ve never assigned 4 GPU’s, yet the vCPU and Memory requirements are blank?
The fix I discovered was to Create a new Revision using the Old Batch console experience, entering my CPU and Memory requirements again. Upon creating this new revision, I noticed that vCPUs and Memory were now showing correctly in the Old Batch console experience.
Next, I saw EC2 provisioning an instance for my jobs that were stuck and eventually they were executed successfully.
Bug reported on the AWS Batch Forums.
References:
Infrastructure as Code
AWS CloudFormation
Why is my AWS Batch job stuck in RUNNABLE status?
Updated: June 23, 2021
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK