Nomad batch jobs with BeeGFS and E-Series
source link: https://scaleoutsean.github.io/2022/04/05/nomad-beegfs-eseries.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Nomad batch jobs with BeeGFS and E-Series
05 Apr 2022 -
6 minute read
Introduction
When I wrote about HashiCorp Nomad with NetApp SolidFire last week I mentioned how BeeGFS CSI with NetApp E-Series is of to-do items.
This post isn’t about CSI, but an intermediate step.
Now, there are many batch job schedulers out there. Some are HPC-focused and extremely good for that use case.
But there are non-HPC use cases for BeeGFS, such as - for the sake of an example - data cleansing, video conversion and even backups (we could run Restic backup jobs for generic hosts and containers using Nomad).
Some folks must use workflows specific to their organization or solution, but sometimes they can be initiated from another system, and other times they’re still executed the old way (crontab-driven shell scripts?).
BeeGFS with E-Series provides extreme scale-out in terms of performance and number of files.
As noted in the first Nomad post, HashiCorp Nomad can schedule batch jobs, containers, VMs and more. It can also allocate desired resources to each job.
The other point I made is we don’t always need CSI storage (and I am not saying this because I haven’t done it yet!).
Setup
These simple example is about running such jobs at a high speed. Normally we’d have multiple clients, but here I have just one.
- BeeGFS cluster (VMs)
- b1 - manager
- b2 - metadata server
- b3 - storage node 1
- b4 - storage node 2
- b5 - BeeGFS and Nomad client
- Nomad servers (VM)
- Parallel filesystem
- BeeGFS mounted at /mnt/beegfs on BeeGFS client (b5)
- Block storage
- NetApp E-Series 5760
On Nomad client with BeeGFS, configure Host Volume that points to a BeeGFS mount point or its subdirectory (e.g. /mnt/beegfs/nomad-mysql). beegfs-ctl gives us a view of our BeeGFS cluster.
$ sudo beegfs-ctl --listnodes --nodetype=meta
b2 [ID: 1]
$ sudo beegfs-ctl --listnodes --nodetype=storage
b3 [ID: 3]
b4 [ID: 4]
$ beegfs-ctl --listnodes --nodetype=client
A83C-624BAFAC-b5 [ID: 101]
Visually:
Generic batch job with I/O to parallel filesystem
On Nomad server, create a batch task such as this:
task "script"{
driver = "raw_exec"
config {
command = "/usr/bin/fio"
args = ["/mnt/beegfs/nomad-mysql/fio.txt"]
}
}
I currently have just one BeeGFS client, but with multiple BeeGFS/Nomad clients this job would be scheduled to any BeeGFS client with the same Host Volume. In fact - because BeeGFS is a parallel file system - you could schedule jobs on several clients at once (for example, N jobs that process 1/N-th of the input file each).
And - as demonstrated in the post about NetApp HCI with BeeGFS VMs on EF280 array - this gives you the ability to run such jobs at multiple GB/s per second per job.
$ nomad job run batch.nomad
==> 2022-04-05T05:09:55Z: Monitoring evaluation "17e84b74"
2022-04-05T05:09:55Z: Evaluation triggered by job "batch"
2022-04-05T05:09:55Z: Allocation "4293eab4" created: node "71a25827", group "example"
==> 2022-04-05T05:09:56Z: Monitoring evaluation "17e84b74"
2022-04-05T05:09:56Z: Allocation "4293eab4" status changed: "pending" -> "running" (Tasks are running)
2022-04-05T05:09:56Z: Evaluation status changed: "pending" -> "complete"
==> 2022-04-05T05:09:56Z: Evaluation "17e84b74" finished with status "complete"
It took 1 second to complete this job.
Job output is a 500 MiB file (rand-read.0.0). Although the workload was read-only, the file was created anew when the job executed.
$ sudo dir -lat /mnt/beegfs/nomad-mysql/
total 512002
-rw-r--r-- 1 nomad nomad 524288000 Apr 5 05:09 rand-read.0.0
drwxrwxr-x 2 nomad nomad 2 Apr 5 05:09 .
-rw-r--r-- 1 root root 202 Apr 5 04:39 fio.txt
drwxrwxrwx 3 root root 7 Apr 5 03:33 ..
BeeGFS client (b5) with a successfully completed batch job:
Job overview:
File format conversion
Another example is video conversion - because recently somebody asked me about this. Our input is a video we want to convert to MP4.
$ ll
total 540906
drwxrwxr-x 2 nomad nomad 3 Apr 5 05:36 ./
drwxrwxrwx 3 root root 7 Apr 5 03:33 ../
-rw-r--r-- 1 root root 202 Apr 5 04:39 fio.txt
-rw-r--r-- 1 nomad nomad 524288000 Apr 5 05:09 rand-read.0.0
-rw-r--r-- 1 nomad nomad 29597941 Mar 29 16:13 sample_1280x720_surfing_with_audio.m2v
When job gets allocated to b5, it consumes CPU (as expected), but very little IO (because it’s CPU-constrained).
$ dstat -tcn
----system---- --total-cpu-usage-- -net/total-
time |usr sys idl wai stl| recv send
05-04 05:39:46| 1 1 98 0 0| 0 0
05-04 05:39:47|100 0 0 0 0|1034k 261k
05-04 05:39:48|100 0 0 0 0| 868B 514k
05-04 05:39:49|100 0 0 0 0| 516k 958B
05-04 05:39:50| 99 1 0 0 0| 517k 514k
05-04 05:39:51| 99 0 1 0 0| 163k 514k
05-04 05:39:52|100 0 0 0 0| 517k 2658B
05-04 05:39:53| 99 1 0 0 0| 804B 514k
05-04 05:39:54|100 0 0 0 0| 516k 760B
Because this BeeGFS filesystem uses storage devices on both storage nodes (b3,b4), each has 50% of that IO activity.
----system---- --total-cpu-usage-- -net/total- -dsk/total-
time |usr sys idl wai stl| recv send| read writ
05-04 05:39:33| 1 0 99 0 0| 130B 110B| 0 0
05-04 05:39:34| 0 0 99 1 0| 66B 118B| 0 20k
05-04 05:39:35| 1 0 99 0 0|1406B 513k| 0 0
05-04 05:39:36| 0 0 100 0 0| 514k 161k| 0 0
05-04 05:39:37| 0 0 100 0 0| 130B 110B| 0 0
05-04 05:39:38| 0 0 100 0 0| 434B 282B| 0 0
05-04 05:39:39| 0 0 100 0 0| 514k 513k| 0 0
05-04 05:39:40| 0 0 100 0 0| 345B 110B| 0 20k
Output (MP4 file sample_1280x720_surfing_with_audio.mp4):
$ ll
total 547563
drwxrwxr-x 2 nomad nomad 3 Apr 5 05:39 ./
drwxrwxrwx 3 root root 7 Apr 5 03:33 ../
-rw-r--r-- 1 root root 202 Apr 5 04:39 fio.txt
-rw-r--r-- 1 nomad nomad 524288000 Apr 5 05:09 rand-read.0.0
-rw-r--r-- 1 nomad nomad 29597941 Mar 29 16:13 sample_1280x720_surfing_with_audio.m2v
-rw-r--r-- 1 nomad nomad 6815792 Apr 5 05:39 sample_1280x720_surfing_with_audio.mp4
I hard-coded video conversion parameters in the job (target resolution was smaller so the file name should have been sample_640x480…) - it’s not something to write home about but it worked for the purpose of this demonstration.
Summary
Nomad makes it easy to schedule all sorts of jobs on various platforms. Some may even be jobs that run in (say) DB VMs and require complex steps that can benefit from other Nomad features such as the ability to integrate with Vault to avoid hard-coded passwords in your scripts.
If you have a workflow that can benefit from a better scheduler, you could dispatch jobs to Nomad which would schedule them for you.
You may be tempted to try Kubernetes and that’s fine, but from a job scheduling perspective all you need is a single binary (nomad) that runs as server and client depending on its role. The entire setup for a VM-based cluster (provision and configure VMs, BeeGFS, Nomad clients and servers) - can be done in less than 20 minutes.
BeeGFS with E-Series gives you the ability to access data from any client - whether it’s a VM, container, physical host - at a very high speed. Jobs in this post did not use Docker - I used the simplest approach, Nomad’s generic exec driver (which has “isolate” option for better security if you need it).
BeeGFS CSI - a CSI-compatible driver for BeeGFS maintained by NetApp - is our next stop on this journey. BeeGFS CSI doesn’t officially support Nomad (as of April 2022), but BeeGFS CSI driver should be able to work with Nomad CSI. More on that in next Nomad-related post. Until then, you may be interested in BeeGFS CSI with Kubernetes.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK