E-Series in a Kubernetes environment with DirectPV, TopoLVM, CSI Driver LVM CSI drivers

09 Dec 2022 -

11 minute read

CSI choices for E-Series in a Kubernetes environment

NetApp currently develops and maintains BeeGFS CSI driver for its BeeGFS with E-Series solution. NetApp Trident used to support E-Series with regular (non-parallel) filesystems but that’s no longer the case, so at this time E-Series users can get full support for both filesystem (BeeGFS) and CSI by purchasing BeeGFS with E-Series from NetApp.

Most Linux applications work fine with BeeGFS, but some customers prefer single-host filesystems (XFS, for example) or must replicate data across multiple Persistent Volumes. What are their choices today?

Some would prefer to have Trident support for E-Series, but in my experience more are interested in CSI drivers for direct-attached storage (aka DAS).

What’s out there in terms of DAS CSI for Kubernetes

Some popular choices are listed below in alphabetic order:

I won’t attempt to compare them here, but I’ll comment on one that stands out which is DirectPV by MinIO: unlike the other two, which may be more “generic”, DirectPV is the first choice for MinIO users in Kubernetes environments with DAS.

That doesn’t mean it can’t work stand-alone - it can - or that you can’t run MinIO with the other two (I think you can). But if you’re using something other than MinIO, the other two may have better features or documentation tailored to your use case.

How does it work?

It works exactly as you imagine it works: each E-Series volume is presented to one, and only one, worker node. Worker nodes can be connected to such volumes directly (SAS or FC cables, for example) or even via SAN (iSCSI, for example).

Basic example for a Kubernetes cluster with 3 bare metal worker nodes connected to E-Series

Worker	DG/DDP	VolName	Worker volume
1	DDP1	vol1	/dev/dm-1
2	DDP1	vol2	/dev/dm-1
3	DDP1	vol3	/dev/dm-1

This makes it possible to create (usually DAS i.e. switch-less) configurations like the second example from my Instaclustr post which is VM (not Kubernetes) based but would work for Kubernetes clusters as well. Using two disk groups, two volumes, and two workers in each AZ:

Instaclustr services with E-Series running across several sites

This configuration with RF3 on application side results in application availability across three AZs, with cross-AZ application rebuilds in the case an application server fails. One way to avoid cross-AZ rebuilds on worker failure - which will happen even though we may have multiple workers in each AZ, because each worker is zoned to see only its own disk(s) - is to run workers in VMs and use SAN from VMs (iSCSI, for example).

This example also shows how a blanket statement such as “you shouldn’t use SAN” can lead to a higher cost, more operational complexity, or both.

Advantages of using DAS CSI with E-Series disk arrays

If you look it up in older NetApp E-Series Technical Reports, E-Series has the following benefits for Ceph, Hadoop and similar applications.

Better redundancy means less unplanned downtime
RF2 with RAID 6 is more cost effective than RF3 with RAID6 (or even JBOD)
Smaller performance impact on application and network traffic during data recovery
Easier storage management

The same applies to this use case as well - although our workload may be Cassandra, Kafka, MinIO, or Elasticsearch rather than Hadoop (HDFS).

Another advantage may be that in the case of smaller environments it may be “better” (on the whole) to use DAS CSI with SAN, and be able to use the same array from other Kubernetes (e.g. VMware Tanzu) and non-Kubernetes applications (e.g. Instaclustr).

Redundancy and data protection with DAS CSI and E-Series

Client connectivity

Depending on the number of client-facing storage connections you may be able to connect several workers to each E-Series array. If you choose non-redundant connections, that doubles the number of workers possible per array, but eliminates storage connection redundancy, so normally we’d use redundant connections (one per E-Series controller).

E-Series has dual controllers and because data is external to workers, with redundant connections to E-Series controllers storage is more reliable than disks internal to worker nodes.

RAID level or DDP

The second question is RAID or DDP configuration.

E-Series lets you build multiple volumes on a single RAID group or DDP (“pool”, see other posts such as this one to see how DDP differs from traditional RAID). Recommended RAID levels include 1 (and 10), 5, 6. For best flexibility - especially if you would use multiple volumes per disk group - use DDP.

If your application protects data on its own by making multiple copies of data (RF2, RF3), with E-Series it is possible to go from RF3 on JBOD (or even RF3 on RAID6) to RF2 on E-Series RAID6/DDP.

Or, if you want to keep RF3 because you must operate service in three AZs, RF3 with E-Series DDP in each AZ may be more cost effective than RF3 with in-worker RAID6. Although this would require three E-Series arrays (one per AZ, like in the image above), you could run 2 workers per AZ with 1 DDP in E-Series rather than two JBODs or two RAID6 per AZ with server-based DAS.

E-Series also supports RAID 0, which is similar to JBOD in the sense that a failed disks results a failed disk group, but thanks to controller redundancy of E-Series, its RAID 0 is more reliable than server-based RAID 0 which becomes unavailable every time the server is rebooted as well as whenever a disk fails. But in most cases RAID 0 isn’t a great idea; DDP has a very low capacity overhead (20-30%) so it is a good trade-off between increased reliability and low capacity overhead for those seeking to avoid network storms during RAID 0 or software RAID failures in workers.

Storage protocol choices

Below are main I/O interface choices for E-Series arrays today.

EF300 and EF600 (end-to-end NVMe) - iSCSI, NVMe/FC, FC, NVMe/IB, NVMe/RoCE… (see datasheet)
E2800 and E5700 - SAS, iSCSI, FC, NVMe/FC…

Two things to highlight: EF300 and EF600 don’t provide SAS connectivity, but they provide faster performance compared to E2800 and E5700. DAS connectivity is possible with all protocols (iSCSI, FC, IB, and (E2800/E5700) SAS).

DAS vs SAN

What if you already have E-Series in a SAN environment and all your E-Series’ client-facing controller ports are occupied?

You can use DAS CSI drivers just fine. Unlike with Trident, where all workers in a cluster (or zone, at least) used to be zoned to see the same set of volumes, DAS CSI requires each worker to see only its own volumes. So for N workers you’d create N*2 zones in a SAN environment (2 per worker, one to each E-Series controller).

Should DAS CSI users ever choose SAN over DAS?

MinIO and others say no. But it really depends.

Let’s say you have a VMware Tanzu cluster and MinIO on bare metal (six workers in total). You may choose DirectPV in a SAN (iSCSI) environment if one E-Series can satisfy your requirements, and avoid buying extra E-Series arrays. Or avoid building one SAN for Tanzu and three DAS islands with in-worker node DAS just for MinIO.

Recovery from server failures

I’m now repeating what I already said above, but when a server fails, DirectPV won’t let you simply re-assign the volume to another worker in the same AZ and “import” it. That’s by design:

we’re not supposed to fiddle with storage (except for the initial provisioning); data management and application recovery largely takes place on application layer
“rescuing” data on a DAS PV by importing it from another worker may make sense in simplest of scenarios, but in many cases data is sharded or erasure-coded by the application so even if the volume was imported to another worker node, that would not mean it has been rescued. Even if TopoLVM or CSI Driver LVM can import orphaned PV in most cases that would not make sense

Backup and restore

See that Instaclustr post linked above the image: all modern analytics applications have built in backup and restore. Some have backup and dump (or export), where backup is “incremental forever” while “full backup” is achieved with “dump” or “export”.

Some “traditional” users may complain they want enterprise backup solutions for modern analytics applications but the reality is most enterprise backup applications do not support them. Why? Because it doesn’t add value. Even with Oracle Database, most users today use built-in Data Guard.

Another objection is that “traditional” backup is slow. Well, if you’re doing a full backup of a 100TB database, it won’t be fast. But on the fast E-Series arrays you can run frequent incrementals, and then restore and test them to a warm stand-by database. Furthermore, I haven’t tried but I would bet that with EF600 it is possible to backup (export, dump) an entire 100 TB database to a hybrid EF600 array in under three hours.

Application-native backups are well-behaving because everyone uses them, they’re easy to automate, and they make it easy to create cold replicas anywhere (including the public cloud) or even perform DR to the cloud (for example, Instaclustr provides such services, although at this time you’d restore a Kubernetes app to VMs because Instaclustr’s service is currently based on VMs).

Note that LVM-based CSI Driver LVM and TopoLVM can take volume (PV volume, not E-Series LUN) snapshots, but again - when you have five workers using volumes from two E-Series arrays, are LVM snapshots better than native incremental backups stored on S3? Probably not.

Driver compatibility and support

These drivers are community-supported and may be commercially supported by your Kubernetes distribution.

User’s responsibility is to pick a Linux distribution and version from the E-Series interoperability site, which means a recent release of RHEL, Rocky, Ubuntu and such. Once connectivity to E-Series is established and (recommended) MPIO in place, there is nothing else to worry about: as mentioned above, there’s no Kubernetes- or CSI-related failover.

In these setups E-Series volumes don’t need to be resized (although that may be possible the same way general DAS volumes can be resized) and there are no array snapshots involved. There may be PV snapshots performed by LVM and we can certainly take E-Series “cold snapshots” if we want to, but in normal operation E-Series just needs to present storage to supported Linux clients.

Why not resize E-Series volumes? Because it’s usually done by the application: use the E-Series API or Ansible modules for E-Series and hosts to a new volume to one or more workers, activate them with the CSI driver, inform the application and let it take care of data rebalancing.

Storage networking, automation and cloud-native workflows

Network automation is usually a sensitive topic because network administrators must be in the loop.

NetApp Trident (and many other CSI drivers) supports only iSCSI SANs (today with ONTAP and SolidFire), which is easiest and most convenient to use, but usually the worst from “don’t mess with networking” perspective. Some users therefore ask for FC - not because iSCSI is slow (it’s not) but because that helps them avoid operational issues with IP SANs.

DAS CSI with E-Series supports any I/O interface supported by OS and E-Series, and doesn’t use switches. While DAS CSI users also have the option to use them with SAN-connected E-Series, DAS CSI with E-Series gives them flexibility in operations and removes the hassle of dealing with switches and especially IP storage networks.

Because modern analytics applications usually use three or more servers, capacity is usually provisioned by adding three volumes at once:

Create volume group/pool or use existing
Create volume and present it to one worker
Rescan storage from the worker and possibly configure MPIO (with iSCSI you would need to login to target after discovery)
Use DAS CSI driver to refresh its view of disks, import/activate and format PV

Step 1 can be done with Ansible.

Steps 2 and 3 need to be done for/on each worker. For that use host collection automation for E-Series. Examples of using Ansible with E-Series can be found on the Internet.

Step 4 involves several commands inside of Kubernetes.

Disk removal works in reverse and Ansible makes it easy to undo (simply add present: false) provisioning done in steps 2 and 3.

DirectPV walk-through with E-Series

TODO.

Summary

Users of modern analytics and other applications that use single host filesystems have a choice of CSI DAS drivers which are available today and simple to use.

This lets them avoid the hassle of dealing with JBODs and worker-captive RAIDs and possibly lower their application replication factor from three to two or even one.

E-Series in a Kubernetes environment with DirectPV, TopoLVM, CSI Driver LVM CSI...

E-Series in a Kubernetes environment with DirectPV, TopoLVM, CSI Driver LVM CSI drivers

CSI choices for E-Series in a Kubernetes environment

What’s out there in terms of DAS CSI for Kubernetes

How does it work?

Advantages of using DAS CSI with E-Series disk arrays

Redundancy and data protection with DAS CSI and E-Series

Client connectivity

RAID level or DDP

Storage protocol choices

DAS vs SAN

Recovery from server failures

Backup and restore

Driver compatibility and support

Storage networking, automation and cloud-native workflows

DirectPV walk-through with E-Series

Summary

Recommend

GitHub - lyft/cni-ipvlan-vpc-k8s: AWS VPC Kubernetes CNI driver using IPvlan

Kubernetes in Hetzner Cloud with Rancher Part 2 - Node Driver

Secure Kubernetes on Hetzner Cloud with a node driver for Rancher

閱讀筆記: 「透過 Kubernetes Event-Driver Autoscaler(KEDA) 來根據各種指標動態擴充...

Minikube now supports rootless podman driver for running Kubernetes

TopoLVM: 基于LVM的Kubernetes本地持久化方案，容量感知，动态创建PV，轻松使用本地磁...

AMD 22.9.2 driver brings support for Ryzen 7000 Series and Grounded 1.0

DNS integration with SAP RISE in multi-cloud environment series guide – Azure

DNS integration with SAP RISE in multi-cloud environment series guide – AWS

Writing Helios drivers in the Mercury driver environment

About Joyk