GPU Support for AI Workloads in Red Hat OpenShift 4

Red Hat OpenShift is an enterprise-grade Kubernetes platform for managing Kubernetes clusters at scale, developed and supported by Red Hat. It offers a path to transform how organizations manage complex infrastructures on-premises as well as across the hybrid cloud.

AI computing brings far-reaching transformations to modern business, including fraud detection in financial services and recommendations engines for entertainment and e-commerce. A 2018 CTA Market Research study shows that companies using AI technology at the core of their corporate strategy boosted their profit margin by 15% compared to companies not embracing AI.

The responsibility of providing infrastructure for these new, massively compute-intensive workload falls on the shoulders of IT. Many organizations struggle with the complexity, time, and cost associated with deploying an IT-approved infrastructure for AI. Up to 40% of organizations wishing to deploy AI see infrastructure as a major blocker. Deploying a cluster for AI workloads often raises questions in areas of network topology and sizing of compute and storage resources. NVIDIA has therefore created reference architectures for typical applications to alleviate the guesswork.

The NVIDIA DGX POD, for example, consists of multiple DGX-1 systems and storage from named vendors. NVIDIA developed DGX POD from the experience of deploying thousands of nodes of leading-edge accelerated computing deployed in the world’s largest research and enterprise environments. Ensuring AI success at-scale, however, necessitates a software platform, such as KubernetesTM, that ensures manageability of the AI infrastructure.

Red Hat OpenShift 4 is a major release incorporating technology from Red Hat’s acquisition of CoreOS. At its core (no pun intended) are immutable system images based on Red Hat Enterprise Linux CoreOS (RHCOS). It follows a new paradigm where installations are never modified or updated after they are deployed but replaced with updated version of the entire system image. This provides higher reliability and consistency with a more predictable deployment process.

This post takes a first look at OpenShift 4 and GPU Operators for AI workloads on NVIDIA reference architectures. We base this post on a software preview requiring some manual steps, which will be resolved in the final version.

Installing and running OpenShift requires a Red Hat account and additional subscriptions. The official installation instructions are available under Installing a cluster on bare metal.

Test Setup Overview

The minimum configuration for an OpenShift cluster consists of three master nodes and two worker nodes (also called compute nodes). Initial setup of the cluster requires an additional bootstrap node, which can be removed or repurposed during the installation process. The requirement of having three master nodes ensures high-availability (avoiding split-brain situations) and allows for uninterrupted upgrades of the master nodes.

We used virtual machines on a single x86 machine for the bootstrap and master nodes and two DGX-1 systems for the compute nodes (bare-metal). A load-balancer ran in a separate VM to distribute requests to the nodes. Using round-robin DNS might have also worked, but getting the configurations correct turned out to be tricky. The virsh network needs to be set in bridged mode instead of NAT so that the nodes can communicate with each other (see Libvirt Networking for more details).

Red Hat OpenShift 4 does not yet provide a fully automated installation method for bare-metal systems but requires external infrastructure for provisioning and performing the initial installation (the OpenShift documentation refers to it as User Provisioned Infrastructure – UPI). In our case, we used the x86 server for provisioning and for booting nodes through PXE boot. Once installed, nodes perform upgrades automatically.

Creating the Systems Configurations

Red Hat Enterprise Linux CoreOS uses Ignition for the system configuration. Ignition provides a similar functionality to cloud-init and allows to configure systems during the first boot.

The Ignition files are generated by the OpenShift installer from the configuration file install-config.yaml. It describes the cluster through various parameters and also includes an SSH key and credentials for pulling containers from the Red Hat container repository. The OpenShift tools and the Pull Secret can be downloaded from the OpenShift Start Page.

apiVersion: v1
baseDomain: nvidia.com
compute:
- hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 2
controlPlane:
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3
metadata:
  creationTimestamp: null
  name: dgxpod
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  networkType: OpenShiftSDN
  machineCIDR: 10.0.0.0/16
  serviceNetwork:
  - 172.30.0.0/16
platform:
  none: {}
pullSecret: '{"auths": ….}'
sshKey: ssh-rsa ...

The parameters baseDomain and metadata:name form the domain name of the cluster (dgxpod.nvidia.com). The network parameters describe the internal network of the OpenShift cluster and only need to be modified if they conflict with the external network.

The following commands create the Ignition files for the nodes and the authentication file for the cluster. Because these commands delete install-config.yam, we kept a copy of it outside the ignition directory. The generated authentication file (ignition/auth/kubeconfig) should be renamed and copied to $USERHOME/.kube/config

mkdir ignition
cp install-config.yaml ignition
openshift-install --dir ignition create ignition-configs

DHCP and PXE Boot

Setting up PXE boot is certainly not an easy feat; providing detailed instructions is beyond the scope of this post. Readers should have knowledge of setting up PXE booting and DHCP. The following snippets only cover the DNS configuration for dnsmasq.

The address directive in the dnsmasq configuration file allows for using a wildcard to resolve any *.apps request with the address of the load balancer. The SRV entries allow the cluster to access the etcd service.

# Add hosts file
addn-hosts=/etc/hosts.dnsmasq

# Forward all *.apps.dgxpod.nvidia.com to the load balancer
address=/apps.dgxpod.nvidia.com/10.33.3.54/

# SRV DNS records
srv-host=_etcd-server-ssl._tcp.dgxpod.nvidia.com,etcd-0.dgxpod.nvidia.com,2380,0,10
srv-host=_etcd-server-ssl._tcp.dgxpod.nvidia.com,etcd-1.dgxpod.nvidia.com,2380,0,10
srv-host=_etcd-server-ssl._tcp.dgxpod.nvidia.com,etcd-2.dgxpod.nvidia.com,2380,0,10

The corresponding /etc/hosts.dnsmasq file lists the IP address and host names. Note that OpenShift requires that the first entry for each host to be the node name, such as master-0. The api-int and api entries point to the load balancer.

10.33.3.44  worker-0.dgxpod.nvidia.com
10.33.3.46  worker-1.dgxpod.nvidia.com

10.33.3.50  master-0.dgxpod.nvidia.com etcd-0.dgxpod.nvidia.com
10.33.3.51  master-1.dgxpod.nvidia.com etcd-1.dgxpod.nvidia.com
10.33.3.52  master-2.dgxpod.nvidia.com etcd-2.dgxpod.nvidia.com

10.33.3.53  bootstrap.dgxpod.nvidia.com

10.33.3.54  api-int.dgxpod.nvidia.com api.dgxpod.nvidia.com

The following pxelinux.cfg file is an example of a non-EFI PXE boot configuration. It defines the kernel and initial ramdisk, and provides additional command line arguments. Note that the arguments prefixed with coreos are passed to the CoreOS installer.

DEFAULT rhcos
PROMPT 0
TIMEOUT 0

LABEL rhcos
	kernel rhcos/rhcos-410.8.20190425.1-installer-kernel
	initrd rhcos/rhcos-410.8.20190425.1-installer-initramfs.img
	append ip=dhcp rd.neednet=1 console=tty0 console=ttyS0 coreos.inst=yes coreos.inst.install_dev=vda coreos.inst.image_url=http://10.33.3.18/rhcos/rhcos-410.8.20190412.1-metal-bios.raw coreos.inst.ignition_url=http://10.33.3.18/rhcos/ignition/master.ign

Kernel, initramfs, and raw images are available from the OpenShift Mirror. The installation instructions, Installing a cluster on bare metal, provide the latest version and download path. The image files and the ignition configurations from the previous step should be copied to the http directory. Ensure you have the proper http SELinux label set for all these files. Note that the DGX-1 systems only support UEFI for network booting and thus requires a different set of files.

Load Balancer

The Load balancer handles distributing requests across the online nodes. We ran and instance of CentOS in a separate virtual machine and used HAProxy with the following configuration.

listen ingress-http
    bind *:80
    mode tcp
    server worker-0 worker-0.dgxpod.nvidia.com:80 check
    server worker-1 worker-1.dgxpod.nvidia.com:80 check

listen ingress-https
    bind *:443
    mode tcp
    server worker-0 worker-0.dgxpod.nvidia.com:443 check
    server worker-1 worker-1.dgxpod.nvidia.com:443 check

listen api
    bind *:6443
    mode tcp
    server bootstrap bootstrap.dgxpod.nvidia.com:6443 check
    server master-0 master-0.dgxpod.nvidia.com:6443 check
    server master-1 master-1.dgxpod.nvidia.com:6443 check
    server master-2 master-2.dgxpod.nvidia.com:6443 check

listen machine-config-server
    bind *:22623
    mode tcp
    server bootstrap bootstrap.dgxpod.nvidia.com:22623 check
    server master-0 master-0.dgxpod.nvidia.com:22623 check
    server master-1 master-1.dgxpod.nvidia.com:22623 check
    server master-2 master-2.dgxpod.nvidia.com:22623 check

Creating the Bootstrap and Master Nodes

The virt-install command allows for easy deployment of the bootstrap and master nodes. NODE-NAME should be replaced with the actual name of the node and NODE-MAC with the corresponding network address (MAC) for the node.

virt-install --connect qemu:///system --name  --ram 8192 --vcpus 4 --os-type=linux --os-variant=virtio26 --disk path=/var/lib/libvirt/images/.qcow2,device=disk,bus=virtio,format=qcow2,size=20 --pxe --network bridge=virbr0 -m  --graphics vnc,listen=0.0.0.0 --noautoconsole

After the initial installation completes, the VM exits and must be restarted manually. The state of the VMs can be monitored using sudo virsh list which prints all active VMs. Once they disappear, they can be restarted with virsh start <NODE-NAME>. In the prerelease version of OpenShift, resolv.conf wasn’t updated and required rebooting the nodes one more time using virsh reset for each node to reboot the nodes again.

The entire installation process of the cluster should take less than an hour, assuming correct settings and configurations. The initial bootstrapping progress can be monitored using the following command.

openshift-install --dir  wait-for bootstrap-complete

The bootstrap node can be deleted when the bootstrapping completes. Next, wait for the entire installation process to complete, use:

openshift-install --dir  wait-for install-complete

The pre-release version of the installer reported errors at times but eventually completed successfully. Because it also didn’t automatically approve pending certifications (CSR), we added the following crontab entry that ran every 10 minutes.

*/10 *  *  *  * dgxuser oc get csr -ojson | jq -r '.items[] | select(.status == {} ) | .metadata.name' | xargs oc adm certificate approve

GPU Support

NVIDIA and Red Hat continue to work together to provide a straightforward mechanism for deploying and managing GPU drivers. The Node Feature Discovery Operator (NFD) and GPU Operator build the foundation for this improved mechanism and will be available from the Red Hat Operator Hub. This allows having and optimized software stack deployed at all times. The following instructions describe the manual steps to install these operators.

The NFD detects hardware features and configurations in the OpenShift cluster, such as CPU type and extensions or, in our case, NVIDIA GPUs.

git clone https://github.com/openshift/cluster-nfd-operator
cd cluster-nfd-operator/manifests
oc create -f .
oc create -f cr/nfd_cr.yaml

After the installation completes, the NVIDIA GPU should show up in the feature list for the worker nodes; the final software will provide a human readable name instead of the vendor ID (0x10de for NVIDIA).

oc describe node worker-0|grep 10de
                    feature.node.kubernetes.io/pci-10de.present=true

The Special Resource Operator (SRO) provides a template for accelerated cards. It activates when a component is detected and installs the correct drivers and other software components.

The development version of the Special Resource Operator already includes support for NVIDIA GPUs and will merge into the NVIDIA GPU Operator when it becomes available. It manages the installation process of all required NVIDIA drivers and software components.

git clone https://github.com/zvonkok/special-resource-operator
cd special-resource-operator/manifests
oc create -f .
cd cr
oc create -f sro_cr_sched_none.yaml

The following nvidia-smi.yaml file defines a Kubernetes Pod that can be used for a quick validation. It allocates a single GPU and runs the nvidia-smi command.

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  containers:
  - image: nvidia/cuda
    name: nvidia-smi
    command: [ nvidia-smi ]
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        nvidia.com/gpu: 1

The oc create -f nvidia-smi.yaml script creates and runs the pod. To monitor the progress of the pod creation use oc describe pod nvidia-smi. When completed, the output of the nvidia-smi command can be viewed with oc logs nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   36C    P0    41W / 300W |      0MiB / 16130MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Finally, the pod can be deleted with oc delete pod nvidia-smi.

Conclusion

Introducing Operators and an immutable infrastructure built on top of Red Hat Enterprise Linux CoreOS brings exciting improvements to OpenShift 4. It simplifies deploying and managing an optimized software stack for multi-node large scale GPU-accelerated data centers. These new features look pretty solid now and we think customers will be pleased to use them going forward.