Configure NVIDIA GPU Operator for PCI Passthrough¶
This is a pre-release feature and may not function as described in StarlingX 5 documentation.
This section provides instructions for configuring NVIDIA GPU Operator.
About this task
Note
NVIDIA GPU Operator is only supported for standard performance kernel profile. There is no support provided for low-latency performance kernel profile.
NVIDIA GPU Operator automates the installation, maintenance, and management of NVIDIA software needed to provision NVIDIA GPU and provisioning of pods that require nvidia.com/gpu resources.
NVIDIA GPU Operator is delivered as a Helm chart to install a number of services and pods to automate the provisioning of NVIDIA GPUs with the needed NVIDIA software components. These components include:
NVIDIA drivers (to enable CUDA which is a parallel computing platform)
Kubernetes device plugin for GPUs
NVIDIA Container Runtime
Automatic Node labelling
DCGM (NVIDIA Data Center GPU Manager) based monitoring
Prerequisites
Download the gpu-operator-v3-1.6.0.3.tgz file at http://mirror.starlingx.cengn.ca/mirror/starlingx/.
Use the following steps to configure the GPU Operator container:
Procedure
Lock the hosts(s).
~(keystone_admin)]$ system host-lock <hostname>
Configure the Container Runtime host path to the NVIDIA runtime which will be installed by the GPU Operator Helm deployment.
~(keystone_admin)]$ system service-parameter-add platform container_runtime custom_container_runtime=nvidia:/usr/local/nvidia/toolkit/nvidia-container-runtime
Unlock the hosts(s). Once the system is unlocked, the system will reboot automatically.
~(keystone_admin)]$ system host-unlock <hostname>
Create the RuntimeClass resource definition and apply it to the system.
cat > nvidia.yml << EOF kind: RuntimeClass apiVersion: node.k8s.io/v1beta1 metadata: name: nvidia handler: nvidia EOF
~(keystone_admin)]$ kubectl apply -f nvidia.yml
Install the GPU Operator Helm charts.
~(keystone_admin)]$ helm install gpu-operator /path/to/gpu-operator-1.6.0.3.tgz
Check if the GPU Operator is deployed using the following command.
~(keystone_admin)]$ kubectl get pods –A NAMESPACE NAME READY STATUS RESTART AGE default gpu-operator-596c49cb9b-2tdlw 1/1 Running 1 24h default gpu-operator-node-feature-discovery-master-7f87b4d6bb-wsbn4 1/1 Running 2 24h default gpu-operator-node-feature-discovery-worker-hqzvw 1/1 Running 4 24h gpu-operator-resources nvidia-container-toolkit-daemonset-8f7nl 1/1 Running 0 14h gpu-operator-resources nvidia-device-plugin-daemonset-g9lmk 1/1 Running 0 14h gpu-operator-resources nvidia-device-plugin-validation 0/1 Pending 0 24h gpu-operator-resources nvidia-driver-daemonset-9mnwr 1/1 Running 0 14h
The plugin validation pod is marked completed.
Check if the nvidia.com/gpu resources are available using the following command.
~(keystone_admin)]$ kubectl describe nodes <hostname> | grep nvidia
Create a pod that uses the NVIDIA RuntimeClass and requests a nvidia.com/gpu resource. Update the nvidia-usage-example-pod.yml file to launch a pod NVIDIA GPU. For example:
cat <<EOF > nvidia-usage-example-pod.yml apiVersion: v1 kind: Pod metadata: name: nvidia-usage-example-pod spec: runtimeClassName: nvidia containers: - name: nvidia-usage-example-pod image: nvidia/samples:cuda10.2-vectorAdd imagePullPolicy: IfNotPresent command: [ "/bin/bash", "-c", "--" ] args: [ "while true; do sleep 300000; done;" ] resources: requests: nvidia.com/gpu: 1 limits: nvidia.com/gpu: 1 EOF
Create a pod using the following command.
~(keystone_admin)]$ kubectl create -f nvidia-usage-example-pod.yml
Check that the pod has been set up correctly. The status of the NVIDIA device is displayed in the table.
~(keystone_admin)]$ kubectl exec -it nvidia-usage-example-pod -- nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:AF:00.0 Off | 0 | | N/A 28C P8 14W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
For information on deleting the GPU Operator, see Delete the GPU Operator.