How to Configure the NVIDIA vGPU Drivers, CUDA Toolkit and Container Toolkit on Debian 12

May 31, 2024 / seanpmassey

As I’ve started building more GPU-enabled workloads in my home lab, I’ve found myself repeating a few steps to get the required software installed. It involved multiple tools, and I was referencing multiple sources in the vendor documentation.

I wanted to pull everything together into one document – both to document my process so I can automate it and also to share so I can help others who are looking at the same thing.

So this post covers the steps for installing and configuring the NVIDIA drivers, CUDA toolkit, and/or the Container Toolkit on vSphere virtual machines.

Install NVIDA Driver Prequisites

There are a few prerequisites required before installing the NVIDIA drivers. This includes installing kernel headers, the programs required to compile the NVIDIA drivers, and disabling Nouveau. We will also install the NVIDIA CUDA Repo.

#Install Prerequisites
sudo apt-get install xfsprogs wget git python3 python3-venv python3-pip p7zip-full build-essential -y
sudo apt-get install linux-headers-$(uname -r) -y

#Disable Nouveau
lsmod | grep nouveau

cat <<EOF | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF
sudo update-initramfs -u

Reboot the system after the initramfs build completes.

sudo reboot

Install the NVIDIA Drivers

NVIDIA includes .run and .deb installer options for Debian-based operating systems. I use the .run option because that is what I am most familiar with. The run file will need to be made executable as it does not have these permissions by default. I also install using the --dkms flag so the driver will be recompiled automatically if the kernel is updated.

The vGPU drivers are distributed through the NVIDIA Enterprise Software Licensing portal through the NVIDIA Virtual GPU or AI Enterprise product sets and require a license to use If you are using PCI Passthrough instead of GRID, you can download the NVIDIA Data Center/Tesla Drivers from the data center driver download page.

I am using the NVAIE product set for some of my testing, so I will be installing a vGPU driver. The steps to install the Driver, CUDA Toolkit, and Container Toolkit are the same whether you are using a regular data center driver or the vGPU driver. You will not need to configure any licensing when using PCI Passthrough.

The drivers need to be downloaded, copied over to the virtual machine, and have the executable flag set on the file.

sudo chmod +X NVIDIA-Linux-x86_64-550.54.15-grid.run
sudo bash ./NVIDIA-Linux-x86_64-550.54.15-grid.run --dkms

Click OK for any messages that are displayed during install. Once the installation is complete, reboot the server.

After the install completes, type the following command to verify that the driver is installed properly.

nvidia-smi

You should receive an output similar to the following:

Installing the CUDA Toolkit

Like the GRID Driver installer, NVIDIA distributes the CUDA Toolkit as both a .run and .deb installer. For this step, I’ll be using the .deb installer as it works with Debian’s built-in package management, can handle upgrades when new CUDA versions are released, and contains a multiple meta package installation options that are documented in the CUDA installation documentation.

By default, the CUDA toolkit installer will try to install an NVIDIA driver. Since this deployment is using a vGPU driver, we don’t want to use the driver included with CUDA. NVIDIA is very prescriptive about which driver versions work with vGPU, and installing a different driver, even if it is the same version, will result in errors.

The first step is to install the CUDA keyring and enable the contrib repository. The keyring file contains the repository information and the GPG signing key. Use the following commands to complete this step:

wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo add-apt-repository contrib

The next step is to update our Apt-Get repos and install the CUDA Toolkit. The CUDA toolkit requires a number of additional packages that will be installed alongside the main application.

sudo apt-get update && sudo apt-get -y install cuda-toolkit-12-5

The package installer does not add CUDA to the system PATH variable, so we need to do this manually. The way I’ve done this is to create a login script that applies for all users using the following command. The CUDA folder path is versioned, so this script to set the PATH variable will need to be updated when the CUDA version changes.

cat <<EOF | sudo tee /etc/profile.d/nvidia.sh
export PATH="/usr/local/cuda-12.5/bin${PATH:+:${PATH}}"
EOF
sudo chmod +x /etc/profile.d/nvidia.sh

Once our script is created, we need to apply the updated PATH variable and test our CUDA Toolkit installation to make sure it is working properly.

source /etc/profile.d/nvidia.sh
nvcc --version

You should receive the following output if the PATH variable is updated properly.

If you receive a command not found error, then the PATH variable has not been set properly, and you need to review and rerun the script that contains your EXPORT command.

NVIDIA Container Toolkit

If you are planning to use container workloads with your GPU, you will need to install the NVIDIA Container Toolkit. The Container Toolkit provides a container runtime library and utilities to configure containers to utilize NVIDIA GPUs. The Container Toolkit is distributed from an apt repository.

Note: The CUDA toolkit is not required if you are planning to only use container workloads with the GPU. An NVIDIA driver is still required on the host or VM.

The first step for installing the NVIDIA Container Toolkit on Debian is to import the Container Toolkit apt repository.

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Update the apt repository packages lists and install the container toolkit.

sudo apt-get update && sudo apt-get install nvidia-container-toolkit

Docker needs to be configured and restarted after the container toolkit is installed.

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Note: Other container runtimes are supported. Please see the documentation to see the supported container runtimes and their configuration instructions.

After restarting your container runtime, you can run a test workload to make sure the container toolkit is installed properly.

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Using NVIDIA GPUs with Docker Compose

GPUs can be utilized with container workloads managed by Docker Compose. You will need to add the following lines, modified to fit your environment, to the container definition in your Docker Compose file. Please see the Docker Compose documentation for more details.

deploy:
 resources:
   reservations:
     devices:
       - driver: nvidia
         count: 1
         capabilities:
           - gpu

Configuring NVIDIA vGPU Licensed Features

Your machine will need to check out a license if NVIDIA vGPU or NVAIE are being used, and the NVIDIA vGPU driver will need to be configured with a license server. The steps for setting up a cloud or local instance of the NVIDIA License System are beyond the scope of this post, but they can be found in the NVIDIA License System documentation.

Note: You do not need to complete these steps if you are using the Data Center Driver with PCI Passthrough. Licensing is only required if you are using vGPU or NVAIE features.

A client configuration token will need to be configured once the license server instance has been set up. The steps for downloading the client configuration token can be found here for CLS, or cloud-hosted, instances and here for DLS, or delegated local, instances.

After generating and downloading the client configuration token, it will need to be placed onto your virtual machine. The file needs to be copied from your local machine to the /etc/nvidia/ClientConfigToken directory. This directory is locked down by default, and it requires root or sudo access to perform any file operations here. So you may need to copy the token file to your local home directory and use sudo to copy it into the ClientConfigToken directory. Or you can place the token file on a local web server and use wget/cURL to download it.

In my lab, I did the following:

sudo wget https://web-server-placeholder-url/NVIDIA/License/client_configuration_token_05-22-2024-22-41-58.tok

The token file needs to be made readable by all users after downloading it into the /etc/nvidia/ClientConfigToken directory.

sudo chmod 744 /etc/nvidia/ClientConfigToken/client_configuration_token_*.tok

The final step is to configure vGPU features. This is done by editing the gridd.conf file and enabling vGPU. The first step is to copy the gridd.conf.template file using the following command.

sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf

The next step is to edit the file, find the line called FeatureType, and change the value from 0 to 1.

sudo nano /etc/nvidia/gridd.conf

Finally, restart the NVIDIA GRID daemon.

sudo systemctl restart nvidia-gridd

You can check the service status with the sudo systemctl status nvidia-gridd command to see if a license was successfully checked out. You can also log into your license service portal and review the logs to see licensing activity.

Sources

While creating this post, I pulled from the following links and sources.

https://docs.nvidia.com/cuda/cuda-installation-guide-linux

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#meta-packages

https://docs.nvidia.com/grid/17.0/grid-vgpu-user-guide/index.html#installing-vgpu-drivers-linux-from-run-file

https://docs.nvidia.com/grid/17.0/grid-vgpu-user-guide/index.html#installing-vgpu-drivers-linux-from-debian-package

https://docs.nvidia.com/ai-enterprise/deployment-guide-vmware/0.1.0/nouveau.html

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

https://docs.docker.com/compose/gpu-support/

https://docs.nvidia.com/license-system/latest/index.html

Configuring a Headless CentOS Virtual Machine for NVIDIA GRID vGPU #blogtober

October 18, 2017June 7, 2022 / seanpmassey / 1 Comment

When IT administrators think of GPUs, the first thing that comes to mind for many is gaming. But GPUs also have business applications. They’re mainly found in high end workstations to support graphics intensive applications like 3D CAD and medical imaging.

But GPUs will have other uses in the enterprise. Many of the emerging technologies, such as artificial intelligence and deep learning, utilize GPUs to perform compute operations. These will start finding their way into the data center, either as part of line-of-business applications or as part of IT operations tools. This could also allow the business to utilize GRID environments after hours for other forms of data processing.

This guide will show you how to build headless virtual machines that can take advantage of NVIDIA GRID vGPU for GPU compute and CUDA. In order to do this, you will need to have a Pascal Series NVIDIA Tesla card such as the P4, P40, or P100 and the GRID 5.0 drivers. The GRID components will also need to be configured in your hypervisor, and you will need to have the GRID drivers for Linux.

I’ll be using CentOS 7.x for this guide. My base CentOS configuration is a minimal install with no graphical shell and a few additional packages like Nano and Open VM Tools. I use Bob Planker’s guide for preparing my VM as a template.

The steps for setting up a headless CentOS VM with GRID are:

Deploy your CentOS VM. This can be from an existing template or installed from scratch. This VM should not have a graphical shell installed, or it should be in a run mode that does not execute the GUI.
Attach a GRID profile to the virtual machine by adding a shared PCI device in vCenter. The selected profile will need to be one of the Virtual Workstation profiles, and these all end with a Q.
GRID requires a 100% memory reservation. When you add an NVIDIA GRID shared PCI device, there will be an associated prompt to reserve all system memory.
Update the VM to ensure all applications and components are the latest version using the following command:
yum update -y
In order to build the GRID driver for Linux, you will need to install a few additional packages. Install these packages with the following command:
yum install -y epel-release dkms libstdc++.i686 gcc kernel-devel
Copy the Linux GRID drivers to your VM using a tool like WinSCP. I generally place the files in /tmp.
Make the driver package executable with the following command:
chmod +X NVIDIA-Linux-x86_64-384.73-grid.run
Execute the driver package. When we execute this, we will also be adding the –dkms flag to support Dynamic Kernel Module Support. This will enable the system to automatically recompile the driver whenever a kernel update is installed. The commands to run the the driver install are:
bash ./NVIDIA-Linux-x86_64-384.73-grid.run –dkms
When prompted, select yes to register the kernel module sources with DKMS by selecting Yes and pressing Enter.
You may receive an error about the installer not being able to locate the X Server path. Click OK. It is safe to ignore this error.
Install the 32-bit Compatibility Libraries by selecting Yes and pressing Enter.
At this point, the installer will start to build the DKMS module and install the driver.
After the install completes, you will be prompted to use the nvidia-xconfig utility to update your X Server configuration. X Server should not be installed because this is a headless machine, so select No and press Enter.
The install is complete. Press Enter to exit the installer.
To validate that the NVIDIA drivers are installed and running properly, run nvidia-smi to get the status of the video card.
Next, we’ll need to configure GRID licensing. We’ll need to create the GRID licensing file from a template supplied by NVIDIA with the following command:
cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
Edit the GRID licensing file using the text editor of your choice. I prefer Nano, so the command I would use is:
nano /etc/nvidia/gridd.conf
Fill in the ServerAddress and BackupServerAddress fields with the fully-qualified domain name or IP addresses of your licensing servers.
Set the FeatureType to 2 to configure the system to retrieve a Virtual Workstation license. The Virtual Workstation license is required to support the CUDA features for GPU Compute.
Save the license file.
Restart the GRID Service with the following command:
service nvidia-gridd restart
Validate that the machine retrieved a license with the following command:
grep gridd /var/log/messages
Download the NVIDIA CUDA Toolkit.
wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda_9.0.176_384.81_linux-run
Make the toolkit installer executable.
chmod +x cuda_9.0.176_384.81_linux-run.sh
Execute the CUDA Toolkit installer.
bash cuda_9.0.176_384.81_linux-run.sh
Accept the EULA.
You will be prompted to download the CUDA Driver. Press N to decline the new driver. This driver does not match the NVIDIA GRID driver version, and it will break the NVIDIA setup. The GRID driver in the VM has to match the GRID software that is installed in the hypervisor.
When prompted to install the CUDA 9.0 toolkit, press Y.
Accept the Default Location for the CUDA toolkit.
When prompted to create a symlink at /usr/local/cuda, press Y.
When prompted to install the CUDA 9.0 samples, press Y.
Accept the default location for the samples.
Reboot the virtual machine.
Log in and run nvidia-smi again. Validate that you get the table output similar to step 15. If you do not receive this, and you get an error, it means that you likely installed the driver that is included with the CUDA toolkit. If that happens, you will need to start over.

At this point, you have a headless VM with the NVIDIA Drivers and CUDA Toolkit installed. So what can you do with this? Just about anything that requires CUDA. You can experiment with deep learning frameworks like Tensorflow, build virtual render nodes for tools like Blender, or even use Matlab for GPU compute.

Challenges of Building A Mail Gateway on Linux When You’re Not a Linux Person…

August 1, 2010 / seanpmassey

I’m a Microsoft guy through and through. I started on Windows 95 as a teenager, learned the Microsoft Server stack in college, and every professional environment I have worked in was Active Directory-based.

One of the goals that I want to accomplish before my son’s first birthday is completing my Exchange 2007 certification. To accomplish that, I’ve started building an Exchange 2007 VM on my home server. I aim to have it running as a live, Internet-connected email system.

That means dealing with spam. And viruses. I think you know where I’m going with this, and what better way to filter spam than to build your own mail gateway.

One thing I’m not is a Linux/Unix person, but I’d like to learn it so I can expand my skill set. This would be the perfect project as it should be well documented, and it would use far fewer system resources than an Exchange Edge Server.

After a few Google searches for information, I have 10 different sets of instructions for setting up Amavis-New, ClamAV, and SpamAssassin (with or without pyzor, razor, and dcc) on a CentOS box.

Unfortunately, none of the instructions match up or work properly. Some want to edit files that don’t exist. Some try to install items that aren’t in the base repositories (that can be found elsewhere, but that isn’t included in the instructions). Almost all of them are two-three years old.

Not that there is anything wrong with this. The benefit of using a VM means that I can quickly revert to a snapshot of a clean install and try again. I know it’s a very Windows-like way of solving the problem, but if it makes me more comfortable with the Linux command line, I don’t see there being any harm in doing it that way for now.

And maybe…hopefully…I’ll have an updated set of instructions by the time I’m done.