How to Configure the NVIDIA vGPU Drivers, CUDA Toolkit and Container Toolkit on Debian 12

As I’ve started building more GPU-enabled workloads in my home lab, I’ve found myself repeating a few steps to get the required software installed. It involved multiple tools, and I was referencing multiple sources in the vendor documentation.

I wanted to pull everything together into one document – both to document my process so I can automate it and also to share so I can help others who are looking at the same thing.

So this post covers the steps for installing and configuring the NVIDIA drivers, CUDA toolkit, and/or the Container Toolkit on vSphere virtual machines.

Install NVIDA Driver Prequisites

There are a few prerequisites required before installing the NVIDIA drivers.  This includes installing kernel headers, the programs required to compile the NVIDIA drivers, and disabling Nouveau. We will also install the NVIDIA CUDA Repo.

#Install Prerequisites
sudo apt-get install xfsprogs wget git python3 python3-venv python3-pip p7zip-full build-essential -y
sudo apt-get install linux-headers-$(uname -r) -y

#Disable Nouveau
lsmod | grep nouveau

cat <<EOF | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF
sudo update-initramfs -u

Reboot the system after the initramfs build completes.

sudo reboot

Install the NVIDIA Drivers

NVIDIA includes .run and .deb installer options for Debian-based operating systems.  I use the .run option because that is what I am most familiar with.  The run file will need to be made executable as it does not have these permissions by default. I also install using the --dkms flag so the driver will be recompiled automatically if the kernel is updated.

The vGPU drivers are distributed through the NVIDIA Enterprise Software Licensing portal through the NVIDIA Virtual GPU or AI Enterprise product sets and require a license to use   If you are using PCI Passthrough instead of GRID, you can download the NVIDIA Data Center/Tesla Drivers from the data center driver download page

I am using the NVAIE product set for some of my testing, so I will be installing a vGPU driver. The steps to install the Driver, CUDA Toolkit, and Container Toolkit are the same whether you are using a regular data center driver or the vGPU driver. You will not need to configure any licensing when using PCI Passthrough.

The drivers need to be downloaded, copied over to the virtual machine, and have the executable flag set on the file.

sudo chmod +X NVIDIA-Linux-x86_64-550.54.15-grid.run
sudo bash ./NVIDIA-Linux-x86_64-550.54.15-grid.run --dkms

Click OK for any messages that are displayed during install.  Once the installation is complete, reboot the server.

After the install completes, type the following command to verify that the driver is installed properly.

nvidia-smi

You should receive an output similar to the following: 

Installing the CUDA Toolkit

Like the GRID Driver installer, NVIDIA distributes the CUDA Toolkit as both a .run and .deb installer. For this step, I’ll be using the .deb installer as it works with Debian’s built-in package management, can handle upgrades when new CUDA versions are released, and contains a multiple meta package installation options that are documented in the CUDA installation documentation.

By default, the CUDA toolkit installer will try to install an NVIDIA driver.  Since this deployment is using a vGPU driver, we don’t want to use the driver included with CUDA.  NVIDIA is very prescriptive about which driver versions work with vGPU, and installing a different driver, even if it is the same version, will result in errors.  

The first step is to install the CUDA keyring and enable the contrib repository.  The keyring file contains the repository information and the GPG signing key.  Use the following commands to complete this step:

wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo add-apt-repository contrib

The next step is to update our Apt-Get repos and install the CUDA Toolkit. The CUDA toolkit requires a number of additional packages that will be installed alongside the main application.

sudo apt-get update && sudo apt-get -y install cuda-toolkit-12-5

The package installer does not add CUDA to the system PATH variable, so we need to do this manually.  The way I’ve done this is to create a login script that applies for all users using the following command.  The CUDA folder path is versioned, so this script to set the PATH variable will need to be updated when the CUDA version changes.

cat <<EOF | sudo tee /etc/profile.d/nvidia.sh
export PATH="/usr/local/cuda-12.5/bin${PATH:+:${PATH}}"
EOF
sudo chmod +x /etc/profile.d/nvidia.sh

Once our script is created, we need to apply the updated PATH variable and test our CUDA Toolkit installation to make sure it is working properly.  

source /etc/profile.d/nvidia.sh
nvcc --version

You should receive the following output if the PATH variable is updated properly.

If you receive a command not found error, then the PATH variable has not been set properly, and you need to review and rerun the script that contains your EXPORT command.

NVIDIA Container Toolkit

If you are planning to use container workloads with your GPU, you will need to install the NVIDIA Container Toolkit.  The Container Toolkit provides a container runtime library and utilities to configure containers to utilize NVIDIA GPUs.  The Container Toolkit is distributed from an apt repository.

Note: The CUDA toolkit is not required if you are planning to only use container workloads with the GPU.  An NVIDIA driver is still required on the host or VM.

The first step for installing the NVIDIA Container Toolkit on Debian is to import the Container Toolkit apt repository.

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Update the apt repository packages lists and install the container toolkit.

sudo apt-get update && sudo apt-get install nvidia-container-toolkit

Docker needs to be configured and restarted after the container toolkit is installed.  

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Note: Other container runtimes are supported.  Please see the documentation to see the supported container runtimes and their configuration instructions.

After restarting your container runtime, you can run a test workload to make sure the container toolkit is installed properly.

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Using NVIDIA GPUs with Docker Compose

GPUs can be utilized with container workloads managed by Docker Compose.  You will need to add the following lines, modified to fit your environment, to the container definition in your Docker Compose file.  Please see the Docker Compose documentation for more details.

deploy:
 resources:
   reservations:
     devices:
       - driver: nvidia
         count: 1
         capabilities:
           - gpu

Configuring NVIDIA vGPU Licensed Features

Your machine will need to check out a license if NVIDIA vGPU or NVAIE are being used, and the NVIDIA vGPU driver will need to be configured with a license server.  The steps for setting up a cloud or local instance of the NVIDIA License System are beyond the scope of this post, but they can be found in the NVIDIA License System documentation.

Note: You do not need to complete these steps if you are using the Data Center Driver with PCI Passthrough. Licensing is only required if you are using vGPU or NVAIE features.

A client configuration token will need to be configured once the license server instance has been set up.  The steps for downloading the client configuration token can be found here for CLS, or cloud-hosted, instances and here for DLS, or delegated local, instances.

After generating and downloading the client configuration token, it will need to be placed onto your virtual machine. The file needs to be copied from your local machine to the /etc/nvidia/ClientConfigToken directory.  This directory is locked down by default, and it requires root or sudo access to perform any file operations here. So you may need to copy the token file to your local home directory and use sudo to copy it into the ClientConfigToken directory.  Or you can place the token file on a local web server and use wget/cURL to download it.

In my lab, I did the following:

sudo wget https://web-server-placeholder-url/NVIDIA/License/client_configuration_token_05-22-2024-22-41-58.tok

The token file needs to be made readable by all users after downloading it into the /etc/nvidia/ClientConfigToken directory.

sudo chmod 744 /etc/nvidia/ClientConfigToken/client_configuration_token_*.tok

The final step is to configure vGPU features.  This is done by editing the gridd.conf file and enabling vGPU.  The first step is to copy the gridd.conf.template file using the following command.

sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf

The next step is to edit the file, find the line called FeatureType, and change the value from 0 to 1.

sudo nano /etc/nvidia/gridd.conf

Finally, restart the NVIDIA GRID daemon.

sudo systemctl restart nvidia-gridd

You can check the service status with the sudo systemctl status nvidia-gridd command to see if a license was successfully checked out.  You can also log into your license service portal and review the logs to see licensing activity.

Sources

While creating this post, I pulled from the following links and sources.

https://docs.nvidia.com/cuda/cuda-installation-guide-linux

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#meta-packages

https://docs.nvidia.com/grid/17.0/grid-vgpu-user-guide/index.html#installing-vgpu-drivers-linux-from-run-file

https://docs.nvidia.com/grid/17.0/grid-vgpu-user-guide/index.html#installing-vgpu-drivers-linux-from-debian-package

https://docs.nvidia.com/ai-enterprise/deployment-guide-vmware/0.1.0/nouveau.html

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

https://docs.docker.com/compose/gpu-support/

https://docs.nvidia.com/license-system/latest/index.html

I’m Finally Building My AI Lab…

When I wrote a Home Lab update post back in January 2020, I talked about AI being one of the technologies that I wanted to focus on in my home lab.  

At that time, AI had unlimited possibilities but was hard to work with. Frameworks like PyTorch and Tensorflow existed, but they required a Python programming background and possibly an advanced mathematics or computer science degree to actually do something with them.  Easy-to-deploy self-hosted options like Stable Diffusion and Ollama were still a couple of years away.

Then the rest of 2020 happened.  Since I’m an EUC person by trade, my attention was diverted away from anything that wasn’t supporting work-from-home initiatives and recovering from the burnout that followed.

GPU accelerated computing and AI were starting to come back on my radar in 2022.  We had a few cloud provider partners asking about building GPU-as-a-Service with VMware Cloud Director.  

Those conversations exploded when OpenAI released their technical marvel, technology demonstrator, and extremely expensive and sophisticated toy – ChatGPT.  That kickstarted the “AI ALL THE THINGS” hype cycle.

Toy might be too strong of a word there.  An incredible amount of R&D went into building ChatGPT.  OpenAI’s GPT models are an incredible technical achievement, and it showcases the everyday power and potential of artificial intelligence.  But it was a research preview that people were meant to use and play with. So my feelings about this only extend to the free and public ChatGPT service itself, not the GPT language models, large language models in general, or AI as a whole.

After testing out ChatGPT a bit, I pulled back from AI technology.  Part of this was driven by trying to find use cases for experimenting with AI, and part of it was driven by an anti-hype backlash.  But that anti-hype backlash, and my other thoughts on AI, is a story for another blog.

Finding my Use Case

Whenever I do something in my lab, I try to anchor it in a use case.  I want to use the technology to solve a problem or challenge that I have.  And when it came to AI, I really struggled with finding a use case.

At least…at first.

But last year, Hasbro decided that they would burn down their community in an attempt to squeeze more money out of their customers.  I found myself with a growing collection of Pathfinder 2nd Edition and 3rd-party Dungeons and Dragons 5th Edition PDFs as I started to play the game with my son and some family friends. And I had a large PDF backlog of other gaming books from the old West End Games Star Wars D6 RPG and Battletech.

This started me down an AI rabbithole.  At first, I just wanted to create some character art to go along with my character sheet.  

Then I started to design my own fantasy and sci-fi settings, and I wanted to create some concept art for the setting I was building.  I had a bit of a vision, and I wanted to see it brought to life.

I tried Midjourney first, and after a month and using most of my credits, I decided to look at self-hosting options.  That led me to Stable Diffusion, which I tested out on my Mac and my Windows desktop.

I had a realization while trying to manage space on my Macbook.  Stable Diffusion is resource heavy and can use a lot of storage when you start experimenting with models. The user interfaces are basically web applications built on the Gradio framework. And I had slightly better GPUs sitting in one of my lab hosts.

So why not virtualize it to take advantage of my lab resources? And if I’m going to virtualize these AI projects, why not try out a few more things like using an LLM to talk to my game book PDFs.

My Virtual AI Lab and Workloads

When I decided to build an AI lab, I wanted to start with resources I already had available. 

Back in 2015, I convinced my wife to let me buy a brand new PowerEdge R730 and a used NVIDIA GRID K1 card. I had to buy a brand new server because I wanted to test out the brand new (at the time) GPU virtualization in my lab VDI environment, and the stock servers were not configured to support GPUs. GPUs typically need 1100 watt power supplies and an enablement kit to deliver power to the GPU that aren’t part of the standard server BOM. Most GPUs that you’d find in a data center are also passively cooled, so the server needs high CFM-fans and hi-speed fan settings to increase airflow over them.

That R730 has a pair of Intel E5-2620 v3 CPUs, 192GB of RAM, and uses ESXi for the hypervisor.  Back in 2018, I upgraded the GRID K1 card to a pair of NVIDIA Tesla P4 GPUs.  The Tesla P4 is basically a data center version of a GTX 1080 – it has the same GP104 graphics processor and 8GB of video memory (also referred to as framebuffer) as the GTX 1080.  The main differences are that it is passively cooled and it only draws 75 watts, so it can draw all of its power from the PCIe slot without any additional power cabling.  

My first virtualized AI workload was the Forge WebUI for Stable Diffusion.  I deployed this on a Debian 12 VM and used PCI passthrough to present one of the P4 cards to the VM.  Image generation times were about 2-3 minutes per image, which is fine for a lab.  

I started to run into issues pretty quickly.  As I said before, P4 only has 8GB of framebuffer, and I would start to hit out-of-memory errors when generating larger images, upscaling images, or attempting to use LORAs or other fine-tuned models. 

When I was researching LLMs, it seemed like the P4 would not be a good fit for even the smallest models. It didn’t have enough framebuffer, poor fp16 performance, and no support for flash attention.  So the P4 gives an all-around bad experience.

So I decided that I need to do a couple of upgrades.  First, I ordered a brand new NVIDIA L4 datacenter GPU.  The L4 is an Ada Lovelace generation datacenter GPU.  It’s a single-slot, 24GB of framebuffer GPU that only draws 75 watts.  It’s the most modern evolution of the P4 form factor.  

But the L4 took a while to ship, and I was getting impatient.  So I went onto eBay and found a great deal on a pre-owned NVIDIA Tesla T4. The T4 is a Turing generation datacenter GPU, and it is the successor to the P4. It has 16GB of framebuffer, and most importantly, it has significantly improved performance and support for features like flash attention.  And it also only draws 75 watts.  

The T4 and L4 were significant improvements over the P4.  I didn’t do any formal benchmarking, but image generation times went from 2-3 minutes to less than a minute and a half.  And I was able to start building out an LLM lab using Ollama and Open-WebUI.  

What’s Next

The initial version of this lab used PCI Passthrough to present the GPUs to my VMs.  I’m now in the process of moving to NVIDIA AI Enterprise (NVAIE) to take advantage of vGPU features.  NVIDIA has provided me with NFR licensing through the NGCA program, so thank you to NVIDIA for enabling this in my lab.  

NVAIE will allow me to create virtual GPUs using only a slice of the physical resources as some of my VMs don’t need a full GPU, and it will allow me to test out some different setups with services running on different VMs.  

I’m also in the process of building out and exploring my LLM environment.  The first iteration of this is being built using Ollama and Open-WebUI.  Open-WebUI seems like an easy on-ramp to testing out Retrieval Augmented Generation (RAG), and I’m trying to wrap my head around that.

I’m building my use case around Pathfinder 2nd Edition.  I’m using Pathfinder because it is probably the most complete ruleset that I have in PDF form.  Paizo, the Pathfinder publisher, also provides a website where all the game’s core rules are available for free (under a fairly permissive license), so I have a source I can scrape to supplement my PDFs. 

This has been kind of a fun challenge as I learn how to convert PDFs into text, chunk them, and import them into a RAG.  I also want to look at other RAG tools and possibly try to build a knowledge graph around this content.

This has turned into fun, but also frustrating at times, project.  I’ve learned a lot, and I’m going to keep digging into it.

Side Notes and Disclosures 

Before I went down the AI Art road, I did try to hire a few artists I knew or who had been referred to me.  They either didn’t do that kind of art or they didn’t get back to me…so I just started creating art for personal use only. I know how controversial AI Art is in creative spaces, so if I ever develop and publish these settings commercially, I would hire artists and the AI art would serve as concept art.

In full disclosure, one of the Tesla P4s was provided by NVIDIA as part of the NGCA program.  I purchased the other P4.

NVIDIA has provided NFR versions of their vGPU and NVAIE license skus through the NGCA program. My vSphere licensing is provided through the VMware by Broadcom vExpert program.  Thank you to NVIDIA and Broadcom for providing licensing.