Author Node: This post has been a few months in the making. While GRIDDays was back in March, I’ve had a few other projects that have kept this on the sidelines until now. This is Part 1. Part 2 will be coming at some point in the future. I figured 1200 words on this was good enough for one chunk.
The general rule of thumb is that if a virtual desktop requires some dedicated hardware – examples include serial devices, hardware license dongles, and physical cards, it’s probably not a good fit to be virtualized. This was especially true of workloads that required high-end 3D acceleration. If a virtual workload required 3D graphics, multiple high-end Quadro cards hard to be installed in the server and then passed through to the virtual machines that required them.
Since pass-through GPUs can’t be shared amongst VMs, this design doesn’t scale well. There is a limit to the number of cards I could install in a host, and that limited the number of 3D workloads I could run. If I needed more, I would have to add hosts. It also limits the flexibility in the environment as VMs with pass-through hardware can’t easily be moved to a new host if maintenance is needed or a hardware failure occurs.
NVIDIA created the GRID products to address the challenges of GPU virtualization. GRID technology combines purpose-built graphics hardware, software, and drivers to allow multiple virtual machines to access a GPU.
I’ve always wondered how it worked, and how it ensured that all configured VMs had equal access to the GPU. I had the opportunity to learn about the technology and the underlying concepts a few weeks ago at NVIDIA GRID Days.
Disclosure: NVIDIA paid for my travel, lodging, and some of my meals while I was out in Santa Clara. This has not influenced the content of this post.
Note: All graphics in this slide are courtesy of NVIDIA.
How it Works – Hardware Layer
So how does a GRID card work? In order to understand it, we have to start with the hardware. A GRID card is a PCIe card with multiple GPUs on the board. The hardware includes the same features that many of the other NVIDIA products have including framebuffer (often referred to as video memory), graphics compute cores, and hardware dedicated to video encode and decode.
Interactions between an operating system and a PCIe hardware device happen through the base address register. Base address registers are used to hold memory addresses used by a physical device. Virtual machines don’t have full access to the GPU hardware, so they are allocated a subset of the GPU’s base address registers for communication with the hardware. This is called a virtual BAR.
Access to the GPU Base Address Registers, and by extension the Virtual BAR, is handled through the CPU’s Memory Management Unit. The MMU handles the translation of the virtual BAR memory addresses into the corresponding physical memory addresses used by the GPU’s BAR. The translation is facilitated by page tables managed by the hypervisor.
The benefit of the virtual bar and hardware-assisted translations is that it is secure. VMs can only access the registers that they are assigned, and they cannot access any other locations outside of the virtual BAR.
The architecture described above – assigning a virtual base address register space that corresponds to a subset of the physical base address register allows multiple VMs to securely share one physical hardware device. That’s only one part of the story. How does work actually get from the guest OS driver to the GPU? And how does the GPU actually manage GPU workloads from multiple VMs?
When the NVIDIA driver submits a job or workload to the GPU, it gets placed into a channel. A channel is essentially a queue or a line that is exposed through each VM’s virtual BAR. Each GPU has a fixed number of channels available, and channels are allocated to each VM by dividing the total number of channels by the number of users that can utilize a profile. So if I’m using a profile that can support 16 VMs per GPU, each VM would get 1/16th of the channels.
When a virtual desktop user opens an application that requires resources on the GPU, the NVIDIA driver in the VM will dedicate a channel to that application. When that application needs the GPU to do something, the NVIDIA driver will submit that job to channels allocated to the application on the GPU through the virtual BAR.
So now that the application is queue up for execution, something needs to get it into the GPU for execution. That job is handled by the scheduler. The scheduler will move work from active channels into the GPU engines. The GPU has four engines for handling a few different tasks – graphics compute, video encode and decode, and a copy engine. The GPU engines are timeshared (more on that below), and they execute jobs in parallel.
When active jobs are placed on an engine, they are executed sequentially. When a job is completed, the NVIDIA driver is signaled that the work has been completed, and the scheduler loads the next job onto the engine to begin processing.
There are two types of scheduling in the computing world – sequential and parallel. When sequential scheduling is used, a single processor executes each job that it receives in order. When it completes that job, it moves onto the next. This can allow a single fast processor to quickly move through jobs, but complex jobs can cause a backup and delay the execution of waiting jobs.
Parallel scheduling uses multiple processors to execute jobs at the same time. When a job on one processor completes, it moves the next job in line onto the processor. Individually, these processors are too slow to handle a complex job. But they prevent a single job from clogging the pipeline.
A good analogy to this would be the checkout lane at a department store. The cashier (and register) is the processor, and each customer is a job that needs to be executed. Customers are queued up in line, and as the cashier finishes checking out one customer, the next customer in the queue is moved up. The cashier can usually process users efficiently and keep the line moving, but if a customer with 60 items walks into the 20 items or less lane, it would back up the line and prevent others from checking out.
This example works for parallel execution as well. Imagine that same department store at Christmas. Every cash register is open, and there is a person at the front of the line directing where people go. This person is the scheduler, and they are placing customers (jobs) on registers (GPU engines) as soon as they have finished with their previous customer.
So how does GRID ensure that all VMs have equal access to the GPU engines? How does it prevent one VM from hogging all the resources on a particular engine?
The answer comes in the way that the scheduler works. The scheduler uses a method called round-robin time slicing. Round-robin time slicing works by giving each channel a small amount of time on a GPU engine. The channel has exclusive access to the GPU engine until the timeslice expires or until there are no more work items in the channel.
If all of the work in a channel is completed before the timeslice expires, any spare cycles are redistributed to other channels or VMs. This ensures that the GPU isn’t sitting idle while jobs are queued in other channels.
The next part of the Understanding vGPU series will cover memory management on the GRID cards.