September | 2017 | The Virtual Horizon

Rubrik has made significant enhancements to their platform since they came out of stealth just over two years ago, and their platform has grown from an innovative way to bring together software and hardware to solve virtualization backup challenges to a robust data protection platform due to their extremely aggressive release schedule.

Yesterday, Rubrik is announcing version 4.1. The latest version builds on the already strong offerings in the Alta release that came out just a few months ago. This release, in particular, is heavily focused on the Microsoft stacks, and there is also a heavy focus on cloud.

So what’s new in Rubrik 4.1?

Multi-Tenancy

The major enhancement is multi-tenancy support. Rubrik 4.1 will now support dividing up a single physical Rubrik cluster into multiple Organizations. Organizations are logical management units inside a physical Rubrik cluster, and each organization can manage their own logical objects such as users, protected objects, SLA domains, and replication targets. This new multi-tenancy model is designed to meet the needs of service provider organizations, where multiple customers may use Rubrik as a backup target, as well as large enterprises that have multiple IT organizations.

In order to support the new multi-tenancy feature, Rubrik is adding role-based access control with multiple levels of access. This will allow application owners and administrators to get limited access to Rubrik to manage their particular resources.

Azure, Azure Stack, and Hyper-V

One of the big foci of the Rubrik 4.1 release is Microsoft, and Rubrik has enhanced their Microsoft platform support.

The first major enhancement to Rubrik’s Microsoft platform offering is Azure Stack support. Rubrik will be able to integrate with Azure Stack and provide protection to customer workloads running on this platform.

The second major enhancement is to the CloudOn App Instantiation feature. CloudOn was released in Alta, and it enables customers to power-on VM snapshots in the public cloud. The initial release supported AWS, and Rubrik is now adding support for Azure.

SQL Server Always-On Support

Rubrik is expanding it’s agent-based SQL Server backup support to Always-On Availability Groups. In the current release, Rubrik will detect if a SQL Server is part of an availability group, but it requires an administrator to manually apply an SLA policy to databases. If there is a failover in the availability group, a manual intervention would be required to change the replica that was being protected. This could be an issue with 2-node availability groups as a node failure, or server reboot, would cause a failover that could impact SLAs on the protected databases.

Rubrik 4.1 will now detect the configuration of a SQL Server, including availability groups. Based on the configuration, Rubrik will dynamically select the replica to back up. If a failover occurs, Rubrik will select a different replica in the availability group to use as a backup source. This feature is only supported on synchronous commit availability groups.

Google Cloud Storage Support

Google Cloud is now supported as a cloud archive target, and all Google Cloud storage tiers are supported.

AWS Glacier and GovCloud Support

One feature that has been requested multiple times since Rubrik was released was support for AWS Glacier for long-term storage retention. Rubrik 4.1 now adds support for Glacier as an archive location.

Also in the 4.1 release is support for AWS GovCloud. This will allow government entities with Rubrik to utilize AWS as a cloud archive.

Thoughts

Rubrik has had an aggressive release schedule since Day 1. And they don’t seem to be letting up on quickly adding features. The 4.1 release does not disappoint in this category.

The feature I’m most excited about is the enhanced support for SQL Always-On Availability Groups. While Rubrik can detect if a database is part of an AG today, the ability to dynamically select the instance to back up is key for organizations that have smaller AGs or utilize the basic 2-node AG feature in SQL Server 2016.

One of the cooler tech announcements at VMworld 2017 was on display at the NVIDIA booth. It wasn’t really an announcement, per se, but more of a demonstration of a long awaited solution to a very difficult challenge in the virtualization space.

NVIDIA displayed a tech demo of vMotion support for VMs with GRID vGPU running on ESXi. Along with this demo was news that they had also solved the problem of suspend and resume on vGPU enabled machines, and these solutions would be included in future product releases. NVIDIA announced live migration support for XenServer earlier this year.

Rob Beekmans (Twitter: @robbeekmans) also wrote about this recently, and his blog has video showing the tech demos in action.

I want to clarify that these are tech demos, not tech previews. Tech Previews, in VMware EUC terms, usually means a feature that is in beta or pre-release to get real-world feedback. These demos likely occurred on a development version of a future ESXi release, and there is no projected timeline as to when they will be released as part of a product.

Challenges to Enabling vMotion Support for vGPU

So you’re probably thinking “What’s the big deal? vMotion is old hat now.” But when vGPU is enabled on a virtual machine, it requires that VM to have direct, but shared, access to physical hardware on the system – in this case, a GPU. And vMotion never worked if a VM had direct access to hardware – be it a PCI device that was passed through or something plugged into a USB port.

If we look at how vGPU works, each VM has a shared PCI device added to it. This shared PCI device provides shared access to a physical card. To facilitate this access, each VM gets a portion of the GPU’s Base Address Register (BAR), or the hardware level interface between the machine and the PCI card. In order to make this portable, there has to be some method of virtualizing the BAR. A VM that migrates may not get the same address space on the BAR when it moves to a new host, and any changes to that would likely cause issues to Windows or any jobs that the VM has placed on the GPU.

There is another challenge to enabling vMotion support for vGPU. Think about what a GPU is – it’s a (massively parallel) processor with dedicated RAM. When you add a GPU into a VM, you’re essentially attaching a 2nd system to the VM, and the data that is in the GPU framebuffer and processor queues needs to be migrated along with the CPU, system RAM, and system state. So this requires extra coordination to ensure that the GPU releases things so they can be migrated to the new host, and it has to be done in a way that doesn’t impact performance for other users or applications that may be sharing the GPU.

Suspend and Resume is another challenge that is very similar to vMotion support. Suspending a VM basically hibernates the VM. All current state information about the VM is saved to disk, and the hardware resources are released. Instead of sending data to another machine, it needs to be written to a state file on disk. This includes the GPU state. When the VM is resumed, it may not get placed on the same host and/or GPU, but all the saved state needs to be restored.

Hardware Preemption and CUDA Support on Pascal

The August 2016 GRID release included support for the Pascal-series cards. Pascal series cards include hardware support for preemption. This is important for GRID because it uses time-slicing to share access to the GPU across multiple VMs. When a time-slice expires, it moves onto the next VM.

This can cause issues when using GRID to run CUDA jobs. CUDA jobs can be very long running, and the job is stopped when the time-slice is expired. Hardware preemption enables long-running CUDA tasks to be interrupted and paused when the time-slice expires, and those jobs are resumed when that VM gets a new time-slice.

So why is this important? In previous versions of GRID, CUDA was only available and supported on the largest profiles. So to support the applications that required CUDA in a virtual virtual environment, and entire GPU would need to be dedicated to the VM. This could be a significant overallocation of resources, and it significantly reduced the density on a host. If a customer was using M60s, which have two GPUs per card, then they may be limited to 4 machines with GPU access if they needed CUDA support.

With Pascal cards and the latest GRID software, CUDA support is enabled on all vDWS profiles (the ones that end with a Q). Now customers can provide CUDA-enabled vGPU profiles to virtual machines without having to dedicate an entire GPU to one machine.

This has two benefits. First, it enables more features in the high-end 3D applications that run on virtual workstations. Not only can these machines be used for design, they can now utilize the GPU to run models or simulations.

The second benefit has nothing to do with virtual desktops or applications. It actually allows GPU-enabled server applications to be fully virtualized. This potentially means things like render farms or, in a future looking state, virtualized AI inference engines for business applications or infrastructure support services. One potentially interesting use case for this is running MapD, a database that runs entirely in the GPU, on a virtual machine.

Analysis

GPUs have the ability to revolutionize enterprise applications in the data center. They can potentially bring artificial intelligence, deep learning, and massively parallel computing to business apps.

vMotion support is critical in enabling enterprise applications in virtual environments. The ability to move applications and servers around is important to keeping services available.

By enabling hardware preemption and vMotion support, it now becomes possible to virtualize the next generation of business applications. These applications will require a GPU and CUDA support to improve performance or utilize deep learning algorithms. Applications that require a GPU and CUDA support can be moved around in the datacenter without impacting the running workloads, maintaining availability and keeping active jobs running so they do not have to be restarted.

This also opens up new opportunities to better utilize data center resources. If I have a large VDI footprint that utilizes GRID, I can’t vMotion any running desktops today to consolidate them on particular hosts. If I can use vMotion to consolidate these desktops, I can utilize the remaining hosts with GPUs to perform other tasks with GPUs such as turning them into render farms, after-hours data processing with GPUs, or other tasks.

This may not seem important now. But I believe that deep learning/artificial intelligence will become a critical feature in business applications, and the ability to turn my VDI hosts into something else after-hours will help enable these next generation applications.