Configuring NVIDIA vGPU for Virtual Machines on ESXi on a Dell PowerEdge MX760c

Introduction

This document provides a comprehensive step-by-step guide to configuring NVIDIA Virtual GPU (vGPU) on VMware ESXi, specifically tailored for environments where virtual machines (VMs) require GPU acceleration. This setup is crucial for workloads such as Artificial Intelligence (AI), Machine Learning (ML), high-performance computing (HPC), and advanced graphics virtualization. We will detail the process of enabling NVIDIA GPUs, such as those installed in a Dell PowerEdge MX760c server, to be shared among multiple VMs, enhancing resource utilization and performance.

While the concept of “GPU passthrough” often refers to dedicating an entire physical GPU to a single VM (DirectPath I/O), NVIDIA vGPU technology allows a physical GPU to be partitioned into multiple virtual GPUs. Each vGPU can then be assigned to a different VM, providing a more flexible and scalable solution. This guide focuses on the vGPU setup, which leverages NVIDIA’s drivers and management software in conjunction with VMware vSphere.

The instructions cover compatibility verification, hardware installation, ESXi host configuration, vGPU assignment to VMs, and driver installation within the guest operating systems. Following these steps will enable your virtualized environment to harness the power of NVIDIA GPUs for demanding applications. We will also briefly touch upon integrating this setup with VMware Private AI Foundation with NVIDIA for streamlined AI workload deployment.

Prerequisites

Before proceeding with the configuration, ensure the following prerequisites are met:

Compatible Server Hardware: A server system that supports NVIDIA GPUs and is certified for the version of ESXi you are running. For instance, the Dell PowerEdge MX760c is supported for ESXi 8.0 Update 3 and is compatible with SR-IOV and NVIDIA GPUs.
NVIDIA GPU: An NVIDIA GPU that supports vGPU technology. Refer to NVIDIA’s documentation for a list of compatible GPUs.
VMware ESXi: A compatible version of VMware ESXi installed on your host server. This guide assumes ESXi 8.0 or a similar modern version.
VMware vCenter Server: While some configurations might be possible without it, vCenter Server is highly recommended for managing vGPU deployments.
NVIDIA vGPU Software: You will need the NVIDIA vGPU Manager VIB (Virtual-machine Infrastructure Bundle) for ESXi and the corresponding NVIDIA guest OS drivers for the VMs. These are typically available from NVIDIA’s licensing portal.
Network Connectivity: Ensure the ESXi host has network access to download necessary files or for management via SSH and vSphere Client.
Appropriate Licensing: NVIDIA vGPU solutions require licensing. Ensure you have the necessary licenses for your deployment.

Step 1: Verify Compatibility

Ensuring hardware and software compatibility is the foundational step for a successful vGPU deployment. Failure to do so can lead to installation issues, instability, or suboptimal performance.

1.1 Check Server Compatibility

Your server must be certified to run the intended ESXi version and support the specific NVIDIA GPU model you plan to use. Server vendors often provide compatibility matrices.

Action: Use the Broadcom Compatibility Guide (formerly VMware Compatibility Guide) to confirm your server model’s support for ESXi (e.g., ESXi 8.0 Update 3) and its compatibility with NVIDIA GPUs.
Example: The Dell PowerEdge MX760c is listed as a supported server model for ESXi 8.0 Update 3 and is known to be compatible with SR-IOV and NVIDIA GPUs, making it suitable for vGPU deployments.
Details: Compatibility verification includes checking for BIOS support for virtualization technologies (VT-d/IOMMU, SR-IOV), adequate power supply and cooling for the GPU, and physical PCIe slot availability.

1.2 Check GPU Compatibility

Not all NVIDIA GPUs support vGPU, and among those that do, compatibility varies with ESXi versions and NVIDIA vGPU software versions.

Action: Consult the official NVIDIA vGPU documentation and the NVIDIA Virtual GPU Software Supported Products List. This documentation provides detailed information on which GPUs are supported, the required vGPU software versions, and compatible ESXi versions.
Details: Pay close attention to the specific vGPU profiles supported by your chosen GPU, as these profiles determine how the GPU’s resources are partitioned and allocated to VMs. Ensure the GPU firmware is up to date as recommended by NVIDIA or your server vendor.

Note: Always use the latest available compatibility information from both VMware/Broadcom and NVIDIA, as these are updated regularly with new hardware and software releases.

Step 2: Install NVIDIA GPU on the Host

Once compatibility is confirmed, the next step is to physically install the NVIDIA GPU into the ESXi host server and configure the server’s BIOS/UEFI settings appropriately.

2.1 Add the GPU as a PCI Device to the Host

Action: Physically install the NVIDIA GPU into an appropriate PCIe slot in the PowerEdge MX760c or your compatible server.
Procedure:
1. Power down and unplug the server. Follow all electrostatic discharge (ESD) precautions.
2. Open the server chassis according to the manufacturer’s instructions.
3. Identify a suitable PCIe slot. High-performance GPUs usually require a x16 PCIe slot and may need auxiliary power connectors.
4. Insert the GPU firmly into the slot and secure it. Connect any necessary auxiliary power cables directly from the server’s power supply to the GPU.
5. Close the server chassis and reconnect power.
Considerations: Ensure the server’s Power Supply Unit (PSU) can handle the additional power load from the GPU. Check server documentation for slot priority or specific slots designated for GPUs. Proper airflow and cooling are also critical for GPU stability and longevity.

2.2 Update Server BIOS/UEFI Settings

Several BIOS/UEFI settings must be enabled to support GPU passthrough and virtualization technologies like vGPU.

Action: Boot the server and enter the BIOS/UEFI setup utility (commonly by pressing F2, DEL, or another designated key during startup).
Key Settings to Enable:
- Virtualization Technology (VT-x / AMD-V): Usually enabled by default, but verify.
- SR-IOV (Single Root I/O Virtualization): This is critical for many vGPU deployments as it allows a PCIe device to appear as multiple separate physical devices. Locate this setting, often under “Integrated Devices,” “PCIe Configuration,” or “Processor Settings.”
- VT-d (Intel Virtualization Technology for Directed I/O) / AMD IOMMU: This technology enables direct assignment of PCIe devices to VMs and is essential for passthrough and vGPU functionality.
- Memory Mapped I/O above 4GB (Above 4G Decoding): Enable this if available, as GPUs require significant address space.
- Disable any conflicting settings like on-board graphics if they interfere, though often they can co-exist.
Save and Exit: After making changes, save the settings and exit the BIOS/UEFI utility. The server will reboot.

Important: The exact naming and location of these settings can vary significantly between server manufacturers and BIOS versions. Consult your server’s technical documentation for specific instructions.

Step 3: Install NVIDIA VIB on ESXi Host

With the hardware installed and BIOS configured, the next phase involves installing the NVIDIA vGPU Manager VIB (Virtual-machine Infrastructure Bundle) on the ESXi host. This software component enables the ESXi hypervisor to recognize and manage the NVIDIA GPU for vGPU operations.

A detailed guide from Broadcom can be found here: Installing and configuring the NVIDIA VIB on ESXi.

3.1 Download the NVIDIA vGPU Manager VIB

Action: Obtain the correct NVIDIA vGPU Manager VIB package for your ESXi version and GPU model. This software is typically downloaded from the NVIDIA Licensing Portal (NPN, or NVIDIA Enterprise Application Hub).
Critical: Ensure the VIB version matches your ESXi host version (e.g., ESXi 8.0, 8.0 U1, 8.0 U2, 8.0 U3). Using an incompatible VIB can lead to installation failure or system instability. The VIB package will be a .vib file.

3.2 Upload the VIB to the ESXi Host

Action: Transfer the downloaded .vib file to a datastore accessible by your ESXi host, or directly to a temporary location on the host (e.g., /tmp).
Method: Use an SCP client (like WinSCP for Windows, or scp command-line utility for Linux/macOS) or the datastore browser in vSphere Client to upload the VIB file.
Example using SCP: scp /path/to/local/vgpu-manager.vib root@your_esxi_host_ip:/vmfs/volumes/your_datastore/

3.3 Install the VIB

Action: Place the ESXi host into maintenance mode. This is crucial to ensure no VMs are running during the driver installation and subsequent reboot. You can do this via the vSphere Client (right-click host > Maintenance Mode > Enter Maintenance Mode).
Procedure:
1. Enable SSH on the ESXi host if it’s not already enabled (vSphere Client: Host > Configure > Services > SSH > Start).
2. Connect to the ESXi host using an SSH client (e.g., PuTTY or command-line SSH).
3. Navigate to the directory where you uploaded the VIB, or use the full path to the VIB file.
4. Run the VIB installation command. Replace /path/to/vgpu-manager.vib with the actual path to your VIB file:esxcli software vib install -v /vmfs/volumes/your_datastore/vgpu-manager.vib

Alternatively, if uploaded to /tmp:

esxcli software vib install -v /tmp/vgpu-manager.vib

This command might require the --no-sig-check flag if the VIB is not signed by a trusted source or if you encounter signature verification issues, though official NVIDIA VIBs should be signed.

After successful installation, the command output will indicate that the VIB has been installed and a reboot is required.
Reboot the ESXi host:reboot
Once the host has rebooted, exit maintenance mode.

3.4 Verify VIB Installation

Action: After the ESXi host reboots, verify that the NVIDIA VIB is installed correctly and the GPU is recognized.
Command: SSH into the ESXi host and run:nvidia-smi
Expected Output: This command should display information about the NVIDIA GPU(s) installed in the host, including GPU model, driver version, temperature, and memory usage. If this command executes successfully and shows your GPU details, the VIB installation was successful. If it returns an error like “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver,” there might be an issue with the VIB installation, GPU compatibility, or BIOS settings.

Step 4: Configure NVIDIA GPUs for vGPU Mode

After installing the NVIDIA VIB on the ESXi host and confirming the driver can communicate with the GPU (via nvidia-smi), you need to ensure the GPU is configured for the correct operational mode for vGPU. Some NVIDIA GPUs can operate in different modes (e.g., graphics/vGPU mode vs. compute mode).

4.1 Enable vGPU Mode (if applicable)

Action: For certain NVIDIA GPU models (especially those in the Tesla or Data Center series), you might need to set the GPU mode to “graphics” or “vGPU” mode. By default, they might be in a “compute” mode. This change is typically done using tools provided by NVIDIA or via nvidia-smi commands on the ESXi host if supported for that specific configuration.
Guidance: Refer to the NVIDIA vGPU Deployment Guide specific to your GPU series and vGPU software version. This guide will provide the exact commands or procedures if a mode change is necessary.

For example, to check the current mode or to change it, you might use specific nvidia-smi persistence mode commands or other NVIDIA utilities. However, for many modern GPUs and vGPU software versions, the driver automatically handles the appropriate mode for vGPU when licensed correctly.

Licensing: Ensure your NVIDIA vGPU licensing is correctly configured. The NVIDIA vGPU software relies on a license server to enable vGPU features. Without a valid license, vGPU functionality will be restricted or disabled. The license dictates which vGPU profiles are available.

4.2 Verify GPU Availability and Passthrough Configuration

Action: Confirm that ESXi recognizes the NVIDIA GPU(s) as available PCI devices that can be used for passthrough or vGPU.
Command: On the ESXi host via SSH, run:esxcli hardware pci list | grep -i nvidia
Expected Output: This command lists all PCI devices containing “nvidia” in their description. You should see entries corresponding to your installed NVIDIA GPU(s), including their vendor ID, device ID, and description. This confirms that the ESXi kernel is aware of the hardware.
vSphere Client Check: You can also check this in the vSphere Client:
1. Select the ESXi host in the inventory.
2. Navigate to Configure > Hardware > PCI Devices.
3. Filter or search for “NVIDIA”. The GPUs should be listed here. You might need to toggle passthrough for the device here if you were doing direct passthrough. For vGPU, the installed VIB handles the GPU sharing mechanism. The GPU should be listed as available for vGPU.

Note on Passthrough vs. vGPU: While the esxcli hardware pci list command is general, the key for vGPU is the installed NVIDIA VIB which enables the hypervisor to mediate access to the GPU and present virtualized instances (vGPUs) to the VMs, rather than passing through the entire physical device to a single VM.

Step 5: Assign vGPU to Virtual Machines (VMs)

With the ESXi host properly configured and the NVIDIA vGPU Manager VIB installed, you can now assign vGPU resources to your virtual machines. This process involves editing the VM’s settings to add a shared PCI device, which represents the vGPU.

5.1 Create or Select an Existing Virtual Machine

Action: In the vSphere Client, either create a new virtual machine or select an existing one that requires GPU acceleration.
Guest OS Compatibility: Ensure the guest operating system (OS) you plan to use within the VM is supported by NVIDIA vGPU technology and that you have the corresponding NVIDIA guest OS drivers. Supported OS typically include various versions of Windows, Windows Server, and Linux distributions.

5.2 Add vGPU Shared PCI Device to the VM

Action: Edit the settings of the target virtual machine to add an NVIDIA vGPU.
Procedure (via vSphere Client):
1. Power off the virtual machine. vGPU assignment typically requires the VM to be powered off.
2. Right-click the VM in the vSphere inventory and select Edit Settings.
3. In the “Virtual Hardware” tab, click Add New Device.
4. Select PCI Device from the dropdown menu and click Add. (Note: For vGPU, it’s often listed more specifically as “Shared PCI Device” or directly shows NVIDIA vGPU profiles).
  Correction/Clarification: The more direct path is: Add New Device -> NVIDIA vGPU. If “NVIDIA vGPU” is not directly an option, it’s under Shared PCI Device, where you then select the vGPU profile.
5. A new “PCI device” entry will appear. Expand it.
6. From the “NVIDIA vGPU Profile” dropdown list, select the desired vGPU profile. (Image placeholder: The above src is an example and won’t render. Actual images cannot be embedded as per instructions.)

The user interface might vary slightly depending on the vSphere version, but the principle is to add a new device and choose the NVIDIA vGPU type and profile.

5.3 Configure the vGPU Profile

Explanation: vGPU profiles define how the physical GPU’s resources (e.g., framebuffer/VRAM, number of supported display heads, compute capability) are allocated to the VM. NVIDIA provides a range of profiles (e.g., Q-series for Quadro features, C-series for compute, B-series for business graphics, A-series for virtual applications).
Selection Criteria: Choose a profile that matches the workload requirements of the VM. For example:
- AI/ML or HPC: Typically require profiles with larger framebuffers and significant compute resources (e.g., C-series or high-end A/Q profiles).
- Virtual Desktops (VDI) / Graphics Workstations: Profiles vary based on the intensity of the graphics applications (e.g., B-series for knowledge workers, Q-series for designers/engineers).
Resource Reservation: After adding the vGPU, you may need to reserve all guest memory for the VM. In the VM’s “Edit Settings,” go to “VM Options” tab, expand “Advanced,” and under “Configuration Parameters,” ensure pciPassthru.use64bitMMIO is set to TRUE if required, and ensure “Reserve all guest memory (All locked)” is checked under the “Virtual Hardware” tab’s Memory section. This is often a requirement for stable vGPU operation.
Click OK to save the VM settings.

5.4 Install NVIDIA Drivers in the Guest OS

Action: Power on the virtual machine. Once the guest OS boots up, you need to install the appropriate NVIDIA guest OS drivers. These are different from the VIB installed on the ESXi host.
Driver Source: Download the NVIDIA vGPU software guest OS drivers from the NVIDIA Licensing Portal. Ensure these drivers match the vGPU software version running on the host (VIB version) and are compatible with the selected vGPU profile and the guest OS.
Installation: Install the drivers within the guest OS following standard driver installation procedures for that OS (e.g., running the setup executable on Windows, or using package managers/scripts on Linux). A reboot of the VM is typically required after driver installation.

5.5 Verify vGPU Functionality in the Guest OS

Action: After the guest OS driver installation and reboot, verify that the vGPU is functioning correctly within the VM.
Verification on Windows:
- Open Device Manager. The NVIDIA GPU should be listed under “Display adapters” without errors.
- Run the nvidia-smi command from a command prompt (usually found in C:\Program Files\NVIDIA Corporation\NVSMI\). It should display details of the assigned vGPU profile and its status.
Verification on Linux:
- Open a terminal and run the nvidia-smi command. It should show the vGPU details.
- Check dmesg or Xorg logs for any NVIDIA-related errors.
License Status: Ensure the VM successfully acquires a vGPU license from your NVIDIA license server. The nvidia-smi output or NVIDIA control panel within the guest can often show licensing status.

Step 6: (Optional) Deploy VMware Private AI Foundation with NVIDIA

For organizations looking to build an enterprise-grade AI platform, VMware Private AI Foundation with NVIDIA offers an integrated solution that leverages the vGPU capabilities you’ve just configured. This platform helps streamline the deployment and management of AI/ML workloads.

Key Aspects:

Install VMware Private AI Foundation Components: This involves deploying specific VMware software components (like Tanzu for Kubernetes workloads, and AI-specific management tools) that are optimized to work with NVIDIA AI Enterprise software. Follow VMware’s official documentation for this deployment.
Integrate with vGPUs: The vGPU-accelerated VMs become the workhorses for your AI applications. VMware Private AI Foundation provides tools and frameworks to efficiently manage these resources and schedule AI/ML jobs on them.
Leverage APIs: Utilize both NVIDIA and VMware APIs for programmatic control, monitoring GPU performance, workload optimization, and dynamic resource management. This allows for automation and integration into MLOps pipelines.

This step is an advanced topic beyond basic vGPU setup but represents a common use case for environments that have invested in NVIDIA vGPU technology on VMware.

Troubleshooting Common Issues

While the setup process is generally straightforward if all compatibility and procedural guidelines are followed, issues can arise. Here are some common troubleshooting areas:

nvidia-smi fails on ESXi host:
- Ensure the VIB is installed correctly and matches ESXi version.
- Verify BIOS settings (SR-IOV, VT-d).
- Check GPU seating and power.
- Consult /var/log/vmkernel.log on ESXi for NVIDIA-related errors.
vGPU option not available when editing VM settings:
- Confirm NVIDIA VIB is installed and nvidia-smi works on the host.
- Ensure the GPU supports vGPU and is in the correct mode.
- Check host licensing for NVIDIA vGPU.
NVIDIA driver fails to install in Guest OS or shows errors:
- Verify you are using the correct NVIDIA guest OS driver version that matches the host VIB and vGPU profile.
- Ensure the VM has sufficient resources (RAM, CPU) and that memory reservation is configured if required.
- Check for OS compatibility.
VM fails to power on after vGPU assignment:
- Insufficient host GPU resources for the selected vGPU profile (e.g., trying to assign more vGPUs than the physical GPU can support).
- Memory reservation issues.
- Incorrect BIOS settings on the host.

Conclusion

Configuring NVIDIA vGPU on VMware ESXi allows businesses to efficiently utilize powerful GPU resources across multiple virtual machines. This unlocks performance for demanding applications like AI/ML, VDI, and graphics-intensive workloads in a virtualized environment. By meticulously following the steps outlined in this guide—from compatibility checks and hardware installation to software configuration on both the host and guest VMs—administrators can create a robust and scalable GPU-accelerated infrastructure. Remember to consult official documentation from VMware and NVIDIA for the most current and detailed information specific to your hardware and software versions.

References

NVIDIA vGPU Documentation (available on the NVIDIA website/portal)
VMware Private AI Foundation with NVIDIA Documentation
Broadcom Compatibility Guide (for VMware hardware compatibility)
Dell PowerEdge MX760c Technical Documentation
Broadcom KB Article: Installing and configuring the NVIDIA VIB on ESXi
General Installation Guidance (Conceptual): YouTube Installation Guide (as mentioned in user prompt context, though no specific link provided here)

VMwareBlogs

"Unlocking the Power of Virtualization: Explore the Latest Insights and Innovations with VMware Blogs"

Configuring NVIDIA vGPU for Virtual Machines on ESXi on a Dell PowerEdge MX760c

Introduction

Prerequisites

Step 1: Verify Compatibility

1.1 Check Server Compatibility

1.2 Check GPU Compatibility

Step 2: Install NVIDIA GPU on the Host

2.1 Add the GPU as a PCI Device to the Host

2.2 Update Server BIOS/UEFI Settings

Step 3: Install NVIDIA VIB on ESXi Host

3.1 Download the NVIDIA vGPU Manager VIB

3.2 Upload the VIB to the ESXi Host

3.3 Install the VIB

3.4 Verify VIB Installation

Step 4: Configure NVIDIA GPUs for vGPU Mode

4.1 Enable vGPU Mode (if applicable)

4.2 Verify GPU Availability and Passthrough Configuration

Step 5: Assign vGPU to Virtual Machines (VMs)

5.1 Create or Select an Existing Virtual Machine

5.2 Add vGPU Shared PCI Device to the VM

5.3 Configure the vGPU Profile

5.4 Install NVIDIA Drivers in the Guest OS

5.5 Verify vGPU Functionality in the Guest OS

Step 6: (Optional) Deploy VMware Private AI Foundation with NVIDIA

Key Aspects:

Troubleshooting Common Issues

Conclusion

References

Leave a comment Cancel reply

Introduction

Prerequisites

Step 1: Verify Compatibility

1.1 Check Server Compatibility

1.2 Check GPU Compatibility

Step 2: Install NVIDIA GPU on the Host

2.1 Add the GPU as a PCI Device to the Host

2.2 Update Server BIOS/UEFI Settings

Step 3: Install NVIDIA VIB on ESXi Host

3.1 Download the NVIDIA vGPU Manager VIB

3.2 Upload the VIB to the ESXi Host

3.3 Install the VIB

3.4 Verify VIB Installation

Step 4: Configure NVIDIA GPUs for vGPU Mode

4.1 Enable vGPU Mode (if applicable)

4.2 Verify GPU Availability and Passthrough Configuration

Step 5: Assign vGPU to Virtual Machines (VMs)

5.1 Create or Select an Existing Virtual Machine

5.2 Add vGPU Shared PCI Device to the VM

5.3 Configure the vGPU Profile

5.4 Install NVIDIA Drivers in the Guest OS

5.5 Verify vGPU Functionality in the Guest OS

Step 6: (Optional) Deploy VMware Private AI Foundation with NVIDIA

Key Aspects:

Troubleshooting Common Issues

Conclusion

References

Share this:

Related

Leave a comment Cancel reply