Introduction
This document provides a comprehensive step-by-step guide to configuring NVIDIA Virtual GPU (vGPU) on VMware ESXi, specifically tailored for environments where virtual machines (VMs) require GPU acceleration. This setup is crucial for workloads such as Artificial Intelligence (AI), Machine Learning (ML), high-performance computing (HPC), and advanced graphics virtualization. We will detail the process of enabling NVIDIA GPUs, such as those installed in a Dell PowerEdge MX760c server, to be shared among multiple VMs, enhancing resource utilization and performance.
While the concept of “GPU passthrough” often refers to dedicating an entire physical GPU to a single VM (DirectPath I/O), NVIDIA vGPU technology allows a physical GPU to be partitioned into multiple virtual GPUs. Each vGPU can then be assigned to a different VM, providing a more flexible and scalable solution. This guide focuses on the vGPU setup, which leverages NVIDIA’s drivers and management software in conjunction with VMware vSphere.
The instructions cover compatibility verification, hardware installation, ESXi host configuration, vGPU assignment to VMs, and driver installation within the guest operating systems. Following these steps will enable your virtualized environment to harness the power of NVIDIA GPUs for demanding applications. We will also briefly touch upon integrating this setup with VMware Private AI Foundation with NVIDIA for streamlined AI workload deployment.
Prerequisites
Before proceeding with the configuration, ensure the following prerequisites are met:
- Compatible Server Hardware: A server system that supports NVIDIA GPUs and is certified for the version of ESXi you are running. For instance, the Dell PowerEdge MX760c is supported for ESXi 8.0 Update 3 and is compatible with SR-IOV and NVIDIA GPUs.
- NVIDIA GPU: An NVIDIA GPU that supports vGPU technology. Refer to NVIDIA’s documentation for a list of compatible GPUs.
- VMware ESXi: A compatible version of VMware ESXi installed on your host server. This guide assumes ESXi 8.0 or a similar modern version.
- VMware vCenter Server: While some configurations might be possible without it, vCenter Server is highly recommended for managing vGPU deployments.
- NVIDIA vGPU Software: You will need the NVIDIA vGPU Manager VIB (Virtual-machine Infrastructure Bundle) for ESXi and the corresponding NVIDIA guest OS drivers for the VMs. These are typically available from NVIDIA’s licensing portal.
- Network Connectivity: Ensure the ESXi host has network access to download necessary files or for management via SSH and vSphere Client.
- Appropriate Licensing: NVIDIA vGPU solutions require licensing. Ensure you have the necessary licenses for your deployment.
Step 1: Verify Compatibility
Ensuring hardware and software compatibility is the foundational step for a successful vGPU deployment. Failure to do so can lead to installation issues, instability, or suboptimal performance.
1.1 Check Server Compatibility
Your server must be certified to run the intended ESXi version and support the specific NVIDIA GPU model you plan to use. Server vendors often provide compatibility matrices.
- Action: Use the Broadcom Compatibility Guide (formerly VMware Compatibility Guide) to confirm your server model’s support for ESXi (e.g., ESXi 8.0 Update 3) and its compatibility with NVIDIA GPUs.
- Example: The Dell PowerEdge MX760c is listed as a supported server model for ESXi 8.0 Update 3 and is known to be compatible with SR-IOV and NVIDIA GPUs, making it suitable for vGPU deployments.
- Details: Compatibility verification includes checking for BIOS support for virtualization technologies (VT-d/IOMMU, SR-IOV), adequate power supply and cooling for the GPU, and physical PCIe slot availability.
1.2 Check GPU Compatibility
Not all NVIDIA GPUs support vGPU, and among those that do, compatibility varies with ESXi versions and NVIDIA vGPU software versions.
- Action: Consult the official NVIDIA vGPU documentation and the NVIDIA Virtual GPU Software Supported Products List. This documentation provides detailed information on which GPUs are supported, the required vGPU software versions, and compatible ESXi versions.
- Details: Pay close attention to the specific vGPU profiles supported by your chosen GPU, as these profiles determine how the GPU’s resources are partitioned and allocated to VMs. Ensure the GPU firmware is up to date as recommended by NVIDIA or your server vendor.
Note: Always use the latest available compatibility information from both VMware/Broadcom and NVIDIA, as these are updated regularly with new hardware and software releases.
Step 2: Install NVIDIA GPU on the Host
Once compatibility is confirmed, the next step is to physically install the NVIDIA GPU into the ESXi host server and configure the server’s BIOS/UEFI settings appropriately.
2.1 Add the GPU as a PCI Device to the Host
- Action: Physically install the NVIDIA GPU into an appropriate PCIe slot in the PowerEdge MX760c or your compatible server.
- Procedure:
- Power down and unplug the server. Follow all electrostatic discharge (ESD) precautions.
- Open the server chassis according to the manufacturer’s instructions.
- Identify a suitable PCIe slot. High-performance GPUs usually require a x16 PCIe slot and may need auxiliary power connectors.
- Insert the GPU firmly into the slot and secure it. Connect any necessary auxiliary power cables directly from the server’s power supply to the GPU.
- Close the server chassis and reconnect power.
- Considerations: Ensure the server’s Power Supply Unit (PSU) can handle the additional power load from the GPU. Check server documentation for slot priority or specific slots designated for GPUs. Proper airflow and cooling are also critical for GPU stability and longevity.
2.2 Update Server BIOS/UEFI Settings
Several BIOS/UEFI settings must be enabled to support GPU passthrough and virtualization technologies like vGPU.
- Action: Boot the server and enter the BIOS/UEFI setup utility (commonly by pressing F2, DEL, or another designated key during startup).
- Key Settings to Enable:
- Virtualization Technology (VT-x / AMD-V): Usually enabled by default, but verify.
- SR-IOV (Single Root I/O Virtualization): This is critical for many vGPU deployments as it allows a PCIe device to appear as multiple separate physical devices. Locate this setting, often under “Integrated Devices,” “PCIe Configuration,” or “Processor Settings.”
- VT-d (Intel Virtualization Technology for Directed I/O) / AMD IOMMU: This technology enables direct assignment of PCIe devices to VMs and is essential for passthrough and vGPU functionality.
- Memory Mapped I/O above 4GB (Above 4G Decoding): Enable this if available, as GPUs require significant address space.
- Disable any conflicting settings like on-board graphics if they interfere, though often they can co-exist.
- Save and Exit: After making changes, save the settings and exit the BIOS/UEFI utility. The server will reboot.
Important: The exact naming and location of these settings can vary significantly between server manufacturers and BIOS versions. Consult your server’s technical documentation for specific instructions.
Step 3: Install NVIDIA VIB on ESXi Host
With the hardware installed and BIOS configured, the next phase involves installing the NVIDIA vGPU Manager VIB (Virtual-machine Infrastructure Bundle) on the ESXi host. This software component enables the ESXi hypervisor to recognize and manage the NVIDIA GPU for vGPU operations.
A detailed guide from Broadcom can be found here: Installing and configuring the NVIDIA VIB on ESXi.
3.1 Download the NVIDIA vGPU Manager VIB
- Action: Obtain the correct NVIDIA vGPU Manager VIB package for your ESXi version and GPU model. This software is typically downloaded from the NVIDIA Licensing Portal (NPN, or NVIDIA Enterprise Application Hub).
- Critical: Ensure the VIB version matches your ESXi host version (e.g., ESXi 8.0, 8.0 U1, 8.0 U2, 8.0 U3). Using an incompatible VIB can lead to installation failure or system instability. The VIB package will be a
.vibfile.
3.2 Upload the VIB to the ESXi Host
- Action: Transfer the downloaded
.vibfile to a datastore accessible by your ESXi host, or directly to a temporary location on the host (e.g.,/tmp). - Method: Use an SCP client (like WinSCP for Windows, or
scpcommand-line utility for Linux/macOS) or the datastore browser in vSphere Client to upload the VIB file. - Example using SCP:
scp /path/to/local/vgpu-manager.vib root@your_esxi_host_ip:/vmfs/volumes/your_datastore/
3.3 Install the VIB
- Action: Place the ESXi host into maintenance mode. This is crucial to ensure no VMs are running during the driver installation and subsequent reboot. You can do this via the vSphere Client (right-click host > Maintenance Mode > Enter Maintenance Mode).
- Procedure:
- Enable SSH on the ESXi host if it’s not already enabled (vSphere Client: Host > Configure > Services > SSH > Start).
- Connect to the ESXi host using an SSH client (e.g., PuTTY or command-line SSH).
- Navigate to the directory where you uploaded the VIB, or use the full path to the VIB file.
- Run the VIB installation command. Replace
/path/to/vgpu-manager.vibwith the actual path to your VIB file:esxcli software vib install -v /vmfs/volumes/your_datastore/vgpu-manager.vib
Alternatively, if uploaded to /tmp:
- esxcli software vib install -v /tmp/vgpu-manager.vib
This command might require the --no-sig-check flag if the VIB is not signed by a trusted source or if you encounter signature verification issues, though official NVIDIA VIBs should be signed.
- After successful installation, the command output will indicate that the VIB has been installed and a reboot is required.
- Reboot the ESXi host:reboot
- Once the host has rebooted, exit maintenance mode.
3.4 Verify VIB Installation
- Action: After the ESXi host reboots, verify that the NVIDIA VIB is installed correctly and the GPU is recognized.
- Command: SSH into the ESXi host and run:nvidia-smi
- Expected Output: This command should display information about the NVIDIA GPU(s) installed in the host, including GPU model, driver version, temperature, and memory usage. If this command executes successfully and shows your GPU details, the VIB installation was successful. If it returns an error like “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver,” there might be an issue with the VIB installation, GPU compatibility, or BIOS settings.
Step 4: Configure NVIDIA GPUs for vGPU Mode
After installing the NVIDIA VIB on the ESXi host and confirming the driver can communicate with the GPU (via nvidia-smi), you need to ensure the GPU is configured for the correct operational mode for vGPU. Some NVIDIA GPUs can operate in different modes (e.g., graphics/vGPU mode vs. compute mode).
4.1 Enable vGPU Mode (if applicable)
- Action: For certain NVIDIA GPU models (especially those in the Tesla or Data Center series), you might need to set the GPU mode to “graphics” or “vGPU” mode. By default, they might be in a “compute” mode. This change is typically done using tools provided by NVIDIA or via
nvidia-smicommands on the ESXi host if supported for that specific configuration. - Guidance: Refer to the NVIDIA vGPU Deployment Guide specific to your GPU series and vGPU software version. This guide will provide the exact commands or procedures if a mode change is necessary.
For example, to check the current mode or to change it, you might use specific nvidia-smi persistence mode commands or other NVIDIA utilities. However, for many modern GPUs and vGPU software versions, the driver automatically handles the appropriate mode for vGPU when licensed correctly.
- Licensing: Ensure your NVIDIA vGPU licensing is correctly configured. The NVIDIA vGPU software relies on a license server to enable vGPU features. Without a valid license, vGPU functionality will be restricted or disabled. The license dictates which vGPU profiles are available.
4.2 Verify GPU Availability and Passthrough Configuration
- Action: Confirm that ESXi recognizes the NVIDIA GPU(s) as available PCI devices that can be used for passthrough or vGPU.
- Command: On the ESXi host via SSH, run:esxcli hardware pci list | grep -i nvidia
- Expected Output: This command lists all PCI devices containing “nvidia” in their description. You should see entries corresponding to your installed NVIDIA GPU(s), including their vendor ID, device ID, and description. This confirms that the ESXi kernel is aware of the hardware.
- vSphere Client Check: You can also check this in the vSphere Client:
- Select the ESXi host in the inventory.
- Navigate to
Configure > Hardware > PCI Devices. - Filter or search for “NVIDIA”. The GPUs should be listed here. You might need to toggle passthrough for the device here if you were doing direct passthrough. For vGPU, the installed VIB handles the GPU sharing mechanism. The GPU should be listed as available for vGPU.
Note on Passthrough vs. vGPU: While the esxcli hardware pci list command is general, the key for vGPU is the installed NVIDIA VIB which enables the hypervisor to mediate access to the GPU and present virtualized instances (vGPUs) to the VMs, rather than passing through the entire physical device to a single VM.
Step 5: Assign vGPU to Virtual Machines (VMs)
With the ESXi host properly configured and the NVIDIA vGPU Manager VIB installed, you can now assign vGPU resources to your virtual machines. This process involves editing the VM’s settings to add a shared PCI device, which represents the vGPU.
5.1 Create or Select an Existing Virtual Machine
- Action: In the vSphere Client, either create a new virtual machine or select an existing one that requires GPU acceleration.
- Guest OS Compatibility: Ensure the guest operating system (OS) you plan to use within the VM is supported by NVIDIA vGPU technology and that you have the corresponding NVIDIA guest OS drivers. Supported OS typically include various versions of Windows, Windows Server, and Linux distributions.
5.2 Add vGPU Shared PCI Device to the VM
- Action: Edit the settings of the target virtual machine to add an NVIDIA vGPU.
- Procedure (via vSphere Client):
- Power off the virtual machine. vGPU assignment typically requires the VM to be powered off.
- Right-click the VM in the vSphere inventory and select Edit Settings.
- In the “Virtual Hardware” tab, click Add New Device.
- Select PCI Device from the dropdown menu and click Add. (Note: For vGPU, it’s often listed more specifically as “Shared PCI Device” or directly shows NVIDIA vGPU profiles).
Correction/Clarification: The more direct path is: Add New Device -> NVIDIA vGPU. If “NVIDIA vGPU” is not directly an option, it’s under Shared PCI Device, where you then select the vGPU profile. - A new “PCI device” entry will appear. Expand it.
- From the “NVIDIA vGPU Profile” dropdown list, select the desired vGPU profile. (Image placeholder: The above src is an example and won’t render. Actual images cannot be embedded as per instructions.)
The user interface might vary slightly depending on the vSphere version, but the principle is to add a new device and choose the NVIDIA vGPU type and profile.
5.3 Configure the vGPU Profile
- Explanation: vGPU profiles define how the physical GPU’s resources (e.g., framebuffer/VRAM, number of supported display heads, compute capability) are allocated to the VM. NVIDIA provides a range of profiles (e.g., Q-series for Quadro features, C-series for compute, B-series for business graphics, A-series for virtual applications).
- Selection Criteria: Choose a profile that matches the workload requirements of the VM. For example:
- AI/ML or HPC: Typically require profiles with larger framebuffers and significant compute resources (e.g., C-series or high-end A/Q profiles).
- Virtual Desktops (VDI) / Graphics Workstations: Profiles vary based on the intensity of the graphics applications (e.g., B-series for knowledge workers, Q-series for designers/engineers).
- Resource Reservation: After adding the vGPU, you may need to reserve all guest memory for the VM. In the VM’s “Edit Settings,” go to “VM Options” tab, expand “Advanced,” and under “Configuration Parameters,” ensure
pciPassthru.use64bitMMIOis set toTRUEif required, and ensure “Reserve all guest memory (All locked)” is checked under the “Virtual Hardware” tab’s Memory section. This is often a requirement for stable vGPU operation. - Click OK to save the VM settings.
5.4 Install NVIDIA Drivers in the Guest OS
- Action: Power on the virtual machine. Once the guest OS boots up, you need to install the appropriate NVIDIA guest OS drivers. These are different from the VIB installed on the ESXi host.
- Driver Source: Download the NVIDIA vGPU software guest OS drivers from the NVIDIA Licensing Portal. Ensure these drivers match the vGPU software version running on the host (VIB version) and are compatible with the selected vGPU profile and the guest OS.
- Installation: Install the drivers within the guest OS following standard driver installation procedures for that OS (e.g., running the setup executable on Windows, or using package managers/scripts on Linux). A reboot of the VM is typically required after driver installation.
5.5 Verify vGPU Functionality in the Guest OS
- Action: After the guest OS driver installation and reboot, verify that the vGPU is functioning correctly within the VM.
- Verification on Windows:
- Open Device Manager. The NVIDIA GPU should be listed under “Display adapters” without errors.
- Run the
nvidia-smicommand from a command prompt (usually found inC:\Program Files\NVIDIA Corporation\NVSMI\). It should display details of the assigned vGPU profile and its status.
- Verification on Linux:
- Open a terminal and run the
nvidia-smicommand. It should show the vGPU details. - Check
dmesgor Xorg logs for any NVIDIA-related errors.
- Open a terminal and run the
- License Status: Ensure the VM successfully acquires a vGPU license from your NVIDIA license server. The
nvidia-smioutput or NVIDIA control panel within the guest can often show licensing status.
Step 6: (Optional) Deploy VMware Private AI Foundation with NVIDIA
For organizations looking to build an enterprise-grade AI platform, VMware Private AI Foundation with NVIDIA offers an integrated solution that leverages the vGPU capabilities you’ve just configured. This platform helps streamline the deployment and management of AI/ML workloads.
Key Aspects:
- Install VMware Private AI Foundation Components: This involves deploying specific VMware software components (like Tanzu for Kubernetes workloads, and AI-specific management tools) that are optimized to work with NVIDIA AI Enterprise software. Follow VMware’s official documentation for this deployment.
- Integrate with vGPUs: The vGPU-accelerated VMs become the workhorses for your AI applications. VMware Private AI Foundation provides tools and frameworks to efficiently manage these resources and schedule AI/ML jobs on them.
- Leverage APIs: Utilize both NVIDIA and VMware APIs for programmatic control, monitoring GPU performance, workload optimization, and dynamic resource management. This allows for automation and integration into MLOps pipelines.
This step is an advanced topic beyond basic vGPU setup but represents a common use case for environments that have invested in NVIDIA vGPU technology on VMware.
Troubleshooting Common Issues
While the setup process is generally straightforward if all compatibility and procedural guidelines are followed, issues can arise. Here are some common troubleshooting areas:
nvidia-smifails on ESXi host:- Ensure the VIB is installed correctly and matches ESXi version.
- Verify BIOS settings (SR-IOV, VT-d).
- Check GPU seating and power.
- Consult
/var/log/vmkernel.logon ESXi for NVIDIA-related errors.
- vGPU option not available when editing VM settings:
- Confirm NVIDIA VIB is installed and
nvidia-smiworks on the host. - Ensure the GPU supports vGPU and is in the correct mode.
- Check host licensing for NVIDIA vGPU.
- Confirm NVIDIA VIB is installed and
- NVIDIA driver fails to install in Guest OS or shows errors:
- Verify you are using the correct NVIDIA guest OS driver version that matches the host VIB and vGPU profile.
- Ensure the VM has sufficient resources (RAM, CPU) and that memory reservation is configured if required.
- Check for OS compatibility.
- VM fails to power on after vGPU assignment:
- Insufficient host GPU resources for the selected vGPU profile (e.g., trying to assign more vGPUs than the physical GPU can support).
- Memory reservation issues.
- Incorrect BIOS settings on the host.
Conclusion
Configuring NVIDIA vGPU on VMware ESXi allows businesses to efficiently utilize powerful GPU resources across multiple virtual machines. This unlocks performance for demanding applications like AI/ML, VDI, and graphics-intensive workloads in a virtualized environment. By meticulously following the steps outlined in this guide—from compatibility checks and hardware installation to software configuration on both the host and guest VMs—administrators can create a robust and scalable GPU-accelerated infrastructure. Remember to consult official documentation from VMware and NVIDIA for the most current and detailed information specific to your hardware and software versions.
References
- NVIDIA vGPU Documentation (available on the NVIDIA website/portal)
- VMware Private AI Foundation with NVIDIA Documentation
- Broadcom Compatibility Guide (for VMware hardware compatibility)
- Dell PowerEdge MX760c Technical Documentation
- Broadcom KB Article: Installing and configuring the NVIDIA VIB on ESXi
- General Installation Guidance (Conceptual): YouTube Installation Guide (as mentioned in user prompt context, though no specific link provided here)