Using NVIDIA GPUs in VMware ESXi and Leveraging GPU for NFS Storage

Overview

This page provides a comprehensive guide on how to enable and utilize NVIDIA GPUs in a VMware ESXi environment, and how GPU acceleration can be leveraged to enhance NFS storage performance. The document includes steps for configuring GPU passthrough, attaching GPU resources to virtual machines (VMs), and utilizing GPU capabilities for NFS storage testing, thereby optimizing data transfer and storage workloads.

Background

  • VMware ESXi: A bare-metal hypervisor that enables virtualization of hardware resources to run multiple VMs.
  • NVIDIA GPU: Offers accelerated hardware computation, commonly used for graphics, compute tasks, and now storage offload and acceleration.
  • NFS (Network File System): A distributed file system protocol allowing clients to access data over a network.

Objectives

  • Enable NVIDIA GPU usage within ESXi environment.
  • Leverage GPU acceleration to improve NFS storage performance.
  • Demonstrate and test the impact of GPU usage on NFS throughput and latency.

Pre-requisites

  • VMware ESXi 7.0 or higher installed and running on supported hardware.
  • NVIDIA GPU installed and verified as compatible with the ESXi host.
  • VMware vSphere client access.
  • Latest NVIDIA vGPU or pass-through drivers installed on the ESXi host.
  • NFS storage configured and accessible on the network.

Step 1: Enable GPU Passthrough on ESXi Host

  1. Log in to ESXi host using vSphere Client.
  2. Navigate to Host > Manage > Hardware > PCI Devices.
  3. Locate the installed NVIDIA GPU in the device list.
  4. Select the checkbox for “Passthrough” for the NVIDIA GPU device.
  5. Reboot the ESXi host to enable passthrough mode.

Step 2: Attach the GPU to a Virtual Machine

  1. Edit settings of the target virtual machine.
  2. Add a new PCI device, then select the NVIDIA GPU from the list of available devices.
  3. Ensure that GPU drivers (CUDA/vDGA/vGPU) are installed inside the guest OS of the VM.
  4. Power on the VM and verify GPU detection with tools like nvidia-smi.

Step 3: Leverage GPU for NFS Storage Performance Testing

Modern GPUs can accelerate storage workloads using GPU-Direct Storage (GDS) or similar technologies, offloading data movement and computation tasks directly to the GPU for efficient data path management.

  1. Install or deploy a testing tool on the VM (e.g., NVIDIA GDS-Tools for Linux, IOmeter, fio).
  2. Configure tools to utilize the attached GPU for storage operations, if supported (via CUDA-accelerated paths or GDS).
  3. Mount the NFS storage on the VM and configure test parameters:
mount -t nfs :/share /mnt/nfs
fio --name=read_test --ioengine=libaio --rw=read --bs=1M --size=10G --numjobs=4 --runtime=300 --time_based --direct=1 --output=read_results.txt

If the testing tool supports GPU acceleration (such as GDS), include the relevant options to utilize the GPU for data transfers. Consult the tool documentation for specific flags and parameters.

Explanation of the Test and Expected Outcomes

  • Purpose of the Test: To benchmark NFS storage performance with and without GPU acceleration, measuring throughput and latency improvements when leveraging the GPU.
  • How It Works: The GPU, when properly configured, can accelerate NFS data transfers by handling memory copies and data movement directly between storage and GPU memory, reducing CPU overhead and boosting bandwidth.
  • Expected Benefits:
    • Increased I/O throughput for sequential and random reads/writes.
    • Reduced data movement latency, especially for workloads involving large files or datasets.
    • Optimized CPU utilization, freeing host resources for other tasks.

Result Analysis

Compare the fio (or other tool) output for runs with GPU acceleration enabled vs disabled. Look for improvements in:

  • Bandwidth (MB/s)
  • Average IOPS
  • Average latency (ms)

If GPU offload is effective, you should see measurable gains in these metrics, particularly as data size and throughput demands increase.

References

The goal is to help administrators thoroughly test and optimize the performance of NFS storage as well as take advantage of GPU resources for virtual machines (VMs).

1. NFS Performance Test on ESXi

NFS is commonly used as a shared datastore in ESXi environments. Testing its performance ensures the storage subsystem meets your requirements for throughput and latency.

Test Workflow

  1. Configure NFS Storage:
    • Add your NFS datastore to ESXi using the vSphere UI or CLI.
  2. Prepare the Test VM:
    • Deploy a lightweight Linux VM (such as Ubuntu or CentOS) on the NFS-backed datastore.
  3. Install Performance Testing Tools:
    • SSH into the VM and install fio and/or iozone for flexible I/O benchmarking.
  4. Run Performance Tests:
    • Execute a set of I/O tests to simulate various workloads (sequential read/write, random read/write, etc.).

Sample Commands

# Add NFS datastore (ESXi shell)
esxcli storage nfs add --host= --share= --volume-name=

# On the test VM, install fio and run a sample test
sudo apt-get update && sudo apt-get install -y fio
fio --name=seqwrite --ioengine=libaio --rw=write --bs=1M --size=1G --numjobs=1 --runtime=60 --group_reporting

# On the test VM, install iozone and run a comprehensive test
sudo apt-get install -y iozone3
iozone -a -g 2G

Explanation

  • The esxcli storage nfs add command mounts the NFS datastore on your ESXi host.
  • Performance tools like fio and iozone mimic real-world I/O operations to test bandwidth, IOPS, and latency.
  • Test multiple block sizes, job counts, and I/O patterns to get a comprehensive view of performance.

Interpreting Results

  • Bandwidth (MB/s): Indicates the data transfer speed.
  • IOPS (Input/Output Operations per Second): Measures how many operations your system can perform per second.
  • Latency: The delay before data transfer begins. Lower values are preferred.

By systematically running these tests, you identify the optimal NFS settings and network configurations for your workloads.

2. Using NVIDIA GPU on ESXi — Command Reference & Workflow

NVIDIA GPUs can be leveraged on ESXi hosts to accelerate workloads in VMs—such as AI/ML, graphics rendering, or computational tasks. The vGPU feature or GPU DirectPath I/O enables resource passthrough.

A. List All NVIDIA GPUs on ESXi

# ESXi Shell command:
nvidia-smi

# Or via ESXi CLI:
esxcli hardware pci list | grep -i nvidia

B. Enable GPU Passthrough (DirectPath I/O)

  1. Enable the relevant PCI device for passthrough in the vSphere Web Client.
  2. Reboot the ESXi host if prompted.
  3. Edit the intended VM’s settings and add the PCI device corresponding to the NVIDIA GPU.
# List all PCI devices and their IDs
esxcli hardware pci list

# (Identify the appropriate device/vender IDs for your GPU.)

C. Assign vGPU Profiles (for supported NVIDIA cards)

  1. Install the NVIDIA VIB on ESXi (contact NVIDIA for the latest package).
  2. Reboot after installation:
esxcli software vib install -v /path/to/NVIDIA-VMware*.vib
reboot

D. Validate GPU in Guest VM

  • After PCI passthrough, install the NVIDIA Driver inside the VM.
  • Validate by running nvidia-smi inside the guest OS.

Workflow Summary

  1. Identify the NVIDIA GPU device with esxcli hardware pci list or nvidia-smi.
  2. Enable passthrough or configure vGPU profiles via vSphere Client.
  3. Install NVIDIA VIB on ESXi for vGPU scenarios.
  4. Attach GPU (or vGPU profile) to selected VM and install guest drivers.
  5. Verify GPU availability within the guest VM.

3. Best Practices & Considerations

  • Ensure your ESXi version is compatible with the NVIDIA GPU model and vGPU software version.
  • Plan NFS storage for both throughput and latency, especially for GPU-accelerated workloads requiring fast data movement.
  • Monitor and troubleshoot using ESXi logs and NVIDIA tools to fine-tune performance.

References

VMware vSphere Documentation

NVIDIA vGPU Release Notes

Leave a comment