Guide: GPU and IOPS Performance Testing on a Virtual Machine with NVIDIA GPU Passthrough

Introduction

This document outlines the procedures for setting up the environment and conducting performance tests on a Virtual Machine (VM) running on an ESXi hypervisor. Specifically, it focuses on VMs configured with NVIDIA GPU passthrough. Proper testing is crucial to ensure that the GPU is correctly recognized, utilized, and performing optimally within the VM, and that storage Input/Output Operations Per Second (IOPS) meet the required levels for demanding applications. These tests help validate the stability and performance of the GPU passthrough configuration and overall VM health. We will cover environment setup for GPU-accelerated libraries like PyTorch and TensorFlow, followed by GPU stress testing and storage IOPS tuning.

Section 1: Environment Setup for GPU Accelerated Computing

Before running any performance benchmarks, it’s essential to configure the software environment within the VM correctly. This typically involves setting up isolated Python environments and installing necessary libraries like PyTorch or TensorFlow with CUDA support to leverage the NVIDIA GPU.

1.1 The Importance of Virtual Environments

Using virtual environments (e.g., using `venv`) is highly recommended to manage dependencies for different projects and avoid conflicts between package versions. Each project can have its own isolated environment with specific libraries.

1.2 Setting up the Environment for PyTorch

The following steps guide you through creating a virtual environment and installing PyTorch with CUDA support, which is essential for GPU computation.

  1. Create a virtual environment (if not already in one):

This command creates a new virtual environment named venv-pytorch in your home directory.

python3 -m venv ~/venv-pytorch
  1. Activate the environment:

To start using the virtual environment, you need to activate it. Your shell prompt will typically change to indicate the active environment.

source ~/venv-pytorch/bin/activate

After activation, your prompt might look like this:

(venv-pytorch) root@gpu:~#
  1. Install PyTorch with CUDA support:

First, upgrade pip, the Python package installer. Then, install PyTorch, torchvision, and torchaudio, specifying the CUDA version (cu121 in this example) via the index URL for PyTorch. Ensure the CUDA version matches the NVIDIA drivers installed on your system and compatible with your GPU.

pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

1.3 Verifying PyTorch and CUDA Installation

After installation, verify that PyTorch can detect and use the CUDA-enabled GPU:

python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"

This command should output True if CUDA is available, followed by the name of your NVIDIA GPU (e.g., NVIDIA GeForce RTX 3090).

1.4 Note on TensorFlow Setup

Setting up an environment for TensorFlow with GPU support follows a similar pattern:

  1. Create and activate a dedicated virtual environment.
  2. Install TensorFlow with GPU support (e.g., pip install tensorflow). TensorFlow’s GPU support is often bundled, but ensure your NVIDIA drivers and CUDA Toolkit versions are compatible with the TensorFlow version you install. Consult the official TensorFlow documentation for specific CUDA and cuDNN version requirements.
  3. Verify the installation by running a TensorFlow command that lists available GPUs (e.g., python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))").

This document focuses on PyTorch for the GPU burn script, but the principles of environment setup are transferable.

Section 2: GPU Performance and Stability Testing (`gpu_burn.py`)

Once the environment is set up, you can proceed to test the GPU’s performance and stability. This is particularly important in a VM with GPU passthrough to ensure the GPU is functioning correctly under heavy load and that the passthrough configuration is stable. The `gpu_burn.py` script is designed for this purpose.

2.1 Purpose in a Passthrough Environment

Running a GPU stress test like `gpu_burn.py` helps to:

  • Confirm that the VM has full and correct access to the GPU’s capabilities.
  • Identify any potential overheating issues or power supply limitations under sustained load.
  • Detect driver instabilities or passthrough configuration errors that might only manifest under stress.
  • Get a qualitative measure of the GPU’s computational throughput.

2.2 Overview of `gpu_burn.py`

The `gpu_burn.py` script utilizes PyTorch to perform intensive computations on all available NVIDIA GPUs. It allocates significant memory on the GPU and then runs continuous matrix multiplication operations to stress the compute units.

2.3 Script Breakdown: `gpu_burn.py`


import torch
import time

def burn_gpu(duration_sec=120):
    print("Starting GPU burn for", duration_sec, "seconds...")

    # Get all GPUs
    if not torch.cuda.is_available():
        print("NVIDIA CUDA is not available. Exiting.")
        return
    
    device_count = torch.cuda.device_count()
    if device_count == 0:
        print("No NVIDIA GPUs detected. Exiting.")
        return
        
    print(f"Found {device_count} NVIDIA GPU(s).")
    devices = [torch.device(f'cuda:{i}') for i in range(device_count)]
    tensors = []

    # Allocate large tensors to occupy memory and compute
    for i, dev in enumerate(devices):
        torch.cuda.set_device(dev) # Explicitly set current device
        print(f"Initializing {dev} (GPU {i})...")
        try:
            # Create two large tensors per device
            # Matrix size can be adjusted based on GPU memory
            # 4096x4096 float32 tensor is 4096*4096*4 bytes = 64 MiB
            a = torch.randn((8192, 8192), device=dev, dtype=torch.float32)
            b = torch.randn((8192, 8192), device=dev, dtype=torch.float32)
            tensors.append((a, b))
            print(f"Allocated tensors on {dev}.")
        except RuntimeError as e:
            print(f"Error allocating tensors on {dev}: {e}")
            print(f"Skipping {dev} due to allocation error. This GPU might have insufficient memory or other issues.")
            # Remove the device if allocation fails to avoid errors in the loop
            devices[i] = None # Mark as None
            continue # Move to the next device
    
    # Filter out devices that failed allocation
    active_devices_tensors = []
    active_device_indices = [] # Keep track of original indices for tensors
    for i, dev in enumerate(devices):
        if dev is not None: # Check if device was successfully initialized
             active_devices_tensors.append({'device': dev, 'tensors_idx': len(active_device_indices)})
             active_device_indices.append(i)


    if not active_devices_tensors:
        print("No GPUs were successfully initialized with tensors. Exiting.")
        return

    print(f"Starting computation on {len(active_devices_tensors)} GPU(s).")
    start_time = time.time()
    loop_count = 0
    while time.time() - start_time < duration_sec:
        for item in active_devices_tensors:
            dev = item['device']
            # Retrieve the correct tensors using the stored index mapping
            original_tensor_idx = active_device_indices[item['tensors_idx']]
            a, b = tensors[original_tensor_idx]
            
            torch.cuda.set_device(dev) # Set current device for the operations
            # Heavy compute: Matrix multiplication
            c = torch.matmul(a, b)  
            
            if c.requires_grad: # Check if backward pass is possible
                c.mean().backward() 
        
        # Synchronize all active GPUs after operations on them in the loop
        for item in active_devices_tensors:
            torch.cuda.synchronize(item['device'])

        loop_count += 1
        if loop_count % 10 == 0: # Print a status update periodically
            elapsed_time = time.time() - start_time
            print(f"Loop {loop_count}, Elapsed time: {elapsed_time:.2f} seconds...")


    end_time = time.time()
    print(f"GPU burn finished after {end_time - start_time:.2f} seconds and {loop_count} loops.")

if __name__ == '__main__':
    burn_gpu(120)  # Default duration: 2 minutes (120 seconds)
        

Key functionalities:

  • CUDA Availability Check: Ensures CUDA is available and GPUs are detected.
  • GPU Initialization: Iterates through detected GPUs, setting each as the current device.
  • Tensor Allocation: Attempts to allocate two large (8192×8192) float32 tensors on each GPU. This consumes approximately 2 * (8192*8192*4 bytes) = 512 MiB of GPU memory per GPU. The size can be adjusted based on available GPU memory. Error handling is included for GPUs where allocation might fail (e.g., due to insufficient memory).
  • Computation Loop: Continuously performs matrix multiplication (torch.matmul(a, b)) on the allocated tensors for each successfully initialized GPU. This operation is computationally intensive. An optional backward pass (c.mean().backward()) can be included for a more comprehensive workload if tensors are created with requires_grad=True.
  • Synchronization: torch.cuda.synchronize() is used to ensure all operations on a GPU complete before the next iteration or measurement.
  • Duration and Reporting: The test runs for a specified duration_sec (default 120 seconds). It prints status updates periodically and a final summary.

Note: The script has been slightly adapted in this documentation to correctly handle tensor indexing if some GPUs fail initialization, ensuring it uses tensors associated with successfully initialized devices. The original script might require minor adjustments in the main loop to correctly associate tensors if a GPU is skipped after the initial tensor list is populated.

2.4 How to Run and Interpret Results

1. Save the script as `gpu_burn.py` on your VM.

2. Ensure you are in the activated virtual environment where PyTorch is installed.

3. Run the script from the terminal:

python gpu_burn.py

You can change the duration by modifying the burn_gpu(120) call in the script or by parameterizing the script’s main function call.

Interpreting Results:

  • Monitor the console output for any errors, especially during tensor allocation or computation.
  • Observe the GPU utilization and temperature using tools like `nvidia-smi` in another terminal window. High, stable utilization is expected.
  • A successful run completes without errors for the specified duration, indicating stability. The number of loops completed can provide a rough performance metric, comparable across similar hardware or different configurations.

Section 3: Storage IOPS Performance Tuning (`fio_iops_tuner.sh`)

Storage performance, particularly Input/Output Operations Per Second (IOPS), is critical for many VM workloads, including databases, applications with heavy logging, or build systems. The `fio_iops_tuner.sh` script uses the Flexible I/O Tester (FIO) tool to measure and attempt to achieve a target IOPS rate.

3.1 Importance of IOPS for VM Workloads

In a virtualized environment, storage I/O often passes through multiple layers (guest OS, hypervisor, physical storage). Testing IOPS within the VM helps to:

  • Verify that the VM is achieving the expected storage performance from the underlying infrastructure.
  • Identify potential bottlenecks in the storage path.
  • Tune FIO parameters like `numjobs` (number of parallel I/O threads) and `iodepth` (number of I/O requests queued per job) to maximize IOPS for a given workload profile.

3.2 Overview of `fio_iops_tuner.sh`

This shell script automates the process of running FIO with varying parameters (`numjobs` and `iodepth`) to find a configuration that meets or exceeds a `TARGET_IOPS`. It iteratively increases the load until the target is met or a maximum number of attempts is reached.

3.3 Script Breakdown: `fio_iops_tuner.sh`


#!/bin/bash

TARGET_IOPS=30000
MAX_ATTEMPTS=10
FIO_FILE=/mnt/nfs/testfile # IMPORTANT: Ensure this path is writable and on the target storage
OUTPUT_FILE=fio_output.txt

echo "Starting dynamic IOPS tuner to reach $TARGET_IOPS IOPS..."
echo "Using fio file: $FIO_FILE"
echo

# Initial FIO parameters
bs=8k         # Block size
iodepth=32    # Initial I/O depth per job
numjobs=4     # Initial number of parallel jobs
size=2G       # Size of the test file per job
rw=randrw     # Random read/write workload
mix=70        # 70% read, 30% write

for attempt in $(seq 1 $MAX_ATTEMPTS); do
    echo "Attempt $attempt: Running fio with numjobs=$numjobs and iodepth=$iodepth..."

    # Ensure unique filename for each attempt if files are not cleaned up by fio or needed for review
    CURRENT_FIO_FILE="${FIO_FILE}_attempt${attempt}"

    fio --name=rand_iops_tune \
        --ioengine=libaio \
        --rw=$rw \
        --rwmixread=$mix \
        --bs=$bs \
        --iodepth=$iodepth \
        --numjobs=$numjobs \
        --runtime=30 \
        --time_based \
        --group_reporting \
        --size=$size \
        --filename=${CURRENT_FIO_FILE} \
        --output=$OUTPUT_FILE \
        --exitall_on_error # Stop all jobs if one errors

    # Sum IOPS from all jobs (read + write if group_reporting is used)
    # This grep might need adjustment based on exact fio version output format
    # The original grep 'iops\s*:\s*[^,]+' might sum specific read/write lines.
    # A more robust sum for mixed rw with group_reporting:
    # Look for lines like:   READ: bw=..., iops=X, ...  WRITE: bw=..., iops=Y, ...
    # Or the aggregate line if present.
    # The provided grep implies summing values from multiple "iops :" lines.
    # If using a version of fio that gives separate read/write iops lines and you want total:
    read_iops=$(grep 'read:' $OUTPUT_FILE | grep -oP 'iops=\K[0-9]+' | awk '{s+=$1} END {print s}')
    write_iops=$(grep 'write:' $OUTPUT_FILE | grep -oP 'iops=\K[0-9]+' | awk '{s+=$1} END {print s}')
    iops=$((read_iops + write_iops))
    
    # Fallback or alternative: If group reporting provides a clear aggregate iops line, parse that.
    # The original script's iops parsing:
    # iops=$(grep -oP 'iops\s*:\s*[^,]+' $OUTPUT_FILE | awk '{sum+=$2} END {print int(sum)}')
    # This might capture multiple iops lines (e.g. per job if not group_reporting, or per r/w type)
    # For simplicity, we'll assume the original script's iops parsing worked for its specific FIO version and output.
    # For robust parsing for total IOPS (read+write) from group reporting, often it's on a summary line for "All Jobs" or similar.
    # A common pattern for group_reporting with mixed workload is to find read and write iops separately and sum them:
    parsed_iops=$(grep -A 10 "Run status group 0" $OUTPUT_FILE | grep -E '(read|write)\s*:\s*IOPS=[^,]+' | grep -oP 'IOPS=\K[0-9.]+' | awk '{s+=$1} END {print int(s)}')
    if [ -z "$parsed_iops" ] || [ "$parsed_iops" -eq 0 ]; then # Fallback if above fails
        parsed_iops=$(grep -oP 'iops\s*:\s*[^,]+' $OUTPUT_FILE | awk '{sum+=$2} END {print int(sum)}') # Original parsing
    fi
    iops=${parsed_iops:-0}


    echo "Result: Total IOPS = $iops"

    if (( iops >= TARGET_IOPS )); then
        echo "Target of $TARGET_IOPS IOPS achieved with numjobs=$numjobs and iodepth=$iodepth"
        # Optional: Clean up test files
        # rm -f ${FIO_FILE}_attempt*
        break
    else
        echo "Not enough. Increasing load..."
        # Increment strategy
        (( numjobs += 2 ))  # Increase number of jobs
        (( iodepth += 16 )) # Increase queue depth
        if [ $attempt -eq $MAX_ATTEMPTS ]; then
            echo "Maximum attempts reached. Target IOPS not achieved."
            # Optional: Clean up test files
            # rm -f ${FIO_FILE}_attempt*
        fi
    fi
    # Optional: Clean up individual attempt file if not needed for detailed review later
    rm -f ${CURRENT_FIO_FILE}
done

echo "Finished tuning. Check $OUTPUT_FILE for detailed output of the last successful or final attempt."
        

Key variables and parameters:

  • `TARGET_IOPS`: The desired IOPS rate the script aims to achieve (e.g., 30000).
  • `MAX_ATTEMPTS`: Maximum number of FIO runs with adjusted parameters (e.g., 10).
  • `FIO_FILE`: Path to the test file FIO will use. Crucially, this file should be on the storage system you intend to test. The script appends `_attemptN` to create unique files per run.
  • `OUTPUT_FILE`: File where FIO’s output is saved.
  • Initial FIO parameters:
    • `bs=8k`: Block size for I/O operations (8 kilobytes).
    • `iodepth=32`: Initial queue depth per job.
    • `numjobs=4`: Initial number of parallel jobs.
    • `size=2G`: Total size of data for each job to process.
    • `rw=randrw`: Specifies a random read/write workload.
    • `mix=70`: Defines the read percentage (70% reads, 30% writes).
    • `runtime=30`: Each FIO test runs for 30 seconds.
    • `ioengine=libaio`: Uses Linux asynchronous I/O.

Tuning Loop:

  1. The script runs FIO with the current `numjobs` and `iodepth`.
  2. It parses the `OUTPUT_FILE` to extract the achieved total IOPS. The `grep` and `awk` commands for IOPS parsing might need adjustment based on the exact FIO version and output format. The documented script includes a more robust parsing attempt for mixed workloads.
  3. If achieved IOPS meet `TARGET_IOPS`, the script reports success and exits.
  4. Otherwise, it increases `numjobs` (by 2) and `iodepth` (by 16) and retries, up to `MAX_ATTEMPTS`.

Note on FIO file path: The script uses `/mnt/nfs/testfile`. Ensure this path is valid, writable by the user running the script, and resides on the storage volume whose performance you want to test. If testing local VM disk, change this to an appropriate path like `/tmp/testfile` or a path on a mounted data disk.

Note on IOPS parsing: FIO output can vary. The script includes an enhanced IOPS parsing logic to sum read and write IOPS from group reporting, which is common for mixed workloads. If issues arise, manually inspect `fio_output.txt` from a single run to confirm the IOPS reporting format and adjust the `grep/awk` pattern accordingly. The original script’s parsing is kept as a fallback.

3.4 How to Run and Interpret Results

1. Ensure FIO is installed on your VM (e.g., `sudo apt-get install fio` or `sudo yum install fio`).

2. Save the script as `fio_iops_tuner.sh` and make it executable: `chmod +x fio_iops_tuner.sh`.

3. Modify `TARGET_IOPS`, `FIO_FILE`, and other parameters in the script as needed for your environment and goals.

4. Run the script:

./fio_iops_tuner.sh

Interpreting Results:

  • The script will print the IOPS achieved in each attempt.
  • If the `TARGET_IOPS` is reached, it will indicate the `numjobs` and `iodepth` that achieved this. These parameters can be valuable for configuring applications that are sensitive to storage I/O performance.
  • If the target is not met after `MAX_ATTEMPTS`, the script will indicate this. This might suggest a storage bottleneck or that the target is too high for the current configuration.
  • The `fio_output.txt` file contains detailed FIO output from the last run, which can be inspected for more in-depth analysis (e.g., latency, bandwidth).

Conclusion

Performing the environment setup, GPU stress testing with `gpu_burn.py`, and storage IOPS tuning with `fio_iops_tuner.sh` are vital steps in validating a VM configured with NVIDIA GPU passthrough on ESXi. These tests help ensure that the GPU is operating correctly and delivering expected performance, and that the storage subsystem can handle the I/O demands of your applications. Successful completion of these tests with satisfactory results provides confidence in the stability and capability of your virtualized high-performance computing environment.

Leave a comment