NFS APD on ESXi – End‑to‑End Troubleshooting Guide

What APD Means for NFS Datastores

For NFS, an APD essentially means the ESXi host has lost communication on the TCP connection to the NFS server for long enough that the storage stack starts treating the datastore as unavailable. An internal APD timer starts after a few seconds of no communication on the NFS TCP stream; if this continues for roughly 140 seconds (the default value of Misc.APDTimeout), the host declares an APD timeout for that datastore.

Once a datastore is in APD, VM I/O continues to retry while management operations such as browsing the datastore, mounting ISOs, or snapshot consolidation can start to fail quickly. From a vSphere client perspective the datastore may appear dimmed or inaccessible, and VMs can look hung if they rely heavily on that datastore.

How APD Shows Up in ESXi Logs

When an APD event occurs, vmkernel and vobd are the primary places to look. On recent ESXi versions the logs are typically under /var/run/log/, though many environments still collect from /var/log/vmkernel.log and /var/log/vobd.log.

The lifecycle of a single APD usually looks like this in vobd.log:

APD start, for example:
[APDCorrelator] ... [vob.storage.apd.start] Device or filesystem with identifier [8a5a1336-3d574c6d] has entered the All Paths Down state.

APD timeout, after about 140 seconds:
[APDCorrelator] ... [esx.problem.storage.apd.timeout] Device or filesystem with identifier [8a5a1336-3d574c6d] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.

APD exit when the host finally sees the storage again:
[APDCorrelator] ... [esx.problem.storage.apd.recovered] Device or filesystem with identifier [8a5a1336-3d574c6d] has exited the All Paths Down state.

In vmkernel.log the same event is reflected by the APD handler messages. A typical sequence is:

StorageApdHandler: 248: APD Timer started for ident [8a5a1336-3d574c6d]
StorageApdHandler: 846: APD Start for ident [8a5a1336-3d574c6d]!
StorageApdHandler: 902: APD Exit for ident [8a5a1336-3d574c6d]

On NFS datastores you usually see loss‑of‑connectivity messages around the same time, such as warnings from the NFS client that it has lost connection to the server or that latency has spiked:

WARNING: NFS: NFSVolumeLatencyUpdate: NFS volume <datastore> performance has deteriorated. I/O latency increased ... Exceeded threshold 10000(us)

These messages are often the early warning before APD actually triggers.

NFSv4‑Specific Errors in vmkernel and vobd

With NFSv4.1 the client maintains stateful sessions and slot tables, so problems are not always just simple timeouts. ESXi may log warnings such as NFS41SessionSlotUnassign when the available session slots drop too low; when this happens under heavy load it can lead to session resets and eventually to APD on that datastore if the session cannot be re‑established cleanly.

Another category of issues are NFSv4 errors like NFS4ERR_SHARE_DENIED that show up if an OPEN call conflicts with an existing share reservation on the same file. While these errors do not in themselves mean APD, they often appear in the same time window when applications are competing for locks or when the NFS server is under stress and struggling with state management; the end result can be perceived as I/O hangs on the ESXi side.

When reviewing logs, it is useful to separate pure connectivity problems (socket resets, RPC timeouts) from v4‑specific state problems (session slot issues, share or lock errors). The former almost always have a clear APD signature in vobd; the latter may manifest as intermittent stalls or file‑level errors without a full datastore APD.

What to Look For on the NFS Server

Once you have the APD start and exit timestamps from ESXi, the next step is to line those up with the storage array or NFS server logs. On an ONTAP‑style array, for example, APD windows on the ESXi side often correspond to connection reset entries such as:

kernel: Nblade.nfsConnResetAndClose:error]: Shutting down connection with the client ... network data protocol is NFS ... client IP address:port is x.x.x.x:yyyy ... reason is CSM error - Maximum number of rewind attempts has been exceeded

This type of message indicates that the NFS server terminated the TCP session to the ESXi host, typically due to internal error handling or congestion. If the server is busy or recovering from a failover, there might also be log lines for node failover, LIF migration, or high latency on the backend disks at the same time.

On general Linux NFS servers, the relevant information is usually in /var/log/messages or /var/log/syslog. Around the APD time you want to see whether there were RPC timeouts, transport errors, NIC resets, or NFS service restarts for the host IP that corresponds to the ESXi VMkernel interface. If the issue is configuration‑related (for example, export rules suddenly not matching, Kerberos failures, or NFSv4 grace periods), that also tends to show clearly in these logs.

Other platforms show similar patterns. Hyperconverged solutions may log controller failovers or filesystem service restarts in their own management logs at the same timestamps that ESXi reports APD. In many documented cases, APD is ultimately traced to a short loss of network connectivity or to the NFS service being restarted while ESXi still has active sessions.

Practical Troubleshooting Workflow

In practice, troubleshooting an NFS APD usually starts with a simple question: did all hosts and datastores see APD at the same time, or was the event limited to a single host, a single datastore, or a subset of the fabric? A single host and one datastore tends to point to a host‑side or network issue, such as a NIC problem or VLAN mis‑tag; simultaneous APDs across multiple hosts and the same datastore are more likely to be array‑side or network‑core events.

From the ESXi side, the first task is to build a clear timeline. Grab the vobd and vmkernel logs, extract all the vob.storage.apd messages, and list for each device or filesystem identifier when APD started, whether it hit the 140‑second timeout, and when it exited. Once you have the APD window, you can overlay any NFS warnings, networking errors, or TCP issues that appear in vmkernel around those times. This timeline is often more useful than individual error messages because it tells you exactly how long the host was blind to the datastore.

In parallel, check the current state of the environment. On an affected host, esxcli storage filesystem list will confirm whether the NFS datastore is still mounted, in an inaccessible state, or has recovered. If the datastore is still visible but VMs are sluggish, look for ongoing NFS latency messages or packet‑loss symptoms; if the datastore has disappeared entirely from the host view, then the focus shifts more to export definitions, DNS, routing, and the NFS service itself.

Once the ESXi view is clear, move to the NFS server and the switching infrastructure. Using the APD timestamps, review array or server logs for connection resets, session drops, failovers, or heavy latency. If, for instance, the array log shows that the connection from the ESXi IP was reset because of a TCP or congestion issue exactly at the APD start time, the root cause is probably somewhere between that controller and the host. In environments where network packet loss triggers slow‑start behavior and repeated retransmissions, the effective throughput can collapse to the point that ESXi perceives it as an APD even though the interface never technically goes down.

A common outcome of this analysis is that the real problem is either a transient network issue (link flap, misconfigured MTU, queue drops) or a storage‑side transient (controller failover, NFS daemon restart). Addressing that underlying cause usually prevents further APDs. If the APD condition persists or if the host has been stuck in APD for an extended period, many vendors recommend a controlled reboot of affected ESXi hosts after the storage problem has been resolved, to clear any stale device state and residual APD references.

PowerCLI Scripts for VMware Daily Administration and Reporting

Prerequisites

Before running these scripts, ensure you have the VMware PowerCLI module installed and are connected to your vCenter Server. You can connect by running the following command in your PowerShell terminal: Connect-VIServer -Server Your-vCenter-Server-Address

Script 1: General VM Inventory Report

This script gathers essential information about all virtual machines in your environment and exports it to a CSV file for easy analysis.

# Description: Exports a detailed report of all VMs to a CSV file.# Usage: Run the script after connecting to vCenter.Get-VM | Select-Object Name, PowerState, NumCpu, MemoryGB, UsedSpaceGB, ProvisionedSpaceGB, @{N='Datastore';E={[string]::Join(',', (Get-Datastore -Id $_.DatastoreIdList))}}, @{N='ESXiHost';E={$_.VMHost.Name}}, @{N='ToolsStatus';E={$_.ExtensionData.Guest.ToolsStatus}} | Export-Csv -Path .\VM_Inventory_Report.csv -NoTypeInformationWrite-Host 'VM Inventory Report has been generated: VM_Inventory_Report.csv'

Plain TextCopy

Script 2: VM Performance Report (CPU & Memory)

This script checks the average CPU and memory usage for all powered-on VMs over the last 24 hours and exports any that exceed a defined threshold (e.g., 80%).

# Description: Identifies VMs with high CPU or Memory usage over the last day.# Usage: Adjust the $threshold variable as needed.$threshold = 80 # CPU/Memory Usage Percentage Threshold$vms = Get-VM | Where-Object { $_.PowerState -eq 'PoweredOn' }$report = @()foreach ($vm in $vms) {    $stats = Get-Stat -Entity $vm -Stat cpu.usagemhz.average, mem.usage.average -Start (Get-Date).AddDays(-1) -IntervalMins 5    $avgCpu = ($stats | where MetricId -eq 'cpu.usagemhz.average' | Measure-Object -Property Value -Average).Average    $avgMem = ($stats | where MetricId -eq 'mem.usage.average' | Measure-Object -Property Value -Average).Average    if ($avgCpu -and $avgMem) {        $cpuUsagePercent = [math]::Round(($avgCpu / ($vm.NumCpu * $vm.VMHost.CpuTotalMhz)) * 100, 2)        $memUsagePercent = [math]::Round(($avgMem / ($vm.MemoryMB * 1024)) * 100, 2)        if ($cpuUsagePercent -gt $threshold -or $memUsagePercent -gt $threshold) {            $report += New-Object PSObject -Property @{                VMName = $vm.Name                AvgCPUUsagePct = $cpuUsagePercent                AvgMemoryUsagePct = $memUsagePercent            }        }    }}$report | Export-Csv -Path .\VM_High_Performance_Report.csv -NoTypeInformationWrite-Host 'High Performance Report has been generated: VM_High_Performance_Report.csv'

Plain TextCopy

Script 3: ESXi Host Compute Resources Left

This script reports on the available CPU and Memory resources for each ESXi host in your cluster, helping you plan for capacity.

# Description: Reports the remaining compute resources on each ESXi host.# Usage: Run the script to get a quick overview of host capacity.Get-VMHost | Select-Object Name,     @{N='CpuUsageMHz';E={$_.CpuUsageMhz}},     @{N='CpuTotalMHz';E={$_.CpuTotalMhz}},     @{N='CpuAvailableMHz';E={$_.CpuTotalMhz - $_.CpuUsageMhz}},    @{N='MemoryUsageGB';E={[math]::Round($_.MemoryUsageGB, 2)}},     @{N='MemoryTotalGB';E={[math]::Round($_.MemoryTotalGB, 2)}},    @{N='MemoryAvailableGB';E={[math]::Round($_.MemoryTotalGB - $_.MemoryUsageGB, 2)}} | Format-Table

Plain TextCopy

Script 4: Report on Powered-Off VMs

This simple script quickly lists all virtual machines that are currently in a powered-off state.

# Description: Lists all VMs that are currently powered off.# Usage: Run the script to find unused or decommissioned VMs.Get-VM | Where-Object { $_.PowerState -eq 'PoweredOff' } | Select-Object Name, VMHost, @{N='LastModified';E={$_.ExtensionData.Config.Modified}} | Export-Csv -Path .\Powered_Off_VMs.csv -NoTypeInformationWrite-Host 'Powered Off VMs report has been generated: Powered_Off_VMs.csv'

Plain TextCopy

Script 5: Audit Who Powered Off a VM

This script searches the vCenter event logs from the last 7 days to find who initiated a ‘power off’ task on a specific VM.

# Description: Finds the user who powered off a specific VM within the last week.# Usage: Replace 'Your-VM-Name' with the actual name of the target VM.$vmName = 'Your-VM-Name'$vm = Get-VM -Name $vmNameGet-VIEvent -Entity $vm -MaxSamples ([int]::MaxValue) -Start (Get-Date).AddDays(-7) | Where-Object { $_.GetType().Name -eq 'VmPoweredOffEvent' } | Select-Object CreatedTime, UserName, FullFormattedMessage | Format-List

Plain TextCopy

Script 6: Check for ESXi Host Crashes or Disconnections

This script checks for ESXi host disconnection events or host error events in the vCenter logs over the past 30 days, which can indicate a crash or network issue (Purple Screen of Death – PSOD).

# Description: Searches for host disconnection or error events in the last 30 days.# Usage: Run this to investigate potential host stability issues.Get-VIEvent -MaxSamples ([int]::MaxValue) -Start (Get-Date).AddDays(-30) | Where-Object { $_.GetType().Name -in ('HostCnxFailedEvent', 'HostDisconnectedEvent', 'HostEsxGenericPanicEvent', 'EnteredMaintenanceModeEvent') } | Select-Object CreatedTime, HostName, FullFormattedMessage | Sort-Object CreatedTime -Descending | Export-Csv -Path .\Host_Crash_Events.csv -NoTypeInformationWrite-Host 'Host crash/disconnection event report has been generated: Host_Crash_Events.csv'

Plain TextCopy

Best Practice Guide: Kubernetes and NAS on VMware

This guide provides a detailed, step-by-step approach to designing and implementing a robust Kubernetes environment that utilizes Network Attached Storage (NAS) on a VMware vSphere platform. Following these best practices will ensure a scalable, resilient, and performant architecture.

Core Design Principles

Separation of Concerns: Keep your storage (NAS), compute (VMware), and orchestration (Kubernetes) layers distinct but well-integrated. This simplifies management and troubleshooting.

Leverage the CSI Standard: Always use a Container Storage Interface (CSI) driver for integrating storage. This is the Kubernetes-native way to connect to storage systems and is vendor-agnostic.

Network Performance is Key: The network is the backbone connecting your K8s nodes (VMs) to the NAS. Dedicate sufficient bandwidth and low latency links for storage traffic.

High Availability (HA): Design for failure. This includes using a resilient NAS appliance, VMware HA for your K8s node VMs, and appropriate Kubernetes deployment strategies.

Granular Access Control: Implement strict permissions on your NAS exports and use Kubernetes Persistent Volumes (PVs) and Persistent Volume Claims (PVCs) to manage access.

Step-by-Step Implementation Guide

Here is a detailed workflow for setting up your environment from the ground up.

1. VMware Environment Preparation

ESXi Hosts & vCenter: Ensure you are running a supported version of vSphere. Configure DRS and HA clusters for automatic load balancing and failover of your Kubernetes node VMs.

Virtual Machine Templates: Create a standardized VM template for your Kubernetes control plane and worker nodes. Use a lightweight, cloud-native OS like Ubuntu Server or Photon OS.

Networking: Create a dedicated vSwitch or Port Group for NAS storage traffic. This isolates storage I/O from other network traffic (management, pod-to-pod) and improves security and performance. Use Jumbo Frames (MTU 9000) on this network if your NAS and physical switches support it.

2. NAS Storage Preparation (NFS Example)

Create NFS Exports: On your NAS appliance, create dedicated NFS shares that will be used by Kubernetes. It’s better to have multiple smaller shares for different applications or teams than one monolithic share.

Set Permissions: Configure export policies to only allow access from the IP addresses of your Kubernetes worker nodes. Set `no_root_squash` if your containers require running as root, but be aware of the security implications.

Optimize for Performance: Enable NFSv4.1 or higher for better performance and features like session trunking. Ensure your NAS has sufficient IOPS capability for your workloads.

3. Kubernetes Cluster Deployment

Provision VMs: Deploy your control plane and worker nodes from the template created in Step 1.

Install Kubernetes: Use a standard tool like `kubeadm` to bootstrap your cluster. Alternatively, leverage a VMware-native solution like VMware Tanzu for deeper integration.

Install CSI Driver: This is the most critical step for storage integration. Deploy the appropriate CSI driver for your NAS. For a generic NFS server, you can use the open-source NFS CSI driver. You typically install it using Helm or by applying its YAML manifests.

4. Integrating and Using NAS Storage

Create a StorageClass: A StorageClass tells Kubernetes how to provision storage. You will create one that uses the NFS CSI driver. This allows developers to request storage dynamically without needing to know the underlying NAS details. Example StorageClass YAML:

apiVersion: storage.k8s.io/v1kind: StorageClassmetadata:  name: nfs-csiprovisioner: nfs.csi.k8s.ioparameters:  server: 192.168.10.100  share: /exports/kubernetes  mountOptions:    - "nfsvers=4.1"reclaimPolicy: RetainvolumeBindingMode: Immediate

Request Storage with a PVC: Developers request storage by creating a PersistentVolumeClaim (PVC) that references the StorageClass. Example PVC YAML:

apiVersion: v1kind: PersistentVolumeClaimmetadata:  name: my-app-dataspec:  accessModes:    - ReadWriteMany  storageClassName: nfs-csi  resources:    requests:      storage: 10Gi

Mount the Volume in a Pod: Finally, mount the PVC as a volume in your application’s Pod definition. Example Pod YAML:

apiVersion: v1kind: Podmetadata:  name: my-nginx-podspec:  containers:  - name: nginx    image: nginx:latest    volumeMounts:    - name: data-volume      mountPath: /usr/share/nginx/html  volumes:  - name: data-volume    persistentVolumeClaim:      claimName: my-app-data

Important Dos and Don’ts

DoDon’t
Do use a CSI driver for dynamic provisioning. It automates PV creation and simplifies management.Don’t use static PV definitions or direct hostPath mounts to the NAS. This is brittle and not scalable.
Do isolate NAS traffic on a dedicated VLAN and vSwitch/Port Group for security and performance.Don’t mix storage traffic with management or pod-to-pod traffic on the same network interface.
Do use the `ReadWriteMany` (RWX) access mode for NFS to share a volume across multiple pods.Don’t assume all storage supports RWX. Block storage (iSCSI/FC) typically only supports `ReadWriteOnce` (RWO).
Do implement a backup strategy for your persistent data on the NAS using snapshots or other backup tools.Don’t assume Kubernetes handles data backups. It only manages the volume lifecycle.
Do monitor storage latency and IOPS from both the VMware and NAS side to identify bottlenecks.Don’t ignore storage performance until applications start failing.

Design Example: Web Application with a Shared Uploads Folder

Scenario: A cluster of web server pods that need to read and write to a common directory for user-uploaded content.

VMware Setup: A 3-node Kubernetes cluster (1 control-plane, 2 workers) running as VMs in a vSphere HA cluster. A dedicated “NAS-Traffic” Port Group is configured for a second vNIC on each worker VM.

NAS Setup: A NAS appliance provides an NFSv4 share at `192.168.50.20:/mnt/k8s_uploads`. The export policy is restricted to the IPs of the worker nodes on the NAS traffic network.

Kubernetes Setup:

The NFS CSI driver is installed in the cluster.

A `StorageClass` named `shared-uploads` is created, pointing to the NFS share.

A `PersistentVolumeClaim` named `uploads-pvc` requests 50Gi of storage using the `shared-uploads` StorageClass with `ReadWriteMany` access mode.

The web application’s `Deployment` is configured to mount `uploads-pvc` at the path `/var/www/html/uploads`.

Any of the web server pods can write a file to the uploads directory, and all other pods can immediately see and serve that file, because they are all connected to the same underlying NFS share. If a worker VM fails, VMware HA restarts it on another host, and Kubernetes reschedules the pod, which then re-attaches to its storage seamlessly.

Deploying Time-Sensitive Applications on Kubernetes in VMware

Deploying time-sensitive applications, such as those in telecommunications (vRAN), high-frequency trading, or real-time data processing, on Kubernetes within a VMware vSphere environment requires careful configuration at both the hypervisor and Kubernetes levels. The goal is to minimize latency and jitter by providing dedicated resources and precise time synchronization.

Prerequisites: VMware vSphere Configuration

Before deploying pods in Kubernetes, the underlying virtual machine (worker node) and ESXi host must be properly configured. These settings reduce virtualization overhead and improve performance predictability.

Precision Time Protocol (PTP): Configure the ESXi host to use a PTP time source. This allows virtual machines to synchronize their clocks with high accuracy, which is critical for applications that depend on precise time-stamping and event ordering.

Latency Sensitivity: In the VM’s settings (VM Options -> Advanced -> Latency Sensitivity), set the value to High. This instructs the vSphere scheduler to reserve physical CPU and memory, minimizing scheduling delays and preemption.

CPU and Memory Reservations: Set a 100% reservation for both CPU and Memory for the worker node VM. This ensures that the resources are always available and not contended by other VMs.

Key Kubernetes Concepts

Kubernetes provides several features to control resource allocation and pod placement, which are essential for time-sensitive workloads.

Quality of Service (QoS) Classes: Kubernetes assigns pods to one of three QoS classes. For time-sensitive applications, the Guaranteed class is essential. A pod is given this class if every container in it has both a memory and CPU request and limit, and they are equal.

CPU Manager Policy: The kubelet’s CPU Manager can be configured with a ‘static’ policy, which allows pods in the Guaranteed QoS class with integer CPU requests exclusive access to CPUs on the node.

HugePages: Using HugePages can improve performance by reducing the overhead associated with memory management (TLB misses).

Example 1: Basic Deployment with Guaranteed QoS

This example demonstrates how to create a simple Pod that qualifies for the ‘Guaranteed’ QoS class. This is the first step towards ensuring predictable performance.

apiVersion: v1kind: Podmetadata: name: low-latency-appspec: containers: – name: my-app-container image: my-real-time-app:latest resources: requests: memory: “2Gi” cpu: “2” limits: memory: “2Gi” cpu: “2”

In this manifest, the CPU and memory requests are identical to their limits, ensuring the pod is placed in the Guaranteed QoS class.

Example 2: Advanced Deployment with CPU Pinning and HugePages

This example builds on the previous one by requesting exclusive CPUs and using HugePages. This configuration is suitable for high-performance applications that require dedicated CPU cores and efficient memory access. Note: This requires the node’s CPU Manager policy to be set to ‘static’ and for HugePages to be pre-allocated on the worker node.

apiVersion: v1kind: Podmetadata: name: high-performance-appspec: containers: – name: my-hpc-container image: my-hpc-app:latest resources: requests: memory: “4Gi” cpu: “4” hugepages-2Mi: “2Gi” limits: memory: “4Gi” cpu: “4” hugepages-2Mi: “2Gi” volumeMounts: – mountPath: /hugepages name: hugepage-volume volumes: – name: hugepage-volume emptyDir: medium: HugePages

This pod requests four dedicated CPU cores and 2Gi of 2-megabyte HugePages, providing a highly stable and low-latency execution environment.

Summary

Successfully deploying time-sensitive applications on Kubernetes in VMware is a multi-layered process. It starts with proper ESXi host and VM configuration to minimize virtualization overhead and concludes with specific Kubernetes pod specifications to guarantee resource allocation and scheduling priority. By combining these techniques, you can build a robust platform for your most demanding workloads.

Performance metric from VC if there is a performance issue on environment

Connect to vCenter (update credentials as needed)

Connect-VIServer -Server “your-vcenter-server”

Define time window for stats (last 30 minutes)

$start = (Get-Date).AddMinutes(-30)
$end = Get-Date

Filter VDI VMs (update identification logic as appropriate)

$vdimvms = Get-VM | Where-Object { $_.Name -like “VDI” -and $_.PowerState -eq “PoweredOn” }

Collect performance data

$results = foreach ($vm in $vdimvms) {
# Gather stats in batch for efficiency
$stats = Get-Stat -Entity $vm -Stat @(
“cpu.ready.summation”,
“mem.latency.average”,
“disk.totalLatency.average”,
“disk.read.average”,
“disk.write.average”
) -Start $start -Finish $end

# Extract and average metrics; protect against missing data
$cpuReady   = $stats | Where-Object {$_.MetricId -eq "cpu.ready.summation"}
$memLatency = $stats | Where-Object {$_.MetricId -eq "mem.latency.average"}
$diskLat    = $stats | Where-Object {$_.MetricId -eq "disk.totalLatency.average"}
$readIOPS   = $stats | Where-Object {$_.MetricId -eq "disk.read.average"}
$writeIOPS  = $stats | Where-Object {$_.MetricId -eq "disk.write.average"}

[PSCustomObject]@{
    VMName        = $vm.Name
    CPUReadyMS    = if ($cpuReady)   { ($cpuReady | Measure-Object -Property Value -Average).Average / 1000 } else { $null }
    MemLatencyMS  = if ($memLatency) { ($memLatency | Measure-Object -Property Value -Average).Average }    else { $null }
    DiskLatencyMS = if ($diskLat)    { ($diskLat | Measure-Object -Property Value -Average).Average }       else { $null }
    ReadIOPS      = if ($readIOPS)   { ($readIOPS | Measure-Object -Property Value -Average).Average }      else { $null }
    WriteIOPS     = if ($writeIOPS)  { ($writeIOPS | Measure-Object -Property Value -Average).Average }     else { $null }
    TotalIOPS     = (
        ((($readIOPS | Measure-Object -Property Value -Sum).Sum) +
         (($writeIOPS | Measure-Object -Property Value -Sum).Sum))
    )
}

}

Display top 10 VMs by disk latency, show table and export to CSV

$timestamp = Get-Date -Format “yyyyMMdd_HHmmss”
$top10 = $results | Sort-Object -Property DiskLatencyMS -Descending | Select-Object -First 10
$top10 | Format-Table -AutoSize
$results | Export-Csv -Path “VDI_VM_Perf_Report_$timestamp.csv” -NoTypeInformation

Notes:

– For environments with large VM counts, consider running data collection in parallel using Start-Job/Runspaces.

– Always verify metric names using Get-Stat -IntervalMins 5 -MaxSamples 1 -Entity (Get-VM | Select-Object -First 1).

– Add additional VM filters (folders/tags) for more targeted results.

Mastering VMware Cloud Foundation: A Step-by-Step Guide with vSAN and NSX

Mastering VMware Cloud Foundation: A Step-by-Step Guide with vSAN and NSX

In today’s dynamic IT landscape, building a robust, agile, and secure private cloud infrastructure is paramount. VMware Cloud Foundation (VCF) offers a comprehensive solution, integrating compute (vSphere), storage (vSAN), networking (NSX), and cloud management (vRealize Suite/Aria Suite) into a single, automated platform. This guide will walk you through the essential steps of deploying and managing VCF, focusing on the powerful synergy of vSAN for storage and NSX for network virtualization.

VCF streamlines the deployment and lifecycle management of your Software-Defined Data Center (SDDC), ensuring consistency and efficiency from day zero to day two operations and beyond.

Step-by-Step Guide to Use VCF with vSAN and NSX

1. Pre-Deployment Preparation

A successful VCF deployment begins with meticulous planning and preparation. Ensuring all prerequisites are met will save significant time and effort during the actual bring-up process.

  • Hardware Requirements: Ensure compatible hardware nodes (VMware vSAN Ready Nodes are highly recommended for optimal performance and support). Verify HCL (Hardware Compatibility List) compliance.
  • Network: Prepare dedicated VLANs for management, vSAN, vMotion, and NSX overlays (Geneve). Assign appropriate IP ranges for each. Make sure DNS (forward and reverse records), NTP (Network Time Protocol), and gateway configurations are meticulously planned and ready. Proper MTU (Jumbo Frames, typically 9000) configuration for vSAN and NSX overlay networks is crucial for performance.
  • Licenses: Secure the necessary VMware Cloud Foundation license, VMware NSX license, and VMware vSAN license. Ensure these licenses are valid and ready for input during deployment.
  • vSphere Environment: Decide on an existing vCenter Server for the Cloud Builder deployment or prepare for a fresh set of ESXi hosts for the management and subsequent workload domains.

2. Deploy VMware Cloud Builder Appliance

The Cloud Builder appliance is the orchestrator for the VCF deployment, simplifying the entire bring-up process.

  • Download the Cloud Builder OVA from VMware Customer Connect (login required).
  • Deploy the OVA to an ESXi host or an existing vCenter Server environment.
  • Configure basic network settings (IP, DNS, gateway, NTP) for the Cloud Builder appliance.
  • Power on the appliance and log in to the Cloud Builder UI via a web browser.

3. Prepare the JSON Configuration File

The JSON configuration file is the blueprint for your VCF deployment, containing all the specifics of your SDDC design.

  • Create or download a JSON file template. This file will specify critical details like cluster names, network pools, IP ranges, ESXi host details, and domain information.
  • Include:
    • Management domain and workload domain details (if applicable).
    • Network segment names for vSAN, vMotion, NSX overlay, and Edge nodes.
    • Licensing information for all required VMware products.
    • Host profiles and resource pools where applicable.
    • User credentials for various components.

4. Start the VCF Bring-Up Process

With the configuration ready, initiate the automated deployment through Cloud Builder.

  • Upload the meticulously prepared JSON configuration file in the Cloud Builder UI.
  • Run the pre-checks and validation steps to ensure network connectivity, naming conventions, and host readiness. This step is crucial for identifying and resolving issues before deployment.
  • Start the deployment via Cloud Builder, which will orchestrate the following:
    • Deploy the management domain vCenter Server Appliance.
    • Deploy the SDDC Manager appliance, which serves as the central management console for VCF.
    • Deploy the NSX-T Manager cluster.
    • Configure NSX overlay and transport zones on the management domain hosts.
    • Prepare and enable the vSAN cluster on the ESXi hosts designated for the management domain.

5. Configure NSX in VCF

NSX-T is deeply integrated into VCF, providing robust network virtualization and security.

  • The NSX-T Manager cluster is automatically deployed in the management domain as part of the VCF bring-up.
  • Set up Transport Zones for VLAN-backed networks and Overlay (Geneve) networks.
  • Create Uplink profiles and assign them to hosts for NSX-T network connectivity.
  • Configure Tier-0 and Tier-1 routers for north-south (external) and east-west (internal) traffic routing, respectively.
  • Set up routing protocols (BGP or static routing) for Edge clusters to ensure proper external connectivity.
  • Set up firewall rules and security policies (Distributed Firewall) as needed to enforce micro-segmentation.

6. vSAN Configuration

vSAN provides the hyper-converged storage layer, fully integrated with vSphere and managed through VCF.

  • vSAN is enabled and configured automatically on the management and workload clusters during their creation.
  • Ensure disk groups are properly formed with dedicated cache devices (SSD/NVMe) and capacity devices (SSD/HDD).
  • Enable vSAN services like deduplication, compression, and fault domains if required, based on your performance and capacity needs.
  • Configure vSAN network traffic to use dedicated VMkernel ports with proper MTU (typically 9000 for jumbo frames) for optimal performance.
  • Monitor vSAN health and performance regularly in vCenter under the vSAN cluster settings.

7. Create Workload Domains

Workload domains are logical constructs that encapsulate compute, storage, and network resources for specific applications or departments.

  • Through the SDDC Manager UI, create additional workload domains if needed, separate from the management domain.
  • Assign available ESXi hosts to these new domains and specify vSAN or other storage options.
  • SDDC Manager will deploy dedicated vCenter Servers for these workload domains.
  • NSX is automatically integrated with these newly created workload domains for network virtualization and security.

8. Post-Deployment Tasks

After the core VCF deployment, several crucial post-deployment tasks refine your SDDC for production use.

  • Create Edge Clusters by deploying additional NSX Edge appliances. These are essential for north-south routing, NAT, VPN, and load balancing services.
  • Configure external routing and failover mechanisms for Edge clusters to ensure high availability for external connectivity.
  • Set up VMware Aria (formerly vRealize) Suite products like Aria Operations (for monitoring) and Aria Automation (for orchestration) for comprehensive management.
  • Enable Tanzu Kubernetes Grid (TKG) for container workloads, leveraging the integrated NSX and vSAN capabilities.
  • Perform initial lifecycle management and update automation via SDDC Manager to ensure your VCF stack is up-to-date and secure.

Note: The lifecycle management capabilities of VCF through SDDC Manager are a cornerstone feature, simplifying upgrades and patching across vSphere, vSAN, and NSX.

Summary Table of Core Components in VCF with vSAN and NSX

PhaseKey Actions / Components
Pre-DeploymentHardware readiness, VLANs, DNS, NTP, Licensing
Deploy Cloud BuilderDeploy OVA, configure network, prepare JSON config
Bring-up ProcessvCenter, SDDC Manager, NSX-T Manager, vSAN cluster setup
NSX-T ConfigurationTransport zones, Uplink profiles, Tier-0/1 gateways
vSAN ConfigurationDisk groups, deduplication/compression, fault domains
Create Workload DomainsESXi cluster creation, vCenter deployment, workload NSX integration
Post-DeploymentEdge clusters, routing, VMware Aria, Tanzu Kubernetes Grid

Post-Deployment Management of VCF with vSAN and NSX

After the successful deployment, ongoing management, monitoring, and optimization are crucial for maintaining a healthy and efficient VCF environment.

1. Monitoring and Health Checks

Proactive monitoring is key to preventing issues and ensuring optimal performance.

  • vCenter and SDDC Manager Dashboards: Regularly check the health status of clusters, hosts, vSAN, NSX, and workload domains through the vCenter UI and SDDC Manager. Utilize built-in alerts and dashboards to track anomalies and performance metrics.
  • vSAN Health Service: Continuously monitor hardware health, disk group status, capacity utilization, network health, and data services (deduplication, compression). Address any warnings or errors immediately.
  • NSX Manager and Controllers: Monitor NSX components’ status, including the Controller cluster, Edge nodes, and control plane communication. Use the extensive troubleshooting tools within NSX Manager to verify overlay networks and routing health.
  • Logs and Event Monitoring: Collect logs from vCenter, ESXi hosts, NSX Manager, and SDDC Manager. Integrate with VMware Aria Operations or third-party SIEM tools for centralized log analytics and faster issue resolution.

2. Routine Tasks

Regular maintenance ensures the long-term stability and security of your VCF infrastructure.

  • Patch and Update Lifecycle Management: Leverage SDDC Manager’s automated capabilities to manage patches and upgrades for the entire solution stack – vSphere, vSAN, NSX, and VCF components. Always follow the recommended upgrade sequence from VMware.
  • Capacity Management: Regularly track CPU, memory, and storage usage across management and workload domains to predict future needs, plan expansions, or rebalance workloads effectively.
  • Backup and Disaster Recovery: Implement a robust backup solution for vCenter, NSX, and SDDC Manager configurations. Consider native vSAN data protection features or integrate with third-party DR solutions to protect VMs and storage metadata.
  • User Access and Security: Manage roles and permissions diligently via vCenter and NSX RBAC (Role-Based Access Control). Regularly review user access and conduct audits for compliance.

Troubleshooting Best Practices

Effective troubleshooting requires understanding the interconnected components of VCF.

vSAN Troubleshooting

  • Common Issues: Be aware of issues like faulty disks, network partitioning, degraded disk groups, and bad capacity devices.
  • Diagnostic Tools: Utilize vSAN Health Service, esxcli vsan commands, and RVC (Ruby vSphere Console) for detailed diagnostics and troubleshooting.
  • Network Troubleshooting: Validate MTU sizes (jumbo frames enabled on vSAN VMkernel interfaces), and verify multicast routing where applicable (for older vSAN versions or specific configurations).
  • Capacity and Performance: Check for congestion or latency spikes; monitor latency at physical disk, cache, and network layers using vSAN performance metrics.
  • Automated Remediation: Leverage automated tools in vSAN and collect VMware support bundles for efficient log collection when engaging support.

NSX Troubleshooting

  • Overlay and Tunnels: Check Geneve (or VXLAN for older deployments) tunnels between hosts and Edge nodes via NSX Manager monitoring. Verify host preparation status and successful VIB installation.
  • Routing Issues: Review Tier-0/Tier-1 router configurations, BGP or static routing neighbors, and route propagation status.
  • Firewall and Security Policies: Confirm that firewall rules are neither overly restrictive nor missing necessary exceptions, ensuring proper traffic flow.
  • Edge Node Health: Monitor for CPU/memory overload on Edge appliances; restart services if necessary.
  • Connectivity Testing: Use NSX CLI commands and common network tests (ping, traceroute, netcat) within NSX environments to verify connectivity.
  • NSX Logs: Collect and analyze logs from NSX Manager, Controllers, and Edge nodes for deeper insights.

Advanced NSX and vSAN Optimizations

Leverage the full power of VCF by utilizing advanced features for enhanced security, performance, and resilience.

NSX Advanced Features

  • Distributed Firewall (DFW) Micro-Segmentation: Enforce granular security policies per VM or workload group to prevent lateral threat movement within your data center.
  • NSX Intelligence: Utilize behavior-based analytics for threat detection, network visibility, and automated policy recommendations.
  • Load Balancing: Implement NSX native L4-L7 load balancing services directly integrated with your VM applications, ensuring high availability and performance.
  • Service Insertion and Chaining: Integrate third-party security and monitoring appliances transparently into the network flow.
  • Multi-Cluster and Federation: Plan and deploy NSX Federation for centralized management and disaster recovery across multiple geographic sites.

vSAN Advanced Tips

  • Storage Policy-Based Management (SPBM): Define VM storage policies for availability (RAID levels), stripe width, checksum, and failure tolerance levels to precisely tune performance and resilience per application.
  • Deduplication and Compression: Enable these space-saving features primarily on all-flash vSAN clusters, carefully considering the potential performance impact.
  • Encryption: Implement vSAN encryption for data-at-rest security without requiring specialized hardware, meeting compliance requirements.
  • QoS and IOPS Limits: Apply QoS (Quality of Service) policies to throttle “noisy neighbor” VMs or guarantee performance for critical workloads.
  • Fault Domains and Stretched Clusters: Configure fault domains to optimize failure isolation within a single site and deploy stretched clusters for site-level redundancy and disaster avoidance.
  • vSAN Performance Service: Utilize the vSAN performance monitoring service to gain deep insights into I/O patterns, bandwidth, and latency, aiding in performance tuning.

Additional Resources

For more in-depth information, official documentation, and community support, refer to the following VMware (now Broadcom) resources:

© 2023 [Your Name/Company Name, if applicable]. All rights reserved. VMware, vSAN, NSX, VCF, and other VMware product names are trademarks of Broadcom Inc. or its subsidiaries.

Top 10 VMware PowerShell Scripts for Admins

Master your VMware vSphere environment with these essential PowerCLI scripts. They simplify daily management, monitoring, and troubleshooting tasks, boosting your efficiency significantly.

1. List All VMs with Power State and Host

Get a quick overview of all virtual machines, their power state, and running host.

Get-VM | Select-Object Name, PowerState, @{Name='Host';Expression={$_.VMHost.Name}} | Format-Table -AutoSize

2. Check VM Tools Status for All VMs

Verify VMware Tools installation and running status across your VMs.

Get-VM | Select-Object Name, @{N="ToolsStatus";E={$_.ExtensionData.Guest.ToolsStatus}} | Format-Table -AutoSize

3. Get Datastore Usage Summary

Monitor datastore free space and capacity percentages to anticipate storage needs.

Get-Datastore | Select-Object Name, FreeSpaceGB, CapacityGB, @{N="FreePercent";E={[math]::Round(($_.FreeSpaceGB/$_.CapacityGB)*100,2)}} | Format-Table -AutoSize

4. Find Snapshots Older Than 30 Days

Identify old snapshots that may impact performance and storage. Cleanup is recommended.

Get-VM | Get-Snapshot | Where-Object {$_.Created -lt (Get-Date).AddDays(-30)} | Select-Object VM, Name, Created | Format-Table -AutoSize

5. Get List of Hosts with CPU and Memory Usage

Track resource utilization on ESXi hosts for capacity planning.

Get-VMHost | Select-Object Name, @{N="CPU_Usage(%)";E={[math]::Round(($_.CpuUsageMHz / $_.CpuTotalMHz)*100,2)}}, @{N="Memory_Usage(%)";E={[math]::Round(($_.MemoryUsageMB / $_.MemoryTotalMB)*100,2)}} | Format-Table -AutoSize

6. VMs with High CPU Usage in Last Hour

List VMs consuming more than 80% CPU average to spot bottlenecks.

$oneHourAgo = (Get-Date).AddHours(-1)
Get-Stat -Entity (Get-VM) -Stat cpu.usage.average -Start $oneHourAgo |
Group-Object -Property Entity | ForEach-Object {
  [PSCustomObject]@{
    VMName = $_.Name
    AvgCPU = ($_.Group | Measure-Object -Property Value -Average).Average
  }
} | Where-Object { $_.AvgCPU -gt 80 } | Sort-Object -Property AvgCPU -Descending | Format-Table -AutoSize

7. Power Off All VMs on a Specific Host

Useful for host maintenance or shutdown, ensuring controlled VM power-off.

$host = "esxi-hostname"
Get-VM -VMHost $host | Where-Object {$_.PowerState -eq "PoweredOn"} | Stop-VM -Confirm:$false

8. Create a New VM Folder and Move Specified VMs

Organize virtual machines into folders programmatically for better vCenter management.

$folderName = "NewFolder"
$vmNames = @("VM1", "VM2", "VM3")
$folder = Get-Folder -Name $folderName -ErrorAction SilentlyContinue
if (-not $folder) { $folder = New-Folder -Name $folderName -Location (Get-Datacenter) }
foreach ($vmName in $vmNames) {
    Get-VM -Name $vmName | Move-VM -Destination $folder
}

9. Export VM List to CSV with Key Info

Generate reports for auditing, capacity planning, or inventory by exporting VM details.

Get-VM | Select-Object Name, PowerState, NumCPU, MemoryGB, ProvisionedSpaceGB | Export-Csv -Path "C:\VMReport.csv" -NoTypeInformation

10. Check VM Network Adapters and IP Addresses

Assess VM connectivity, IP addresses, and network adapter configurations.

Get-VM | Select-Object Name, @{N="IPAddresses";E={$_.Guest.IPAddress -join ","}}, @{N="NetworkAdapters";E={$_.NetworkAdapters.Name -join ","}} | Format-Table -AutoSize

How to Use

First, install PowerCLI: Install-Module VMware.PowerCLI. Then, connect to your vCenter or ESXi host: Connect-VIServer -Server your_vcenter_server. Copy and run the scripts in your PowerShell console. Always test in a non-production environment first.

These PowerCLI scripts are fundamental tools for any VMware administrator. Utilize them to enhance your operational efficiency and maintain a robust virtualized infrastructure. Happy scripting!

Step-by-Step Guide: Running Kubernetes Applications in a VMware Environment

This documentation provides a comprehensive walkthrough for deploying and managing modern, containerized applications with Kubernetes on a VMware vSphere foundation. By leveraging familiar VMware tools and infrastructure, organizations can accelerate their adoption of Kubernetes while maintaining enterprise-grade stability, security, and performance. This guide covers architecture, deployment options, networking design, and practical examples using solutions like VMware Tanzu and vSphere.

1. Understanding the Architecture

Running Kubernetes on VMware combines the power of cloud-native orchestration with the robustness of enterprise virtualization. This hybrid approach allows you to leverage existing investments in hardware, skills, and operational processes.

VMware Environment: The Foundation

The core infrastructure is your vSphere platform, which provides the compute, storage, and networking resources for the Kubernetes nodes. Key components include:

  • ESXi Hosts: The hypervisors that run the virtual machines (VMs) for Kubernetes control plane and worker nodes.
  • vCenter Server: The centralized management plane for your ESXi hosts and VMs. It’s essential for deploying, managing, and monitoring the cluster’s underlying infrastructure.
  • vSphere Storage: Datastores (vSAN, VMFS, NFS) that provide persistent storage for VMs and, through the vSphere CSI driver, for Kubernetes applications.

Kubernetes Installation: A Spectrum of Choices

VMware offers a range of options for deploying Kubernetes, from deeply integrated, turn-key solutions to flexible, do-it-yourself methods.

  • VMware vSphere with Tanzu (VKS): This is the premier, integrated solution that embeds Kubernetes directly into vSphere. It transforms a vSphere cluster into a platform for running both VMs and containers side-by-side. It simplifies deployment and provides seamless access to vSphere resources.
  • VMware Tanzu Kubernetes Grid (TKG): A standalone, multi-cloud Kubernetes runtime that you can deploy on vSphere (and other clouds). TKG is ideal for organizations that need a consistent Kubernetes distribution across different environments.
  • Kubeadm on VMs: The generic, open-source approach. You create Linux VMs on vSphere and use standard Kubernetes tools like kubeadm to bootstrap a cluster. This offers maximum flexibility but requires more manual configuration and lifecycle management.

Networking: The Critical Connector

Proper network design is crucial for security and performance. VMware provides powerful constructs for Kubernetes networking:

  • VMware NSX: An advanced network virtualization and security platform. When integrated with Kubernetes, NSX provides a full networking and security stack, including pod networking, load balancing, and micro-segmentation for “zero-trust” security between microservices.
  • vSphere Distributed Switch (vDS): Can be used to create isolated networks (VLANs) for different traffic types—such as management, pod, and service traffic—providing a solid and performant networking base.

2. Prerequisites

Before deploying a cluster, ensure your VMware environment is prepared and has sufficient resources.

  • Configured vSphere/vCenter: A healthy vSphere 7.0U2 or newer environment with available ESXi hosts in a cluster.
  • Sufficient Resources: Plan for your desired cluster size. A small test cluster (1 control plane, 3 workers) may require at least 16 vCPUs, 64GB RAM, and 500GB of storage. Production clusters will require significantly more.
  • Networking Infrastructure:
    • (For vDS) Pre-configured port groups and VLANs for management, workload, and external access.
    • (For NSX) NSX Manager deployed and configured with network segments and T0/T1 gateways.
    • A pool of available IP addresses for all required networks.
  • Tooling (Optional but Recommended): VMware Tanzu CLI, Rancher, or other management tools to simplify cluster lifecycle operations.

3. Cluster Deployment: Step by Step

Option 1: VMware Tanzu Kubernetes Grid (TKG) Standalone

TKG provides a streamlined CLI or UI experience for creating conformant Kubernetes clusters.

# Install prerequisites: Docker, Tanzu CLI, kubectl
# Start the UI-based installer for a guided experience
tanzu standalone-cluster create --ui

# Alternatively, use a YAML configuration file for repeatable deployments
tanzu standalone-cluster create -f my-cluster-config.yaml

The wizard or YAML file allows you to specify the vCenter endpoint, the number of nodes, VM sizes (e.g., small, medium, large), and network settings.

Option 2: vSphere with Tanzu (VKS)

This method is fully integrated into the vSphere Client.

  1. In the vSphere Client, navigate to Workload Management.
  2. Enable it on a vSphere cluster, which deploys a Supervisor Cluster.
  3. Configure control plane node sizes and worker node pools via VM Classes.
  4. Assign network segments for Pod and Service IP ranges.
  5. Once enabled, developers can provision their own “Tanzu Kubernetes Clusters” on-demand.

Option 3: Kubeadm on VMs (DIY)

This is the most manual but also most transparent method.

  1. Prepare Linux VMs on vSphere (e.g., Ubuntu 20.04). Best practice is to create a template.
  2. Install a container runtime (Containerd), kubeadm, kubelet, and kubectl on all VMs.
  3. Initialize the master node:
# Replace with your chosen Pod network range
sudo kubeadm init --pod-network-cidr=192.168.0.0/16
  1. Install a CNI (Container Network Interface) plugin like Calico or Antrea.
  2. Join worker nodes using the command provided by the kubeadm init output.

4. Networking Design Example

A segmented network topology is a best practice for security and manageability. NSX or vDS with VLANs enables this isolation.

Reference Network Topology

ComponentNetwork / VLANExample Address RangePurpose
vSphere Managementmgmt-vlan10.0.0.0/24Access to vCenter, ESXi management, and NSX Manager. Highly secured.
Kubernetes APIk8s-control-plane10.10.10.0/24For `kubectl` access and external automation tools to reach the cluster API.
Pod Network (Overlay)k8s-pods-vxlan192.168.0.0/16Internal, private network for all Pod-to-Pod communication. Managed by the CNI.
Service Networkk8s-svc-vlan10.20.20.0/24Virtual IP range for Kubernetes services. Traffic is not routable externally.
External LB / Ingressext-lb-vlan10.30.30.0/24Public-facing network where application IPs are exposed via LoadBalancers.

[External Users]      | [Firewall / Router]      |  [Load Balancer/Ingress VIPs (10.30.30.x)]      |   [K8s Service Network (10.20.20.x) – Internal]      |   [Pods: Overlay Network (192.168.x.x)]      | [Worker Node VMs: Management Network on vSphere]      | [vSphere Mgmt (vCenter, NSX, ESXi)]

5. Deploying an Application Example

Once the cluster is running, you can deploy applications using standard Kubernetes manifest files.

Sample Deployment YAML (nginx-deployment.yaml)

This manifest creates a Deployment that ensures three replicas of an Nginx web server are always running.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3 # Desired number of pods
  selector:
    matchLabels:
      app: nginx # Connects the Deployment to the pods
  template: # Pod template
    metadata:
      labels:
        app: nginx # Label applied to each pod
    spec:
      containers:
      - name: nginx
        image: nginx:latest # The container image to use
        ports:
        - containerPort: 80 # The port the application listens on

Apply the configuration to your cluster:

kubectl apply -f nginx-deployment.yaml

6. Exposing the Application via a Service

A Deployment runs your pods, but a Service exposes them to the network. For production, a LoadBalancer service is recommended.

Sample LoadBalancer Service (nginx-service.yaml)

When deployed in an integrated environment like Tanzu with NSX, this automatically provisions an external IP from your load balancer pool.

apiVersion: v1
kind: Service
metadata:
  name: nginx-service
spec:
  type: LoadBalancer # Asks the cloud provider for a load balancer
  selector:
    app: nginx # Forwards traffic to pods with this label
  ports:
    - protocol: TCP
      port: 80 # The port the service will be exposed on
      targetPort: 80 # The port on the pod to send traffic to

Apply the service and find its external IP:

kubectl apply -f nginx-service.yaml
kubectl get service nginx-service
# The output will show an EXTERNAL-IP once provisioned

You can then access your application at http://<EXTERNAL-IP>.

7. Scaling, Monitoring, and Managing

  • Scaling: Easily adjust the number of replicas to handle changing loads.
kubectl scale deployment/nginx-deployment --replicas=5
  • Monitoring: Combine vSphere monitoring (for VM health) with in-cluster tools like Prometheus and Grafana (for application metrics). VMware vRealize Operations provides a holistic view from app to infrastructure.
  • Storage: Use the vSphere CSI driver to provide persistent storage. Developers request storage with a PersistentVolumeClaim (PVC), and vSphere automatically provisions a virtual disk on a datastore (vSAN, VMFS, etc.) to back it.

Best Practices & Further Reading

  • Use Resource Pools: In vSphere, use Resource Pools to guarantee CPU and memory for Kubernetes nodes, isolating them from other VM workloads.
  • Embrace NSX Security: Use NSX micro-segmentation to create firewall rules that control traffic between pods, enforcing a zero-trust security model.
  • Automate Everything: Leverage Terraform, Ansible, or PowerCLI to automate the deployment and configuration of your vSphere infrastructure and Kubernetes clusters.
  • Follow Validated Designs: For production, consult VMware’s official reference architectures to ensure a supportable and scalable deployment.

Useful References

Document Version 1.0 | A foundational framework for enterprise Kubernetes on VMware.

Configuring NVIDIA vGPU for Virtual Machines on ESXi on a Dell PowerEdge MX760c

Introduction

This document provides a comprehensive step-by-step guide to configuring NVIDIA Virtual GPU (vGPU) on VMware ESXi, specifically tailored for environments where virtual machines (VMs) require GPU acceleration. This setup is crucial for workloads such as Artificial Intelligence (AI), Machine Learning (ML), high-performance computing (HPC), and advanced graphics virtualization. We will detail the process of enabling NVIDIA GPUs, such as those installed in a Dell PowerEdge MX760c server, to be shared among multiple VMs, enhancing resource utilization and performance.

While the concept of “GPU passthrough” often refers to dedicating an entire physical GPU to a single VM (DirectPath I/O), NVIDIA vGPU technology allows a physical GPU to be partitioned into multiple virtual GPUs. Each vGPU can then be assigned to a different VM, providing a more flexible and scalable solution. This guide focuses on the vGPU setup, which leverages NVIDIA’s drivers and management software in conjunction with VMware vSphere.

The instructions cover compatibility verification, hardware installation, ESXi host configuration, vGPU assignment to VMs, and driver installation within the guest operating systems. Following these steps will enable your virtualized environment to harness the power of NVIDIA GPUs for demanding applications. We will also briefly touch upon integrating this setup with VMware Private AI Foundation with NVIDIA for streamlined AI workload deployment.

Prerequisites

Before proceeding with the configuration, ensure the following prerequisites are met:

  • Compatible Server Hardware: A server system that supports NVIDIA GPUs and is certified for the version of ESXi you are running. For instance, the Dell PowerEdge MX760c is supported for ESXi 8.0 Update 3 and is compatible with SR-IOV and NVIDIA GPUs.
  • NVIDIA GPU: An NVIDIA GPU that supports vGPU technology. Refer to NVIDIA’s documentation for a list of compatible GPUs.
  • VMware ESXi: A compatible version of VMware ESXi installed on your host server. This guide assumes ESXi 8.0 or a similar modern version.
  • VMware vCenter Server: While some configurations might be possible without it, vCenter Server is highly recommended for managing vGPU deployments.
  • NVIDIA vGPU Software: You will need the NVIDIA vGPU Manager VIB (Virtual-machine Infrastructure Bundle) for ESXi and the corresponding NVIDIA guest OS drivers for the VMs. These are typically available from NVIDIA’s licensing portal.
  • Network Connectivity: Ensure the ESXi host has network access to download necessary files or for management via SSH and vSphere Client.
  • Appropriate Licensing: NVIDIA vGPU solutions require licensing. Ensure you have the necessary licenses for your deployment.

Step 1: Verify Compatibility

Ensuring hardware and software compatibility is the foundational step for a successful vGPU deployment. Failure to do so can lead to installation issues, instability, or suboptimal performance.

1.1 Check Server Compatibility

Your server must be certified to run the intended ESXi version and support the specific NVIDIA GPU model you plan to use. Server vendors often provide compatibility matrices.

  • Action: Use the Broadcom Compatibility Guide (formerly VMware Compatibility Guide) to confirm your server model’s support for ESXi (e.g., ESXi 8.0 Update 3) and its compatibility with NVIDIA GPUs.
  • Example: The Dell PowerEdge MX760c is listed as a supported server model for ESXi 8.0 Update 3 and is known to be compatible with SR-IOV and NVIDIA GPUs, making it suitable for vGPU deployments.
  • Details: Compatibility verification includes checking for BIOS support for virtualization technologies (VT-d/IOMMU, SR-IOV), adequate power supply and cooling for the GPU, and physical PCIe slot availability.

1.2 Check GPU Compatibility

Not all NVIDIA GPUs support vGPU, and among those that do, compatibility varies with ESXi versions and NVIDIA vGPU software versions.

  • Action: Consult the official NVIDIA vGPU documentation and the NVIDIA Virtual GPU Software Supported Products List. This documentation provides detailed information on which GPUs are supported, the required vGPU software versions, and compatible ESXi versions.
  • Details: Pay close attention to the specific vGPU profiles supported by your chosen GPU, as these profiles determine how the GPU’s resources are partitioned and allocated to VMs. Ensure the GPU firmware is up to date as recommended by NVIDIA or your server vendor.

Note: Always use the latest available compatibility information from both VMware/Broadcom and NVIDIA, as these are updated regularly with new hardware and software releases.

Step 2: Install NVIDIA GPU on the Host

Once compatibility is confirmed, the next step is to physically install the NVIDIA GPU into the ESXi host server and configure the server’s BIOS/UEFI settings appropriately.

2.1 Add the GPU as a PCI Device to the Host

  • Action: Physically install the NVIDIA GPU into an appropriate PCIe slot in the PowerEdge MX760c or your compatible server.
  • Procedure:
    1. Power down and unplug the server. Follow all electrostatic discharge (ESD) precautions.
    2. Open the server chassis according to the manufacturer’s instructions.
    3. Identify a suitable PCIe slot. High-performance GPUs usually require a x16 PCIe slot and may need auxiliary power connectors.
    4. Insert the GPU firmly into the slot and secure it. Connect any necessary auxiliary power cables directly from the server’s power supply to the GPU.
    5. Close the server chassis and reconnect power.
  • Considerations: Ensure the server’s Power Supply Unit (PSU) can handle the additional power load from the GPU. Check server documentation for slot priority or specific slots designated for GPUs. Proper airflow and cooling are also critical for GPU stability and longevity.

2.2 Update Server BIOS/UEFI Settings

Several BIOS/UEFI settings must be enabled to support GPU passthrough and virtualization technologies like vGPU.

  • Action: Boot the server and enter the BIOS/UEFI setup utility (commonly by pressing F2, DEL, or another designated key during startup).
  • Key Settings to Enable:
    • Virtualization Technology (VT-x / AMD-V): Usually enabled by default, but verify.
    • SR-IOV (Single Root I/O Virtualization): This is critical for many vGPU deployments as it allows a PCIe device to appear as multiple separate physical devices. Locate this setting, often under “Integrated Devices,” “PCIe Configuration,” or “Processor Settings.”
    • VT-d (Intel Virtualization Technology for Directed I/O) / AMD IOMMU: This technology enables direct assignment of PCIe devices to VMs and is essential for passthrough and vGPU functionality.
    • Memory Mapped I/O above 4GB (Above 4G Decoding): Enable this if available, as GPUs require significant address space.
    • Disable any conflicting settings like on-board graphics if they interfere, though often they can co-exist.
  • Save and Exit: After making changes, save the settings and exit the BIOS/UEFI utility. The server will reboot.

Important: The exact naming and location of these settings can vary significantly between server manufacturers and BIOS versions. Consult your server’s technical documentation for specific instructions.

Step 3: Install NVIDIA VIB on ESXi Host

With the hardware installed and BIOS configured, the next phase involves installing the NVIDIA vGPU Manager VIB (Virtual-machine Infrastructure Bundle) on the ESXi host. This software component enables the ESXi hypervisor to recognize and manage the NVIDIA GPU for vGPU operations.

A detailed guide from Broadcom can be found here: Installing and configuring the NVIDIA VIB on ESXi.

3.1 Download the NVIDIA vGPU Manager VIB

  • Action: Obtain the correct NVIDIA vGPU Manager VIB package for your ESXi version and GPU model. This software is typically downloaded from the NVIDIA Licensing Portal (NPN, or NVIDIA Enterprise Application Hub).
  • Critical: Ensure the VIB version matches your ESXi host version (e.g., ESXi 8.0, 8.0 U1, 8.0 U2, 8.0 U3). Using an incompatible VIB can lead to installation failure or system instability. The VIB package will be a .vib file.

3.2 Upload the VIB to the ESXi Host

  • Action: Transfer the downloaded .vib file to a datastore accessible by your ESXi host, or directly to a temporary location on the host (e.g., /tmp).
  • Method: Use an SCP client (like WinSCP for Windows, or scp command-line utility for Linux/macOS) or the datastore browser in vSphere Client to upload the VIB file.
  • Example using SCP: scp /path/to/local/vgpu-manager.vib root@your_esxi_host_ip:/vmfs/volumes/your_datastore/

3.3 Install the VIB

  • Action: Place the ESXi host into maintenance mode. This is crucial to ensure no VMs are running during the driver installation and subsequent reboot. You can do this via the vSphere Client (right-click host > Maintenance Mode > Enter Maintenance Mode).
  • Procedure:
    1. Enable SSH on the ESXi host if it’s not already enabled (vSphere Client: Host > Configure > Services > SSH > Start).
    2. Connect to the ESXi host using an SSH client (e.g., PuTTY or command-line SSH).
    3. Navigate to the directory where you uploaded the VIB, or use the full path to the VIB file.
    4. Run the VIB installation command. Replace /path/to/vgpu-manager.vib with the actual path to your VIB file:esxcli software vib install -v /vmfs/volumes/your_datastore/vgpu-manager.vib

Alternatively, if uploaded to /tmp:

  1. esxcli software vib install -v /tmp/vgpu-manager.vib

This command might require the --no-sig-check flag if the VIB is not signed by a trusted source or if you encounter signature verification issues, though official NVIDIA VIBs should be signed.

  1. After successful installation, the command output will indicate that the VIB has been installed and a reboot is required.
  2. Reboot the ESXi host:reboot
  3. Once the host has rebooted, exit maintenance mode.

3.4 Verify VIB Installation

  • Action: After the ESXi host reboots, verify that the NVIDIA VIB is installed correctly and the GPU is recognized.
  • Command: SSH into the ESXi host and run:nvidia-smi
  • Expected Output: This command should display information about the NVIDIA GPU(s) installed in the host, including GPU model, driver version, temperature, and memory usage. If this command executes successfully and shows your GPU details, the VIB installation was successful. If it returns an error like “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver,” there might be an issue with the VIB installation, GPU compatibility, or BIOS settings.

Step 4: Configure NVIDIA GPUs for vGPU Mode

After installing the NVIDIA VIB on the ESXi host and confirming the driver can communicate with the GPU (via nvidia-smi), you need to ensure the GPU is configured for the correct operational mode for vGPU. Some NVIDIA GPUs can operate in different modes (e.g., graphics/vGPU mode vs. compute mode).

4.1 Enable vGPU Mode (if applicable)

  • Action: For certain NVIDIA GPU models (especially those in the Tesla or Data Center series), you might need to set the GPU mode to “graphics” or “vGPU” mode. By default, they might be in a “compute” mode. This change is typically done using tools provided by NVIDIA or via nvidia-smi commands on the ESXi host if supported for that specific configuration.
  • Guidance: Refer to the NVIDIA vGPU Deployment Guide specific to your GPU series and vGPU software version. This guide will provide the exact commands or procedures if a mode change is necessary.

For example, to check the current mode or to change it, you might use specific nvidia-smi persistence mode commands or other NVIDIA utilities. However, for many modern GPUs and vGPU software versions, the driver automatically handles the appropriate mode for vGPU when licensed correctly.

  • Licensing: Ensure your NVIDIA vGPU licensing is correctly configured. The NVIDIA vGPU software relies on a license server to enable vGPU features. Without a valid license, vGPU functionality will be restricted or disabled. The license dictates which vGPU profiles are available.

4.2 Verify GPU Availability and Passthrough Configuration

  • Action: Confirm that ESXi recognizes the NVIDIA GPU(s) as available PCI devices that can be used for passthrough or vGPU.
  • Command: On the ESXi host via SSH, run:esxcli hardware pci list | grep -i nvidia
  • Expected Output: This command lists all PCI devices containing “nvidia” in their description. You should see entries corresponding to your installed NVIDIA GPU(s), including their vendor ID, device ID, and description. This confirms that the ESXi kernel is aware of the hardware.
  • vSphere Client Check: You can also check this in the vSphere Client:
    1. Select the ESXi host in the inventory.
    2. Navigate to Configure > Hardware > PCI Devices.
    3. Filter or search for “NVIDIA”. The GPUs should be listed here. You might need to toggle passthrough for the device here if you were doing direct passthrough. For vGPU, the installed VIB handles the GPU sharing mechanism. The GPU should be listed as available for vGPU.

Note on Passthrough vs. vGPU: While the esxcli hardware pci list command is general, the key for vGPU is the installed NVIDIA VIB which enables the hypervisor to mediate access to the GPU and present virtualized instances (vGPUs) to the VMs, rather than passing through the entire physical device to a single VM.

Step 5: Assign vGPU to Virtual Machines (VMs)

With the ESXi host properly configured and the NVIDIA vGPU Manager VIB installed, you can now assign vGPU resources to your virtual machines. This process involves editing the VM’s settings to add a shared PCI device, which represents the vGPU.

5.1 Create or Select an Existing Virtual Machine

  • Action: In the vSphere Client, either create a new virtual machine or select an existing one that requires GPU acceleration.
  • Guest OS Compatibility: Ensure the guest operating system (OS) you plan to use within the VM is supported by NVIDIA vGPU technology and that you have the corresponding NVIDIA guest OS drivers. Supported OS typically include various versions of Windows, Windows Server, and Linux distributions.

5.2 Add vGPU Shared PCI Device to the VM

  • Action: Edit the settings of the target virtual machine to add an NVIDIA vGPU.
  • Procedure (via vSphere Client):
    1. Power off the virtual machine. vGPU assignment typically requires the VM to be powered off.
    2. Right-click the VM in the vSphere inventory and select Edit Settings.
    3. In the “Virtual Hardware” tab, click Add New Device.
    4. Select PCI Device from the dropdown menu and click Add. (Note: For vGPU, it’s often listed more specifically as “Shared PCI Device” or directly shows NVIDIA vGPU profiles). 
      Correction/Clarification: The more direct path is: Add New Device -> NVIDIA vGPU. If “NVIDIA vGPU” is not directly an option, it’s under Shared PCI Device, where you then select the vGPU profile.
    5. A new “PCI device” entry will appear. Expand it.
    6. From the “NVIDIA vGPU Profile” dropdown list, select the desired vGPU profile. (Image placeholder: The above src is an example and won’t render. Actual images cannot be embedded as per instructions.)

The user interface might vary slightly depending on the vSphere version, but the principle is to add a new device and choose the NVIDIA vGPU type and profile.

5.3 Configure the vGPU Profile

  • Explanation: vGPU profiles define how the physical GPU’s resources (e.g., framebuffer/VRAM, number of supported display heads, compute capability) are allocated to the VM. NVIDIA provides a range of profiles (e.g., Q-series for Quadro features, C-series for compute, B-series for business graphics, A-series for virtual applications).
  • Selection Criteria: Choose a profile that matches the workload requirements of the VM. For example:
    • AI/ML or HPC: Typically require profiles with larger framebuffers and significant compute resources (e.g., C-series or high-end A/Q profiles).
    • Virtual Desktops (VDI) / Graphics Workstations: Profiles vary based on the intensity of the graphics applications (e.g., B-series for knowledge workers, Q-series for designers/engineers).
  • Resource Reservation: After adding the vGPU, you may need to reserve all guest memory for the VM. In the VM’s “Edit Settings,” go to “VM Options” tab, expand “Advanced,” and under “Configuration Parameters,” ensure pciPassthru.use64bitMMIO is set to TRUE if required, and ensure “Reserve all guest memory (All locked)” is checked under the “Virtual Hardware” tab’s Memory section. This is often a requirement for stable vGPU operation.
  • Click OK to save the VM settings.

5.4 Install NVIDIA Drivers in the Guest OS

  • Action: Power on the virtual machine. Once the guest OS boots up, you need to install the appropriate NVIDIA guest OS drivers. These are different from the VIB installed on the ESXi host.
  • Driver Source: Download the NVIDIA vGPU software guest OS drivers from the NVIDIA Licensing Portal. Ensure these drivers match the vGPU software version running on the host (VIB version) and are compatible with the selected vGPU profile and the guest OS.
  • Installation: Install the drivers within the guest OS following standard driver installation procedures for that OS (e.g., running the setup executable on Windows, or using package managers/scripts on Linux). A reboot of the VM is typically required after driver installation.

5.5 Verify vGPU Functionality in the Guest OS

  • Action: After the guest OS driver installation and reboot, verify that the vGPU is functioning correctly within the VM.
  • Verification on Windows:
    • Open Device Manager. The NVIDIA GPU should be listed under “Display adapters” without errors.
    • Run the nvidia-smi command from a command prompt (usually found in C:\Program Files\NVIDIA Corporation\NVSMI\). It should display details of the assigned vGPU profile and its status.
  • Verification on Linux:
    • Open a terminal and run the nvidia-smi command. It should show the vGPU details.
    • Check dmesg or Xorg logs for any NVIDIA-related errors.
  • License Status: Ensure the VM successfully acquires a vGPU license from your NVIDIA license server. The nvidia-smi output or NVIDIA control panel within the guest can often show licensing status.

Step 6: (Optional) Deploy VMware Private AI Foundation with NVIDIA

For organizations looking to build an enterprise-grade AI platform, VMware Private AI Foundation with NVIDIA offers an integrated solution that leverages the vGPU capabilities you’ve just configured. This platform helps streamline the deployment and management of AI/ML workloads.

Key Aspects:

  • Install VMware Private AI Foundation Components: This involves deploying specific VMware software components (like Tanzu for Kubernetes workloads, and AI-specific management tools) that are optimized to work with NVIDIA AI Enterprise software. Follow VMware’s official documentation for this deployment.
  • Integrate with vGPUs: The vGPU-accelerated VMs become the workhorses for your AI applications. VMware Private AI Foundation provides tools and frameworks to efficiently manage these resources and schedule AI/ML jobs on them.
  • Leverage APIs: Utilize both NVIDIA and VMware APIs for programmatic control, monitoring GPU performance, workload optimization, and dynamic resource management. This allows for automation and integration into MLOps pipelines.

This step is an advanced topic beyond basic vGPU setup but represents a common use case for environments that have invested in NVIDIA vGPU technology on VMware.

Troubleshooting Common Issues

While the setup process is generally straightforward if all compatibility and procedural guidelines are followed, issues can arise. Here are some common troubleshooting areas:

  • nvidia-smi fails on ESXi host:
    • Ensure the VIB is installed correctly and matches ESXi version.
    • Verify BIOS settings (SR-IOV, VT-d).
    • Check GPU seating and power.
    • Consult /var/log/vmkernel.log on ESXi for NVIDIA-related errors.
  • vGPU option not available when editing VM settings:
    • Confirm NVIDIA VIB is installed and nvidia-smi works on the host.
    • Ensure the GPU supports vGPU and is in the correct mode.
    • Check host licensing for NVIDIA vGPU.
  • NVIDIA driver fails to install in Guest OS or shows errors:
    • Verify you are using the correct NVIDIA guest OS driver version that matches the host VIB and vGPU profile.
    • Ensure the VM has sufficient resources (RAM, CPU) and that memory reservation is configured if required.
    • Check for OS compatibility.
  • VM fails to power on after vGPU assignment:
    • Insufficient host GPU resources for the selected vGPU profile (e.g., trying to assign more vGPUs than the physical GPU can support).
    • Memory reservation issues.
    • Incorrect BIOS settings on the host.

Conclusion

Configuring NVIDIA vGPU on VMware ESXi allows businesses to efficiently utilize powerful GPU resources across multiple virtual machines. This unlocks performance for demanding applications like AI/ML, VDI, and graphics-intensive workloads in a virtualized environment. By meticulously following the steps outlined in this guide—from compatibility checks and hardware installation to software configuration on both the host and guest VMs—administrators can create a robust and scalable GPU-accelerated infrastructure. Remember to consult official documentation from VMware and NVIDIA for the most current and detailed information specific to your hardware and software versions.

References

  • NVIDIA vGPU Documentation (available on the NVIDIA website/portal)
  • VMware Private AI Foundation with NVIDIA Documentation
  • Broadcom Compatibility Guide (for VMware hardware compatibility)
  • Dell PowerEdge MX760c Technical Documentation
  • Broadcom KB Article: Installing and configuring the NVIDIA VIB on ESXi
  • General Installation Guidance (Conceptual): YouTube Installation Guide (as mentioned in user prompt context, though no specific link provided here)

Guide: GPU and IOPS Performance Testing on a Virtual Machine with NVIDIA GPU Passthrough

Introduction

This document outlines the procedures for setting up the environment and conducting performance tests on a Virtual Machine (VM) running on an ESXi hypervisor. Specifically, it focuses on VMs configured with NVIDIA GPU passthrough. Proper testing is crucial to ensure that the GPU is correctly recognized, utilized, and performing optimally within the VM, and that storage Input/Output Operations Per Second (IOPS) meet the required levels for demanding applications. These tests help validate the stability and performance of the GPU passthrough configuration and overall VM health. We will cover environment setup for GPU-accelerated libraries like PyTorch and TensorFlow, followed by GPU stress testing and storage IOPS tuning.

Section 1: Environment Setup for GPU Accelerated Computing

Before running any performance benchmarks, it’s essential to configure the software environment within the VM correctly. This typically involves setting up isolated Python environments and installing necessary libraries like PyTorch or TensorFlow with CUDA support to leverage the NVIDIA GPU.

1.1 The Importance of Virtual Environments

Using virtual environments (e.g., using `venv`) is highly recommended to manage dependencies for different projects and avoid conflicts between package versions. Each project can have its own isolated environment with specific libraries.

1.2 Setting up the Environment for PyTorch

The following steps guide you through creating a virtual environment and installing PyTorch with CUDA support, which is essential for GPU computation.

  1. Create a virtual environment (if not already in one):

This command creates a new virtual environment named venv-pytorch in your home directory.

python3 -m venv ~/venv-pytorch
  1. Activate the environment:

To start using the virtual environment, you need to activate it. Your shell prompt will typically change to indicate the active environment.

source ~/venv-pytorch/bin/activate

After activation, your prompt might look like this:

(venv-pytorch) root@gpu:~#
  1. Install PyTorch with CUDA support:

First, upgrade pip, the Python package installer. Then, install PyTorch, torchvision, and torchaudio, specifying the CUDA version (cu121 in this example) via the index URL for PyTorch. Ensure the CUDA version matches the NVIDIA drivers installed on your system and compatible with your GPU.

pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

1.3 Verifying PyTorch and CUDA Installation

After installation, verify that PyTorch can detect and use the CUDA-enabled GPU:

python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"

This command should output True if CUDA is available, followed by the name of your NVIDIA GPU (e.g., NVIDIA GeForce RTX 3090).

1.4 Note on TensorFlow Setup

Setting up an environment for TensorFlow with GPU support follows a similar pattern:

  1. Create and activate a dedicated virtual environment.
  2. Install TensorFlow with GPU support (e.g., pip install tensorflow). TensorFlow’s GPU support is often bundled, but ensure your NVIDIA drivers and CUDA Toolkit versions are compatible with the TensorFlow version you install. Consult the official TensorFlow documentation for specific CUDA and cuDNN version requirements.
  3. Verify the installation by running a TensorFlow command that lists available GPUs (e.g., python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))").

This document focuses on PyTorch for the GPU burn script, but the principles of environment setup are transferable.

Section 2: GPU Performance and Stability Testing (`gpu_burn.py`)

Once the environment is set up, you can proceed to test the GPU’s performance and stability. This is particularly important in a VM with GPU passthrough to ensure the GPU is functioning correctly under heavy load and that the passthrough configuration is stable. The `gpu_burn.py` script is designed for this purpose.

2.1 Purpose in a Passthrough Environment

Running a GPU stress test like `gpu_burn.py` helps to:

  • Confirm that the VM has full and correct access to the GPU’s capabilities.
  • Identify any potential overheating issues or power supply limitations under sustained load.
  • Detect driver instabilities or passthrough configuration errors that might only manifest under stress.
  • Get a qualitative measure of the GPU’s computational throughput.

2.2 Overview of `gpu_burn.py`

The `gpu_burn.py` script utilizes PyTorch to perform intensive computations on all available NVIDIA GPUs. It allocates significant memory on the GPU and then runs continuous matrix multiplication operations to stress the compute units.

2.3 Script Breakdown: `gpu_burn.py`


import torch
import time

def burn_gpu(duration_sec=120):
    print("Starting GPU burn for", duration_sec, "seconds...")

    # Get all GPUs
    if not torch.cuda.is_available():
        print("NVIDIA CUDA is not available. Exiting.")
        return
    
    device_count = torch.cuda.device_count()
    if device_count == 0:
        print("No NVIDIA GPUs detected. Exiting.")
        return
        
    print(f"Found {device_count} NVIDIA GPU(s).")
    devices = [torch.device(f'cuda:{i}') for i in range(device_count)]
    tensors = []

    # Allocate large tensors to occupy memory and compute
    for i, dev in enumerate(devices):
        torch.cuda.set_device(dev) # Explicitly set current device
        print(f"Initializing {dev} (GPU {i})...")
        try:
            # Create two large tensors per device
            # Matrix size can be adjusted based on GPU memory
            # 4096x4096 float32 tensor is 4096*4096*4 bytes = 64 MiB
            a = torch.randn((8192, 8192), device=dev, dtype=torch.float32)
            b = torch.randn((8192, 8192), device=dev, dtype=torch.float32)
            tensors.append((a, b))
            print(f"Allocated tensors on {dev}.")
        except RuntimeError as e:
            print(f"Error allocating tensors on {dev}: {e}")
            print(f"Skipping {dev} due to allocation error. This GPU might have insufficient memory or other issues.")
            # Remove the device if allocation fails to avoid errors in the loop
            devices[i] = None # Mark as None
            continue # Move to the next device
    
    # Filter out devices that failed allocation
    active_devices_tensors = []
    active_device_indices = [] # Keep track of original indices for tensors
    for i, dev in enumerate(devices):
        if dev is not None: # Check if device was successfully initialized
             active_devices_tensors.append({'device': dev, 'tensors_idx': len(active_device_indices)})
             active_device_indices.append(i)


    if not active_devices_tensors:
        print("No GPUs were successfully initialized with tensors. Exiting.")
        return

    print(f"Starting computation on {len(active_devices_tensors)} GPU(s).")
    start_time = time.time()
    loop_count = 0
    while time.time() - start_time < duration_sec:
        for item in active_devices_tensors:
            dev = item['device']
            # Retrieve the correct tensors using the stored index mapping
            original_tensor_idx = active_device_indices[item['tensors_idx']]
            a, b = tensors[original_tensor_idx]
            
            torch.cuda.set_device(dev) # Set current device for the operations
            # Heavy compute: Matrix multiplication
            c = torch.matmul(a, b)  
            
            if c.requires_grad: # Check if backward pass is possible
                c.mean().backward() 
        
        # Synchronize all active GPUs after operations on them in the loop
        for item in active_devices_tensors:
            torch.cuda.synchronize(item['device'])

        loop_count += 1
        if loop_count % 10 == 0: # Print a status update periodically
            elapsed_time = time.time() - start_time
            print(f"Loop {loop_count}, Elapsed time: {elapsed_time:.2f} seconds...")


    end_time = time.time()
    print(f"GPU burn finished after {end_time - start_time:.2f} seconds and {loop_count} loops.")

if __name__ == '__main__':
    burn_gpu(120)  # Default duration: 2 minutes (120 seconds)
        

Key functionalities:

  • CUDA Availability Check: Ensures CUDA is available and GPUs are detected.
  • GPU Initialization: Iterates through detected GPUs, setting each as the current device.
  • Tensor Allocation: Attempts to allocate two large (8192×8192) float32 tensors on each GPU. This consumes approximately 2 * (8192*8192*4 bytes) = 512 MiB of GPU memory per GPU. The size can be adjusted based on available GPU memory. Error handling is included for GPUs where allocation might fail (e.g., due to insufficient memory).
  • Computation Loop: Continuously performs matrix multiplication (torch.matmul(a, b)) on the allocated tensors for each successfully initialized GPU. This operation is computationally intensive. An optional backward pass (c.mean().backward()) can be included for a more comprehensive workload if tensors are created with requires_grad=True.
  • Synchronization: torch.cuda.synchronize() is used to ensure all operations on a GPU complete before the next iteration or measurement.
  • Duration and Reporting: The test runs for a specified duration_sec (default 120 seconds). It prints status updates periodically and a final summary.

Note: The script has been slightly adapted in this documentation to correctly handle tensor indexing if some GPUs fail initialization, ensuring it uses tensors associated with successfully initialized devices. The original script might require minor adjustments in the main loop to correctly associate tensors if a GPU is skipped after the initial tensor list is populated.

2.4 How to Run and Interpret Results

1. Save the script as `gpu_burn.py` on your VM.

2. Ensure you are in the activated virtual environment where PyTorch is installed.

3. Run the script from the terminal:

python gpu_burn.py

You can change the duration by modifying the burn_gpu(120) call in the script or by parameterizing the script’s main function call.

Interpreting Results:

  • Monitor the console output for any errors, especially during tensor allocation or computation.
  • Observe the GPU utilization and temperature using tools like `nvidia-smi` in another terminal window. High, stable utilization is expected.
  • A successful run completes without errors for the specified duration, indicating stability. The number of loops completed can provide a rough performance metric, comparable across similar hardware or different configurations.

Section 3: Storage IOPS Performance Tuning (`fio_iops_tuner.sh`)

Storage performance, particularly Input/Output Operations Per Second (IOPS), is critical for many VM workloads, including databases, applications with heavy logging, or build systems. The `fio_iops_tuner.sh` script uses the Flexible I/O Tester (FIO) tool to measure and attempt to achieve a target IOPS rate.

3.1 Importance of IOPS for VM Workloads

In a virtualized environment, storage I/O often passes through multiple layers (guest OS, hypervisor, physical storage). Testing IOPS within the VM helps to:

  • Verify that the VM is achieving the expected storage performance from the underlying infrastructure.
  • Identify potential bottlenecks in the storage path.
  • Tune FIO parameters like `numjobs` (number of parallel I/O threads) and `iodepth` (number of I/O requests queued per job) to maximize IOPS for a given workload profile.

3.2 Overview of `fio_iops_tuner.sh`

This shell script automates the process of running FIO with varying parameters (`numjobs` and `iodepth`) to find a configuration that meets or exceeds a `TARGET_IOPS`. It iteratively increases the load until the target is met or a maximum number of attempts is reached.

3.3 Script Breakdown: `fio_iops_tuner.sh`


#!/bin/bash

TARGET_IOPS=30000
MAX_ATTEMPTS=10
FIO_FILE=/mnt/nfs/testfile # IMPORTANT: Ensure this path is writable and on the target storage
OUTPUT_FILE=fio_output.txt

echo "Starting dynamic IOPS tuner to reach $TARGET_IOPS IOPS..."
echo "Using fio file: $FIO_FILE"
echo

# Initial FIO parameters
bs=8k         # Block size
iodepth=32    # Initial I/O depth per job
numjobs=4     # Initial number of parallel jobs
size=2G       # Size of the test file per job
rw=randrw     # Random read/write workload
mix=70        # 70% read, 30% write

for attempt in $(seq 1 $MAX_ATTEMPTS); do
    echo "Attempt $attempt: Running fio with numjobs=$numjobs and iodepth=$iodepth..."

    # Ensure unique filename for each attempt if files are not cleaned up by fio or needed for review
    CURRENT_FIO_FILE="${FIO_FILE}_attempt${attempt}"

    fio --name=rand_iops_tune \
        --ioengine=libaio \
        --rw=$rw \
        --rwmixread=$mix \
        --bs=$bs \
        --iodepth=$iodepth \
        --numjobs=$numjobs \
        --runtime=30 \
        --time_based \
        --group_reporting \
        --size=$size \
        --filename=${CURRENT_FIO_FILE} \
        --output=$OUTPUT_FILE \
        --exitall_on_error # Stop all jobs if one errors

    # Sum IOPS from all jobs (read + write if group_reporting is used)
    # This grep might need adjustment based on exact fio version output format
    # The original grep 'iops\s*:\s*[^,]+' might sum specific read/write lines.
    # A more robust sum for mixed rw with group_reporting:
    # Look for lines like:   READ: bw=..., iops=X, ...  WRITE: bw=..., iops=Y, ...
    # Or the aggregate line if present.
    # The provided grep implies summing values from multiple "iops :" lines.
    # If using a version of fio that gives separate read/write iops lines and you want total:
    read_iops=$(grep 'read:' $OUTPUT_FILE | grep -oP 'iops=\K[0-9]+' | awk '{s+=$1} END {print s}')
    write_iops=$(grep 'write:' $OUTPUT_FILE | grep -oP 'iops=\K[0-9]+' | awk '{s+=$1} END {print s}')
    iops=$((read_iops + write_iops))
    
    # Fallback or alternative: If group reporting provides a clear aggregate iops line, parse that.
    # The original script's iops parsing:
    # iops=$(grep -oP 'iops\s*:\s*[^,]+' $OUTPUT_FILE | awk '{sum+=$2} END {print int(sum)}')
    # This might capture multiple iops lines (e.g. per job if not group_reporting, or per r/w type)
    # For simplicity, we'll assume the original script's iops parsing worked for its specific FIO version and output.
    # For robust parsing for total IOPS (read+write) from group reporting, often it's on a summary line for "All Jobs" or similar.
    # A common pattern for group_reporting with mixed workload is to find read and write iops separately and sum them:
    parsed_iops=$(grep -A 10 "Run status group 0" $OUTPUT_FILE | grep -E '(read|write)\s*:\s*IOPS=[^,]+' | grep -oP 'IOPS=\K[0-9.]+' | awk '{s+=$1} END {print int(s)}')
    if [ -z "$parsed_iops" ] || [ "$parsed_iops" -eq 0 ]; then # Fallback if above fails
        parsed_iops=$(grep -oP 'iops\s*:\s*[^,]+' $OUTPUT_FILE | awk '{sum+=$2} END {print int(sum)}') # Original parsing
    fi
    iops=${parsed_iops:-0}


    echo "Result: Total IOPS = $iops"

    if (( iops >= TARGET_IOPS )); then
        echo "Target of $TARGET_IOPS IOPS achieved with numjobs=$numjobs and iodepth=$iodepth"
        # Optional: Clean up test files
        # rm -f ${FIO_FILE}_attempt*
        break
    else
        echo "Not enough. Increasing load..."
        # Increment strategy
        (( numjobs += 2 ))  # Increase number of jobs
        (( iodepth += 16 )) # Increase queue depth
        if [ $attempt -eq $MAX_ATTEMPTS ]; then
            echo "Maximum attempts reached. Target IOPS not achieved."
            # Optional: Clean up test files
            # rm -f ${FIO_FILE}_attempt*
        fi
    fi
    # Optional: Clean up individual attempt file if not needed for detailed review later
    rm -f ${CURRENT_FIO_FILE}
done

echo "Finished tuning. Check $OUTPUT_FILE for detailed output of the last successful or final attempt."
        

Key variables and parameters:

  • `TARGET_IOPS`: The desired IOPS rate the script aims to achieve (e.g., 30000).
  • `MAX_ATTEMPTS`: Maximum number of FIO runs with adjusted parameters (e.g., 10).
  • `FIO_FILE`: Path to the test file FIO will use. Crucially, this file should be on the storage system you intend to test. The script appends `_attemptN` to create unique files per run.
  • `OUTPUT_FILE`: File where FIO’s output is saved.
  • Initial FIO parameters:
    • `bs=8k`: Block size for I/O operations (8 kilobytes).
    • `iodepth=32`: Initial queue depth per job.
    • `numjobs=4`: Initial number of parallel jobs.
    • `size=2G`: Total size of data for each job to process.
    • `rw=randrw`: Specifies a random read/write workload.
    • `mix=70`: Defines the read percentage (70% reads, 30% writes).
    • `runtime=30`: Each FIO test runs for 30 seconds.
    • `ioengine=libaio`: Uses Linux asynchronous I/O.

Tuning Loop:

  1. The script runs FIO with the current `numjobs` and `iodepth`.
  2. It parses the `OUTPUT_FILE` to extract the achieved total IOPS. The `grep` and `awk` commands for IOPS parsing might need adjustment based on the exact FIO version and output format. The documented script includes a more robust parsing attempt for mixed workloads.
  3. If achieved IOPS meet `TARGET_IOPS`, the script reports success and exits.
  4. Otherwise, it increases `numjobs` (by 2) and `iodepth` (by 16) and retries, up to `MAX_ATTEMPTS`.

Note on FIO file path: The script uses `/mnt/nfs/testfile`. Ensure this path is valid, writable by the user running the script, and resides on the storage volume whose performance you want to test. If testing local VM disk, change this to an appropriate path like `/tmp/testfile` or a path on a mounted data disk.

Note on IOPS parsing: FIO output can vary. The script includes an enhanced IOPS parsing logic to sum read and write IOPS from group reporting, which is common for mixed workloads. If issues arise, manually inspect `fio_output.txt` from a single run to confirm the IOPS reporting format and adjust the `grep/awk` pattern accordingly. The original script’s parsing is kept as a fallback.

3.4 How to Run and Interpret Results

1. Ensure FIO is installed on your VM (e.g., `sudo apt-get install fio` or `sudo yum install fio`).

2. Save the script as `fio_iops_tuner.sh` and make it executable: `chmod +x fio_iops_tuner.sh`.

3. Modify `TARGET_IOPS`, `FIO_FILE`, and other parameters in the script as needed for your environment and goals.

4. Run the script:

./fio_iops_tuner.sh

Interpreting Results:

  • The script will print the IOPS achieved in each attempt.
  • If the `TARGET_IOPS` is reached, it will indicate the `numjobs` and `iodepth` that achieved this. These parameters can be valuable for configuring applications that are sensitive to storage I/O performance.
  • If the target is not met after `MAX_ATTEMPTS`, the script will indicate this. This might suggest a storage bottleneck or that the target is too high for the current configuration.
  • The `fio_output.txt` file contains detailed FIO output from the last run, which can be inspected for more in-depth analysis (e.g., latency, bandwidth).

Conclusion

Performing the environment setup, GPU stress testing with `gpu_burn.py`, and storage IOPS tuning with `fio_iops_tuner.sh` are vital steps in validating a VM configured with NVIDIA GPU passthrough on ESXi. These tests help ensure that the GPU is operating correctly and delivering expected performance, and that the storage subsystem can handle the I/O demands of your applications. Successful completion of these tests with satisfactory results provides confidence in the stability and capability of your virtualized high-performance computing environment.