Troubleshooting vimService Creation Failed 403 in vCenter

The error vimservice creation failed: error forbidden 403 during vSphere/vCenter operations typically indicates that the client cannot establish a VIM (vSphere API) session because of HTTP 403 (forbidden) conditions such as proxy interference, wrong endpoint, or access control.

Below is a Confluence‑ready troubleshooting document.


vimService creation failed: error forbidden 403 – Troubleshooting Guide

This document explains causes and a structured troubleshooting approach for vimService creation failed: error forbidden 403 seen when deploying VCSA, adding vCenter endpoints, or using tools that talk to the vSphere API.


Scope and symptoms

Applies to

  • VCSA deployment (UI/CLI installer) failing in Stage 1 or login step with vimService creation failed: error forbidden 403 (or similar).
  • Third‑party or custom clients (Terraform, pyvmomi, backup tools, etc.) calling https://<vcenter_or_esxi>:443/sdk and receiving HTTP 403.
  • Browser access to vCenter working, but thick installers or automation tools failing with 403 errors to the same vCenter/ESXi.

Common symptoms

  • Installer/utility logs contain entries such as:
    • vimService creation failed: error forbidden 403
    • A problem occurred while logging in. Verify the connection details.
  • vSphere Web Client/HTML5 UI may still be accessible via browser.
  • Problem appears only from specific client machines (e.g., Windows jump host with a proxy configured).

Root cause overview

HTTP 403 “Forbidden” indicates that the HTTP request reached the server but was rejected due to authorization, endpoint, or access policies.

In the context of vSphere, typical root causes are:

  • Wrong endpoint or target
    • Deploying VCSA or connecting a client to a non‑vSphere server (e.g., Windows IIS, generic web server) instead of an ESXi or vCenter IP/FQDN.
    • Using the wrong URL path or port instead of /sdk on 443.
  • Proxy settings and man‑in‑the‑middle interception
    • Windows system/browser proxy is used by the vSphere installer; traffic goes through an HTTP proxy that returns 403 rather than forwarding to vCenter/ESXi.
  • Access control / permissions / IP restrictions
    • vCenter restricted to specific source IP ranges or firewall rules; client address not allowed.
    • API/service disabled or restricted on a security gateway or load balancer in front of vCenter.
  • Authentication/authorization problems (vSphere side)
    • User has no permissions on any vCenter instance in a linked group, resulting in login denial and HTTP 403 from the UI or API gateway.
    • Session or token issues (e.g., stale SSO session, NoPermission) causing the API gateway to respond with 403.

Data collection

Before changing anything, gather basic info:

  1. Error details from client/installer logs
    • vCenter/VCSA installer logs (e.g., vcsa-ui-installer\logs\ or \ProgramData\VMware\ on Windows): look for vimService creation failed and surrounding lines.
    • Third‑party tool logs (Terraform, backup, etc.) for HTTP status and exact URL.
  2. Target details
    • IP/FQDN entered in the installer or tool.
    • Confirm whether this IP is vCenter, ESXi, or something else (IIS, reverse proxy, load balancer, etc.).
  3. Network path
    • Whether a proxy is configured on the client (Internet Options, browser, system proxy, corporate PAC file).
    • Firewall or WAF devices between client and vCenter.
  4. vCenter logs (if accessible)
    • Check vCenter vpxd.log and appliance reverse‑proxy logs for any 403 entries tied to the client’s IP/time.

Step‑by‑step troubleshooting

Step 1 – Verify the target endpoint

  1. From the client machine, open a browser and go to:
    • https://<target>:443/ and https://<target>:443/sdk
    • Confirm that:
      • https://<target>/ shows vCenter or ESXi login page.
      • https://<target>/sdk presents the vSphere Web Services SDK (XML/WSDL style page).
  2. If you see an IIS/Apache/web‑app page instead of vCenter/ESXi, the IP or FQDN is wrong; correct it to point to vCenter or an ESXi host.
  3. If /sdk returns 404 or 403 while the vCenter UI works via another FQDN, check:
    • Load balancer / reverse proxy configuration.
    • Whether you should use the vCenter FQDN instead of an alias.

Step 2 – Eliminate proxy interference

  1. On Windows client where the error occurs, check proxy:
    • Internet Options → Connections → LAN settings → Proxy server / automatic configuration script.
  2. If a proxy or PAC is configured:
    • Temporarily disable the proxy or add the vCenter/ESXi FQDN/IP to the proxy bypass list.
  3. Re‑run the installer or client:
    • In many documented cases, disabling the proxy resolves vimservice creation failed: error forbidden 403.
  4. For tools like Terraform or custom scripts, ensure environment variables like HTTPS_PROXY and HTTP_PROXY are unset for vCenter connections, or use no‑proxy lists.

Step 3 – Validate DNS, certificates, and SSL path

  1. Confirm DNS resolution:
    • nslookup <vcenter_fqdn> from the client; ensure it resolves to the correct IP.
  2. Confirm that no SSL inspection appliance is rewriting or blocking /sdk.
  3. If a reverse proxy/WAF is used:
    • Review its rules for 403 responses on /sdk or /api.

Although certificate errors usually produce 401/SSL errors, some middleware can return 403 when certificate policies fail.

Step 4 – Check vCenter permissions and SSO

If you hit 403 only after successful TCP/SSL connection to the correct vCenter:

  1. Try the same user in the vSphere Client (H5/HTML5 UI).
    • If login fails with messages like “Unable to login because you do not have permission on any vCenter Server systems connected to this client,” you may have 0 effective permissions and an SSO federation problem.
  2. Review SSO and global permissions:
    • Ensure the user or group has at least read permissions on the vCenter inventory.
    • In linked‑mode scenarios, verify that the user has permissions on at least one linked vCenter, or the client receives a denial.
  3. Check vCenter logs for NoPermission or NotAuthenticated faults around the time of the 403:
    • These appear as vim.fault.NoPermission or NotAuthenticated in logs like vpxd.log and UI traces.
  4. If using tokens or external identity providers, validate token audience, scope, and expiration; invalid tokens can surface as HTTP 403 in the API gateway.

Step 5 – Review firewall and IP access restrictions

  1. Confirm that the client IP is allowed to reach vCenter on TCP 443:
    • Use telnet <vcenter> 443 or curl -vk https://<vcenter>/sdk from the client.
  2. Check for:
    • NSX distributed firewall rules blocking this source.
    • Perimeter firewalls or load balancers configured with IP allow‑lists; unauthorized IPs can receive 403 instead of 401/timeout.
  3. If the same operation works from another jump host or subnet, suspect IP‑based access control and adjust rules accordingly.

Step 6 – Retest VCSA deployment / client operation

After applying fixes (correct endpoint, disabled proxy, updated permissions):

  • Re‑run the VCSA deployment wizard or client:
    • Confirm that the login step succeeds and the deployment progresses beyond the previous failure.
  • For automation tools, re‑run plan or the API call and confirm that the HTTP status changes from 403 to 200 (or appropriate 2xx).

Known patterns and solutions

Pattern 1 – Windows proxy causing installer 403

  • VCSA installer launched from a Windows host with corporate proxy configured in IE/Internet Options.
  • Browser access to vCenter works through proxy, but the installer gets 403 from the proxy instead of reaching vCenter /sdk.

Resolution:

  • Disable system/IE proxy or exclude vCenter/ESXi FQDNs from proxy.
  • Restart installer and retry; error disappears.

Pattern 2 – Wrong target (non‑vSphere server)

  • User points VCSA installer to a Windows server or some other web server rather than an ESXi/vCenter endpoint.
  • /sdk belongs to IIS or is missing, returning 404/403.

Resolution:

  • Identify the correct ESXi or existing vCenter that will host the new appliance and use that IP/FQDN.

Pattern 3 – API access forbidden for automation user

  • REST or SOAP API calls fail with 403 while UI logins succeed using another administrative account.
  • The automation user has insufficient vCenter permissions or is blocked by IP restrictions.

Resolution:

  • Assign the required vSphere roles/privileges to that user (or group) at appropriate scope.
  • Confirm there is no IP allow‑list blocking the client.

  • Always test https://<vcenter>/sdk directly from the same system that runs the installer or automation to verify connectivity and routing.
  • Avoid using internet proxies for internal vCenter/ESXi access; where unavoidable, configure no‑proxy rules for vSphere endpoints.
  • Standardize vCenter access FQDN and ensure DNS, certificates, and firewall rules all align with that FQDN.
  • For automation accounts, create dedicated service principals with clearly defined roles, and test via API tools (curl/Postman/pyvmomi) before integrating into larger workflows.

VMKPing “Invalid Argument” While Testing vMotion Network

When testing vMotion network connectivity from an ESXi host, vmkping can return Unknown interface 'vmkX': Invalid argument even though the VMkernel adapter exists and works for vMotion.

This behavior is almost always related to the vMotion TCP/IP stack or another non‑default stack (vxlan, provisioning) being used on the VMkernel interface.


Scope and prerequisites

This document applies to:

  • ESXi hosts using dedicated vMotion VMkernel adapters (often on the vMotion TCP/IP stack).
  • Environments where vmkping from ESXi fails with an “Invalid argument” or “Unknown interface” error when specifying -I vmkX.

Prerequisites:

  • Shell/SSH access to the ESXi host (Tech Support Mode / SSH enabled).
  • vSphere Client access to verify VMkernel and TCP/IP stack configuration.

Problem description and symptoms

Typical error messages

When running vmkping from ESXi to test vMotion connectivity, you might see:

  • Unknown interface 'vmk1': Invalid argument
  • vmkping: sendto() failed: Invalid argument

This usually occurs with commands like:

  • vmkping -I vmk1 <peer_vmotion_IP>
  • vmkping <peer_vmotion_IP>

Functional impact

Despite the error:

  • vMotion may still work successfully because the vMotion TCP/IP stack is functioning correctly.
  • Standard ping from ESXi or from external devices to the vMotion IPs may fail because the vMotion stack is L2‑only or has no gateway.

Root cause

vMotion TCP/IP stack behavior

VMkernel adapters can be attached to different TCP/IP stacks:

  • defaultTcpipStack (usually Management, vMotion, vSAN in simple setups)
  • vmotion (dedicated vMotion TCP/IP stack)
  • vxlanvSphereProvisioning, etc.

Key points:

  • When a VMkernel adapter is created on the vMotion stack, the gateway option disappears in the UI because this stack is designed as an L2 network for vMotion traffic.
  • vmkping uses the VMkernel’s TCP/IP stack, not the host’s management stack, and requires explicit stack selection for non‑default stacks.

If you call vmkping without telling it which TCP/IP stack to use, it assumes defaultTcpipStack.
When the interface is actually on vmotion, this mismatch causes the Unknown interface or Invalid argument error.


Identification: confirm VMkernel and TCP/IP stack

Perform these checks from an ESXi shell:

1. List VMkernel interfaces and stacks

esxcli network ip interface list
  • Look at Name (vmk0, vmk1, …) and Netstack Instance columns to see which stack each VMkernel uses (e.g., defaultTcpipStackvmotionvxlan).

Alternative older command:

esxcfg-vmknic -l
  • Check the NetStack column to identify which stack is bound to each vmk.

2. List available TCP/IP stacks

esxcli network ip netstack list
  • Confirm valid stack names such as defaultTcpipStackvmotionvxlanvSphereProvisioning.

3. Validate vMotion tagging and IPs

In vSphere Client:

  • For each host, open Configure → Networking → VMkernel adapters.
  • Verify:
    • Which vmk is enabled for vMotion.
    • Whether it uses the vMotion TCP/IP stack or the default stack.
    • IP address, VLAN, and port group settings.

Correct test methods for vMotion network

Option 1 – esxcli (recommended for vMotion stack)

Use esxcli network diag ping and specify the VMkernel and the netstack:

esxcli network diag ping -I vmk1 --netstack=vmotion -H <target_vmotion_vmk_IP>
  • Replace vmk1 with the local vMotion VMkernel name.
  • Replace <target_vmotion_vmk_IP> with the vMotion VMkernel IP of the peer ESXi host.

This method works even when the vMotion stack is L2‑only with no gateway.

Jumbo frame / MTU test

For MTU 9000 testing, include payload size and do‑not‑fragment options:

esxcli network diag ping -I vmk1 --netstack=vmotion -H <target_vmotion_vmk_IP> -s 8972 -d
  • For MTU 9000, payload 8972 bytes plus headers approximates the full frame and validates end‑to‑end jumbo support.

Option 2 – vmkping with netstack parameter

If you prefer vmkping directly:

vmkping -S vmotion -I vmk1 <target_vmotion_vmk_IP>

or

vmkping ++netstack=vmotion -I vmk1 <target_vmotion_vmk_IP>

Key notes:

  • -S vmotion or ++netstack=vmotion selects the vMotion TCP/IP stack.
  • Stack name is case‑sensitive and must match the value from esxcli network ip netstack list.

For MTU testing:

vmkping ++netstack=vmotion -I vmk1 -s 8972 -d <target_vmotion_vmk_IP>

Step‑by‑step troubleshooting workflow

Use this procedure when you see Invalid argument while testing vMotion.

Step 1 – Confirm VMkernel stack and role

  1. On each host, run:​bashesxcli network ip interface list
  2. Identify:
    • Which vmk has vMotion enabled.
    • Its Netstack Instance (e.g., vmotion).
  3. In vSphere Client, verify that vMotion is enabled on that vmk and that IPs are in the correct vMotion VLAN/subnet.

Step 2 – Reproduce the error with standard vmkping

From the ESXi shell:

vmkping -I vmk1 <target_vmotion_vmk_IP>
  • If you get Unknown interface 'vmk1': Invalid argument, this confirms the mismatch between vmkping’s default stack and the interface’s actual stack.

Step 3 – Test with correct netstack

Run:

vmkping -S vmotion -I vmk1 <target_vmotion_vmk_IP>
# or
esxcli network diag ping -I vmk1 --netstack=vmotion -H <target_vmotion_vmk_IP>
  • Successful responses (ICMP replies) indicate L2 connectivity on the vMotion network.

Step 4 – Validate MTU and physical network

If connectivity works but jumbo frame tests fail:

  1. Run a jumbo ping:​bashvmkping -S vmotion -I vmk1 -s 8972 -d <target_vmotion_vmk_IP>
  2. If it fails:
    • Check vSwitch / vDS MTU configuration.
    • Check physical NIC MTU.
    • Check upstream physical switch ports and VLAN MTU.

Step 5 – Check routing/gateway expectations

  • For vMotion TCP/IP stack, Layer 3 routing typically requires configuring a default gateway for that stack in Networking → TCP/IP configuration on the host.
  • Without a gateway or static route, vMotion stack pings to other subnets will fail even if same‑subnet pings work.

Example command set (copy/paste)

Adjust interface names and IPs before using.

# 1. Inventory VMkernel interfaces and stacks
esxcli network ip interface list
esxcfg-vmknic -l
esxcli network ip netstack list

# 2. Test vMotion vmk using vMotion stack (esxcli)
esxcli network diag ping -I vmk1 --netstack=vmotion -H 10.10.20.12

# 3. Test vMotion vmk using vmkping with netstack
vmkping -S vmotion -I vmk1 10.10.20.12
vmkping ++netstack=vmotion -I vmk1 10.10.20.12

# 4. Jumbo frame tests (MTU 9000)
esxcli network diag ping -I vmk1 --netstack=vmotion -H 10.10.20.12 -s 8972 -d
vmkping -S vmotion -I vmk1 -s 8972 -d 10.10.20.12

ESXi NFSv4 Error Decoder & Log Analyzer

NFSv4 failures in ESXi appear as NFS4ERR_* codes, lockd/portmap conflicts, and RPC timeouts in vmkernel.log. This script suite extracts all NFSv4 errors with detailed explanations, affected datastores, and remediation steps directly from live logs.

NFSv4 Error Code Reference

NFS4ERR CodeHexMeaningCommon CauseFix
NFS4ERR_LOCK_UNAVAIL0x0000006cLock deniedlockd conflict, port 4045 usedKill conflicting process, restart lockd
NFS4ERR_STALE0x0000001fStale file handleNAS reboot, export changedRemount datastore, check NAS exports
NFS4ERR_EXPIRED0x0000001eLease expiredNetwork partition >90sCheck MTU, firewall 2049/TCP
NFS4ERR_DELAY0x00000016Server busyNAS overloadedIncrease nfs.maxqueuesize, NAS perf
NFS4ERR_IO0x00000011I/O errorNAS disk failureCheck NAS alerts, failover pool
NFS4ERR_BADHANDLE0x00000002Invalid handleCorrupt mountUnmount/remount NFS datastore

NFSv4 Log Parser (nfs4-analyzer.sh)

Purpose: Extracts ALL NFSv4 errors from vmkernel.log with timestamps, datastores, and RPC details.

bash#!/bin/bash
# nfs4-analyzer.sh - Extract NFSv4 failures from ESXi logs
LOG_FILE="/var/run/log/vmkernel.log"
REPORT="/tmp/nfs4-errors-$(date +%Y%m%d-%H%M).csv"

echo "Timestamp,Datastore,NFS4ERR_Code,Error_Type,Severity,Server_IP,Mount_Point,Raw_Log" > $REPORT

# NFS4ERR_* codes
grep -i "NFS4ERR\|nfs.*error\|rpc.*fail\|lockd\|portmap" $LOG_FILE | while read line; do
    timestamp=$(echo $line | sed 's/^\(.*\)cpu.*/\1/')
    datastore=$(echo $line | grep -o '/vmfs/volumes/[^ ]*' | head -1 || echo 'unknown')
    server_ip=$(echo $line | grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' | head -1)
    
    # Extract NFS4ERR code
    if [[ $line =~ NFS4ERR_([A-Z_]+) ]]; then
        code=${BASH_REMATCH[1]}
        severity="HIGH"
    elif [[ $line =~ (timeout\|fail|dead|unavailable) ]]; then
        code=$(echo $line | grep -oE 'timeout|fail|dead|unavailable|refused')
        severity="MEDIUM"
    else
        code="OTHER"
        severity="LOW"
    fi
    
    echo "$timestamp,$datastore,$code,$severity,$server_ip,$datastore,$line" >> $REPORT
done

# Summary stats
echo "=== NFSv4 Error Summary ===" | tee -a $REPORT
awk -F, '{print $3" "$4" "$5}' $REPORT | sort | uniq -c | sort -nr | head -10

# Critical alerts
echo "CRITICAL (HIGH severity):" | tee -a $REPORT
grep ",HIGH," $REPORT | cut -d, -f1,2,3,7-

echo "Report saved: $REPORT ($(wc -l < $REPORT) entries)"

Usage: ./nfs4-analyzer.sh → /tmp/nfs4-errors-*.csv

Detailed NFSv4 Error Decoder (nfs4-decoder.py)

Purpose: Maps NFSv4 error codes to RFC 5661 explanations + ESXi-specific fixes.

python#!/usr/bin/env python3
# nfs4-decoder.py - Detailed NFSv4 error explanations
import re
import sys
import pandas as pd

NFS4_ERRORS = {
    'NFS4ERR_LOCK_UNAVAIL': {
        'rfc': 'RFC 5661 Sec 14.2.1 - Lock held by another client',
        'esxi_cause': 'portd/lockd conflict on 4045/TCP, Windows NFS client interference',
        'fix': '1. `esxcli system process list | grep lockd` → kill PID\n2. Check `netstat -an | grep 4045`\n3. Restart: `services.sh restart`',
        'severity': 'CRITICAL'
    },
    'NFS4ERR_STALE': {
        'rfc': 'RFC 5661 Sec 14.2.30 - File handle no longer valid',
        'esxi_cause': 'NAS export removed, filesystem ID changed, NAS failover',
        'fix': '`esxcli storage filesystem unmount -l DATASTORE && esxcli storage filesystem mount -v nfs -h NAS_IP -s /export`',
        'severity': 'HIGH'
    },
    'NFS4ERR_EXPIRED': {
        'rfc': 'RFC 5661 Sec 14.2.9 - Lease expired',
        'esxi_cause': 'Network blip >90s, firewall dropped TCP 2049',
        'fix': '1. `vmkping -I vmk0 NAS_IP -s 8972` (Jumbo)\n2. Check ESXi firewall: `esxcli network firewall ruleset list | grep nfs`',
        'severity': 'HIGH'
    },
    'NFS4ERR_DELAY': {
        'rfc': 'RFC 5661 Sec 14.2.7 - Server temporarily unavailable',
        'esxi_cause': 'NAS RPC queue full, nfs.maxqueuesize too low',
        'fix': '`esxcli system settings advanced set -o /NFS/MaxQueueSize -i 16` → rescan',
        'severity': 'MEDIUM'
    }
}

def decode_nfs4_error(log_line):
    for error, details in NFS4_ERRORS.items():
        if re.search(error, log_line):
            return {
                **details,
                'raw_line': log_line,
                'timestamp': re.search(r'\[([^\]]+)', log_line).group(1)
            }
    return {'error': 'UNKNOWN_NFS4', 'severity': 'INFO', 'raw_line': log_line}

# Process log file
if len(sys.argv) > 1:
    with open(sys.argv[1]) as f:
        errors = [decode_nfs4_error(line) for line in f if 'NFS4ERR' in line or 'nfs.*error' in line]
    
    df = pd.DataFrame(errors)
    print(df[['timestamp', 'severity', 'esxi_cause', 'fix']].to_string(index=False))

Live NFSv4 Monitor (nfs4-live-tail.sh)

Purpose: Real-time NFSv4 error detection with instant alerts.

bash#!/bin/bash
# nfs4-live-tail.sh - Watch NFSv4 errors live
tail -f /var/run/log/vmkernel.log | grep --line-buffered -i "NFS4ERR\|nfs.*(error\|fail\|timeout\|lock)" | while read line; do
    echo "$(date): $line"
    
    # Auto-run decoder
    echo "$line" | python3 nfs4-decoder.py | head -3
    
    # Alert on critical
    if echo "$line" | grep -q "LOCK_UNAVAIL\|STALE\|EXPIRED"; then
        echo "CRITICAL NFSv4 ERROR - Check /tmp/nfs4-errors-*.csv" | mail -s "ESXi NFSv4 Failure $(hostname)" oncall@company.com
    fi
done

Run: ./nfs4-live-tail.sh (Ctrl+C to stop)

Master NFSv4 Dashboard Generator (nfs4-dashboard.py)

python#!/usr/bin/env python3
# nfs4-dashboard.py - HTML dashboard with error trends
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import subprocess

# Parse all reports
all_reports = []
for report in subprocess.check_output('ls /tmp/nfs4-errors-*.csv', shell=True).decode().split():
    df = pd.read_csv(report, names=['ts','ds','code','sev','ip','mp','raw'])
    df['time'] = pd.to_datetime(df['ts'])
    all_reports.append(df)

df = pd.concat(all_reports)

# Charts
fig = make_subplots(rows=2, cols=2, 
                   subplot_titles=('NFSv4 Errors by Datastore', 'Error Timeline', 
                                 'Top Error Codes', 'Server Response'))

fig.add_trace(px.bar(df, x='ds', y=df.groupby('ds').size(), 
                    title='Errors by Datastore').data[0], row=1, col=1)

timeline = px.line(df.groupby(['time','code']).size().reset_index(name='count'), 
                  x='time', y='count', color='code')
for trace in timeline.data:
    fig.add_trace(trace, row=1, col=2)

fig.add_trace(px.pie(df, names='code', values=df.groupby('code').size()).data[0], row=2, col=1)

fig.write_html('nfs4-dashboard.html')
print("Dashboard: nfs4-dashboard.html")

Quick One-Liners for NFSv4 Issues

IssueCommand
Lock conflicts`esxcli system process list
Port 4045 check`netstat -an
RPC timeoutsrpcinfo -T tcp NAS_IP nfs → should return prog 100003
Mount status`esxcli storage filesystem list
NFS firewallesxcli network firewall ruleset set --ruleset-id nfsClient --enabled true
Force remountesxcli storage filesystem unmount -l DATASTORE && mount

Automated Cron Setup

bash# /etc/cron.d/nfs4-monitor
*/5 * * * * root /scripts/nfs4-analyzer.sh >> /var/log/nfs4-monitor.log
0 * * * * root python3 /scripts/nfs4-dashboard.py

Alert Thresholds:

  • LOCK_UNAVAIL > 5/min → Page oncall
  • STALE/EXPIRED > 2 → Check NAS failover
  • DELAY > 10 → Storage team

Sample Output (Tintri NFS)

text2025-12-28 19:09:00: NFS4ERR_LOCK_UNAVAIL on /vmfs/volumes/tintrinfs-prod
Cause: Windows NFS client on port 4045 conflicts with ESXi lockd
Fix: Kill PID 12345 (lockd), restart services.sh

3x NFS4ERR_STALE on 10.0.1.50:/tintri-export
Cause: Tintri controller failover
Fix: Remount datastore

Confluence Deployment

text1. SCP scripts → ESXi /scripts/
2. ./nfs4-analyzer.sh → instant CSV report
3. python3 nfs4-dashboard.py → embed HTML
4. Cron + ./nfs4-live-tail.sh for oncall shifts

Pro Tip: For Tintri NFS, grep tintri|10.0.1 in logs; check controller status in GlobalProtect.

Run ./nfs4-analyzer.sh now for your current NFSv4 issues!

ESXi SCSI Decoder & VMkernel Log Analyzer

VMkernel logs contain SCSI sense codes, path states, and HBA errors in hex format that decode storage failures like LUN timeouts, reservation conflicts, and path flaps. This script suite parses /var/run/log/vmkernel.log, decodes SCSI status, and generates troubleshooting dashboards with failed paths/host details.

SCSI Sense Code Decoder (scsi-decoder.py)

Purpose: Converts hex sense data from vmkernel logs to human-readable errors.

python#!/usr/bin/env python3
# scsi-decoder.py - Decode VMware SCSI sense codes
SCSI_SENSE = {
    '0x0': 'No Sense (OK)',
    '0x2': 'Not Ready',
    '0x3': 'Medium Error',
    '0x4': 'Hardware Error',
    '0x5': 'Illegal Request',
    '0x6': 'Unit Attention',
    '0x7': 'Data Protect',
    '0xb': 'Aborted Command',
    '0xe': 'Overlapped Commands Attempted'
}

ASC_QUAL = {
    '0x2800': 'LUN Not Ready, Format in Progress',
    '0x3f01': 'Removed Target',
    '0x3f07': 'Multiple LUN Reported',
    '0x4700': 'Reservation Conflict',
    '0x4c00': 'Snapshot Snapshot Failed',
    '0x5506': 'Illegal Message',
    '0x0800': 'Logical Unit Communication Failure'
}

def decode_scsi(line):
    """Parse vmkernel SCSI line: [timestamp] vmkwarning: CPUx: NMP: nmp_ThrottleLogForDevice: ... VMW_SCSIERR_0xX"""
    if 'VMW_SCSIERR' not in line:
        return None
    
    sense_match = re.search(r'VMW_SCSIERR_([0-9a-fA-F]{2})', line)
    if sense_match:
        sense = f"0x{sense_match.group(1)}"
        naa = re.search(r'naa\.([0-9a-fA-F:]+)', line)
        lun = naa.group(1) if naa else 'Unknown'
        
        return {
            'lun': lun,
            'sense_key': SCSI_SENSE.get(sense, f'Unknown: {sense}'),
            'raw_line': line.strip(),
            'timestamp': re.search(r'\[(.*?)\]', line).group(1)
        }
    return None

# Usage example
log_line = "2025-12-28T18:46:00.123Z cpu5:32:VMW_SCSIERR_0xb: naa.60a9800064824b4f4f4f4f4f4f4f4f4f"
print(decode_scsi(log_line))
# Output: {'lun': '60a980006482...', 'sense_key': 'Aborted Command', ...}

VMkernel Log Parser (vmk-log-analyzer.sh)

Purpose: Real-time parsing of SCSI errors, path states, and HBA failures.

bash#!/bin/bash
# vmk-log-analyzer.sh - Live SCSI/Path decoder
LOG_FILE="/var/run/log/vmkernel.log"
REPORT="/tmp/scsi-report-$(date +%Y%m%d-%H%M).csv"

echo "Host,LUN,SCSI_Error,Path_State,Timestamp,Count" > $REPORT

# SCSI Errors
grep -i "VMW_SCSIERR\|scsi\|LUN\|NMP\|path dead\|timeout" $LOG_FILE | while read line; do
    host=$(hostname)
    lun=$(echo $line | grep -o 'naa\.[0-9a-f:]*' | head -1 || echo 'unknown')
    error=$(echo $line | grep -o 'VMW_SCSIERR_[0-9a-f]*\|timeout\|dead\|failed' | head -1)
    timestamp=$(echo $line | grep -o '\[[0-9TZ.-]*' | sed 's/\[//;s/\]//')
    
    echo "$host,$lun,$error,$timestamp,1" >> $REPORT.tmp
done

# Path States
grep -i "path state\|working\|dead\|standby\|active" $LOG_FILE | while read line; do
    host=$(hostname)
    lun=$(echo $line | grep -o 'naa\.[0-9a-f:]*' | head -1)
    path_state=$(echo $line | grep -oE '(working|dead|standby|active|disabled)')
    timestamp=$(echo $line | grep -o '\[[0-9TZ.-]*' | sed 's/\[//;s/\]//')
    
    echo "$host,$lun,$path_state,$timestamp,1" >> $REPORT.tmp
done

# Aggregate & sort
cat $REPORT.tmp | sort | uniq -c | sort -nr | awk '{print $2","$3","$4","$5","$1}' >> $REPORT
rm $REPORT.tmp

echo "SCSI/Path Report: $(wc -l < $REPORT) entries"
tail -20 $REPORT

Cron*/5 * * * * /scripts/vmk-log-analyzer.sh (5min intervals).

Live Dashboard Generator (scsi-dashboard.py)

Purpose: Creates HTML table with SCSI errors, path status, and failure trends.

python#!/usr/bin/env python3
# scsi-dashboard.py - Interactive SCSI failure dashboard
import pandas as pd
import plotly.express as px
from datetime import datetime

df = pd.read_csv('/tmp/scsi-report-*.csv', names=['Host','LUN','Error','Path_State','Timestamp','Count'])

# Top failing LUNs
top_luns = df.groupby('LUN')['Count'].sum().sort_values(ascending=False).head(10)

# Path state pie chart
path_pie = px.pie(df, names='Path_State', values='Count', title='Path States Distribution')

# Error timeline
df['Time'] = pd.to_datetime(df['Timestamp'])
error_timeline = px.line(df.groupby(['Time','Error']).size().reset_index(name='Count'), 
                        x='Time', y='Count', color='Error', title='SCSI Errors Over Time')

# HTML Report
html = f"""
<html>
<head><title>ESXi SCSI Status - {datetime.now()}</title>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css">
</head>
<body>
<div class="container">
<h2>🔍 ESXi Storage Path & SCSI Errors</h2>
<div class="row">
<div class="col-md-6">{path_pie.to_html(full_html=False, include_plotlyjs='cdn')}</div>
<div class="col-md-6">{error_timeline.to_html(full_html=False, include_plotlyjs='cdn')}</div>
</div>

<h3>Top Failing LUNs</h3>
{table = df.pivot_table(values='Count', index='LUN', columns='Error', aggfunc='sum', fill_value=0)
table.to_html()}
</div>
</body>
</html>
"""
with open('scsi-dashboard.html', 'w') as f:
    f.write(html)

Master Orchestrator (storage-scsi-monitor.py)

Purpose: Monitors all ESXi hosts, parses logs, generates alerts.

python#!/usr/bin/env python3
import paramiko
import subprocess
from scsi_decoder import decode_scsi # From above

ESXI_HOSTS = ['esxi1.example.com', 'esxi2.example.com']
ALERT_THRESHOLD = 10 # Errors per 5min

def analyze_host(host):
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(host, username='root', password='esxi_password')

# Tail vmkernel log + run analyzer
stdin, stdout, stderr = ssh.exec_command("""
tail -500 /var/run/log/vmkernel.log | grep -E 'SCSIERR|path dead|timeout' > /tmp/scsi.log &&
/scripts/vmk-log-analyzer.sh
""")

# Parse results
errors = pd.read_csv('/tmp/scsi-report-*.csv')
critical_luns = errors[errors['Count'] > ALERT_THRESHOLD]

if not critical_luns.empty:
print(f"ALERT {host}: {len(critical_luns)} critical LUNs")
for _, row in critical_luns.iterrows():
print(f" LUN {row['LUN']}: {row['Error']} x{row['Count']}")

ssh.close()

# Run across all hosts
for host in ESXI_HOSTS:
analyze_host(host)

subprocess.run(['python3', 'scsi-dashboard.py'])
print("SCSI dashboard: scsi-dashboard.html")

One-Liner SCSI Commands

TaskCommand
Live SCSI errorstailf /var/run/log/vmkernel.log | grep SCSIERR
Path status`esxcli storage core path list | grep -E ‘dead
LUN reservationsesxcli storage core device list | grep reservation
HBA errorsesxcli storage core adapter stats get -A vmhbaX
Decode sense manuallyecho "0xb" | python3 scsi-decoder.py

Common SCSI Errors & Fixes

Sense CodeMeaningFix
0xb (Aborted)Queue overflow, SAN busyIncrease HBA queue depth, check SAN zoning
0x6 (Unit Attention)LUN resetRescan: esxcli storage core adapter rescan --all
0x47 (Reservation)vMotion conflictStagger migrations, check esxtop %RESV
Path DeadCable/switch/HBA failesxcli storage core path set -p <path> -s active

Alerting Integration

bash# Email critical LUNs
if [ $(wc -l < /tmp/scsi-report-*.csv) -gt 5 ]; then
cat /tmp/scsi-report-*.csv | mail -s "ESXi SCSI Errors $(hostname)" storage-team@company.com
fi

Confluence Deployment

text1. SCP scripts to ESXi: `/scripts/`
2. Cron: `*/5 * * * * /scripts/vmk-log-analyzer.sh`
3. Jumpbox: Run `storage-scsi-monitor.py` hourly
4. Embed: `{html}http://scsi-dashboard.html{html}`

Sample Output:

textesxi1: 12 critical LUNs
LUN naa.60a980...: Aborted Command x8
Path vmhba32:C0:T0:L0 dead x4

Pro Tip: Filter Tintri LUNs: grep tintri\|naa.60a9 /var/run/log/vmkernel.log

Run ./storage-scsi-monitor.py for instant dashboard!

VMware Storage Performance Testing Suite

ESXi hosts provide fioioping, and esxtop for storage benchmarking directly from CLI, while vCenter PowerCLI aggregates performance across clusters/datastores. This script suite generates IOPS, latency, and throughput charts viewable in Confluence/HTML dashboards.

Core Testing Engine (fio-perf-test.sh)

Purpose: Run standardized fio workloads (4K random, 64K seq) on VMFS/NFS datastores.

bash#!/bin/bash
# fio-perf-test.sh - Run on ESXi via SSH
DATASTORE="/vmfs/volumes/$(esxcli storage filesystem list | grep -v Mounted | head -1 | awk '{print $1}')"
TEST_DIR="$DATASTORE/perf-test"
FIO_TEST="/usr/lib/vmware/fio/fio"

mkdir -p $TEST_DIR
cd $TEST_DIR

cat > fio-random-4k.yaml << EOF
[global]
ioengine=libaio
direct=1
size=1G
time_based
runtime=60
group_reporting
directory=$TEST_DIR

[rand-read]
rw=randread
bs=4k
numjobs=4
iodepth=32
filename=testfile.dat

[rand-write]
rw=randwrite
bs=4k
numjobs=4
iodepth=32
filename=testfile.dat
EOF

# Run tests
$FIO_TEST fio-random-4k.yaml > /tmp/fio-4k-results.txt
$FIO_TEST --name=seq-read --rw=read --bs=64k --size=4G --runtime=60 --direct=1 --numjobs=1 --iodepth=32 $TEST_DIR/testfile.dat >> /tmp/fio-seq-results.txt

# Cleanup
rm -rf $TEST_DIR/*
echo "$(hostname),$(date),$(grep read /tmp/fio-4k-results.txt | tail -1 | awk '{print $3}'),$(grep IOPS /tmp/fio-4k-results.txt | grep read | awk '{print $2}')" >> /tmp/storage-perf.csv

Cron Schedule0 2 * * 1 /scripts/fio-perf-test.sh (weekly baseline).

vCenter PowerCLI Aggregator (StoragePerf.ps1)

Purpose: Collects historical perf + runs live esxtop captures across all hosts.

powershell# StoragePerf.ps1 - vCenter Storage Performance Dashboard
Connect-VIServer vcenter.example.com

$Report = @()
$Clusters = Get-Cluster

foreach ($Cluster in $Clusters) {
    $Hosts = Get-VMHost -Location $Cluster
    foreach ($Host in $Hosts) {
        # Live esxtop data (requires esxtop installed)
        $Esxtop = Invoke-VMScript -VM $Host -ScriptText {
            esxtop -b -a -d 30 | grep -E 'DAVG|%LAT|IOPS' | tail -20
        } -GuestCredential (Get-Credential)

        # Historical datastore stats
        $Datastores = Get-Datastore -VMHost $Host
        foreach ($DS in $Datastores) {
            $Perf = $DS | Get-Stat -Stat "datastore.read.average","datastore.write.average" -MaxSamples 24 -Interval Min | 
                    Select @{N='Time';E={$_.Timestamp}}, @{N='ReadKBps';E={[math]::Round($_.Value,2)}}, @{N='WriteKBps';E={[math]::Round($_.Value,2)}}
            
            $Report += [PSCustomObject]@{
                Host = $Host.Name
                Datastore = $DS.Name
                FreeGB = [math]::Round($DS.FreeSpaceGB,1)
                ReadAvgKBps = ($Perf.ReadKBps | Measure -Average).Average
                WriteAvgKBps = ($Perf.WriteKBps | Measure -Average).Average
                EsxtopLatency = ($Esxtop | Select-String "DAVG" | Select-Object -Last 1).ToString().Split()[2]
            }
        }
    }
}

# Export CSV for charts
$Report | Export-Csv "StoragePerf-$(Get-Date -f yyyy-MM-dd).csv" -NoTypeInformation

# Generate HTML dashboard
$Report | ConvertTo-Html -Property Host,Datastore,FreeGB,ReadAvgKBps,WriteKBps,EsxtopLatency -Title "Storage Performance" | 
    Out-File "storage-dashboard.html"

Performance Chart Generator (perf-charts.py)

Purpose: Converts CSV data to interactive Plotly charts for Confluence.

python#!/usr/bin/env python3
# perf-charts.py - Generate HTML charts from CSV
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import sys

df = pd.read_csv(sys.argv[1])

# IOPS vs Latency scatter
fig1 = px.scatter(df, x='ReadAvgKBps', y='EsxtopLatency', 
                 size='FreeGB', color='Host', hover_name='Datastore',
                 title='Storage Read Performance vs Latency',
                 labels={'ReadAvgKBps':'Read KBps', 'EsxtopLatency':'Avg Latency (ms)'})

# Throughput bar chart
fig2 = px.bar(df, x='Datastore', y=['ReadAvgKBps','WriteAvgKBps'], 
              barmode='group', title='Read/Write Throughput by Datastore')

# Combined dashboard
fig = make_subplots(rows=2, cols=1, subplot_titles=('IOPS vs Latency', 'Read/Write Throughput'))
fig.add_trace(fig1.data[0], row=1, col=1)
fig.add_trace(fig2.data[0], row=2, col=1)
fig.add_trace(fig2.data[1], row=2, col=1)

fig.write_html('storage-perf-dashboard.html')
print("Charts saved: storage-perf-dashboard.html")

Usagepython3 perf-charts.py StoragePerf-2025-12-28.csv

Master Orchestrator (storage-benchmark.py)

Purpose: Runs fio tests on all ESXi hosts + generates dashboard.

python#!/usr/bin/env python3
import paramiko
import subprocess
import pandas as pd
from datetime import datetime

ESXI_HOSTS = ['esxi1.example.com', 'esxi2.example.com']
VCENTER = 'vcenter.example.com'

def run_fio(host):
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(host, username='root', password='your-esxi-password')

# Copy & run fio script
stdin, stdout, stderr = ssh.exec_command('wget -O /tmp/fio-test.sh https://your-confluence/scripts/fio-perf-test.sh && chmod +x /tmp/fio-test.sh && /tmp/fio-test.sh')
result = stdout.read().decode()
ssh.close()
return result

# Execute tests
perf_data = []
for host in ESXI_HOSTS:
print(f"Testing {host}...")
run_fio(host)
perf_data.append({'Host': host, 'TestTime': datetime.now()})

# Pull PowerCLI report
subprocess.run(['pwsh', '-File', 'StoragePerf.ps1'])

# Generate charts
subprocess.run(['python3', 'perf-charts.py', f'StoragePerf-{datetime.now().strftime("%Y-%m-%d")}.csv'])

print("Storage benchmark complete. View storage-perf-dashboard.html")

Confluence Chart Embedding

HTML Macro (paste storage-perf-dashboard.html content):

text{html}

{html}

CSV Table with Inline Charts:

text||Host||Datastore||Read IOPS||Latency||Chart||
|esxi1|datastore1|2450|2.3ms|![Read IOPS|width=150px,height=100px](storage-esxi1.png)|

Automated Dashboard Cronjob

bash#!/bin/bash
# /etc/cron.d/storage-perf
# Daily 3AM: Test + upload to Confluence
0 3 * * * root /usr/local/bin/storage-benchmark.py >> /var/log/storage-perf.log 2>&1

Output Files:

  • /tmp/storage-perf.csv → Historical trends
  • storage-perf-dashboard.html → Interactive Plotly charts
  • /var/log/storage-perf.log → Audit trail

Sample Output Charts

Expected Results (Tintri VMstore baseline):

textDatastore: tintri-vmfs-01
4K Random Read: 12,500 IOPS @ 1.8ms
4K Random Write: 8,200 IOPS @ 2.4ms
64K Seq Read: 450 MB/s
64K Seq Write: 380 MB/s

Pro Tips & Alerts

text☐ Alert if Latency > 5ms: Add to PowerCLI `if($EsxtopLatency -gt 5) {Send-MailMessage}`
☐ Tintri-specific: Add `esxtop` filter for Tintri LUN paths
☐ NFS tuning: Test with `nfs.maxqueuesize=8` parameter
☐ Compare baselines: Git commit CSV files weekly

Run Firstpython3 storage-benchmark.py --dry-run to validate hosts/configs.

Basic PS scripts for VMware Admin

VMware admins rely on PowerCLI, esxcli one-liners, and Python scripts for daily tasks like health checks, VM migrations, and storage monitoring to save hours of manual work.

Daily Health Check Script (PowerCLI)

Purpose: Run every morning to spot issues across cluster, hosts, VMs, and datastores.

powershell# VMware Daily Health Check - Save as HealthCheck.ps1
Connect-VIServer vcenter.example.com

$Report = @()
$Clusters = Get-Cluster
foreach ($Cluster in $Clusters) {
    $Hosts = Get-VMHost -Location $Cluster
    $Report += [PSCustomObject]@{
        Cluster = $Cluster.Name
        HostsDown = ($Hosts | Where {$_.State -ne 'Connected'}).Count
        VMsDown = (Get-VM -Location $Hosts | Where {$_.PowerState -ne 'PoweredOn'}).Count
        DatastoresFull = (Get-Datastore -Location $Hosts | Where {$_.FreeSpaceGB/($_.CapacityGB)*100 -lt 20}).Count
        HighCPUHosts = ($Hosts | Where {$_.CpuUsageMhz -gt 80}).Count
    }
}
$Report | Export-Csv "DailyHealth-$(Get-Date -f yyyy-MM-dd).csv" -NoTypeInformation
Send-MailMessage -To admin@company.com -Subject "VMware Health Report" -Body "Check attached CSV" -Attachments "DailyHealth-$(Get-Date -f yyyy-MM-dd).csv"

Schedule: Windows Task Scheduler daily 8AM, outputs CSV + email summary.

Host Reboot & Maintenance Script

Purpose: Graceful host maintenance with VM evacuation.

powershell# HostMaintenance.ps1
param($HostName, $MaintenanceReason)

$HostObj = Get-VMHost $HostName
if ((Get-VM -Location $HostObj).Count -gt 0) {
    Move-VM -VM (Get-VM -Location $HostObj) -Destination (Get-Cluster -VMHost $HostObj | Get-VMHost | Where {$_.State -eq 'Connected'} | Select -First 1)
}
Set-VMHost $HostName -State Maintenance -Confirm:$false
Restart-VMHost $HostName -Confirm:$false -Reason $MaintenanceReason

Usage./HostMaintenance.ps1 esxi-01 "Patching"

Storage Rescan & Connectivity Script (ESXi SSH)

Purpose: Fix “LUNs not visible” after SAN changes – run on all hosts.

bash#!/bin/bash
# storage-rescan.sh - Run via SSH or Ansible
for adapter in $(esxcli storage core adapter list | grep -E 'vmhba[0-9]+' | awk '{print $1}'); do
    esxcli storage core adapter rescan -A $adapter
done
esxcli storage filesystem list | grep -E 'Mounted|Accessible'
vmkping -I vmk1 $(esxcli iscsi adapter discovery sendtarget list | awk '{print $7}' | tail -1)
echo "Storage rescan complete on $(hostname)"

Cron*/30 * * * * /scripts/storage-rescan.sh >> /var/log/storage-rescan.log

VM Snapshot Cleanup Script

Purpose: Auto-delete snapshots >7 days old to prevent datastore exhaustion.

powershell# SnapshotCleanup.ps1
$OldSnapshots = Get-VM | Get-Snapshot | Where {$_.CreateTime -lt (Get-Date).AddDays(-7)}
foreach ($Snap in $OldSnapshots) {
    Remove-Snapshot $Snap -Confirm:$false -RunAsync
    Write-Output "Deleted snapshot $($Snap.Name) on $($Snap.VM.Name)"
}

Alert on large snaps: Add Where {$_.SizeGB -gt 10} filter.

iSCSI Path Failover Test Script

Purpose: Verify multipath redundancy before maintenance.

bash#!/bin/bash
# iscsi-path-test.sh
echo "=== iSCSI Path Status ==="
esxcli iscsi session list
echo "=== Active Paths ==="
esxcli storage core path list | grep -E 'working|dead'
echo "=== LUN Paths ==="
esxcli storage nmp device list | grep -E 'path'

Run weekly: Documents path count for compliance audits.

NFS Mount Verification Script

Purpose: Check all NFS datastores connectivity.

bash#!/bin/bash
# nfs-check.sh
for datastore in $(esxcli storage filesystem list | grep nfs | awk '{print $1}'); do
    mount | grep $datastore || echo "NFS $datastore NOT mounted!"
    vmkping -c 3 -I vmk0 $(esxcli storage filesystem list | grep $datastore | awk '{print $4}')
done

Performance Monitoring Script (esxtop Capture)

Purpose: Collect 5min esxtop data during issues.

bash#!/bin/bash
# perf-capture.sh
esxtop -b -a -d 300 > /tmp/esxtop-$(date +%Y%m%d-%H%M%S).csv
echo "Captured $(ls -lh /tmp/esxtop-*.csv | wc -l) perf files"

Analyzeesxtop replay or Excel pivot on %LATENCY, DAVG.

One-Liner Toolkit

TaskCommand
List locked VMsvmkfstools -D /vmfs/volumes/datastore/vm.vmdk | grep MAC
Check VMFS healthvoma -check -disk naa.xxx
Reset iSCSI adapteresxcli iscsi adapter set -A vmhbaXX -e false; esxcli iscsi adapter set -A vmhbaXX -e true
Host connection testesxcli network ip connection list | grep ESTABLISHED
Datastore I/Oesxtop (press ‘d’), look for %UTIL>80%

Deployment Guide

  1. PowerCLI SetupInstall-Module VMware.PowerCLI on jumpbox.
  2. ESXi Scripts: SCP to /scripts/chmod +x, add to crontab.
  3. Confluence Integration: Embed scripts as <pre> blocks, add “Copy” buttons.
  4. Alerting: Pipe outputs to Slack/Teams via webhook or email.
  5. Version Control: Git repo per datacenter, tag releases.

NFS APD on ESXi – End‑to‑End Troubleshooting Guide

What APD Means for NFS Datastores

For NFS, an APD essentially means the ESXi host has lost communication on the TCP connection to the NFS server for long enough that the storage stack starts treating the datastore as unavailable. An internal APD timer starts after a few seconds of no communication on the NFS TCP stream; if this continues for roughly 140 seconds (the default value of Misc.APDTimeout), the host declares an APD timeout for that datastore.

Once a datastore is in APD, VM I/O continues to retry while management operations such as browsing the datastore, mounting ISOs, or snapshot consolidation can start to fail quickly. From a vSphere client perspective the datastore may appear dimmed or inaccessible, and VMs can look hung if they rely heavily on that datastore.

How APD Shows Up in ESXi Logs

When an APD event occurs, vmkernel and vobd are the primary places to look. On recent ESXi versions the logs are typically under /var/run/log/, though many environments still collect from /var/log/vmkernel.log and /var/log/vobd.log.

The lifecycle of a single APD usually looks like this in vobd.log:

APD start, for example:
[APDCorrelator] ... [vob.storage.apd.start] Device or filesystem with identifier [8a5a1336-3d574c6d] has entered the All Paths Down state.

APD timeout, after about 140 seconds:
[APDCorrelator] ... [esx.problem.storage.apd.timeout] Device or filesystem with identifier [8a5a1336-3d574c6d] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.

APD exit when the host finally sees the storage again:
[APDCorrelator] ... [esx.problem.storage.apd.recovered] Device or filesystem with identifier [8a5a1336-3d574c6d] has exited the All Paths Down state.

In vmkernel.log the same event is reflected by the APD handler messages. A typical sequence is:

StorageApdHandler: 248: APD Timer started for ident [8a5a1336-3d574c6d]
StorageApdHandler: 846: APD Start for ident [8a5a1336-3d574c6d]!
StorageApdHandler: 902: APD Exit for ident [8a5a1336-3d574c6d]

On NFS datastores you usually see loss‑of‑connectivity messages around the same time, such as warnings from the NFS client that it has lost connection to the server or that latency has spiked:

WARNING: NFS: NFSVolumeLatencyUpdate: NFS volume <datastore> performance has deteriorated. I/O latency increased ... Exceeded threshold 10000(us)

These messages are often the early warning before APD actually triggers.

NFSv4‑Specific Errors in vmkernel and vobd

With NFSv4.1 the client maintains stateful sessions and slot tables, so problems are not always just simple timeouts. ESXi may log warnings such as NFS41SessionSlotUnassign when the available session slots drop too low; when this happens under heavy load it can lead to session resets and eventually to APD on that datastore if the session cannot be re‑established cleanly.

Another category of issues are NFSv4 errors like NFS4ERR_SHARE_DENIED that show up if an OPEN call conflicts with an existing share reservation on the same file. While these errors do not in themselves mean APD, they often appear in the same time window when applications are competing for locks or when the NFS server is under stress and struggling with state management; the end result can be perceived as I/O hangs on the ESXi side.

When reviewing logs, it is useful to separate pure connectivity problems (socket resets, RPC timeouts) from v4‑specific state problems (session slot issues, share or lock errors). The former almost always have a clear APD signature in vobd; the latter may manifest as intermittent stalls or file‑level errors without a full datastore APD.

What to Look For on the NFS Server

Once you have the APD start and exit timestamps from ESXi, the next step is to line those up with the storage array or NFS server logs. On an ONTAP‑style array, for example, APD windows on the ESXi side often correspond to connection reset entries such as:

kernel: Nblade.nfsConnResetAndClose:error]: Shutting down connection with the client ... network data protocol is NFS ... client IP address:port is x.x.x.x:yyyy ... reason is CSM error - Maximum number of rewind attempts has been exceeded

This type of message indicates that the NFS server terminated the TCP session to the ESXi host, typically due to internal error handling or congestion. If the server is busy or recovering from a failover, there might also be log lines for node failover, LIF migration, or high latency on the backend disks at the same time.

On general Linux NFS servers, the relevant information is usually in /var/log/messages or /var/log/syslog. Around the APD time you want to see whether there were RPC timeouts, transport errors, NIC resets, or NFS service restarts for the host IP that corresponds to the ESXi VMkernel interface. If the issue is configuration‑related (for example, export rules suddenly not matching, Kerberos failures, or NFSv4 grace periods), that also tends to show clearly in these logs.

Other platforms show similar patterns. Hyperconverged solutions may log controller failovers or filesystem service restarts in their own management logs at the same timestamps that ESXi reports APD. In many documented cases, APD is ultimately traced to a short loss of network connectivity or to the NFS service being restarted while ESXi still has active sessions.

Practical Troubleshooting Workflow

In practice, troubleshooting an NFS APD usually starts with a simple question: did all hosts and datastores see APD at the same time, or was the event limited to a single host, a single datastore, or a subset of the fabric? A single host and one datastore tends to point to a host‑side or network issue, such as a NIC problem or VLAN mis‑tag; simultaneous APDs across multiple hosts and the same datastore are more likely to be array‑side or network‑core events.

From the ESXi side, the first task is to build a clear timeline. Grab the vobd and vmkernel logs, extract all the vob.storage.apd messages, and list for each device or filesystem identifier when APD started, whether it hit the 140‑second timeout, and when it exited. Once you have the APD window, you can overlay any NFS warnings, networking errors, or TCP issues that appear in vmkernel around those times. This timeline is often more useful than individual error messages because it tells you exactly how long the host was blind to the datastore.

In parallel, check the current state of the environment. On an affected host, esxcli storage filesystem list will confirm whether the NFS datastore is still mounted, in an inaccessible state, or has recovered. If the datastore is still visible but VMs are sluggish, look for ongoing NFS latency messages or packet‑loss symptoms; if the datastore has disappeared entirely from the host view, then the focus shifts more to export definitions, DNS, routing, and the NFS service itself.

Once the ESXi view is clear, move to the NFS server and the switching infrastructure. Using the APD timestamps, review array or server logs for connection resets, session drops, failovers, or heavy latency. If, for instance, the array log shows that the connection from the ESXi IP was reset because of a TCP or congestion issue exactly at the APD start time, the root cause is probably somewhere between that controller and the host. In environments where network packet loss triggers slow‑start behavior and repeated retransmissions, the effective throughput can collapse to the point that ESXi perceives it as an APD even though the interface never technically goes down.

A common outcome of this analysis is that the real problem is either a transient network issue (link flap, misconfigured MTU, queue drops) or a storage‑side transient (controller failover, NFS daemon restart). Addressing that underlying cause usually prevents further APDs. If the APD condition persists or if the host has been stuck in APD for an extended period, many vendors recommend a controlled reboot of affected ESXi hosts after the storage problem has been resolved, to clear any stale device state and residual APD references.

PowerCLI Scripts for VMware Daily Administration and Reporting

Prerequisites

Before running these scripts, ensure you have the VMware PowerCLI module installed and are connected to your vCenter Server. You can connect by running the following command in your PowerShell terminal: Connect-VIServer -Server Your-vCenter-Server-Address

Script 1: General VM Inventory Report

This script gathers essential information about all virtual machines in your environment and exports it to a CSV file for easy analysis.

# Description: Exports a detailed report of all VMs to a CSV file.# Usage: Run the script after connecting to vCenter.Get-VM | Select-Object Name, PowerState, NumCpu, MemoryGB, UsedSpaceGB, ProvisionedSpaceGB, @{N='Datastore';E={[string]::Join(',', (Get-Datastore -Id $_.DatastoreIdList))}}, @{N='ESXiHost';E={$_.VMHost.Name}}, @{N='ToolsStatus';E={$_.ExtensionData.Guest.ToolsStatus}} | Export-Csv -Path .\VM_Inventory_Report.csv -NoTypeInformationWrite-Host 'VM Inventory Report has been generated: VM_Inventory_Report.csv'

Plain TextCopy

Script 2: VM Performance Report (CPU & Memory)

This script checks the average CPU and memory usage for all powered-on VMs over the last 24 hours and exports any that exceed a defined threshold (e.g., 80%).

# Description: Identifies VMs with high CPU or Memory usage over the last day.# Usage: Adjust the $threshold variable as needed.$threshold = 80 # CPU/Memory Usage Percentage Threshold$vms = Get-VM | Where-Object { $_.PowerState -eq 'PoweredOn' }$report = @()foreach ($vm in $vms) {    $stats = Get-Stat -Entity $vm -Stat cpu.usagemhz.average, mem.usage.average -Start (Get-Date).AddDays(-1) -IntervalMins 5    $avgCpu = ($stats | where MetricId -eq 'cpu.usagemhz.average' | Measure-Object -Property Value -Average).Average    $avgMem = ($stats | where MetricId -eq 'mem.usage.average' | Measure-Object -Property Value -Average).Average    if ($avgCpu -and $avgMem) {        $cpuUsagePercent = [math]::Round(($avgCpu / ($vm.NumCpu * $vm.VMHost.CpuTotalMhz)) * 100, 2)        $memUsagePercent = [math]::Round(($avgMem / ($vm.MemoryMB * 1024)) * 100, 2)        if ($cpuUsagePercent -gt $threshold -or $memUsagePercent -gt $threshold) {            $report += New-Object PSObject -Property @{                VMName = $vm.Name                AvgCPUUsagePct = $cpuUsagePercent                AvgMemoryUsagePct = $memUsagePercent            }        }    }}$report | Export-Csv -Path .\VM_High_Performance_Report.csv -NoTypeInformationWrite-Host 'High Performance Report has been generated: VM_High_Performance_Report.csv'

Plain TextCopy

Script 3: ESXi Host Compute Resources Left

This script reports on the available CPU and Memory resources for each ESXi host in your cluster, helping you plan for capacity.

# Description: Reports the remaining compute resources on each ESXi host.# Usage: Run the script to get a quick overview of host capacity.Get-VMHost | Select-Object Name,     @{N='CpuUsageMHz';E={$_.CpuUsageMhz}},     @{N='CpuTotalMHz';E={$_.CpuTotalMhz}},     @{N='CpuAvailableMHz';E={$_.CpuTotalMhz - $_.CpuUsageMhz}},    @{N='MemoryUsageGB';E={[math]::Round($_.MemoryUsageGB, 2)}},     @{N='MemoryTotalGB';E={[math]::Round($_.MemoryTotalGB, 2)}},    @{N='MemoryAvailableGB';E={[math]::Round($_.MemoryTotalGB - $_.MemoryUsageGB, 2)}} | Format-Table

Plain TextCopy

Script 4: Report on Powered-Off VMs

This simple script quickly lists all virtual machines that are currently in a powered-off state.

# Description: Lists all VMs that are currently powered off.# Usage: Run the script to find unused or decommissioned VMs.Get-VM | Where-Object { $_.PowerState -eq 'PoweredOff' } | Select-Object Name, VMHost, @{N='LastModified';E={$_.ExtensionData.Config.Modified}} | Export-Csv -Path .\Powered_Off_VMs.csv -NoTypeInformationWrite-Host 'Powered Off VMs report has been generated: Powered_Off_VMs.csv'

Plain TextCopy

Script 5: Audit Who Powered Off a VM

This script searches the vCenter event logs from the last 7 days to find who initiated a ‘power off’ task on a specific VM.

# Description: Finds the user who powered off a specific VM within the last week.# Usage: Replace 'Your-VM-Name' with the actual name of the target VM.$vmName = 'Your-VM-Name'$vm = Get-VM -Name $vmNameGet-VIEvent -Entity $vm -MaxSamples ([int]::MaxValue) -Start (Get-Date).AddDays(-7) | Where-Object { $_.GetType().Name -eq 'VmPoweredOffEvent' } | Select-Object CreatedTime, UserName, FullFormattedMessage | Format-List

Plain TextCopy

Script 6: Check for ESXi Host Crashes or Disconnections

This script checks for ESXi host disconnection events or host error events in the vCenter logs over the past 30 days, which can indicate a crash or network issue (Purple Screen of Death – PSOD).

# Description: Searches for host disconnection or error events in the last 30 days.# Usage: Run this to investigate potential host stability issues.Get-VIEvent -MaxSamples ([int]::MaxValue) -Start (Get-Date).AddDays(-30) | Where-Object { $_.GetType().Name -in ('HostCnxFailedEvent', 'HostDisconnectedEvent', 'HostEsxGenericPanicEvent', 'EnteredMaintenanceModeEvent') } | Select-Object CreatedTime, HostName, FullFormattedMessage | Sort-Object CreatedTime -Descending | Export-Csv -Path .\Host_Crash_Events.csv -NoTypeInformationWrite-Host 'Host crash/disconnection event report has been generated: Host_Crash_Events.csv'

Plain TextCopy

Best Practice Guide: Kubernetes and NAS on VMware

This guide provides a detailed, step-by-step approach to designing and implementing a robust Kubernetes environment that utilizes Network Attached Storage (NAS) on a VMware vSphere platform. Following these best practices will ensure a scalable, resilient, and performant architecture.

Core Design Principles

Separation of Concerns: Keep your storage (NAS), compute (VMware), and orchestration (Kubernetes) layers distinct but well-integrated. This simplifies management and troubleshooting.

Leverage the CSI Standard: Always use a Container Storage Interface (CSI) driver for integrating storage. This is the Kubernetes-native way to connect to storage systems and is vendor-agnostic.

Network Performance is Key: The network is the backbone connecting your K8s nodes (VMs) to the NAS. Dedicate sufficient bandwidth and low latency links for storage traffic.

High Availability (HA): Design for failure. This includes using a resilient NAS appliance, VMware HA for your K8s node VMs, and appropriate Kubernetes deployment strategies.

Granular Access Control: Implement strict permissions on your NAS exports and use Kubernetes Persistent Volumes (PVs) and Persistent Volume Claims (PVCs) to manage access.

Step-by-Step Implementation Guide

Here is a detailed workflow for setting up your environment from the ground up.

1. VMware Environment Preparation

ESXi Hosts & vCenter: Ensure you are running a supported version of vSphere. Configure DRS and HA clusters for automatic load balancing and failover of your Kubernetes node VMs.

Virtual Machine Templates: Create a standardized VM template for your Kubernetes control plane and worker nodes. Use a lightweight, cloud-native OS like Ubuntu Server or Photon OS.

Networking: Create a dedicated vSwitch or Port Group for NAS storage traffic. This isolates storage I/O from other network traffic (management, pod-to-pod) and improves security and performance. Use Jumbo Frames (MTU 9000) on this network if your NAS and physical switches support it.

2. NAS Storage Preparation (NFS Example)

Create NFS Exports: On your NAS appliance, create dedicated NFS shares that will be used by Kubernetes. It’s better to have multiple smaller shares for different applications or teams than one monolithic share.

Set Permissions: Configure export policies to only allow access from the IP addresses of your Kubernetes worker nodes. Set `no_root_squash` if your containers require running as root, but be aware of the security implications.

Optimize for Performance: Enable NFSv4.1 or higher for better performance and features like session trunking. Ensure your NAS has sufficient IOPS capability for your workloads.

3. Kubernetes Cluster Deployment

Provision VMs: Deploy your control plane and worker nodes from the template created in Step 1.

Install Kubernetes: Use a standard tool like `kubeadm` to bootstrap your cluster. Alternatively, leverage a VMware-native solution like VMware Tanzu for deeper integration.

Install CSI Driver: This is the most critical step for storage integration. Deploy the appropriate CSI driver for your NAS. For a generic NFS server, you can use the open-source NFS CSI driver. You typically install it using Helm or by applying its YAML manifests.

4. Integrating and Using NAS Storage

Create a StorageClass: A StorageClass tells Kubernetes how to provision storage. You will create one that uses the NFS CSI driver. This allows developers to request storage dynamically without needing to know the underlying NAS details. Example StorageClass YAML:

apiVersion: storage.k8s.io/v1kind: StorageClassmetadata:  name: nfs-csiprovisioner: nfs.csi.k8s.ioparameters:  server: 192.168.10.100  share: /exports/kubernetes  mountOptions:    - "nfsvers=4.1"reclaimPolicy: RetainvolumeBindingMode: Immediate

Request Storage with a PVC: Developers request storage by creating a PersistentVolumeClaim (PVC) that references the StorageClass. Example PVC YAML:

apiVersion: v1kind: PersistentVolumeClaimmetadata:  name: my-app-dataspec:  accessModes:    - ReadWriteMany  storageClassName: nfs-csi  resources:    requests:      storage: 10Gi

Mount the Volume in a Pod: Finally, mount the PVC as a volume in your application’s Pod definition. Example Pod YAML:

apiVersion: v1kind: Podmetadata:  name: my-nginx-podspec:  containers:  - name: nginx    image: nginx:latest    volumeMounts:    - name: data-volume      mountPath: /usr/share/nginx/html  volumes:  - name: data-volume    persistentVolumeClaim:      claimName: my-app-data

Important Dos and Don’ts

DoDon’t
Do use a CSI driver for dynamic provisioning. It automates PV creation and simplifies management.Don’t use static PV definitions or direct hostPath mounts to the NAS. This is brittle and not scalable.
Do isolate NAS traffic on a dedicated VLAN and vSwitch/Port Group for security and performance.Don’t mix storage traffic with management or pod-to-pod traffic on the same network interface.
Do use the `ReadWriteMany` (RWX) access mode for NFS to share a volume across multiple pods.Don’t assume all storage supports RWX. Block storage (iSCSI/FC) typically only supports `ReadWriteOnce` (RWO).
Do implement a backup strategy for your persistent data on the NAS using snapshots or other backup tools.Don’t assume Kubernetes handles data backups. It only manages the volume lifecycle.
Do monitor storage latency and IOPS from both the VMware and NAS side to identify bottlenecks.Don’t ignore storage performance until applications start failing.

Design Example: Web Application with a Shared Uploads Folder

Scenario: A cluster of web server pods that need to read and write to a common directory for user-uploaded content.

VMware Setup: A 3-node Kubernetes cluster (1 control-plane, 2 workers) running as VMs in a vSphere HA cluster. A dedicated “NAS-Traffic” Port Group is configured for a second vNIC on each worker VM.

NAS Setup: A NAS appliance provides an NFSv4 share at `192.168.50.20:/mnt/k8s_uploads`. The export policy is restricted to the IPs of the worker nodes on the NAS traffic network.

Kubernetes Setup:

The NFS CSI driver is installed in the cluster.

A `StorageClass` named `shared-uploads` is created, pointing to the NFS share.

A `PersistentVolumeClaim` named `uploads-pvc` requests 50Gi of storage using the `shared-uploads` StorageClass with `ReadWriteMany` access mode.

The web application’s `Deployment` is configured to mount `uploads-pvc` at the path `/var/www/html/uploads`.

Any of the web server pods can write a file to the uploads directory, and all other pods can immediately see and serve that file, because they are all connected to the same underlying NFS share. If a worker VM fails, VMware HA restarts it on another host, and Kubernetes reschedules the pod, which then re-attaches to its storage seamlessly.

Deploying Time-Sensitive Applications on Kubernetes in VMware

Deploying time-sensitive applications, such as those in telecommunications (vRAN), high-frequency trading, or real-time data processing, on Kubernetes within a VMware vSphere environment requires careful configuration at both the hypervisor and Kubernetes levels. The goal is to minimize latency and jitter by providing dedicated resources and precise time synchronization.

Prerequisites: VMware vSphere Configuration

Before deploying pods in Kubernetes, the underlying virtual machine (worker node) and ESXi host must be properly configured. These settings reduce virtualization overhead and improve performance predictability.

Precision Time Protocol (PTP): Configure the ESXi host to use a PTP time source. This allows virtual machines to synchronize their clocks with high accuracy, which is critical for applications that depend on precise time-stamping and event ordering.

Latency Sensitivity: In the VM’s settings (VM Options -> Advanced -> Latency Sensitivity), set the value to High. This instructs the vSphere scheduler to reserve physical CPU and memory, minimizing scheduling delays and preemption.

CPU and Memory Reservations: Set a 100% reservation for both CPU and Memory for the worker node VM. This ensures that the resources are always available and not contended by other VMs.

Key Kubernetes Concepts

Kubernetes provides several features to control resource allocation and pod placement, which are essential for time-sensitive workloads.

Quality of Service (QoS) Classes: Kubernetes assigns pods to one of three QoS classes. For time-sensitive applications, the Guaranteed class is essential. A pod is given this class if every container in it has both a memory and CPU request and limit, and they are equal.

CPU Manager Policy: The kubelet’s CPU Manager can be configured with a ‘static’ policy, which allows pods in the Guaranteed QoS class with integer CPU requests exclusive access to CPUs on the node.

HugePages: Using HugePages can improve performance by reducing the overhead associated with memory management (TLB misses).

Example 1: Basic Deployment with Guaranteed QoS

This example demonstrates how to create a simple Pod that qualifies for the ‘Guaranteed’ QoS class. This is the first step towards ensuring predictable performance.

apiVersion: v1kind: Podmetadata: name: low-latency-appspec: containers: – name: my-app-container image: my-real-time-app:latest resources: requests: memory: “2Gi” cpu: “2” limits: memory: “2Gi” cpu: “2”

In this manifest, the CPU and memory requests are identical to their limits, ensuring the pod is placed in the Guaranteed QoS class.

Example 2: Advanced Deployment with CPU Pinning and HugePages

This example builds on the previous one by requesting exclusive CPUs and using HugePages. This configuration is suitable for high-performance applications that require dedicated CPU cores and efficient memory access. Note: This requires the node’s CPU Manager policy to be set to ‘static’ and for HugePages to be pre-allocated on the worker node.

apiVersion: v1kind: Podmetadata: name: high-performance-appspec: containers: – name: my-hpc-container image: my-hpc-app:latest resources: requests: memory: “4Gi” cpu: “4” hugepages-2Mi: “2Gi” limits: memory: “4Gi” cpu: “4” hugepages-2Mi: “2Gi” volumeMounts: – mountPath: /hugepages name: hugepage-volume volumes: – name: hugepage-volume emptyDir: medium: HugePages

This pod requests four dedicated CPU cores and 2Gi of 2-megabyte HugePages, providing a highly stable and low-latency execution environment.

Summary

Successfully deploying time-sensitive applications on Kubernetes in VMware is a multi-layered process. It starts with proper ESXi host and VM configuration to minimize virtualization overhead and concludes with specific Kubernetes pod specifications to guarantee resource allocation and scheduling priority. By combining these techniques, you can build a robust platform for your most demanding workloads.