Troubleshooting vimService Creation Failed 403 in vCenter

The error vimservice creation failed: error forbidden 403 during vSphere/vCenter operations typically indicates that the client cannot establish a VIM (vSphere API) session because of HTTP 403 (forbidden) conditions such as proxy interference, wrong endpoint, or access control.

Below is a Confluence‑ready troubleshooting document.


vimService creation failed: error forbidden 403 – Troubleshooting Guide

This document explains causes and a structured troubleshooting approach for vimService creation failed: error forbidden 403 seen when deploying VCSA, adding vCenter endpoints, or using tools that talk to the vSphere API.


Scope and symptoms

Applies to

  • VCSA deployment (UI/CLI installer) failing in Stage 1 or login step with vimService creation failed: error forbidden 403 (or similar).
  • Third‑party or custom clients (Terraform, pyvmomi, backup tools, etc.) calling https://<vcenter_or_esxi>:443/sdk and receiving HTTP 403.
  • Browser access to vCenter working, but thick installers or automation tools failing with 403 errors to the same vCenter/ESXi.

Common symptoms

  • Installer/utility logs contain entries such as:
    • vimService creation failed: error forbidden 403
    • A problem occurred while logging in. Verify the connection details.
  • vSphere Web Client/HTML5 UI may still be accessible via browser.
  • Problem appears only from specific client machines (e.g., Windows jump host with a proxy configured).

Root cause overview

HTTP 403 “Forbidden” indicates that the HTTP request reached the server but was rejected due to authorization, endpoint, or access policies.

In the context of vSphere, typical root causes are:

  • Wrong endpoint or target
    • Deploying VCSA or connecting a client to a non‑vSphere server (e.g., Windows IIS, generic web server) instead of an ESXi or vCenter IP/FQDN.
    • Using the wrong URL path or port instead of /sdk on 443.
  • Proxy settings and man‑in‑the‑middle interception
    • Windows system/browser proxy is used by the vSphere installer; traffic goes through an HTTP proxy that returns 403 rather than forwarding to vCenter/ESXi.
  • Access control / permissions / IP restrictions
    • vCenter restricted to specific source IP ranges or firewall rules; client address not allowed.
    • API/service disabled or restricted on a security gateway or load balancer in front of vCenter.
  • Authentication/authorization problems (vSphere side)
    • User has no permissions on any vCenter instance in a linked group, resulting in login denial and HTTP 403 from the UI or API gateway.
    • Session or token issues (e.g., stale SSO session, NoPermission) causing the API gateway to respond with 403.

Data collection

Before changing anything, gather basic info:

  1. Error details from client/installer logs
    • vCenter/VCSA installer logs (e.g., vcsa-ui-installer\logs\ or \ProgramData\VMware\ on Windows): look for vimService creation failed and surrounding lines.
    • Third‑party tool logs (Terraform, backup, etc.) for HTTP status and exact URL.
  2. Target details
    • IP/FQDN entered in the installer or tool.
    • Confirm whether this IP is vCenter, ESXi, or something else (IIS, reverse proxy, load balancer, etc.).
  3. Network path
    • Whether a proxy is configured on the client (Internet Options, browser, system proxy, corporate PAC file).
    • Firewall or WAF devices between client and vCenter.
  4. vCenter logs (if accessible)
    • Check vCenter vpxd.log and appliance reverse‑proxy logs for any 403 entries tied to the client’s IP/time.

Step‑by‑step troubleshooting

Step 1 – Verify the target endpoint

  1. From the client machine, open a browser and go to:
    • https://<target>:443/ and https://<target>:443/sdk
    • Confirm that:
      • https://<target>/ shows vCenter or ESXi login page.
      • https://<target>/sdk presents the vSphere Web Services SDK (XML/WSDL style page).
  2. If you see an IIS/Apache/web‑app page instead of vCenter/ESXi, the IP or FQDN is wrong; correct it to point to vCenter or an ESXi host.
  3. If /sdk returns 404 or 403 while the vCenter UI works via another FQDN, check:
    • Load balancer / reverse proxy configuration.
    • Whether you should use the vCenter FQDN instead of an alias.

Step 2 – Eliminate proxy interference

  1. On Windows client where the error occurs, check proxy:
    • Internet Options → Connections → LAN settings → Proxy server / automatic configuration script.
  2. If a proxy or PAC is configured:
    • Temporarily disable the proxy or add the vCenter/ESXi FQDN/IP to the proxy bypass list.
  3. Re‑run the installer or client:
    • In many documented cases, disabling the proxy resolves vimservice creation failed: error forbidden 403.
  4. For tools like Terraform or custom scripts, ensure environment variables like HTTPS_PROXY and HTTP_PROXY are unset for vCenter connections, or use no‑proxy lists.

Step 3 – Validate DNS, certificates, and SSL path

  1. Confirm DNS resolution:
    • nslookup <vcenter_fqdn> from the client; ensure it resolves to the correct IP.
  2. Confirm that no SSL inspection appliance is rewriting or blocking /sdk.
  3. If a reverse proxy/WAF is used:
    • Review its rules for 403 responses on /sdk or /api.

Although certificate errors usually produce 401/SSL errors, some middleware can return 403 when certificate policies fail.

Step 4 – Check vCenter permissions and SSO

If you hit 403 only after successful TCP/SSL connection to the correct vCenter:

  1. Try the same user in the vSphere Client (H5/HTML5 UI).
    • If login fails with messages like “Unable to login because you do not have permission on any vCenter Server systems connected to this client,” you may have 0 effective permissions and an SSO federation problem.
  2. Review SSO and global permissions:
    • Ensure the user or group has at least read permissions on the vCenter inventory.
    • In linked‑mode scenarios, verify that the user has permissions on at least one linked vCenter, or the client receives a denial.
  3. Check vCenter logs for NoPermission or NotAuthenticated faults around the time of the 403:
    • These appear as vim.fault.NoPermission or NotAuthenticated in logs like vpxd.log and UI traces.
  4. If using tokens or external identity providers, validate token audience, scope, and expiration; invalid tokens can surface as HTTP 403 in the API gateway.

Step 5 – Review firewall and IP access restrictions

  1. Confirm that the client IP is allowed to reach vCenter on TCP 443:
    • Use telnet <vcenter> 443 or curl -vk https://<vcenter>/sdk from the client.
  2. Check for:
    • NSX distributed firewall rules blocking this source.
    • Perimeter firewalls or load balancers configured with IP allow‑lists; unauthorized IPs can receive 403 instead of 401/timeout.
  3. If the same operation works from another jump host or subnet, suspect IP‑based access control and adjust rules accordingly.

Step 6 – Retest VCSA deployment / client operation

After applying fixes (correct endpoint, disabled proxy, updated permissions):

  • Re‑run the VCSA deployment wizard or client:
    • Confirm that the login step succeeds and the deployment progresses beyond the previous failure.
  • For automation tools, re‑run plan or the API call and confirm that the HTTP status changes from 403 to 200 (or appropriate 2xx).

Known patterns and solutions

Pattern 1 – Windows proxy causing installer 403

  • VCSA installer launched from a Windows host with corporate proxy configured in IE/Internet Options.
  • Browser access to vCenter works through proxy, but the installer gets 403 from the proxy instead of reaching vCenter /sdk.

Resolution:

  • Disable system/IE proxy or exclude vCenter/ESXi FQDNs from proxy.
  • Restart installer and retry; error disappears.

Pattern 2 – Wrong target (non‑vSphere server)

  • User points VCSA installer to a Windows server or some other web server rather than an ESXi/vCenter endpoint.
  • /sdk belongs to IIS or is missing, returning 404/403.

Resolution:

  • Identify the correct ESXi or existing vCenter that will host the new appliance and use that IP/FQDN.

Pattern 3 – API access forbidden for automation user

  • REST or SOAP API calls fail with 403 while UI logins succeed using another administrative account.
  • The automation user has insufficient vCenter permissions or is blocked by IP restrictions.

Resolution:

  • Assign the required vSphere roles/privileges to that user (or group) at appropriate scope.
  • Confirm there is no IP allow‑list blocking the client.

  • Always test https://<vcenter>/sdk directly from the same system that runs the installer or automation to verify connectivity and routing.
  • Avoid using internet proxies for internal vCenter/ESXi access; where unavoidable, configure no‑proxy rules for vSphere endpoints.
  • Standardize vCenter access FQDN and ensure DNS, certificates, and firewall rules all align with that FQDN.
  • For automation accounts, create dedicated service principals with clearly defined roles, and test via API tools (curl/Postman/pyvmomi) before integrating into larger workflows.

VMKPing “Invalid Argument” While Testing vMotion Network

When testing vMotion network connectivity from an ESXi host, vmkping can return Unknown interface 'vmkX': Invalid argument even though the VMkernel adapter exists and works for vMotion.

This behavior is almost always related to the vMotion TCP/IP stack or another non‑default stack (vxlan, provisioning) being used on the VMkernel interface.


Scope and prerequisites

This document applies to:

  • ESXi hosts using dedicated vMotion VMkernel adapters (often on the vMotion TCP/IP stack).
  • Environments where vmkping from ESXi fails with an “Invalid argument” or “Unknown interface” error when specifying -I vmkX.

Prerequisites:

  • Shell/SSH access to the ESXi host (Tech Support Mode / SSH enabled).
  • vSphere Client access to verify VMkernel and TCP/IP stack configuration.

Problem description and symptoms

Typical error messages

When running vmkping from ESXi to test vMotion connectivity, you might see:

  • Unknown interface 'vmk1': Invalid argument
  • vmkping: sendto() failed: Invalid argument

This usually occurs with commands like:

  • vmkping -I vmk1 <peer_vmotion_IP>
  • vmkping <peer_vmotion_IP>

Functional impact

Despite the error:

  • vMotion may still work successfully because the vMotion TCP/IP stack is functioning correctly.
  • Standard ping from ESXi or from external devices to the vMotion IPs may fail because the vMotion stack is L2‑only or has no gateway.

Root cause

vMotion TCP/IP stack behavior

VMkernel adapters can be attached to different TCP/IP stacks:

  • defaultTcpipStack (usually Management, vMotion, vSAN in simple setups)
  • vmotion (dedicated vMotion TCP/IP stack)
  • vxlanvSphereProvisioning, etc.

Key points:

  • When a VMkernel adapter is created on the vMotion stack, the gateway option disappears in the UI because this stack is designed as an L2 network for vMotion traffic.
  • vmkping uses the VMkernel’s TCP/IP stack, not the host’s management stack, and requires explicit stack selection for non‑default stacks.

If you call vmkping without telling it which TCP/IP stack to use, it assumes defaultTcpipStack.
When the interface is actually on vmotion, this mismatch causes the Unknown interface or Invalid argument error.


Identification: confirm VMkernel and TCP/IP stack

Perform these checks from an ESXi shell:

1. List VMkernel interfaces and stacks

esxcli network ip interface list
  • Look at Name (vmk0, vmk1, …) and Netstack Instance columns to see which stack each VMkernel uses (e.g., defaultTcpipStackvmotionvxlan).

Alternative older command:

esxcfg-vmknic -l
  • Check the NetStack column to identify which stack is bound to each vmk.

2. List available TCP/IP stacks

esxcli network ip netstack list
  • Confirm valid stack names such as defaultTcpipStackvmotionvxlanvSphereProvisioning.

3. Validate vMotion tagging and IPs

In vSphere Client:

  • For each host, open Configure → Networking → VMkernel adapters.
  • Verify:
    • Which vmk is enabled for vMotion.
    • Whether it uses the vMotion TCP/IP stack or the default stack.
    • IP address, VLAN, and port group settings.

Correct test methods for vMotion network

Option 1 – esxcli (recommended for vMotion stack)

Use esxcli network diag ping and specify the VMkernel and the netstack:

esxcli network diag ping -I vmk1 --netstack=vmotion -H <target_vmotion_vmk_IP>
  • Replace vmk1 with the local vMotion VMkernel name.
  • Replace <target_vmotion_vmk_IP> with the vMotion VMkernel IP of the peer ESXi host.

This method works even when the vMotion stack is L2‑only with no gateway.

Jumbo frame / MTU test

For MTU 9000 testing, include payload size and do‑not‑fragment options:

esxcli network diag ping -I vmk1 --netstack=vmotion -H <target_vmotion_vmk_IP> -s 8972 -d
  • For MTU 9000, payload 8972 bytes plus headers approximates the full frame and validates end‑to‑end jumbo support.

Option 2 – vmkping with netstack parameter

If you prefer vmkping directly:

vmkping -S vmotion -I vmk1 <target_vmotion_vmk_IP>

or

vmkping ++netstack=vmotion -I vmk1 <target_vmotion_vmk_IP>

Key notes:

  • -S vmotion or ++netstack=vmotion selects the vMotion TCP/IP stack.
  • Stack name is case‑sensitive and must match the value from esxcli network ip netstack list.

For MTU testing:

vmkping ++netstack=vmotion -I vmk1 -s 8972 -d <target_vmotion_vmk_IP>

Step‑by‑step troubleshooting workflow

Use this procedure when you see Invalid argument while testing vMotion.

Step 1 – Confirm VMkernel stack and role

  1. On each host, run:​bashesxcli network ip interface list
  2. Identify:
    • Which vmk has vMotion enabled.
    • Its Netstack Instance (e.g., vmotion).
  3. In vSphere Client, verify that vMotion is enabled on that vmk and that IPs are in the correct vMotion VLAN/subnet.

Step 2 – Reproduce the error with standard vmkping

From the ESXi shell:

vmkping -I vmk1 <target_vmotion_vmk_IP>
  • If you get Unknown interface 'vmk1': Invalid argument, this confirms the mismatch between vmkping’s default stack and the interface’s actual stack.

Step 3 – Test with correct netstack

Run:

vmkping -S vmotion -I vmk1 <target_vmotion_vmk_IP>
# or
esxcli network diag ping -I vmk1 --netstack=vmotion -H <target_vmotion_vmk_IP>
  • Successful responses (ICMP replies) indicate L2 connectivity on the vMotion network.

Step 4 – Validate MTU and physical network

If connectivity works but jumbo frame tests fail:

  1. Run a jumbo ping:​bashvmkping -S vmotion -I vmk1 -s 8972 -d <target_vmotion_vmk_IP>
  2. If it fails:
    • Check vSwitch / vDS MTU configuration.
    • Check physical NIC MTU.
    • Check upstream physical switch ports and VLAN MTU.

Step 5 – Check routing/gateway expectations

  • For vMotion TCP/IP stack, Layer 3 routing typically requires configuring a default gateway for that stack in Networking → TCP/IP configuration on the host.
  • Without a gateway or static route, vMotion stack pings to other subnets will fail even if same‑subnet pings work.

Example command set (copy/paste)

Adjust interface names and IPs before using.

# 1. Inventory VMkernel interfaces and stacks
esxcli network ip interface list
esxcfg-vmknic -l
esxcli network ip netstack list

# 2. Test vMotion vmk using vMotion stack (esxcli)
esxcli network diag ping -I vmk1 --netstack=vmotion -H 10.10.20.12

# 3. Test vMotion vmk using vmkping with netstack
vmkping -S vmotion -I vmk1 10.10.20.12
vmkping ++netstack=vmotion -I vmk1 10.10.20.12

# 4. Jumbo frame tests (MTU 9000)
esxcli network diag ping -I vmk1 --netstack=vmotion -H 10.10.20.12 -s 8972 -d
vmkping -S vmotion -I vmk1 -s 8972 -d 10.10.20.12

ESXi NFSv4 Error Decoder & Log Analyzer

NFSv4 failures in ESXi appear as NFS4ERR_* codes, lockd/portmap conflicts, and RPC timeouts in vmkernel.log. This script suite extracts all NFSv4 errors with detailed explanations, affected datastores, and remediation steps directly from live logs.

NFSv4 Error Code Reference

NFS4ERR CodeHexMeaningCommon CauseFix
NFS4ERR_LOCK_UNAVAIL0x0000006cLock deniedlockd conflict, port 4045 usedKill conflicting process, restart lockd
NFS4ERR_STALE0x0000001fStale file handleNAS reboot, export changedRemount datastore, check NAS exports
NFS4ERR_EXPIRED0x0000001eLease expiredNetwork partition >90sCheck MTU, firewall 2049/TCP
NFS4ERR_DELAY0x00000016Server busyNAS overloadedIncrease nfs.maxqueuesize, NAS perf
NFS4ERR_IO0x00000011I/O errorNAS disk failureCheck NAS alerts, failover pool
NFS4ERR_BADHANDLE0x00000002Invalid handleCorrupt mountUnmount/remount NFS datastore

NFSv4 Log Parser (nfs4-analyzer.sh)

Purpose: Extracts ALL NFSv4 errors from vmkernel.log with timestamps, datastores, and RPC details.

bash#!/bin/bash
# nfs4-analyzer.sh - Extract NFSv4 failures from ESXi logs
LOG_FILE="/var/run/log/vmkernel.log"
REPORT="/tmp/nfs4-errors-$(date +%Y%m%d-%H%M).csv"

echo "Timestamp,Datastore,NFS4ERR_Code,Error_Type,Severity,Server_IP,Mount_Point,Raw_Log" > $REPORT

# NFS4ERR_* codes
grep -i "NFS4ERR\|nfs.*error\|rpc.*fail\|lockd\|portmap" $LOG_FILE | while read line; do
    timestamp=$(echo $line | sed 's/^\(.*\)cpu.*/\1/')
    datastore=$(echo $line | grep -o '/vmfs/volumes/[^ ]*' | head -1 || echo 'unknown')
    server_ip=$(echo $line | grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' | head -1)
    
    # Extract NFS4ERR code
    if [[ $line =~ NFS4ERR_([A-Z_]+) ]]; then
        code=${BASH_REMATCH[1]}
        severity="HIGH"
    elif [[ $line =~ (timeout\|fail|dead|unavailable) ]]; then
        code=$(echo $line | grep -oE 'timeout|fail|dead|unavailable|refused')
        severity="MEDIUM"
    else
        code="OTHER"
        severity="LOW"
    fi
    
    echo "$timestamp,$datastore,$code,$severity,$server_ip,$datastore,$line" >> $REPORT
done

# Summary stats
echo "=== NFSv4 Error Summary ===" | tee -a $REPORT
awk -F, '{print $3" "$4" "$5}' $REPORT | sort | uniq -c | sort -nr | head -10

# Critical alerts
echo "CRITICAL (HIGH severity):" | tee -a $REPORT
grep ",HIGH," $REPORT | cut -d, -f1,2,3,7-

echo "Report saved: $REPORT ($(wc -l < $REPORT) entries)"

Usage: ./nfs4-analyzer.sh → /tmp/nfs4-errors-*.csv

Detailed NFSv4 Error Decoder (nfs4-decoder.py)

Purpose: Maps NFSv4 error codes to RFC 5661 explanations + ESXi-specific fixes.

python#!/usr/bin/env python3
# nfs4-decoder.py - Detailed NFSv4 error explanations
import re
import sys
import pandas as pd

NFS4_ERRORS = {
    'NFS4ERR_LOCK_UNAVAIL': {
        'rfc': 'RFC 5661 Sec 14.2.1 - Lock held by another client',
        'esxi_cause': 'portd/lockd conflict on 4045/TCP, Windows NFS client interference',
        'fix': '1. `esxcli system process list | grep lockd` → kill PID\n2. Check `netstat -an | grep 4045`\n3. Restart: `services.sh restart`',
        'severity': 'CRITICAL'
    },
    'NFS4ERR_STALE': {
        'rfc': 'RFC 5661 Sec 14.2.30 - File handle no longer valid',
        'esxi_cause': 'NAS export removed, filesystem ID changed, NAS failover',
        'fix': '`esxcli storage filesystem unmount -l DATASTORE && esxcli storage filesystem mount -v nfs -h NAS_IP -s /export`',
        'severity': 'HIGH'
    },
    'NFS4ERR_EXPIRED': {
        'rfc': 'RFC 5661 Sec 14.2.9 - Lease expired',
        'esxi_cause': 'Network blip >90s, firewall dropped TCP 2049',
        'fix': '1. `vmkping -I vmk0 NAS_IP -s 8972` (Jumbo)\n2. Check ESXi firewall: `esxcli network firewall ruleset list | grep nfs`',
        'severity': 'HIGH'
    },
    'NFS4ERR_DELAY': {
        'rfc': 'RFC 5661 Sec 14.2.7 - Server temporarily unavailable',
        'esxi_cause': 'NAS RPC queue full, nfs.maxqueuesize too low',
        'fix': '`esxcli system settings advanced set -o /NFS/MaxQueueSize -i 16` → rescan',
        'severity': 'MEDIUM'
    }
}

def decode_nfs4_error(log_line):
    for error, details in NFS4_ERRORS.items():
        if re.search(error, log_line):
            return {
                **details,
                'raw_line': log_line,
                'timestamp': re.search(r'\[([^\]]+)', log_line).group(1)
            }
    return {'error': 'UNKNOWN_NFS4', 'severity': 'INFO', 'raw_line': log_line}

# Process log file
if len(sys.argv) > 1:
    with open(sys.argv[1]) as f:
        errors = [decode_nfs4_error(line) for line in f if 'NFS4ERR' in line or 'nfs.*error' in line]
    
    df = pd.DataFrame(errors)
    print(df[['timestamp', 'severity', 'esxi_cause', 'fix']].to_string(index=False))

Live NFSv4 Monitor (nfs4-live-tail.sh)

Purpose: Real-time NFSv4 error detection with instant alerts.

bash#!/bin/bash
# nfs4-live-tail.sh - Watch NFSv4 errors live
tail -f /var/run/log/vmkernel.log | grep --line-buffered -i "NFS4ERR\|nfs.*(error\|fail\|timeout\|lock)" | while read line; do
    echo "$(date): $line"
    
    # Auto-run decoder
    echo "$line" | python3 nfs4-decoder.py | head -3
    
    # Alert on critical
    if echo "$line" | grep -q "LOCK_UNAVAIL\|STALE\|EXPIRED"; then
        echo "CRITICAL NFSv4 ERROR - Check /tmp/nfs4-errors-*.csv" | mail -s "ESXi NFSv4 Failure $(hostname)" oncall@company.com
    fi
done

Run: ./nfs4-live-tail.sh (Ctrl+C to stop)

Master NFSv4 Dashboard Generator (nfs4-dashboard.py)

python#!/usr/bin/env python3
# nfs4-dashboard.py - HTML dashboard with error trends
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import subprocess

# Parse all reports
all_reports = []
for report in subprocess.check_output('ls /tmp/nfs4-errors-*.csv', shell=True).decode().split():
    df = pd.read_csv(report, names=['ts','ds','code','sev','ip','mp','raw'])
    df['time'] = pd.to_datetime(df['ts'])
    all_reports.append(df)

df = pd.concat(all_reports)

# Charts
fig = make_subplots(rows=2, cols=2, 
                   subplot_titles=('NFSv4 Errors by Datastore', 'Error Timeline', 
                                 'Top Error Codes', 'Server Response'))

fig.add_trace(px.bar(df, x='ds', y=df.groupby('ds').size(), 
                    title='Errors by Datastore').data[0], row=1, col=1)

timeline = px.line(df.groupby(['time','code']).size().reset_index(name='count'), 
                  x='time', y='count', color='code')
for trace in timeline.data:
    fig.add_trace(trace, row=1, col=2)

fig.add_trace(px.pie(df, names='code', values=df.groupby('code').size()).data[0], row=2, col=1)

fig.write_html('nfs4-dashboard.html')
print("Dashboard: nfs4-dashboard.html")

Quick One-Liners for NFSv4 Issues

IssueCommand
Lock conflicts`esxcli system process list
Port 4045 check`netstat -an
RPC timeoutsrpcinfo -T tcp NAS_IP nfs → should return prog 100003
Mount status`esxcli storage filesystem list
NFS firewallesxcli network firewall ruleset set --ruleset-id nfsClient --enabled true
Force remountesxcli storage filesystem unmount -l DATASTORE && mount

Automated Cron Setup

bash# /etc/cron.d/nfs4-monitor
*/5 * * * * root /scripts/nfs4-analyzer.sh >> /var/log/nfs4-monitor.log
0 * * * * root python3 /scripts/nfs4-dashboard.py

Alert Thresholds:

  • LOCK_UNAVAIL > 5/min → Page oncall
  • STALE/EXPIRED > 2 → Check NAS failover
  • DELAY > 10 → Storage team

Sample Output (Tintri NFS)

text2025-12-28 19:09:00: NFS4ERR_LOCK_UNAVAIL on /vmfs/volumes/tintrinfs-prod
Cause: Windows NFS client on port 4045 conflicts with ESXi lockd
Fix: Kill PID 12345 (lockd), restart services.sh

3x NFS4ERR_STALE on 10.0.1.50:/tintri-export
Cause: Tintri controller failover
Fix: Remount datastore

Confluence Deployment

text1. SCP scripts → ESXi /scripts/
2. ./nfs4-analyzer.sh → instant CSV report
3. python3 nfs4-dashboard.py → embed HTML
4. Cron + ./nfs4-live-tail.sh for oncall shifts

Pro Tip: For Tintri NFS, grep tintri|10.0.1 in logs; check controller status in GlobalProtect.

Run ./nfs4-analyzer.sh now for your current NFSv4 issues!

ESXi SCSI Decoder & VMkernel Log Analyzer

VMkernel logs contain SCSI sense codes, path states, and HBA errors in hex format that decode storage failures like LUN timeouts, reservation conflicts, and path flaps. This script suite parses /var/run/log/vmkernel.log, decodes SCSI status, and generates troubleshooting dashboards with failed paths/host details.

SCSI Sense Code Decoder (scsi-decoder.py)

Purpose: Converts hex sense data from vmkernel logs to human-readable errors.

python#!/usr/bin/env python3
# scsi-decoder.py - Decode VMware SCSI sense codes
SCSI_SENSE = {
    '0x0': 'No Sense (OK)',
    '0x2': 'Not Ready',
    '0x3': 'Medium Error',
    '0x4': 'Hardware Error',
    '0x5': 'Illegal Request',
    '0x6': 'Unit Attention',
    '0x7': 'Data Protect',
    '0xb': 'Aborted Command',
    '0xe': 'Overlapped Commands Attempted'
}

ASC_QUAL = {
    '0x2800': 'LUN Not Ready, Format in Progress',
    '0x3f01': 'Removed Target',
    '0x3f07': 'Multiple LUN Reported',
    '0x4700': 'Reservation Conflict',
    '0x4c00': 'Snapshot Snapshot Failed',
    '0x5506': 'Illegal Message',
    '0x0800': 'Logical Unit Communication Failure'
}

def decode_scsi(line):
    """Parse vmkernel SCSI line: [timestamp] vmkwarning: CPUx: NMP: nmp_ThrottleLogForDevice: ... VMW_SCSIERR_0xX"""
    if 'VMW_SCSIERR' not in line:
        return None
    
    sense_match = re.search(r'VMW_SCSIERR_([0-9a-fA-F]{2})', line)
    if sense_match:
        sense = f"0x{sense_match.group(1)}"
        naa = re.search(r'naa\.([0-9a-fA-F:]+)', line)
        lun = naa.group(1) if naa else 'Unknown'
        
        return {
            'lun': lun,
            'sense_key': SCSI_SENSE.get(sense, f'Unknown: {sense}'),
            'raw_line': line.strip(),
            'timestamp': re.search(r'\[(.*?)\]', line).group(1)
        }
    return None

# Usage example
log_line = "2025-12-28T18:46:00.123Z cpu5:32:VMW_SCSIERR_0xb: naa.60a9800064824b4f4f4f4f4f4f4f4f4f"
print(decode_scsi(log_line))
# Output: {'lun': '60a980006482...', 'sense_key': 'Aborted Command', ...}

VMkernel Log Parser (vmk-log-analyzer.sh)

Purpose: Real-time parsing of SCSI errors, path states, and HBA failures.

bash#!/bin/bash
# vmk-log-analyzer.sh - Live SCSI/Path decoder
LOG_FILE="/var/run/log/vmkernel.log"
REPORT="/tmp/scsi-report-$(date +%Y%m%d-%H%M).csv"

echo "Host,LUN,SCSI_Error,Path_State,Timestamp,Count" > $REPORT

# SCSI Errors
grep -i "VMW_SCSIERR\|scsi\|LUN\|NMP\|path dead\|timeout" $LOG_FILE | while read line; do
    host=$(hostname)
    lun=$(echo $line | grep -o 'naa\.[0-9a-f:]*' | head -1 || echo 'unknown')
    error=$(echo $line | grep -o 'VMW_SCSIERR_[0-9a-f]*\|timeout\|dead\|failed' | head -1)
    timestamp=$(echo $line | grep -o '\[[0-9TZ.-]*' | sed 's/\[//;s/\]//')
    
    echo "$host,$lun,$error,$timestamp,1" >> $REPORT.tmp
done

# Path States
grep -i "path state\|working\|dead\|standby\|active" $LOG_FILE | while read line; do
    host=$(hostname)
    lun=$(echo $line | grep -o 'naa\.[0-9a-f:]*' | head -1)
    path_state=$(echo $line | grep -oE '(working|dead|standby|active|disabled)')
    timestamp=$(echo $line | grep -o '\[[0-9TZ.-]*' | sed 's/\[//;s/\]//')
    
    echo "$host,$lun,$path_state,$timestamp,1" >> $REPORT.tmp
done

# Aggregate & sort
cat $REPORT.tmp | sort | uniq -c | sort -nr | awk '{print $2","$3","$4","$5","$1}' >> $REPORT
rm $REPORT.tmp

echo "SCSI/Path Report: $(wc -l < $REPORT) entries"
tail -20 $REPORT

Cron*/5 * * * * /scripts/vmk-log-analyzer.sh (5min intervals).

Live Dashboard Generator (scsi-dashboard.py)

Purpose: Creates HTML table with SCSI errors, path status, and failure trends.

python#!/usr/bin/env python3
# scsi-dashboard.py - Interactive SCSI failure dashboard
import pandas as pd
import plotly.express as px
from datetime import datetime

df = pd.read_csv('/tmp/scsi-report-*.csv', names=['Host','LUN','Error','Path_State','Timestamp','Count'])

# Top failing LUNs
top_luns = df.groupby('LUN')['Count'].sum().sort_values(ascending=False).head(10)

# Path state pie chart
path_pie = px.pie(df, names='Path_State', values='Count', title='Path States Distribution')

# Error timeline
df['Time'] = pd.to_datetime(df['Timestamp'])
error_timeline = px.line(df.groupby(['Time','Error']).size().reset_index(name='Count'), 
                        x='Time', y='Count', color='Error', title='SCSI Errors Over Time')

# HTML Report
html = f"""
<html>
<head><title>ESXi SCSI Status - {datetime.now()}</title>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css">
</head>
<body>
<div class="container">
<h2>🔍 ESXi Storage Path & SCSI Errors</h2>
<div class="row">
<div class="col-md-6">{path_pie.to_html(full_html=False, include_plotlyjs='cdn')}</div>
<div class="col-md-6">{error_timeline.to_html(full_html=False, include_plotlyjs='cdn')}</div>
</div>

<h3>Top Failing LUNs</h3>
{table = df.pivot_table(values='Count', index='LUN', columns='Error', aggfunc='sum', fill_value=0)
table.to_html()}
</div>
</body>
</html>
"""
with open('scsi-dashboard.html', 'w') as f:
    f.write(html)

Master Orchestrator (storage-scsi-monitor.py)

Purpose: Monitors all ESXi hosts, parses logs, generates alerts.

python#!/usr/bin/env python3
import paramiko
import subprocess
from scsi_decoder import decode_scsi # From above

ESXI_HOSTS = ['esxi1.example.com', 'esxi2.example.com']
ALERT_THRESHOLD = 10 # Errors per 5min

def analyze_host(host):
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(host, username='root', password='esxi_password')

# Tail vmkernel log + run analyzer
stdin, stdout, stderr = ssh.exec_command("""
tail -500 /var/run/log/vmkernel.log | grep -E 'SCSIERR|path dead|timeout' > /tmp/scsi.log &&
/scripts/vmk-log-analyzer.sh
""")

# Parse results
errors = pd.read_csv('/tmp/scsi-report-*.csv')
critical_luns = errors[errors['Count'] > ALERT_THRESHOLD]

if not critical_luns.empty:
print(f"ALERT {host}: {len(critical_luns)} critical LUNs")
for _, row in critical_luns.iterrows():
print(f" LUN {row['LUN']}: {row['Error']} x{row['Count']}")

ssh.close()

# Run across all hosts
for host in ESXI_HOSTS:
analyze_host(host)

subprocess.run(['python3', 'scsi-dashboard.py'])
print("SCSI dashboard: scsi-dashboard.html")

One-Liner SCSI Commands

TaskCommand
Live SCSI errorstailf /var/run/log/vmkernel.log | grep SCSIERR
Path status`esxcli storage core path list | grep -E ‘dead
LUN reservationsesxcli storage core device list | grep reservation
HBA errorsesxcli storage core adapter stats get -A vmhbaX
Decode sense manuallyecho "0xb" | python3 scsi-decoder.py

Common SCSI Errors & Fixes

Sense CodeMeaningFix
0xb (Aborted)Queue overflow, SAN busyIncrease HBA queue depth, check SAN zoning
0x6 (Unit Attention)LUN resetRescan: esxcli storage core adapter rescan --all
0x47 (Reservation)vMotion conflictStagger migrations, check esxtop %RESV
Path DeadCable/switch/HBA failesxcli storage core path set -p <path> -s active

Alerting Integration

bash# Email critical LUNs
if [ $(wc -l < /tmp/scsi-report-*.csv) -gt 5 ]; then
cat /tmp/scsi-report-*.csv | mail -s "ESXi SCSI Errors $(hostname)" storage-team@company.com
fi

Confluence Deployment

text1. SCP scripts to ESXi: `/scripts/`
2. Cron: `*/5 * * * * /scripts/vmk-log-analyzer.sh`
3. Jumpbox: Run `storage-scsi-monitor.py` hourly
4. Embed: `{html}http://scsi-dashboard.html{html}`

Sample Output:

textesxi1: 12 critical LUNs
LUN naa.60a980...: Aborted Command x8
Path vmhba32:C0:T0:L0 dead x4

Pro Tip: Filter Tintri LUNs: grep tintri\|naa.60a9 /var/run/log/vmkernel.log

Run ./storage-scsi-monitor.py for instant dashboard!

VMware Storage Performance Testing Suite

ESXi hosts provide fioioping, and esxtop for storage benchmarking directly from CLI, while vCenter PowerCLI aggregates performance across clusters/datastores. This script suite generates IOPS, latency, and throughput charts viewable in Confluence/HTML dashboards.

Core Testing Engine (fio-perf-test.sh)

Purpose: Run standardized fio workloads (4K random, 64K seq) on VMFS/NFS datastores.

bash#!/bin/bash
# fio-perf-test.sh - Run on ESXi via SSH
DATASTORE="/vmfs/volumes/$(esxcli storage filesystem list | grep -v Mounted | head -1 | awk '{print $1}')"
TEST_DIR="$DATASTORE/perf-test"
FIO_TEST="/usr/lib/vmware/fio/fio"

mkdir -p $TEST_DIR
cd $TEST_DIR

cat > fio-random-4k.yaml << EOF
[global]
ioengine=libaio
direct=1
size=1G
time_based
runtime=60
group_reporting
directory=$TEST_DIR

[rand-read]
rw=randread
bs=4k
numjobs=4
iodepth=32
filename=testfile.dat

[rand-write]
rw=randwrite
bs=4k
numjobs=4
iodepth=32
filename=testfile.dat
EOF

# Run tests
$FIO_TEST fio-random-4k.yaml > /tmp/fio-4k-results.txt
$FIO_TEST --name=seq-read --rw=read --bs=64k --size=4G --runtime=60 --direct=1 --numjobs=1 --iodepth=32 $TEST_DIR/testfile.dat >> /tmp/fio-seq-results.txt

# Cleanup
rm -rf $TEST_DIR/*
echo "$(hostname),$(date),$(grep read /tmp/fio-4k-results.txt | tail -1 | awk '{print $3}'),$(grep IOPS /tmp/fio-4k-results.txt | grep read | awk '{print $2}')" >> /tmp/storage-perf.csv

Cron Schedule0 2 * * 1 /scripts/fio-perf-test.sh (weekly baseline).

vCenter PowerCLI Aggregator (StoragePerf.ps1)

Purpose: Collects historical perf + runs live esxtop captures across all hosts.

powershell# StoragePerf.ps1 - vCenter Storage Performance Dashboard
Connect-VIServer vcenter.example.com

$Report = @()
$Clusters = Get-Cluster

foreach ($Cluster in $Clusters) {
    $Hosts = Get-VMHost -Location $Cluster
    foreach ($Host in $Hosts) {
        # Live esxtop data (requires esxtop installed)
        $Esxtop = Invoke-VMScript -VM $Host -ScriptText {
            esxtop -b -a -d 30 | grep -E 'DAVG|%LAT|IOPS' | tail -20
        } -GuestCredential (Get-Credential)

        # Historical datastore stats
        $Datastores = Get-Datastore -VMHost $Host
        foreach ($DS in $Datastores) {
            $Perf = $DS | Get-Stat -Stat "datastore.read.average","datastore.write.average" -MaxSamples 24 -Interval Min | 
                    Select @{N='Time';E={$_.Timestamp}}, @{N='ReadKBps';E={[math]::Round($_.Value,2)}}, @{N='WriteKBps';E={[math]::Round($_.Value,2)}}
            
            $Report += [PSCustomObject]@{
                Host = $Host.Name
                Datastore = $DS.Name
                FreeGB = [math]::Round($DS.FreeSpaceGB,1)
                ReadAvgKBps = ($Perf.ReadKBps | Measure -Average).Average
                WriteAvgKBps = ($Perf.WriteKBps | Measure -Average).Average
                EsxtopLatency = ($Esxtop | Select-String "DAVG" | Select-Object -Last 1).ToString().Split()[2]
            }
        }
    }
}

# Export CSV for charts
$Report | Export-Csv "StoragePerf-$(Get-Date -f yyyy-MM-dd).csv" -NoTypeInformation

# Generate HTML dashboard
$Report | ConvertTo-Html -Property Host,Datastore,FreeGB,ReadAvgKBps,WriteKBps,EsxtopLatency -Title "Storage Performance" | 
    Out-File "storage-dashboard.html"

Performance Chart Generator (perf-charts.py)

Purpose: Converts CSV data to interactive Plotly charts for Confluence.

python#!/usr/bin/env python3
# perf-charts.py - Generate HTML charts from CSV
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import sys

df = pd.read_csv(sys.argv[1])

# IOPS vs Latency scatter
fig1 = px.scatter(df, x='ReadAvgKBps', y='EsxtopLatency', 
                 size='FreeGB', color='Host', hover_name='Datastore',
                 title='Storage Read Performance vs Latency',
                 labels={'ReadAvgKBps':'Read KBps', 'EsxtopLatency':'Avg Latency (ms)'})

# Throughput bar chart
fig2 = px.bar(df, x='Datastore', y=['ReadAvgKBps','WriteAvgKBps'], 
              barmode='group', title='Read/Write Throughput by Datastore')

# Combined dashboard
fig = make_subplots(rows=2, cols=1, subplot_titles=('IOPS vs Latency', 'Read/Write Throughput'))
fig.add_trace(fig1.data[0], row=1, col=1)
fig.add_trace(fig2.data[0], row=2, col=1)
fig.add_trace(fig2.data[1], row=2, col=1)

fig.write_html('storage-perf-dashboard.html')
print("Charts saved: storage-perf-dashboard.html")

Usagepython3 perf-charts.py StoragePerf-2025-12-28.csv

Master Orchestrator (storage-benchmark.py)

Purpose: Runs fio tests on all ESXi hosts + generates dashboard.

python#!/usr/bin/env python3
import paramiko
import subprocess
import pandas as pd
from datetime import datetime

ESXI_HOSTS = ['esxi1.example.com', 'esxi2.example.com']
VCENTER = 'vcenter.example.com'

def run_fio(host):
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(host, username='root', password='your-esxi-password')

# Copy & run fio script
stdin, stdout, stderr = ssh.exec_command('wget -O /tmp/fio-test.sh https://your-confluence/scripts/fio-perf-test.sh && chmod +x /tmp/fio-test.sh && /tmp/fio-test.sh')
result = stdout.read().decode()
ssh.close()
return result

# Execute tests
perf_data = []
for host in ESXI_HOSTS:
print(f"Testing {host}...")
run_fio(host)
perf_data.append({'Host': host, 'TestTime': datetime.now()})

# Pull PowerCLI report
subprocess.run(['pwsh', '-File', 'StoragePerf.ps1'])

# Generate charts
subprocess.run(['python3', 'perf-charts.py', f'StoragePerf-{datetime.now().strftime("%Y-%m-%d")}.csv'])

print("Storage benchmark complete. View storage-perf-dashboard.html")

Confluence Chart Embedding

HTML Macro (paste storage-perf-dashboard.html content):

text{html}

{html}

CSV Table with Inline Charts:

text||Host||Datastore||Read IOPS||Latency||Chart||
|esxi1|datastore1|2450|2.3ms|![Read IOPS|width=150px,height=100px](storage-esxi1.png)|

Automated Dashboard Cronjob

bash#!/bin/bash
# /etc/cron.d/storage-perf
# Daily 3AM: Test + upload to Confluence
0 3 * * * root /usr/local/bin/storage-benchmark.py >> /var/log/storage-perf.log 2>&1

Output Files:

  • /tmp/storage-perf.csv → Historical trends
  • storage-perf-dashboard.html → Interactive Plotly charts
  • /var/log/storage-perf.log → Audit trail

Sample Output Charts

Expected Results (Tintri VMstore baseline):

textDatastore: tintri-vmfs-01
4K Random Read: 12,500 IOPS @ 1.8ms
4K Random Write: 8,200 IOPS @ 2.4ms
64K Seq Read: 450 MB/s
64K Seq Write: 380 MB/s

Pro Tips & Alerts

text☐ Alert if Latency > 5ms: Add to PowerCLI `if($EsxtopLatency -gt 5) {Send-MailMessage}`
☐ Tintri-specific: Add `esxtop` filter for Tintri LUN paths
☐ NFS tuning: Test with `nfs.maxqueuesize=8` parameter
☐ Compare baselines: Git commit CSV files weekly

Run Firstpython3 storage-benchmark.py --dry-run to validate hosts/configs.

Basic PS scripts for VMware Admin

VMware admins rely on PowerCLI, esxcli one-liners, and Python scripts for daily tasks like health checks, VM migrations, and storage monitoring to save hours of manual work.

Daily Health Check Script (PowerCLI)

Purpose: Run every morning to spot issues across cluster, hosts, VMs, and datastores.

powershell# VMware Daily Health Check - Save as HealthCheck.ps1
Connect-VIServer vcenter.example.com

$Report = @()
$Clusters = Get-Cluster
foreach ($Cluster in $Clusters) {
    $Hosts = Get-VMHost -Location $Cluster
    $Report += [PSCustomObject]@{
        Cluster = $Cluster.Name
        HostsDown = ($Hosts | Where {$_.State -ne 'Connected'}).Count
        VMsDown = (Get-VM -Location $Hosts | Where {$_.PowerState -ne 'PoweredOn'}).Count
        DatastoresFull = (Get-Datastore -Location $Hosts | Where {$_.FreeSpaceGB/($_.CapacityGB)*100 -lt 20}).Count
        HighCPUHosts = ($Hosts | Where {$_.CpuUsageMhz -gt 80}).Count
    }
}
$Report | Export-Csv "DailyHealth-$(Get-Date -f yyyy-MM-dd).csv" -NoTypeInformation
Send-MailMessage -To admin@company.com -Subject "VMware Health Report" -Body "Check attached CSV" -Attachments "DailyHealth-$(Get-Date -f yyyy-MM-dd).csv"

Schedule: Windows Task Scheduler daily 8AM, outputs CSV + email summary.

Host Reboot & Maintenance Script

Purpose: Graceful host maintenance with VM evacuation.

powershell# HostMaintenance.ps1
param($HostName, $MaintenanceReason)

$HostObj = Get-VMHost $HostName
if ((Get-VM -Location $HostObj).Count -gt 0) {
    Move-VM -VM (Get-VM -Location $HostObj) -Destination (Get-Cluster -VMHost $HostObj | Get-VMHost | Where {$_.State -eq 'Connected'} | Select -First 1)
}
Set-VMHost $HostName -State Maintenance -Confirm:$false
Restart-VMHost $HostName -Confirm:$false -Reason $MaintenanceReason

Usage./HostMaintenance.ps1 esxi-01 "Patching"

Storage Rescan & Connectivity Script (ESXi SSH)

Purpose: Fix “LUNs not visible” after SAN changes – run on all hosts.

bash#!/bin/bash
# storage-rescan.sh - Run via SSH or Ansible
for adapter in $(esxcli storage core adapter list | grep -E 'vmhba[0-9]+' | awk '{print $1}'); do
    esxcli storage core adapter rescan -A $adapter
done
esxcli storage filesystem list | grep -E 'Mounted|Accessible'
vmkping -I vmk1 $(esxcli iscsi adapter discovery sendtarget list | awk '{print $7}' | tail -1)
echo "Storage rescan complete on $(hostname)"

Cron*/30 * * * * /scripts/storage-rescan.sh >> /var/log/storage-rescan.log

VM Snapshot Cleanup Script

Purpose: Auto-delete snapshots >7 days old to prevent datastore exhaustion.

powershell# SnapshotCleanup.ps1
$OldSnapshots = Get-VM | Get-Snapshot | Where {$_.CreateTime -lt (Get-Date).AddDays(-7)}
foreach ($Snap in $OldSnapshots) {
    Remove-Snapshot $Snap -Confirm:$false -RunAsync
    Write-Output "Deleted snapshot $($Snap.Name) on $($Snap.VM.Name)"
}

Alert on large snaps: Add Where {$_.SizeGB -gt 10} filter.

iSCSI Path Failover Test Script

Purpose: Verify multipath redundancy before maintenance.

bash#!/bin/bash
# iscsi-path-test.sh
echo "=== iSCSI Path Status ==="
esxcli iscsi session list
echo "=== Active Paths ==="
esxcli storage core path list | grep -E 'working|dead'
echo "=== LUN Paths ==="
esxcli storage nmp device list | grep -E 'path'

Run weekly: Documents path count for compliance audits.

NFS Mount Verification Script

Purpose: Check all NFS datastores connectivity.

bash#!/bin/bash
# nfs-check.sh
for datastore in $(esxcli storage filesystem list | grep nfs | awk '{print $1}'); do
    mount | grep $datastore || echo "NFS $datastore NOT mounted!"
    vmkping -c 3 -I vmk0 $(esxcli storage filesystem list | grep $datastore | awk '{print $4}')
done

Performance Monitoring Script (esxtop Capture)

Purpose: Collect 5min esxtop data during issues.

bash#!/bin/bash
# perf-capture.sh
esxtop -b -a -d 300 > /tmp/esxtop-$(date +%Y%m%d-%H%M%S).csv
echo "Captured $(ls -lh /tmp/esxtop-*.csv | wc -l) perf files"

Analyzeesxtop replay or Excel pivot on %LATENCY, DAVG.

One-Liner Toolkit

TaskCommand
List locked VMsvmkfstools -D /vmfs/volumes/datastore/vm.vmdk | grep MAC
Check VMFS healthvoma -check -disk naa.xxx
Reset iSCSI adapteresxcli iscsi adapter set -A vmhbaXX -e false; esxcli iscsi adapter set -A vmhbaXX -e true
Host connection testesxcli network ip connection list | grep ESTABLISHED
Datastore I/Oesxtop (press ‘d’), look for %UTIL>80%

Deployment Guide

  1. PowerCLI SetupInstall-Module VMware.PowerCLI on jumpbox.
  2. ESXi Scripts: SCP to /scripts/chmod +x, add to crontab.
  3. Confluence Integration: Embed scripts as <pre> blocks, add “Copy” buttons.
  4. Alerting: Pipe outputs to Slack/Teams via webhook or email.
  5. Version Control: Git repo per datacenter, tag releases.