ESXi NFSv4 Error Decoder & Log Analyzer

December 28, 2025 Tapas Kumar MahantaLeave a comment

NFSv4 failures in ESXi appear as NFS4ERR_* codes, lockd/portmap conflicts, and RPC timeouts in vmkernel.log. This script suite extracts all NFSv4 errors with detailed explanations, affected datastores, and remediation steps directly from live logs.

NFSv4 Error Code Reference

NFS4ERR Code	Hex	Meaning	Common Cause	Fix
NFS4ERR_LOCK_UNAVAIL	0x0000006c	Lock denied	lockd conflict, port 4045 used	Kill conflicting process, restart lockd
NFS4ERR_STALE	0x0000001f	Stale file handle	NAS reboot, export changed	Remount datastore, check NAS exports
NFS4ERR_EXPIRED	0x0000001e	Lease expired	Network partition >90s	Check MTU, firewall 2049/TCP
NFS4ERR_DELAY	0x00000016	Server busy	NAS overloaded	Increase nfs.maxqueuesize, NAS perf
NFS4ERR_IO	0x00000011	I/O error	NAS disk failure	Check NAS alerts, failover pool
NFS4ERR_BADHANDLE	0x00000002	Invalid handle	Corrupt mount	Unmount/remount NFS datastore

NFSv4 Log Parser (nfs4-analyzer.sh)

Purpose: Extracts ALL NFSv4 errors from vmkernel.log with timestamps, datastores, and RPC details.

bash#!/bin/bash
# nfs4-analyzer.sh - Extract NFSv4 failures from ESXi logs
LOG_FILE="/var/run/log/vmkernel.log"
REPORT="/tmp/nfs4-errors-$(date +%Y%m%d-%H%M).csv"

echo "Timestamp,Datastore,NFS4ERR_Code,Error_Type,Severity,Server_IP,Mount_Point,Raw_Log" > $REPORT

# NFS4ERR_* codes
grep -i "NFS4ERR\|nfs.*error\|rpc.*fail\|lockd\|portmap" $LOG_FILE | while read line; do
    timestamp=$(echo $line | sed 's/^\(.*\)cpu.*/\1/')
    datastore=$(echo $line | grep -o '/vmfs/volumes/[^ ]*' | head -1 || echo 'unknown')
    server_ip=$(echo $line | grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' | head -1)
    
    # Extract NFS4ERR code
    if [[ $line =~ NFS4ERR_([A-Z_]+) ]]; then
        code=${BASH_REMATCH[1]}
        severity="HIGH"
    elif [[ $line =~ (timeout\|fail|dead|unavailable) ]]; then
        code=$(echo $line | grep -oE 'timeout|fail|dead|unavailable|refused')
        severity="MEDIUM"
    else
        code="OTHER"
        severity="LOW"
    fi
    
    echo "$timestamp,$datastore,$code,$severity,$server_ip,$datastore,$line" >> $REPORT
done

# Summary stats
echo "=== NFSv4 Error Summary ===" | tee -a $REPORT
awk -F, '{print $3" "$4" "$5}' $REPORT | sort | uniq -c | sort -nr | head -10

# Critical alerts
echo "CRITICAL (HIGH severity):" | tee -a $REPORT
grep ",HIGH," $REPORT | cut -d, -f1,2,3,7-

echo "Report saved: $REPORT ($(wc -l < $REPORT) entries)"

Usage: ./nfs4-analyzer.sh → /tmp/nfs4-errors-*.csv

Detailed NFSv4 Error Decoder (nfs4-decoder.py)

Purpose: Maps NFSv4 error codes to RFC 5661 explanations + ESXi-specific fixes.

python#!/usr/bin/env python3
# nfs4-decoder.py - Detailed NFSv4 error explanations
import re
import sys
import pandas as pd

NFS4_ERRORS = {
    'NFS4ERR_LOCK_UNAVAIL': {
        'rfc': 'RFC 5661 Sec 14.2.1 - Lock held by another client',
        'esxi_cause': 'portd/lockd conflict on 4045/TCP, Windows NFS client interference',
        'fix': '1. `esxcli system process list | grep lockd` → kill PID\n2. Check `netstat -an | grep 4045`\n3. Restart: `services.sh restart`',
        'severity': 'CRITICAL'
    },
    'NFS4ERR_STALE': {
        'rfc': 'RFC 5661 Sec 14.2.30 - File handle no longer valid',
        'esxi_cause': 'NAS export removed, filesystem ID changed, NAS failover',
        'fix': '`esxcli storage filesystem unmount -l DATASTORE && esxcli storage filesystem mount -v nfs -h NAS_IP -s /export`',
        'severity': 'HIGH'
    },
    'NFS4ERR_EXPIRED': {
        'rfc': 'RFC 5661 Sec 14.2.9 - Lease expired',
        'esxi_cause': 'Network blip >90s, firewall dropped TCP 2049',
        'fix': '1. `vmkping -I vmk0 NAS_IP -s 8972` (Jumbo)\n2. Check ESXi firewall: `esxcli network firewall ruleset list | grep nfs`',
        'severity': 'HIGH'
    },
    'NFS4ERR_DELAY': {
        'rfc': 'RFC 5661 Sec 14.2.7 - Server temporarily unavailable',
        'esxi_cause': 'NAS RPC queue full, nfs.maxqueuesize too low',
        'fix': '`esxcli system settings advanced set -o /NFS/MaxQueueSize -i 16` → rescan',
        'severity': 'MEDIUM'
    }
}

def decode_nfs4_error(log_line):
    for error, details in NFS4_ERRORS.items():
        if re.search(error, log_line):
            return {
                **details,
                'raw_line': log_line,
                'timestamp': re.search(r'\[([^\]]+)', log_line).group(1)
            }
    return {'error': 'UNKNOWN_NFS4', 'severity': 'INFO', 'raw_line': log_line}

# Process log file
if len(sys.argv) > 1:
    with open(sys.argv[1]) as f:
        errors = [decode_nfs4_error(line) for line in f if 'NFS4ERR' in line or 'nfs.*error' in line]
    
    df = pd.DataFrame(errors)
    print(df[['timestamp', 'severity', 'esxi_cause', 'fix']].to_string(index=False))

Live NFSv4 Monitor (nfs4-live-tail.sh)

Purpose: Real-time NFSv4 error detection with instant alerts.

bash#!/bin/bash
# nfs4-live-tail.sh - Watch NFSv4 errors live
tail -f /var/run/log/vmkernel.log | grep --line-buffered -i "NFS4ERR\|nfs.*(error\|fail\|timeout\|lock)" | while read line; do
    echo "$(date): $line"
    
    # Auto-run decoder
    echo "$line" | python3 nfs4-decoder.py | head -3
    
    # Alert on critical
    if echo "$line" | grep -q "LOCK_UNAVAIL\|STALE\|EXPIRED"; then
        echo "CRITICAL NFSv4 ERROR - Check /tmp/nfs4-errors-*.csv" | mail -s "ESXi NFSv4 Failure $(hostname)" oncall@company.com
    fi
done

Run: ./nfs4-live-tail.sh (Ctrl+C to stop)

Master NFSv4 Dashboard Generator (nfs4-dashboard.py)

python#!/usr/bin/env python3
# nfs4-dashboard.py - HTML dashboard with error trends
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import subprocess

# Parse all reports
all_reports = []
for report in subprocess.check_output('ls /tmp/nfs4-errors-*.csv', shell=True).decode().split():
    df = pd.read_csv(report, names=['ts','ds','code','sev','ip','mp','raw'])
    df['time'] = pd.to_datetime(df['ts'])
    all_reports.append(df)

df = pd.concat(all_reports)

# Charts
fig = make_subplots(rows=2, cols=2, 
                   subplot_titles=('NFSv4 Errors by Datastore', 'Error Timeline', 
                                 'Top Error Codes', 'Server Response'))

fig.add_trace(px.bar(df, x='ds', y=df.groupby('ds').size(), 
                    title='Errors by Datastore').data[0], row=1, col=1)

timeline = px.line(df.groupby(['time','code']).size().reset_index(name='count'), 
                  x='time', y='count', color='code')
for trace in timeline.data:
    fig.add_trace(trace, row=1, col=2)

fig.add_trace(px.pie(df, names='code', values=df.groupby('code').size()).data[0], row=2, col=1)

fig.write_html('nfs4-dashboard.html')
print("Dashboard: nfs4-dashboard.html")

Quick One-Liners for NFSv4 Issues

Issue	Command
Lock conflicts	`esxcli system process list
Port 4045 check	`netstat -an
RPC timeouts	`rpcinfo -T tcp NAS_IP nfs` → should return prog 100003
Mount status	`esxcli storage filesystem list
NFS firewall	`esxcli network firewall ruleset set --ruleset-id nfsClient --enabled true`
Force remount	`esxcli storage filesystem unmount -l DATASTORE && mount`

Automated Cron Setup

bash# /etc/cron.d/nfs4-monitor
*/5 * * * * root /scripts/nfs4-analyzer.sh >> /var/log/nfs4-monitor.log
0 * * * * root python3 /scripts/nfs4-dashboard.py

Alert Thresholds:

LOCK_UNAVAIL > 5/min → Page oncall
STALE/EXPIRED > 2 → Check NAS failover
DELAY > 10 → Storage team

Sample Output (Tintri NFS)

text2025-12-28 19:09:00: NFS4ERR_LOCK_UNAVAIL on /vmfs/volumes/tintrinfs-prod
Cause: Windows NFS client on port 4045 conflicts with ESXi lockd
Fix: Kill PID 12345 (lockd), restart services.sh

3x NFS4ERR_STALE on 10.0.1.50:/tintri-export
Cause: Tintri controller failover
Fix: Remount datastore

Confluence Deployment

text1. SCP scripts → ESXi /scripts/
2. ./nfs4-analyzer.sh → instant CSV report
3. python3 nfs4-dashboard.py → embed HTML
4. Cron + ./nfs4-live-tail.sh for oncall shifts

Pro Tip: For Tintri NFS, grep tintri|10.0.1 in logs; check controller status in GlobalProtect.

Run ./nfs4-analyzer.sh now for your current NFSv4 issues!

ESXi SCSI Decoder & VMkernel Log Analyzer

December 28, 2025 Tapas Kumar MahantaLeave a comment

VMkernel logs contain SCSI sense codes, path states, and HBA errors in hex format that decode storage failures like LUN timeouts, reservation conflicts, and path flaps. This script suite parses /var/run/log/vmkernel.log, decodes SCSI status, and generates troubleshooting dashboards with failed paths/host details.

SCSI Sense Code Decoder (scsi-decoder.py)

Purpose: Converts hex sense data from vmkernel logs to human-readable errors.

python#!/usr/bin/env python3
# scsi-decoder.py - Decode VMware SCSI sense codes
SCSI_SENSE = {
    '0x0': 'No Sense (OK)',
    '0x2': 'Not Ready',
    '0x3': 'Medium Error',
    '0x4': 'Hardware Error',
    '0x5': 'Illegal Request',
    '0x6': 'Unit Attention',
    '0x7': 'Data Protect',
    '0xb': 'Aborted Command',
    '0xe': 'Overlapped Commands Attempted'
}

ASC_QUAL = {
    '0x2800': 'LUN Not Ready, Format in Progress',
    '0x3f01': 'Removed Target',
    '0x3f07': 'Multiple LUN Reported',
    '0x4700': 'Reservation Conflict',
    '0x4c00': 'Snapshot Snapshot Failed',
    '0x5506': 'Illegal Message',
    '0x0800': 'Logical Unit Communication Failure'
}

def decode_scsi(line):
    """Parse vmkernel SCSI line: [timestamp] vmkwarning: CPUx: NMP: nmp_ThrottleLogForDevice: ... VMW_SCSIERR_0xX"""
    if 'VMW_SCSIERR' not in line:
        return None
    
    sense_match = re.search(r'VMW_SCSIERR_([0-9a-fA-F]{2})', line)
    if sense_match:
        sense = f"0x{sense_match.group(1)}"
        naa = re.search(r'naa\.([0-9a-fA-F:]+)', line)
        lun = naa.group(1) if naa else 'Unknown'
        
        return {
            'lun': lun,
            'sense_key': SCSI_SENSE.get(sense, f'Unknown: {sense}'),
            'raw_line': line.strip(),
            'timestamp': re.search(r'\[(.*?)\]', line).group(1)
        }
    return None

# Usage example
log_line = "2025-12-28T18:46:00.123Z cpu5:32:VMW_SCSIERR_0xb: naa.60a9800064824b4f4f4f4f4f4f4f4f4f"
print(decode_scsi(log_line))
# Output: {'lun': '60a980006482...', 'sense_key': 'Aborted Command', ...}

VMkernel Log Parser (vmk-log-analyzer.sh)

Purpose: Real-time parsing of SCSI errors, path states, and HBA failures.

bash#!/bin/bash
# vmk-log-analyzer.sh - Live SCSI/Path decoder
LOG_FILE="/var/run/log/vmkernel.log"
REPORT="/tmp/scsi-report-$(date +%Y%m%d-%H%M).csv"

echo "Host,LUN,SCSI_Error,Path_State,Timestamp,Count" > $REPORT

# SCSI Errors
grep -i "VMW_SCSIERR\|scsi\|LUN\|NMP\|path dead\|timeout" $LOG_FILE | while read line; do
    host=$(hostname)
    lun=$(echo $line | grep -o 'naa\.[0-9a-f:]*' | head -1 || echo 'unknown')
    error=$(echo $line | grep -o 'VMW_SCSIERR_[0-9a-f]*\|timeout\|dead\|failed' | head -1)
    timestamp=$(echo $line | grep -o '\[[0-9TZ.-]*' | sed 's/\[//;s/\]//')
    
    echo "$host,$lun,$error,$timestamp,1" >> $REPORT.tmp
done

# Path States
grep -i "path state\|working\|dead\|standby\|active" $LOG_FILE | while read line; do
    host=$(hostname)
    lun=$(echo $line | grep -o 'naa\.[0-9a-f:]*' | head -1)
    path_state=$(echo $line | grep -oE '(working|dead|standby|active|disabled)')
    timestamp=$(echo $line | grep -o '\[[0-9TZ.-]*' | sed 's/\[//;s/\]//')
    
    echo "$host,$lun,$path_state,$timestamp,1" >> $REPORT.tmp
done

# Aggregate & sort
cat $REPORT.tmp | sort | uniq -c | sort -nr | awk '{print $2","$3","$4","$5","$1}' >> $REPORT
rm $REPORT.tmp

echo "SCSI/Path Report: $(wc -l < $REPORT) entries"
tail -20 $REPORT

Cron: */5 * * * * /scripts/vmk-log-analyzer.sh (5min intervals).

Live Dashboard Generator (scsi-dashboard.py)

Purpose: Creates HTML table with SCSI errors, path status, and failure trends.

python#!/usr/bin/env python3
# scsi-dashboard.py - Interactive SCSI failure dashboard
import pandas as pd
import plotly.express as px
from datetime import datetime

df = pd.read_csv('/tmp/scsi-report-*.csv', names=['Host','LUN','Error','Path_State','Timestamp','Count'])

# Top failing LUNs
top_luns = df.groupby('LUN')['Count'].sum().sort_values(ascending=False).head(10)

# Path state pie chart
path_pie = px.pie(df, names='Path_State', values='Count', title='Path States Distribution')

# Error timeline
df['Time'] = pd.to_datetime(df['Timestamp'])
error_timeline = px.line(df.groupby(['Time','Error']).size().reset_index(name='Count'), 
                        x='Time', y='Count', color='Error', title='SCSI Errors Over Time')

# HTML Report
html = f"""
<html>
<head><title>ESXi SCSI Status - {datetime.now()}</title>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css">
</head>
<body>
<div class="container">
<h2>🔍 ESXi Storage Path & SCSI Errors</h2>
<div class="row">
<div class="col-md-6">{path_pie.to_html(full_html=False, include_plotlyjs='cdn')}</div>
<div class="col-md-6">{error_timeline.to_html(full_html=False, include_plotlyjs='cdn')}</div>
</div>

<h3>Top Failing LUNs</h3>
{table = df.pivot_table(values='Count', index='LUN', columns='Error', aggfunc='sum', fill_value=0)
table.to_html()}
</div>
</body>
</html>
"""
with open('scsi-dashboard.html', 'w') as f:
    f.write(html)

Master Orchestrator (storage-scsi-monitor.py)

Purpose: Monitors all ESXi hosts, parses logs, generates alerts.

python#!/usr/bin/env python3
import paramiko
import subprocess
from scsi_decoder import decode_scsi  # From above

ESXI_HOSTS = ['esxi1.example.com', 'esxi2.example.com']
ALERT_THRESHOLD = 10  # Errors per 5min

def analyze_host(host):
    ssh = paramiko.SSHClient()
    ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    ssh.connect(host, username='root', password='esxi_password')
    
    # Tail vmkernel log + run analyzer
    stdin, stdout, stderr = ssh.exec_command("""
        tail -500 /var/run/log/vmkernel.log | grep -E 'SCSIERR|path dead|timeout' > /tmp/scsi.log &&
        /scripts/vmk-log-analyzer.sh
    """)
    
    # Parse results
    errors = pd.read_csv('/tmp/scsi-report-*.csv')
    critical_luns = errors[errors['Count'] > ALERT_THRESHOLD]
    
    if not critical_luns.empty:
        print(f"ALERT {host}: {len(critical_luns)} critical LUNs")
        for _, row in critical_luns.iterrows():
            print(f"  LUN {row['LUN']}: {row['Error']} x{row['Count']}")
    
    ssh.close()

# Run across all hosts
for host in ESXI_HOSTS:
    analyze_host(host)

subprocess.run(['python3', 'scsi-dashboard.py'])
print("SCSI dashboard: scsi-dashboard.html")

One-Liner SCSI Commands

Task	Command
Live SCSI errors	`tailf /var/run/log/vmkernel.log \| grep SCSIERR`
Path status	`esxcli storage core path list \| grep -E ‘dead
LUN reservations	`esxcli storage core device list \| grep reservation`
HBA errors	`esxcli storage core adapter stats get -A vmhbaX`
Decode sense manually	`echo "0xb" \| python3 scsi-decoder.py`

Common SCSI Errors & Fixes

Sense Code	Meaning	Fix
0xb (Aborted)	Queue overflow, SAN busy	Increase HBA queue depth, check SAN zoning
0x6 (Unit Attention)	LUN reset	Rescan: `esxcli storage core adapter rescan --all`
0x47 (Reservation)	vMotion conflict	Stagger migrations, check `esxtop` %RESV
Path Dead	Cable/switch/HBA fail	`esxcli storage core path set -p <path> -s active`

Alerting Integration

bash# Email critical LUNs
if [ $(wc -l < /tmp/scsi-report-*.csv) -gt 5 ]; then
    cat /tmp/scsi-report-*.csv | mail -s "ESXi SCSI Errors $(hostname)" storage-team@company.com
fi

Confluence Deployment

text1. SCP scripts to ESXi: `/scripts/`
2. Cron: `*/5 * * * * /scripts/vmk-log-analyzer.sh`
3. Jumpbox: Run `storage-scsi-monitor.py` hourly
4. Embed: `{html}http://scsi-dashboard.html{html}`

Sample Output:

textesxi1: 12 critical LUNs
  LUN naa.60a980...: Aborted Command x8
  Path vmhba32:C0:T0:L0 dead x4

Pro Tip: Filter Tintri LUNs: grep tintri\|naa.60a9 /var/run/log/vmkernel.log

Run ./storage-scsi-monitor.py for instant dashboard!

VMware Storage Performance Testing Suite

December 28, 2025 Tapas Kumar MahantaLeave a comment

ESXi hosts provide fio, ioping, and esxtop for storage benchmarking directly from CLI, while vCenter PowerCLI aggregates performance across clusters/datastores. This script suite generates IOPS, latency, and throughput charts viewable in Confluence/HTML dashboards.

Core Testing Engine (fio-perf-test.sh)

Purpose: Run standardized fio workloads (4K random, 64K seq) on VMFS/NFS datastores.

bash#!/bin/bash
# fio-perf-test.sh - Run on ESXi via SSH
DATASTORE="/vmfs/volumes/$(esxcli storage filesystem list | grep -v Mounted | head -1 | awk '{print $1}')"
TEST_DIR="$DATASTORE/perf-test"
FIO_TEST="/usr/lib/vmware/fio/fio"

mkdir -p $TEST_DIR
cd $TEST_DIR

cat > fio-random-4k.yaml << EOF
[global]
ioengine=libaio
direct=1
size=1G
time_based
runtime=60
group_reporting
directory=$TEST_DIR

[rand-read]
rw=randread
bs=4k
numjobs=4
iodepth=32
filename=testfile.dat

[rand-write]
rw=randwrite
bs=4k
numjobs=4
iodepth=32
filename=testfile.dat
EOF

# Run tests
$FIO_TEST fio-random-4k.yaml > /tmp/fio-4k-results.txt
$FIO_TEST --name=seq-read --rw=read --bs=64k --size=4G --runtime=60 --direct=1 --numjobs=1 --iodepth=32 $TEST_DIR/testfile.dat >> /tmp/fio-seq-results.txt

# Cleanup
rm -rf $TEST_DIR/*
echo "$(hostname),$(date),$(grep read /tmp/fio-4k-results.txt | tail -1 | awk '{print $3}'),$(grep IOPS /tmp/fio-4k-results.txt | grep read | awk '{print $2}')" >> /tmp/storage-perf.csv

Cron Schedule: 0 2 * * 1 /scripts/fio-perf-test.sh (weekly baseline).

vCenter PowerCLI Aggregator (StoragePerf.ps1)

Purpose: Collects historical perf + runs live esxtop captures across all hosts.

powershell# StoragePerf.ps1 - vCenter Storage Performance Dashboard
Connect-VIServer vcenter.example.com

$Report = @()
$Clusters = Get-Cluster

foreach ($Cluster in $Clusters) {
    $Hosts = Get-VMHost -Location $Cluster
    foreach ($Host in $Hosts) {
        # Live esxtop data (requires esxtop installed)
        $Esxtop = Invoke-VMScript -VM $Host -ScriptText {
            esxtop -b -a -d 30 | grep -E 'DAVG|%LAT|IOPS' | tail -20
        } -GuestCredential (Get-Credential)

        # Historical datastore stats
        $Datastores = Get-Datastore -VMHost $Host
        foreach ($DS in $Datastores) {
            $Perf = $DS | Get-Stat -Stat "datastore.read.average","datastore.write.average" -MaxSamples 24 -Interval Min | 
                    Select @{N='Time';E={$_.Timestamp}}, @{N='ReadKBps';E={[math]::Round($_.Value,2)}}, @{N='WriteKBps';E={[math]::Round($_.Value,2)}}
            
            $Report += [PSCustomObject]@{
                Host = $Host.Name
                Datastore = $DS.Name
                FreeGB = [math]::Round($DS.FreeSpaceGB,1)
                ReadAvgKBps = ($Perf.ReadKBps | Measure -Average).Average
                WriteAvgKBps = ($Perf.WriteKBps | Measure -Average).Average
                EsxtopLatency = ($Esxtop | Select-String "DAVG" | Select-Object -Last 1).ToString().Split()[2]
            }
        }
    }
}

# Export CSV for charts
$Report | Export-Csv "StoragePerf-$(Get-Date -f yyyy-MM-dd).csv" -NoTypeInformation

# Generate HTML dashboard
$Report | ConvertTo-Html -Property Host,Datastore,FreeGB,ReadAvgKBps,WriteKBps,EsxtopLatency -Title "Storage Performance" | 
    Out-File "storage-dashboard.html"

Performance Chart Generator (perf-charts.py)

Purpose: Converts CSV data to interactive Plotly charts for Confluence.

python#!/usr/bin/env python3
# perf-charts.py - Generate HTML charts from CSV
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import sys

df = pd.read_csv(sys.argv[1])

# IOPS vs Latency scatter
fig1 = px.scatter(df, x='ReadAvgKBps', y='EsxtopLatency', 
                 size='FreeGB', color='Host', hover_name='Datastore',
                 title='Storage Read Performance vs Latency',
                 labels={'ReadAvgKBps':'Read KBps', 'EsxtopLatency':'Avg Latency (ms)'})

# Throughput bar chart
fig2 = px.bar(df, x='Datastore', y=['ReadAvgKBps','WriteAvgKBps'], 
              barmode='group', title='Read/Write Throughput by Datastore')

# Combined dashboard
fig = make_subplots(rows=2, cols=1, subplot_titles=('IOPS vs Latency', 'Read/Write Throughput'))
fig.add_trace(fig1.data[0], row=1, col=1)
fig.add_trace(fig2.data[0], row=2, col=1)
fig.add_trace(fig2.data[1], row=2, col=1)

fig.write_html('storage-perf-dashboard.html')
print("Charts saved: storage-perf-dashboard.html")

Usage: python3 perf-charts.py StoragePerf-2025-12-28.csv

Master Orchestrator (storage-benchmark.py)

Purpose: Runs fio tests on all ESXi hosts + generates dashboard.

python#!/usr/bin/env python3
import paramiko
import subprocess
import pandas as pd
from datetime import datetime

ESXI_HOSTS = ['esxi1.example.com', 'esxi2.example.com']
VCENTER = 'vcenter.example.com'

def run_fio(host):
    ssh = paramiko.SSHClient()
    ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    ssh.connect(host, username='root', password='your-esxi-password')
    
    # Copy & run fio script
    stdin, stdout, stderr = ssh.exec_command('wget -O /tmp/fio-test.sh https://your-confluence/scripts/fio-perf-test.sh && chmod +x /tmp/fio-test.sh && /tmp/fio-test.sh')
    result = stdout.read().decode()
    ssh.close()
    return result

# Execute tests
perf_data = []
for host in ESXI_HOSTS:
    print(f"Testing {host}...")
    run_fio(host)
    perf_data.append({'Host': host, 'TestTime': datetime.now()})

# Pull PowerCLI report
subprocess.run(['pwsh', '-File', 'StoragePerf.ps1'])

# Generate charts
subprocess.run(['python3', 'perf-charts.py', f'StoragePerf-{datetime.now().strftime("%Y-%m-%d")}.csv'])

print("Storage benchmark complete. View storage-perf-dashboard.html")

Confluence Chart Embedding

HTML Macro (paste storage-perf-dashboard.html content):

text{html}

{html}

CSV Table with Inline Charts:

text||Host||Datastore||Read IOPS||Latency||Chart||
|esxi1|datastore1|2450|2.3ms|![Read IOPS|width=150px,height=100px](storage-esxi1.png)|

Automated Dashboard Cronjob

bash#!/bin/bash
# /etc/cron.d/storage-perf
# Daily 3AM: Test + upload to Confluence
0 3 * * * root /usr/local/bin/storage-benchmark.py >> /var/log/storage-perf.log 2>&1

Output Files:

/tmp/storage-perf.csv → Historical trends
storage-perf-dashboard.html → Interactive Plotly charts
/var/log/storage-perf.log → Audit trail

Sample Output Charts

Expected Results (Tintri VMstore baseline):

textDatastore: tintri-vmfs-01
4K Random Read: 12,500 IOPS @ 1.8ms
4K Random Write: 8,200 IOPS @ 2.4ms
64K Seq Read: 450 MB/s
64K Seq Write: 380 MB/s

Pro Tips & Alerts

text☐ Alert if Latency > 5ms: Add to PowerCLI `if($EsxtopLatency -gt 5) {Send-MailMessage}`
☐ Tintri-specific: Add `esxtop` filter for Tintri LUN paths
☐ NFS tuning: Test with `nfs.maxqueuesize=8` parameter
☐ Compare baselines: Git commit CSV files weekly

Run First: python3 storage-benchmark.py --dry-run to validate hosts/configs.

Basic PS scripts for VMware Admin

December 28, 2025 Tapas Kumar MahantaLeave a comment

VMware admins rely on PowerCLI, esxcli one-liners, and Python scripts for daily tasks like health checks, VM migrations, and storage monitoring to save hours of manual work.

Daily Health Check Script (PowerCLI)

Purpose: Run every morning to spot issues across cluster, hosts, VMs, and datastores.

powershell# VMware Daily Health Check - Save as HealthCheck.ps1
Connect-VIServer vcenter.example.com

$Report = @()
$Clusters = Get-Cluster
foreach ($Cluster in $Clusters) {
    $Hosts = Get-VMHost -Location $Cluster
    $Report += [PSCustomObject]@{
        Cluster = $Cluster.Name
        HostsDown = ($Hosts | Where {$_.State -ne 'Connected'}).Count
        VMsDown = (Get-VM -Location $Hosts | Where {$_.PowerState -ne 'PoweredOn'}).Count
        DatastoresFull = (Get-Datastore -Location $Hosts | Where {$_.FreeSpaceGB/($_.CapacityGB)*100 -lt 20}).Count
        HighCPUHosts = ($Hosts | Where {$_.CpuUsageMhz -gt 80}).Count
    }
}
$Report | Export-Csv "DailyHealth-$(Get-Date -f yyyy-MM-dd).csv" -NoTypeInformation
Send-MailMessage -To admin@company.com -Subject "VMware Health Report" -Body "Check attached CSV" -Attachments "DailyHealth-$(Get-Date -f yyyy-MM-dd).csv"

Schedule: Windows Task Scheduler daily 8AM, outputs CSV + email summary.

Host Reboot & Maintenance Script

Purpose: Graceful host maintenance with VM evacuation.

powershell# HostMaintenance.ps1
param($HostName, $MaintenanceReason)

$HostObj = Get-VMHost $HostName
if ((Get-VM -Location $HostObj).Count -gt 0) {
    Move-VM -VM (Get-VM -Location $HostObj) -Destination (Get-Cluster -VMHost $HostObj | Get-VMHost | Where {$_.State -eq 'Connected'} | Select -First 1)
}
Set-VMHost $HostName -State Maintenance -Confirm:$false
Restart-VMHost $HostName -Confirm:$false -Reason $MaintenanceReason

Usage: ./HostMaintenance.ps1 esxi-01 "Patching"

Storage Rescan & Connectivity Script (ESXi SSH)

Purpose: Fix “LUNs not visible” after SAN changes – run on all hosts.

bash#!/bin/bash
# storage-rescan.sh - Run via SSH or Ansible
for adapter in $(esxcli storage core adapter list | grep -E 'vmhba[0-9]+' | awk '{print $1}'); do
    esxcli storage core adapter rescan -A $adapter
done
esxcli storage filesystem list | grep -E 'Mounted|Accessible'
vmkping -I vmk1 $(esxcli iscsi adapter discovery sendtarget list | awk '{print $7}' | tail -1)
echo "Storage rescan complete on $(hostname)"

Cron: */30 * * * * /scripts/storage-rescan.sh >> /var/log/storage-rescan.log

VM Snapshot Cleanup Script

Purpose: Auto-delete snapshots >7 days old to prevent datastore exhaustion.

powershell# SnapshotCleanup.ps1
$OldSnapshots = Get-VM | Get-Snapshot | Where {$_.CreateTime -lt (Get-Date).AddDays(-7)}
foreach ($Snap in $OldSnapshots) {
    Remove-Snapshot $Snap -Confirm:$false -RunAsync
    Write-Output "Deleted snapshot $($Snap.Name) on $($Snap.VM.Name)"
}

Alert on large snaps: Add Where {$_.SizeGB -gt 10} filter.

iSCSI Path Failover Test Script

Purpose: Verify multipath redundancy before maintenance.

bash#!/bin/bash
# iscsi-path-test.sh
echo "=== iSCSI Path Status ==="
esxcli iscsi session list
echo "=== Active Paths ==="
esxcli storage core path list | grep -E 'working|dead'
echo "=== LUN Paths ==="
esxcli storage nmp device list | grep -E 'path'

Run weekly: Documents path count for compliance audits.

NFS Mount Verification Script

Purpose: Check all NFS datastores connectivity.

bash#!/bin/bash
# nfs-check.sh
for datastore in $(esxcli storage filesystem list | grep nfs | awk '{print $1}'); do
    mount | grep $datastore || echo "NFS $datastore NOT mounted!"
    vmkping -c 3 -I vmk0 $(esxcli storage filesystem list | grep $datastore | awk '{print $4}')
done

Performance Monitoring Script (esxtop Capture)

Purpose: Collect 5min esxtop data during issues.

bash#!/bin/bash
# perf-capture.sh
esxtop -b -a -d 300 > /tmp/esxtop-$(date +%Y%m%d-%H%M%S).csv
echo "Captured $(ls -lh /tmp/esxtop-*.csv | wc -l) perf files"

Analyze: esxtop replay or Excel pivot on %LATENCY, DAVG.

One-Liner Toolkit

Task	Command
List locked VMs	`vmkfstools -D /vmfs/volumes/datastore/vm.vmdk \| grep MAC`
Check VMFS health	`voma -check -disk naa.xxx`
Reset iSCSI adapter	`esxcli iscsi adapter set -A vmhbaXX -e false; esxcli iscsi adapter set -A vmhbaXX -e true`
Host connection test	`esxcli network ip connection list \| grep ESTABLISHED`
Datastore I/O	`esxtop` (press ‘d’), look for %UTIL>80%

Deployment Guide

PowerCLI Setup: Install-Module VMware.PowerCLI on jumpbox.
ESXi Scripts: SCP to /scripts/, chmod +x, add to crontab.
Confluence Integration: Embed scripts as <pre> blocks, add “Copy” buttons.
Alerting: Pipe outputs to Slack/Teams via webhook or email.
Version Control: Git repo per datacenter, tag releases.