ESXi NFSv4 Error Decoder & Log Analyzer

NFSv4 failures in ESXi appear as NFS4ERR_* codes, lockd/portmap conflicts, and RPC timeouts in vmkernel.log. This script suite extracts all NFSv4 errors with detailed explanations, affected datastores, and remediation steps directly from live logs.

NFSv4 Error Code Reference

NFS4ERR Code	Hex	Meaning	Common Cause	Fix
NFS4ERR_LOCK_UNAVAIL	0x0000006c	Lock denied	lockd conflict, port 4045 used	Kill conflicting process, restart lockd
NFS4ERR_STALE	0x0000001f	Stale file handle	NAS reboot, export changed	Remount datastore, check NAS exports
NFS4ERR_EXPIRED	0x0000001e	Lease expired	Network partition >90s	Check MTU, firewall 2049/TCP
NFS4ERR_DELAY	0x00000016	Server busy	NAS overloaded	Increase nfs.maxqueuesize, NAS perf
NFS4ERR_IO	0x00000011	I/O error	NAS disk failure	Check NAS alerts, failover pool
NFS4ERR_BADHANDLE	0x00000002	Invalid handle	Corrupt mount	Unmount/remount NFS datastore

NFSv4 Log Parser (nfs4-analyzer.sh)

Purpose: Extracts ALL NFSv4 errors from vmkernel.log with timestamps, datastores, and RPC details.

bash#!/bin/bash
# nfs4-analyzer.sh - Extract NFSv4 failures from ESXi logs
LOG_FILE="/var/run/log/vmkernel.log"
REPORT="/tmp/nfs4-errors-$(date +%Y%m%d-%H%M).csv"

echo "Timestamp,Datastore,NFS4ERR_Code,Error_Type,Severity,Server_IP,Mount_Point,Raw_Log" > $REPORT

# NFS4ERR_* codes
grep -i "NFS4ERR\|nfs.*error\|rpc.*fail\|lockd\|portmap" $LOG_FILE | while read line; do
    timestamp=$(echo $line | sed 's/^\(.*\)cpu.*/\1/')
    datastore=$(echo $line | grep -o '/vmfs/volumes/[^ ]*' | head -1 || echo 'unknown')
    server_ip=$(echo $line | grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' | head -1)
    
    # Extract NFS4ERR code
    if [[ $line =~ NFS4ERR_([A-Z_]+) ]]; then
        code=${BASH_REMATCH[1]}
        severity="HIGH"
    elif [[ $line =~ (timeout\|fail|dead|unavailable) ]]; then
        code=$(echo $line | grep -oE 'timeout|fail|dead|unavailable|refused')
        severity="MEDIUM"
    else
        code="OTHER"
        severity="LOW"
    fi
    
    echo "$timestamp,$datastore,$code,$severity,$server_ip,$datastore,$line" >> $REPORT
done

# Summary stats
echo "=== NFSv4 Error Summary ===" | tee -a $REPORT
awk -F, '{print $3" "$4" "$5}' $REPORT | sort | uniq -c | sort -nr | head -10

# Critical alerts
echo "CRITICAL (HIGH severity):" | tee -a $REPORT
grep ",HIGH," $REPORT | cut -d, -f1,2,3,7-

echo "Report saved: $REPORT ($(wc -l < $REPORT) entries)"

Usage: ./nfs4-analyzer.sh → /tmp/nfs4-errors-*.csv

Detailed NFSv4 Error Decoder (nfs4-decoder.py)

Purpose: Maps NFSv4 error codes to RFC 5661 explanations + ESXi-specific fixes.

python#!/usr/bin/env python3
# nfs4-decoder.py - Detailed NFSv4 error explanations
import re
import sys
import pandas as pd

NFS4_ERRORS = {
    'NFS4ERR_LOCK_UNAVAIL': {
        'rfc': 'RFC 5661 Sec 14.2.1 - Lock held by another client',
        'esxi_cause': 'portd/lockd conflict on 4045/TCP, Windows NFS client interference',
        'fix': '1. `esxcli system process list | grep lockd` → kill PID\n2. Check `netstat -an | grep 4045`\n3. Restart: `services.sh restart`',
        'severity': 'CRITICAL'
    },
    'NFS4ERR_STALE': {
        'rfc': 'RFC 5661 Sec 14.2.30 - File handle no longer valid',
        'esxi_cause': 'NAS export removed, filesystem ID changed, NAS failover',
        'fix': '`esxcli storage filesystem unmount -l DATASTORE && esxcli storage filesystem mount -v nfs -h NAS_IP -s /export`',
        'severity': 'HIGH'
    },
    'NFS4ERR_EXPIRED': {
        'rfc': 'RFC 5661 Sec 14.2.9 - Lease expired',
        'esxi_cause': 'Network blip >90s, firewall dropped TCP 2049',
        'fix': '1. `vmkping -I vmk0 NAS_IP -s 8972` (Jumbo)\n2. Check ESXi firewall: `esxcli network firewall ruleset list | grep nfs`',
        'severity': 'HIGH'
    },
    'NFS4ERR_DELAY': {
        'rfc': 'RFC 5661 Sec 14.2.7 - Server temporarily unavailable',
        'esxi_cause': 'NAS RPC queue full, nfs.maxqueuesize too low',
        'fix': '`esxcli system settings advanced set -o /NFS/MaxQueueSize -i 16` → rescan',
        'severity': 'MEDIUM'
    }
}

def decode_nfs4_error(log_line):
    for error, details in NFS4_ERRORS.items():
        if re.search(error, log_line):
            return {
                **details,
                'raw_line': log_line,
                'timestamp': re.search(r'\[([^\]]+)', log_line).group(1)
            }
    return {'error': 'UNKNOWN_NFS4', 'severity': 'INFO', 'raw_line': log_line}

# Process log file
if len(sys.argv) > 1:
    with open(sys.argv[1]) as f:
        errors = [decode_nfs4_error(line) for line in f if 'NFS4ERR' in line or 'nfs.*error' in line]
    
    df = pd.DataFrame(errors)
    print(df[['timestamp', 'severity', 'esxi_cause', 'fix']].to_string(index=False))

Live NFSv4 Monitor (nfs4-live-tail.sh)

Purpose: Real-time NFSv4 error detection with instant alerts.

bash#!/bin/bash
# nfs4-live-tail.sh - Watch NFSv4 errors live
tail -f /var/run/log/vmkernel.log | grep --line-buffered -i "NFS4ERR\|nfs.*(error\|fail\|timeout\|lock)" | while read line; do
    echo "$(date): $line"
    
    # Auto-run decoder
    echo "$line" | python3 nfs4-decoder.py | head -3
    
    # Alert on critical
    if echo "$line" | grep -q "LOCK_UNAVAIL\|STALE\|EXPIRED"; then
        echo "CRITICAL NFSv4 ERROR - Check /tmp/nfs4-errors-*.csv" | mail -s "ESXi NFSv4 Failure $(hostname)" oncall@company.com
    fi
done

Run: ./nfs4-live-tail.sh (Ctrl+C to stop)

Master NFSv4 Dashboard Generator (nfs4-dashboard.py)

python#!/usr/bin/env python3
# nfs4-dashboard.py - HTML dashboard with error trends
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import subprocess

# Parse all reports
all_reports = []
for report in subprocess.check_output('ls /tmp/nfs4-errors-*.csv', shell=True).decode().split():
    df = pd.read_csv(report, names=['ts','ds','code','sev','ip','mp','raw'])
    df['time'] = pd.to_datetime(df['ts'])
    all_reports.append(df)

df = pd.concat(all_reports)

# Charts
fig = make_subplots(rows=2, cols=2, 
                   subplot_titles=('NFSv4 Errors by Datastore', 'Error Timeline', 
                                 'Top Error Codes', 'Server Response'))

fig.add_trace(px.bar(df, x='ds', y=df.groupby('ds').size(), 
                    title='Errors by Datastore').data[0], row=1, col=1)

timeline = px.line(df.groupby(['time','code']).size().reset_index(name='count'), 
                  x='time', y='count', color='code')
for trace in timeline.data:
    fig.add_trace(trace, row=1, col=2)

fig.add_trace(px.pie(df, names='code', values=df.groupby('code').size()).data[0], row=2, col=1)

fig.write_html('nfs4-dashboard.html')
print("Dashboard: nfs4-dashboard.html")

Quick One-Liners for NFSv4 Issues

Issue	Command
Lock conflicts	`esxcli system process list
Port 4045 check	`netstat -an
RPC timeouts	`rpcinfo -T tcp NAS_IP nfs` → should return prog 100003
Mount status	`esxcli storage filesystem list
NFS firewall	`esxcli network firewall ruleset set --ruleset-id nfsClient --enabled true`
Force remount	`esxcli storage filesystem unmount -l DATASTORE && mount`

Automated Cron Setup

bash# /etc/cron.d/nfs4-monitor
*/5 * * * * root /scripts/nfs4-analyzer.sh >> /var/log/nfs4-monitor.log
0 * * * * root python3 /scripts/nfs4-dashboard.py

Alert Thresholds:

LOCK_UNAVAIL > 5/min → Page oncall
STALE/EXPIRED > 2 → Check NAS failover
DELAY > 10 → Storage team

Sample Output (Tintri NFS)

text2025-12-28 19:09:00: NFS4ERR_LOCK_UNAVAIL on /vmfs/volumes/tintrinfs-prod
Cause: Windows NFS client on port 4045 conflicts with ESXi lockd
Fix: Kill PID 12345 (lockd), restart services.sh

3x NFS4ERR_STALE on 10.0.1.50:/tintri-export
Cause: Tintri controller failover
Fix: Remount datastore

Confluence Deployment

text1. SCP scripts → ESXi /scripts/
2. ./nfs4-analyzer.sh → instant CSV report
3. python3 nfs4-dashboard.py → embed HTML
4. Cron + ./nfs4-live-tail.sh for oncall shifts

Pro Tip: For Tintri NFS, grep tintri|10.0.1 in logs; check controller status in GlobalProtect.

Run ./nfs4-analyzer.sh now for your current NFSv4 issues!

VMwareBlogs

"Unlocking the Power of Virtualization: Explore the Latest Insights and Innovations with VMware Blogs"

ESXi NFSv4 Error Decoder & Log Analyzer

NFSv4 Error Code Reference

NFSv4 Log Parser (nfs4-analyzer.sh)

Detailed NFSv4 Error Decoder (nfs4-decoder.py)

Live NFSv4 Monitor (nfs4-live-tail.sh)

Master NFSv4 Dashboard Generator (nfs4-dashboard.py)

Quick One-Liners for NFSv4 Issues

Automated Cron Setup

Sample Output (Tintri NFS)

Confluence Deployment

Leave a comment Cancel reply

NFSv4 Error Code Reference

NFSv4 Log Parser (nfs4-analyzer.sh)

Detailed NFSv4 Error Decoder (nfs4-decoder.py)

Live NFSv4 Monitor (nfs4-live-tail.sh)

Master NFSv4 Dashboard Generator (nfs4-dashboard.py)

Quick One-Liners for NFSv4 Issues

Automated Cron Setup

Sample Output (Tintri NFS)

Confluence Deployment

Share this:

Related

Leave a comment Cancel reply