ESXi NFSv4 Error Decoder & Log Analyzer

NFSv4 failures in ESXi appear as NFS4ERR_* codes, lockd/portmap conflicts, and RPC timeouts in vmkernel.log. This script suite extracts all NFSv4 errors with detailed explanations, affected datastores, and remediation steps directly from live logs.

NFSv4 Error Code Reference

NFS4ERR CodeHexMeaningCommon CauseFix
NFS4ERR_LOCK_UNAVAIL0x0000006cLock deniedlockd conflict, port 4045 usedKill conflicting process, restart lockd
NFS4ERR_STALE0x0000001fStale file handleNAS reboot, export changedRemount datastore, check NAS exports
NFS4ERR_EXPIRED0x0000001eLease expiredNetwork partition >90sCheck MTU, firewall 2049/TCP
NFS4ERR_DELAY0x00000016Server busyNAS overloadedIncrease nfs.maxqueuesize, NAS perf
NFS4ERR_IO0x00000011I/O errorNAS disk failureCheck NAS alerts, failover pool
NFS4ERR_BADHANDLE0x00000002Invalid handleCorrupt mountUnmount/remount NFS datastore

NFSv4 Log Parser (nfs4-analyzer.sh)

Purpose: Extracts ALL NFSv4 errors from vmkernel.log with timestamps, datastores, and RPC details.

bash#!/bin/bash
# nfs4-analyzer.sh - Extract NFSv4 failures from ESXi logs
LOG_FILE="/var/run/log/vmkernel.log"
REPORT="/tmp/nfs4-errors-$(date +%Y%m%d-%H%M).csv"

echo "Timestamp,Datastore,NFS4ERR_Code,Error_Type,Severity,Server_IP,Mount_Point,Raw_Log" > $REPORT

# NFS4ERR_* codes
grep -i "NFS4ERR\|nfs.*error\|rpc.*fail\|lockd\|portmap" $LOG_FILE | while read line; do
    timestamp=$(echo $line | sed 's/^\(.*\)cpu.*/\1/')
    datastore=$(echo $line | grep -o '/vmfs/volumes/[^ ]*' | head -1 || echo 'unknown')
    server_ip=$(echo $line | grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' | head -1)
    
    # Extract NFS4ERR code
    if [[ $line =~ NFS4ERR_([A-Z_]+) ]]; then
        code=${BASH_REMATCH[1]}
        severity="HIGH"
    elif [[ $line =~ (timeout\|fail|dead|unavailable) ]]; then
        code=$(echo $line | grep -oE 'timeout|fail|dead|unavailable|refused')
        severity="MEDIUM"
    else
        code="OTHER"
        severity="LOW"
    fi
    
    echo "$timestamp,$datastore,$code,$severity,$server_ip,$datastore,$line" >> $REPORT
done

# Summary stats
echo "=== NFSv4 Error Summary ===" | tee -a $REPORT
awk -F, '{print $3" "$4" "$5}' $REPORT | sort | uniq -c | sort -nr | head -10

# Critical alerts
echo "CRITICAL (HIGH severity):" | tee -a $REPORT
grep ",HIGH," $REPORT | cut -d, -f1,2,3,7-

echo "Report saved: $REPORT ($(wc -l < $REPORT) entries)"

Usage: ./nfs4-analyzer.sh → /tmp/nfs4-errors-*.csv

Detailed NFSv4 Error Decoder (nfs4-decoder.py)

Purpose: Maps NFSv4 error codes to RFC 5661 explanations + ESXi-specific fixes.

python#!/usr/bin/env python3
# nfs4-decoder.py - Detailed NFSv4 error explanations
import re
import sys
import pandas as pd

NFS4_ERRORS = {
    'NFS4ERR_LOCK_UNAVAIL': {
        'rfc': 'RFC 5661 Sec 14.2.1 - Lock held by another client',
        'esxi_cause': 'portd/lockd conflict on 4045/TCP, Windows NFS client interference',
        'fix': '1. `esxcli system process list | grep lockd` → kill PID\n2. Check `netstat -an | grep 4045`\n3. Restart: `services.sh restart`',
        'severity': 'CRITICAL'
    },
    'NFS4ERR_STALE': {
        'rfc': 'RFC 5661 Sec 14.2.30 - File handle no longer valid',
        'esxi_cause': 'NAS export removed, filesystem ID changed, NAS failover',
        'fix': '`esxcli storage filesystem unmount -l DATASTORE && esxcli storage filesystem mount -v nfs -h NAS_IP -s /export`',
        'severity': 'HIGH'
    },
    'NFS4ERR_EXPIRED': {
        'rfc': 'RFC 5661 Sec 14.2.9 - Lease expired',
        'esxi_cause': 'Network blip >90s, firewall dropped TCP 2049',
        'fix': '1. `vmkping -I vmk0 NAS_IP -s 8972` (Jumbo)\n2. Check ESXi firewall: `esxcli network firewall ruleset list | grep nfs`',
        'severity': 'HIGH'
    },
    'NFS4ERR_DELAY': {
        'rfc': 'RFC 5661 Sec 14.2.7 - Server temporarily unavailable',
        'esxi_cause': 'NAS RPC queue full, nfs.maxqueuesize too low',
        'fix': '`esxcli system settings advanced set -o /NFS/MaxQueueSize -i 16` → rescan',
        'severity': 'MEDIUM'
    }
}

def decode_nfs4_error(log_line):
    for error, details in NFS4_ERRORS.items():
        if re.search(error, log_line):
            return {
                **details,
                'raw_line': log_line,
                'timestamp': re.search(r'\[([^\]]+)', log_line).group(1)
            }
    return {'error': 'UNKNOWN_NFS4', 'severity': 'INFO', 'raw_line': log_line}

# Process log file
if len(sys.argv) > 1:
    with open(sys.argv[1]) as f:
        errors = [decode_nfs4_error(line) for line in f if 'NFS4ERR' in line or 'nfs.*error' in line]
    
    df = pd.DataFrame(errors)
    print(df[['timestamp', 'severity', 'esxi_cause', 'fix']].to_string(index=False))

Live NFSv4 Monitor (nfs4-live-tail.sh)

Purpose: Real-time NFSv4 error detection with instant alerts.

bash#!/bin/bash
# nfs4-live-tail.sh - Watch NFSv4 errors live
tail -f /var/run/log/vmkernel.log | grep --line-buffered -i "NFS4ERR\|nfs.*(error\|fail\|timeout\|lock)" | while read line; do
    echo "$(date): $line"
    
    # Auto-run decoder
    echo "$line" | python3 nfs4-decoder.py | head -3
    
    # Alert on critical
    if echo "$line" | grep -q "LOCK_UNAVAIL\|STALE\|EXPIRED"; then
        echo "CRITICAL NFSv4 ERROR - Check /tmp/nfs4-errors-*.csv" | mail -s "ESXi NFSv4 Failure $(hostname)" oncall@company.com
    fi
done

Run: ./nfs4-live-tail.sh (Ctrl+C to stop)

Master NFSv4 Dashboard Generator (nfs4-dashboard.py)

python#!/usr/bin/env python3
# nfs4-dashboard.py - HTML dashboard with error trends
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import subprocess

# Parse all reports
all_reports = []
for report in subprocess.check_output('ls /tmp/nfs4-errors-*.csv', shell=True).decode().split():
    df = pd.read_csv(report, names=['ts','ds','code','sev','ip','mp','raw'])
    df['time'] = pd.to_datetime(df['ts'])
    all_reports.append(df)

df = pd.concat(all_reports)

# Charts
fig = make_subplots(rows=2, cols=2, 
                   subplot_titles=('NFSv4 Errors by Datastore', 'Error Timeline', 
                                 'Top Error Codes', 'Server Response'))

fig.add_trace(px.bar(df, x='ds', y=df.groupby('ds').size(), 
                    title='Errors by Datastore').data[0], row=1, col=1)

timeline = px.line(df.groupby(['time','code']).size().reset_index(name='count'), 
                  x='time', y='count', color='code')
for trace in timeline.data:
    fig.add_trace(trace, row=1, col=2)

fig.add_trace(px.pie(df, names='code', values=df.groupby('code').size()).data[0], row=2, col=1)

fig.write_html('nfs4-dashboard.html')
print("Dashboard: nfs4-dashboard.html")

Quick One-Liners for NFSv4 Issues

IssueCommand
Lock conflicts`esxcli system process list
Port 4045 check`netstat -an
RPC timeoutsrpcinfo -T tcp NAS_IP nfs → should return prog 100003
Mount status`esxcli storage filesystem list
NFS firewallesxcli network firewall ruleset set --ruleset-id nfsClient --enabled true
Force remountesxcli storage filesystem unmount -l DATASTORE && mount

Automated Cron Setup

bash# /etc/cron.d/nfs4-monitor
*/5 * * * * root /scripts/nfs4-analyzer.sh >> /var/log/nfs4-monitor.log
0 * * * * root python3 /scripts/nfs4-dashboard.py

Alert Thresholds:

  • LOCK_UNAVAIL > 5/min → Page oncall
  • STALE/EXPIRED > 2 → Check NAS failover
  • DELAY > 10 → Storage team

Sample Output (Tintri NFS)

text2025-12-28 19:09:00: NFS4ERR_LOCK_UNAVAIL on /vmfs/volumes/tintrinfs-prod
Cause: Windows NFS client on port 4045 conflicts with ESXi lockd
Fix: Kill PID 12345 (lockd), restart services.sh

3x NFS4ERR_STALE on 10.0.1.50:/tintri-export
Cause: Tintri controller failover
Fix: Remount datastore

Confluence Deployment

text1. SCP scripts → ESXi /scripts/
2. ./nfs4-analyzer.sh → instant CSV report
3. python3 nfs4-dashboard.py → embed HTML
4. Cron + ./nfs4-live-tail.sh for oncall shifts

Pro Tip: For Tintri NFS, grep tintri|10.0.1 in logs; check controller status in GlobalProtect.

Run ./nfs4-analyzer.sh now for your current NFSv4 issues!

Leave a comment