VMkernel logs contain SCSI sense codes, path states, and HBA errors in hex format that decode storage failures like LUN timeouts, reservation conflicts, and path flaps. This script suite parses /var/run/log/vmkernel.log, decodes SCSI status, and generates troubleshooting dashboards with failed paths/host details.
SCSI Sense Code Decoder (scsi-decoder.py)
Purpose: Converts hex sense data from vmkernel logs to human-readable errors.
python#!/usr/bin/env python3
# scsi-decoder.py - Decode VMware SCSI sense codes
SCSI_SENSE = {
'0x0': 'No Sense (OK)',
'0x2': 'Not Ready',
'0x3': 'Medium Error',
'0x4': 'Hardware Error',
'0x5': 'Illegal Request',
'0x6': 'Unit Attention',
'0x7': 'Data Protect',
'0xb': 'Aborted Command',
'0xe': 'Overlapped Commands Attempted'
}
ASC_QUAL = {
'0x2800': 'LUN Not Ready, Format in Progress',
'0x3f01': 'Removed Target',
'0x3f07': 'Multiple LUN Reported',
'0x4700': 'Reservation Conflict',
'0x4c00': 'Snapshot Snapshot Failed',
'0x5506': 'Illegal Message',
'0x0800': 'Logical Unit Communication Failure'
}
def decode_scsi(line):
"""Parse vmkernel SCSI line: [timestamp] vmkwarning: CPUx: NMP: nmp_ThrottleLogForDevice: ... VMW_SCSIERR_0xX"""
if 'VMW_SCSIERR' not in line:
return None
sense_match = re.search(r'VMW_SCSIERR_([0-9a-fA-F]{2})', line)
if sense_match:
sense = f"0x{sense_match.group(1)}"
naa = re.search(r'naa\.([0-9a-fA-F:]+)', line)
lun = naa.group(1) if naa else 'Unknown'
return {
'lun': lun,
'sense_key': SCSI_SENSE.get(sense, f'Unknown: {sense}'),
'raw_line': line.strip(),
'timestamp': re.search(r'\[(.*?)\]', line).group(1)
}
return None
# Usage example
log_line = "2025-12-28T18:46:00.123Z cpu5:32:VMW_SCSIERR_0xb: naa.60a9800064824b4f4f4f4f4f4f4f4f4f"
print(decode_scsi(log_line))
# Output: {'lun': '60a980006482...', 'sense_key': 'Aborted Command', ...}
VMkernel Log Parser (vmk-log-analyzer.sh)
Purpose: Real-time parsing of SCSI errors, path states, and HBA failures.
bash#!/bin/bash
# vmk-log-analyzer.sh - Live SCSI/Path decoder
LOG_FILE="/var/run/log/vmkernel.log"
REPORT="/tmp/scsi-report-$(date +%Y%m%d-%H%M).csv"
echo "Host,LUN,SCSI_Error,Path_State,Timestamp,Count" > $REPORT
# SCSI Errors
grep -i "VMW_SCSIERR\|scsi\|LUN\|NMP\|path dead\|timeout" $LOG_FILE | while read line; do
host=$(hostname)
lun=$(echo $line | grep -o 'naa\.[0-9a-f:]*' | head -1 || echo 'unknown')
error=$(echo $line | grep -o 'VMW_SCSIERR_[0-9a-f]*\|timeout\|dead\|failed' | head -1)
timestamp=$(echo $line | grep -o '\[[0-9TZ.-]*' | sed 's/\[//;s/\]//')
echo "$host,$lun,$error,$timestamp,1" >> $REPORT.tmp
done
# Path States
grep -i "path state\|working\|dead\|standby\|active" $LOG_FILE | while read line; do
host=$(hostname)
lun=$(echo $line | grep -o 'naa\.[0-9a-f:]*' | head -1)
path_state=$(echo $line | grep -oE '(working|dead|standby|active|disabled)')
timestamp=$(echo $line | grep -o '\[[0-9TZ.-]*' | sed 's/\[//;s/\]//')
echo "$host,$lun,$path_state,$timestamp,1" >> $REPORT.tmp
done
# Aggregate & sort
cat $REPORT.tmp | sort | uniq -c | sort -nr | awk '{print $2","$3","$4","$5","$1}' >> $REPORT
rm $REPORT.tmp
echo "SCSI/Path Report: $(wc -l < $REPORT) entries"
tail -20 $REPORT
Cron: */5 * * * * /scripts/vmk-log-analyzer.sh (5min intervals).
Live Dashboard Generator (scsi-dashboard.py)
Purpose: Creates HTML table with SCSI errors, path status, and failure trends.
python#!/usr/bin/env python3
# scsi-dashboard.py - Interactive SCSI failure dashboard
import pandas as pd
import plotly.express as px
from datetime import datetime
df = pd.read_csv('/tmp/scsi-report-*.csv', names=['Host','LUN','Error','Path_State','Timestamp','Count'])
# Top failing LUNs
top_luns = df.groupby('LUN')['Count'].sum().sort_values(ascending=False).head(10)
# Path state pie chart
path_pie = px.pie(df, names='Path_State', values='Count', title='Path States Distribution')
# Error timeline
df['Time'] = pd.to_datetime(df['Timestamp'])
error_timeline = px.line(df.groupby(['Time','Error']).size().reset_index(name='Count'),
x='Time', y='Count', color='Error', title='SCSI Errors Over Time')
# HTML Report
html = f"""
<html>
<head><title>ESXi SCSI Status - {datetime.now()}</title>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css">
</head>
<body>
<div class="container">
<h2>🔍 ESXi Storage Path & SCSI Errors</h2>
<div class="row">
<div class="col-md-6">{path_pie.to_html(full_html=False, include_plotlyjs='cdn')}</div>
<div class="col-md-6">{error_timeline.to_html(full_html=False, include_plotlyjs='cdn')}</div>
</div>
<h3>Top Failing LUNs</h3>
{table = df.pivot_table(values='Count', index='LUN', columns='Error', aggfunc='sum', fill_value=0)
table.to_html()}
</div>
</body>
</html>
"""
with open('scsi-dashboard.html', 'w') as f:
f.write(html)
Master Orchestrator (storage-scsi-monitor.py)
Purpose: Monitors all ESXi hosts, parses logs, generates alerts.
python#!/usr/bin/env python3
import paramiko
import subprocess
from scsi_decoder import decode_scsi # From above
ESXI_HOSTS = ['esxi1.example.com', 'esxi2.example.com']
ALERT_THRESHOLD = 10 # Errors per 5min
def analyze_host(host):
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(host, username='root', password='esxi_password')
# Tail vmkernel log + run analyzer
stdin, stdout, stderr = ssh.exec_command("""
tail -500 /var/run/log/vmkernel.log | grep -E 'SCSIERR|path dead|timeout' > /tmp/scsi.log &&
/scripts/vmk-log-analyzer.sh
""")
# Parse results
errors = pd.read_csv('/tmp/scsi-report-*.csv')
critical_luns = errors[errors['Count'] > ALERT_THRESHOLD]
if not critical_luns.empty:
print(f"ALERT {host}: {len(critical_luns)} critical LUNs")
for _, row in critical_luns.iterrows():
print(f" LUN {row['LUN']}: {row['Error']} x{row['Count']}")
ssh.close()
# Run across all hosts
for host in ESXI_HOSTS:
analyze_host(host)
subprocess.run(['python3', 'scsi-dashboard.py'])
print("SCSI dashboard: scsi-dashboard.html")
One-Liner SCSI Commands
Common SCSI Errors & Fixes
Alerting Integration
bash# Email critical LUNs
if [ $(wc -l < /tmp/scsi-report-*.csv) -gt 5 ]; then
cat /tmp/scsi-report-*.csv | mail -s "ESXi SCSI Errors $(hostname)" storage-team@company.com
fi
Confluence Deployment
text1. SCP scripts to ESXi: `/scripts/`
2. Cron: `*/5 * * * * /scripts/vmk-log-analyzer.sh`
3. Jumpbox: Run `storage-scsi-monitor.py` hourly
4. Embed: `{html}http://scsi-dashboard.html{html}`
Sample Output:
textesxi1: 12 critical LUNs
LUN naa.60a980...: Aborted Command x8
Path vmhba32:C0:T0:L0 dead x4
Pro Tip: Filter Tintri LUNs: grep tintri\|naa.60a9 /var/run/log/vmkernel.log
Run ./storage-scsi-monitor.py for instant dashboard!