ESXi SCSI Decoder & VMkernel Log Analyzer

VMkernel logs contain SCSI sense codes, path states, and HBA errors in hex format that decode storage failures like LUN timeouts, reservation conflicts, and path flaps. This script suite parses /var/run/log/vmkernel.log, decodes SCSI status, and generates troubleshooting dashboards with failed paths/host details.

SCSI Sense Code Decoder (scsi-decoder.py)

Purpose: Converts hex sense data from vmkernel logs to human-readable errors.

python#!/usr/bin/env python3
# scsi-decoder.py - Decode VMware SCSI sense codes
SCSI_SENSE = {
    '0x0': 'No Sense (OK)',
    '0x2': 'Not Ready',
    '0x3': 'Medium Error',
    '0x4': 'Hardware Error',
    '0x5': 'Illegal Request',
    '0x6': 'Unit Attention',
    '0x7': 'Data Protect',
    '0xb': 'Aborted Command',
    '0xe': 'Overlapped Commands Attempted'
}

ASC_QUAL = {
    '0x2800': 'LUN Not Ready, Format in Progress',
    '0x3f01': 'Removed Target',
    '0x3f07': 'Multiple LUN Reported',
    '0x4700': 'Reservation Conflict',
    '0x4c00': 'Snapshot Snapshot Failed',
    '0x5506': 'Illegal Message',
    '0x0800': 'Logical Unit Communication Failure'
}

def decode_scsi(line):
    """Parse vmkernel SCSI line: [timestamp] vmkwarning: CPUx: NMP: nmp_ThrottleLogForDevice: ... VMW_SCSIERR_0xX"""
    if 'VMW_SCSIERR' not in line:
        return None
    
    sense_match = re.search(r'VMW_SCSIERR_([0-9a-fA-F]{2})', line)
    if sense_match:
        sense = f"0x{sense_match.group(1)}"
        naa = re.search(r'naa\.([0-9a-fA-F:]+)', line)
        lun = naa.group(1) if naa else 'Unknown'
        
        return {
            'lun': lun,
            'sense_key': SCSI_SENSE.get(sense, f'Unknown: {sense}'),
            'raw_line': line.strip(),
            'timestamp': re.search(r'\[(.*?)\]', line).group(1)
        }
    return None

# Usage example
log_line = "2025-12-28T18:46:00.123Z cpu5:32:VMW_SCSIERR_0xb: naa.60a9800064824b4f4f4f4f4f4f4f4f4f"
print(decode_scsi(log_line))
# Output: {'lun': '60a980006482...', 'sense_key': 'Aborted Command', ...}

VMkernel Log Parser (vmk-log-analyzer.sh)

Purpose: Real-time parsing of SCSI errors, path states, and HBA failures.

bash#!/bin/bash
# vmk-log-analyzer.sh - Live SCSI/Path decoder
LOG_FILE="/var/run/log/vmkernel.log"
REPORT="/tmp/scsi-report-$(date +%Y%m%d-%H%M).csv"

echo "Host,LUN,SCSI_Error,Path_State,Timestamp,Count" > $REPORT

# SCSI Errors
grep -i "VMW_SCSIERR\|scsi\|LUN\|NMP\|path dead\|timeout" $LOG_FILE | while read line; do
    host=$(hostname)
    lun=$(echo $line | grep -o 'naa\.[0-9a-f:]*' | head -1 || echo 'unknown')
    error=$(echo $line | grep -o 'VMW_SCSIERR_[0-9a-f]*\|timeout\|dead\|failed' | head -1)
    timestamp=$(echo $line | grep -o '\[[0-9TZ.-]*' | sed 's/\[//;s/\]//')
    
    echo "$host,$lun,$error,$timestamp,1" >> $REPORT.tmp
done

# Path States
grep -i "path state\|working\|dead\|standby\|active" $LOG_FILE | while read line; do
    host=$(hostname)
    lun=$(echo $line | grep -o 'naa\.[0-9a-f:]*' | head -1)
    path_state=$(echo $line | grep -oE '(working|dead|standby|active|disabled)')
    timestamp=$(echo $line | grep -o '\[[0-9TZ.-]*' | sed 's/\[//;s/\]//')
    
    echo "$host,$lun,$path_state,$timestamp,1" >> $REPORT.tmp
done

# Aggregate & sort
cat $REPORT.tmp | sort | uniq -c | sort -nr | awk '{print $2","$3","$4","$5","$1}' >> $REPORT
rm $REPORT.tmp

echo "SCSI/Path Report: $(wc -l < $REPORT) entries"
tail -20 $REPORT

Cron*/5 * * * * /scripts/vmk-log-analyzer.sh (5min intervals).

Live Dashboard Generator (scsi-dashboard.py)

Purpose: Creates HTML table with SCSI errors, path status, and failure trends.

python#!/usr/bin/env python3
# scsi-dashboard.py - Interactive SCSI failure dashboard
import pandas as pd
import plotly.express as px
from datetime import datetime

df = pd.read_csv('/tmp/scsi-report-*.csv', names=['Host','LUN','Error','Path_State','Timestamp','Count'])

# Top failing LUNs
top_luns = df.groupby('LUN')['Count'].sum().sort_values(ascending=False).head(10)

# Path state pie chart
path_pie = px.pie(df, names='Path_State', values='Count', title='Path States Distribution')

# Error timeline
df['Time'] = pd.to_datetime(df['Timestamp'])
error_timeline = px.line(df.groupby(['Time','Error']).size().reset_index(name='Count'), 
                        x='Time', y='Count', color='Error', title='SCSI Errors Over Time')

# HTML Report
html = f"""
<html>
<head><title>ESXi SCSI Status - {datetime.now()}</title>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css">
</head>
<body>
<div class="container">
<h2>🔍 ESXi Storage Path & SCSI Errors</h2>
<div class="row">
<div class="col-md-6">{path_pie.to_html(full_html=False, include_plotlyjs='cdn')}</div>
<div class="col-md-6">{error_timeline.to_html(full_html=False, include_plotlyjs='cdn')}</div>
</div>

<h3>Top Failing LUNs</h3>
{table = df.pivot_table(values='Count', index='LUN', columns='Error', aggfunc='sum', fill_value=0)
table.to_html()}
</div>
</body>
</html>
"""
with open('scsi-dashboard.html', 'w') as f:
    f.write(html)

Master Orchestrator (storage-scsi-monitor.py)

Purpose: Monitors all ESXi hosts, parses logs, generates alerts.

python#!/usr/bin/env python3
import paramiko
import subprocess
from scsi_decoder import decode_scsi # From above

ESXI_HOSTS = ['esxi1.example.com', 'esxi2.example.com']
ALERT_THRESHOLD = 10 # Errors per 5min

def analyze_host(host):
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(host, username='root', password='esxi_password')

# Tail vmkernel log + run analyzer
stdin, stdout, stderr = ssh.exec_command("""
tail -500 /var/run/log/vmkernel.log | grep -E 'SCSIERR|path dead|timeout' > /tmp/scsi.log &&
/scripts/vmk-log-analyzer.sh
""")

# Parse results
errors = pd.read_csv('/tmp/scsi-report-*.csv')
critical_luns = errors[errors['Count'] > ALERT_THRESHOLD]

if not critical_luns.empty:
print(f"ALERT {host}: {len(critical_luns)} critical LUNs")
for _, row in critical_luns.iterrows():
print(f" LUN {row['LUN']}: {row['Error']} x{row['Count']}")

ssh.close()

# Run across all hosts
for host in ESXI_HOSTS:
analyze_host(host)

subprocess.run(['python3', 'scsi-dashboard.py'])
print("SCSI dashboard: scsi-dashboard.html")

One-Liner SCSI Commands

TaskCommand
Live SCSI errorstailf /var/run/log/vmkernel.log | grep SCSIERR
Path status`esxcli storage core path list | grep -E ‘dead
LUN reservationsesxcli storage core device list | grep reservation
HBA errorsesxcli storage core adapter stats get -A vmhbaX
Decode sense manuallyecho "0xb" | python3 scsi-decoder.py

Common SCSI Errors & Fixes

Sense CodeMeaningFix
0xb (Aborted)Queue overflow, SAN busyIncrease HBA queue depth, check SAN zoning
0x6 (Unit Attention)LUN resetRescan: esxcli storage core adapter rescan --all
0x47 (Reservation)vMotion conflictStagger migrations, check esxtop %RESV
Path DeadCable/switch/HBA failesxcli storage core path set -p <path> -s active

Alerting Integration

bash# Email critical LUNs
if [ $(wc -l < /tmp/scsi-report-*.csv) -gt 5 ]; then
cat /tmp/scsi-report-*.csv | mail -s "ESXi SCSI Errors $(hostname)" storage-team@company.com
fi

Confluence Deployment

text1. SCP scripts to ESXi: `/scripts/`
2. Cron: `*/5 * * * * /scripts/vmk-log-analyzer.sh`
3. Jumpbox: Run `storage-scsi-monitor.py` hourly
4. Embed: `{html}http://scsi-dashboard.html{html}`

Sample Output:

textesxi1: 12 critical LUNs
LUN naa.60a980...: Aborted Command x8
Path vmhba32:C0:T0:L0 dead x4

Pro Tip: Filter Tintri LUNs: grep tintri\|naa.60a9 /var/run/log/vmkernel.log

Run ./storage-scsi-monitor.py for instant dashboard!

Leave a comment