ESXi SCSI Decoder & VMkernel Log Analyzer

VMkernel logs contain SCSI sense codes, path states, and HBA errors in hex format that decode storage failures like LUN timeouts, reservation conflicts, and path flaps. This script suite parses /var/run/log/vmkernel.log, decodes SCSI status, and generates troubleshooting dashboards with failed paths/host details.

SCSI Sense Code Decoder (scsi-decoder.py)

Purpose: Converts hex sense data from vmkernel logs to human-readable errors.

python#!/usr/bin/env python3
# scsi-decoder.py - Decode VMware SCSI sense codes
SCSI_SENSE = {
    '0x0': 'No Sense (OK)',
    '0x2': 'Not Ready',
    '0x3': 'Medium Error',
    '0x4': 'Hardware Error',
    '0x5': 'Illegal Request',
    '0x6': 'Unit Attention',
    '0x7': 'Data Protect',
    '0xb': 'Aborted Command',
    '0xe': 'Overlapped Commands Attempted'
}

ASC_QUAL = {
    '0x2800': 'LUN Not Ready, Format in Progress',
    '0x3f01': 'Removed Target',
    '0x3f07': 'Multiple LUN Reported',
    '0x4700': 'Reservation Conflict',
    '0x4c00': 'Snapshot Snapshot Failed',
    '0x5506': 'Illegal Message',
    '0x0800': 'Logical Unit Communication Failure'
}

def decode_scsi(line):
    """Parse vmkernel SCSI line: [timestamp] vmkwarning: CPUx: NMP: nmp_ThrottleLogForDevice: ... VMW_SCSIERR_0xX"""
    if 'VMW_SCSIERR' not in line:
        return None
    
    sense_match = re.search(r'VMW_SCSIERR_([0-9a-fA-F]{2})', line)
    if sense_match:
        sense = f"0x{sense_match.group(1)}"
        naa = re.search(r'naa\.([0-9a-fA-F:]+)', line)
        lun = naa.group(1) if naa else 'Unknown'
        
        return {
            'lun': lun,
            'sense_key': SCSI_SENSE.get(sense, f'Unknown: {sense}'),
            'raw_line': line.strip(),
            'timestamp': re.search(r'\[(.*?)\]', line).group(1)
        }
    return None

# Usage example
log_line = "2025-12-28T18:46:00.123Z cpu5:32:VMW_SCSIERR_0xb: naa.60a9800064824b4f4f4f4f4f4f4f4f4f"
print(decode_scsi(log_line))
# Output: {'lun': '60a980006482...', 'sense_key': 'Aborted Command', ...}

VMkernel Log Parser (vmk-log-analyzer.sh)

Purpose: Real-time parsing of SCSI errors, path states, and HBA failures.

bash#!/bin/bash
# vmk-log-analyzer.sh - Live SCSI/Path decoder
LOG_FILE="/var/run/log/vmkernel.log"
REPORT="/tmp/scsi-report-$(date +%Y%m%d-%H%M).csv"

echo "Host,LUN,SCSI_Error,Path_State,Timestamp,Count" > $REPORT

# SCSI Errors
grep -i "VMW_SCSIERR\|scsi\|LUN\|NMP\|path dead\|timeout" $LOG_FILE | while read line; do
    host=$(hostname)
    lun=$(echo $line | grep -o 'naa\.[0-9a-f:]*' | head -1 || echo 'unknown')
    error=$(echo $line | grep -o 'VMW_SCSIERR_[0-9a-f]*\|timeout\|dead\|failed' | head -1)
    timestamp=$(echo $line | grep -o '\[[0-9TZ.-]*' | sed 's/\[//;s/\]//')
    
    echo "$host,$lun,$error,$timestamp,1" >> $REPORT.tmp
done

# Path States
grep -i "path state\|working\|dead\|standby\|active" $LOG_FILE | while read line; do
    host=$(hostname)
    lun=$(echo $line | grep -o 'naa\.[0-9a-f:]*' | head -1)
    path_state=$(echo $line | grep -oE '(working|dead|standby|active|disabled)')
    timestamp=$(echo $line | grep -o '\[[0-9TZ.-]*' | sed 's/\[//;s/\]//')
    
    echo "$host,$lun,$path_state,$timestamp,1" >> $REPORT.tmp
done

# Aggregate & sort
cat $REPORT.tmp | sort | uniq -c | sort -nr | awk '{print $2","$3","$4","$5","$1}' >> $REPORT
rm $REPORT.tmp

echo "SCSI/Path Report: $(wc -l < $REPORT) entries"
tail -20 $REPORT

Cron: */5 * * * * /scripts/vmk-log-analyzer.sh (5min intervals).

Live Dashboard Generator (scsi-dashboard.py)

Purpose: Creates HTML table with SCSI errors, path status, and failure trends.

python#!/usr/bin/env python3
# scsi-dashboard.py - Interactive SCSI failure dashboard
import pandas as pd
import plotly.express as px
from datetime import datetime

df = pd.read_csv('/tmp/scsi-report-*.csv', names=['Host','LUN','Error','Path_State','Timestamp','Count'])

# Top failing LUNs
top_luns = df.groupby('LUN')['Count'].sum().sort_values(ascending=False).head(10)

# Path state pie chart
path_pie = px.pie(df, names='Path_State', values='Count', title='Path States Distribution')

# Error timeline
df['Time'] = pd.to_datetime(df['Timestamp'])
error_timeline = px.line(df.groupby(['Time','Error']).size().reset_index(name='Count'), 
                        x='Time', y='Count', color='Error', title='SCSI Errors Over Time')

# HTML Report
html = f"""
<html>
<head><title>ESXi SCSI Status - {datetime.now()}</title>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css">
</head>
<body>
<div class="container">
<h2>🔍 ESXi Storage Path & SCSI Errors</h2>
<div class="row">
<div class="col-md-6">{path_pie.to_html(full_html=False, include_plotlyjs='cdn')}</div>
<div class="col-md-6">{error_timeline.to_html(full_html=False, include_plotlyjs='cdn')}</div>
</div>

<h3>Top Failing LUNs</h3>
{table = df.pivot_table(values='Count', index='LUN', columns='Error', aggfunc='sum', fill_value=0)
table.to_html()}
</div>
</body>
</html>
"""
with open('scsi-dashboard.html', 'w') as f:
    f.write(html)

Master Orchestrator (storage-scsi-monitor.py)

Purpose: Monitors all ESXi hosts, parses logs, generates alerts.

python#!/usr/bin/env python3
import paramiko
import subprocess
from scsi_decoder import decode_scsi  # From above

ESXI_HOSTS = ['esxi1.example.com', 'esxi2.example.com']
ALERT_THRESHOLD = 10  # Errors per 5min

def analyze_host(host):
    ssh = paramiko.SSHClient()
    ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    ssh.connect(host, username='root', password='esxi_password')
    
    # Tail vmkernel log + run analyzer
    stdin, stdout, stderr = ssh.exec_command("""
        tail -500 /var/run/log/vmkernel.log | grep -E 'SCSIERR|path dead|timeout' > /tmp/scsi.log &&
        /scripts/vmk-log-analyzer.sh
    """)
    
    # Parse results
    errors = pd.read_csv('/tmp/scsi-report-*.csv')
    critical_luns = errors[errors['Count'] > ALERT_THRESHOLD]
    
    if not critical_luns.empty:
        print(f"ALERT {host}: {len(critical_luns)} critical LUNs")
        for _, row in critical_luns.iterrows():
            print(f"  LUN {row['LUN']}: {row['Error']} x{row['Count']}")
    
    ssh.close()

# Run across all hosts
for host in ESXI_HOSTS:
    analyze_host(host)

subprocess.run(['python3', 'scsi-dashboard.py'])
print("SCSI dashboard: scsi-dashboard.html")

One-Liner SCSI Commands

Task	Command
Live SCSI errors	`tailf /var/run/log/vmkernel.log \| grep SCSIERR`
Path status	`esxcli storage core path list \| grep -E ‘dead
LUN reservations	`esxcli storage core device list \| grep reservation`
HBA errors	`esxcli storage core adapter stats get -A vmhbaX`
Decode sense manually	`echo "0xb" \| python3 scsi-decoder.py`

Common SCSI Errors & Fixes

Sense Code	Meaning	Fix
0xb (Aborted)	Queue overflow, SAN busy	Increase HBA queue depth, check SAN zoning
0x6 (Unit Attention)	LUN reset	Rescan: `esxcli storage core adapter rescan --all`
0x47 (Reservation)	vMotion conflict	Stagger migrations, check `esxtop` %RESV
Path Dead	Cable/switch/HBA fail	`esxcli storage core path set -p <path> -s active`

Alerting Integration

bash# Email critical LUNs
if [ $(wc -l < /tmp/scsi-report-*.csv) -gt 5 ]; then
    cat /tmp/scsi-report-*.csv | mail -s "ESXi SCSI Errors $(hostname)" storage-team@company.com
fi

Confluence Deployment

text1. SCP scripts to ESXi: `/scripts/`
2. Cron: `*/5 * * * * /scripts/vmk-log-analyzer.sh`
3. Jumpbox: Run `storage-scsi-monitor.py` hourly
4. Embed: `{html}http://scsi-dashboard.html{html}`

Sample Output:

textesxi1: 12 critical LUNs
  LUN naa.60a980...: Aborted Command x8
  Path vmhba32:C0:T0:L0 dead x4

Pro Tip: Filter Tintri LUNs: grep tintri\|naa.60a9 /var/run/log/vmkernel.log

Run ./storage-scsi-monitor.py for instant dashboard!

VMwareBlogs

"Unlocking the Power of Virtualization: Explore the Latest Insights and Innovations with VMware Blogs"

ESXi SCSI Decoder & VMkernel Log Analyzer

SCSI Sense Code Decoder (scsi-decoder.py)

VMkernel Log Parser (vmk-log-analyzer.sh)

Live Dashboard Generator (scsi-dashboard.py)

Master Orchestrator (storage-scsi-monitor.py)

One-Liner SCSI Commands

Common SCSI Errors & Fixes

Alerting Integration

Confluence Deployment

Leave a comment Cancel reply

SCSI Sense Code Decoder (scsi-decoder.py)

VMkernel Log Parser (vmk-log-analyzer.sh)

Live Dashboard Generator (scsi-dashboard.py)

Master Orchestrator (storage-scsi-monitor.py)

One-Liner SCSI Commands

Common SCSI Errors & Fixes

Alerting Integration

Confluence Deployment

Share this:

Related

Leave a comment Cancel reply