PSOD (Purple Screen of Death) is a critical error in VMware ESXi that occurs when the hypervisor encounters a severe issue that prevents it from continuing normal operations. When a PSOD occurs, the entire ESXi host halts, and a purple diagnostic screen is displayed with error information. PSODs are usually caused by low-level hardware or software issues and require careful troubleshooting to identify and resolve the root cause.
How to Fix a PSOD:
Example of PSOD Troubleshooting:
Let’s say you encounter a PSOD with the following error message:
PSOD: PCPU 1 locked up. Failed to ack TLB invalidate request. #PF Exception 14 in world 34150:TestVM
Troubleshooting steps might include:
- Reviewing the error message and understanding the context of the PSOD (PCPU 1 locked up).
- Checking the VMkernel log (
/var/log/vmkernel.log) to see if there were any hardware-related issues on CPU 1 leading up to the PSOD. - Verifying that the CPU is functioning correctly and is not overheating.
- Checking for any BIOS/UEFI updates for the server’s motherboard and updating if necessary.
- Reviewing VMware’s Knowledge Base for any known issues related to “Failed to ack TLB invalidate request” errors.
- If the issue persists, engaging VMware Support for further analysis and assistance.
Two common types of errors that can lead to PSODs are NMI (Non-Maskable Interrupt) and MCE (Machine Check Exception). Both NMI and MCE are hardware-related errors and can indicate serious issues with the underlying physical hardware.
NMI (Non-Maskable Interrupt): NMI is a type of interrupt that cannot be disabled or masked by the CPU. It is typically used for critical hardware events that require immediate attention. When an NMI occurs, the CPU immediately stops executing the current task and jumps to the NMI handler, which is responsible for handling the critical event.
Example NMI PSOD message:
PSOD: NMI received for unknown reason 3c on CPU 0.
MCE (Machine Check Exception): MCE is a hardware exception generated by the CPU when it detects a hardware-related error, such as memory errors, cache errors, or other internal CPU errors. MCEs are typically raised when the CPU detects an error that cannot be corrected, indicating a potential hardware problem.
Example MCE PSOD message:
PSOD: MCE Exception 0x21 in world 1234:TestVM
Troubleshooting NMI and MCE PSODs: Since both NMI and MCE PSODs are hardware-related errors, troubleshooting them requires a thorough analysis of the physical hardware. Here are some general steps for troubleshooting NMI and MCE PSODs:
- Collect PSOD Details: Note down the exact PSOD error message and any associated error codes. This information will be valuable for troubleshooting.
- Check Hardware Health: Use the server’s integrated management tools or vendor-specific utilities to check the health of the CPU, memory, storage, and other hardware components. Look for any error indications or hardware faults.
- Update Firmware and Drivers: Ensure that the server’s firmware (BIOS/UEFI) and hardware drivers are up-to-date. Outdated firmware or drivers can lead to hardware compatibility issues.
- Run Hardware Diagnostics: Many server vendors provide hardware diagnostic tools that can help identify hardware issues. Run comprehensive hardware diagnostics to detect any problems with the CPU, memory, or other components.
- Check for Known Issues: Search VMware’s Knowledge Base and community forums for any known issues related to the specific PSOD error messages you encountered.
- Review VM Configurations: If the PSOD is associated with a specific VM, review the VM’s configurations, such as CPU and memory settings, to ensure they are within supported limits.
- Monitor Hardware Temperature: Monitor the hardware temperature to ensure that the server is not overheating, as overheating can lead to hardware errors.
- Review Physical Connections: Verify that all physical connections, such as memory modules and expansion cards, are seated properly.
- Engage Vendor Support: If you are unable to resolve the issue, engage the server vendor’s support team for further assistance and hardware validation.
It’s important to remember that NMI and MCE PSODs are low-level hardware errors, and resolving them may require in-depth knowledge of server hardware and firmware. If you are unsure about the steps or need further assistance, consider seeking help from experienced VMware administrators or the server vendor’s support team. Additionally, keep the server’s hardware and firmware up-to-date to minimize the risk of encountering NMI and MCE errors.