Validate all the service profiles in Cisco UCS

To validate all the service profiles in Cisco UCS (Unified Computing System) using PowerShell, you can use the UCS PowerTool module. UCS PowerTool is a set of PowerShell cmdlets that allows you to manage and automate tasks on Cisco UCS infrastructure. Here’s a PowerShell script to validate all the service profiles in Cisco UCS:

Prerequisites:

  1. Ensure you have the UCS PowerTool module installed. You can download it from the Cisco website and follow the installation instructions provided.

PowerShell Script to Validate All Service Profiles:

# Import the UCS PowerTool module
Import-Module Cisco.Ucs.Core

# Set the UCS Manager IP address or hostname and credentials
$UCSManagerIP = "UCS_MANAGER_IP_ADDRESS_OR_HOSTNAME"
$Username = "USERNAME"
$Password = "PASSWORD"

# Connect to the UCS Manager
Connect-Ucs -Name $UCSManagerIP -Credential (Get-Credential -UserName $Username -Message "Enter UCS Manager Password")

# Get all the service profiles from UCS Manager
$ServiceProfiles = Get-UcsServiceProfile

# Loop through each service profile and validate
foreach ($ServiceProfile in $ServiceProfiles) {
    # Get the service profile name
    $ServiceProfileName = $ServiceProfile.Dn.Split("/")[-1]

    # Add your validation checks here
    # For example, you can check if the service profile is associated with a server, if it's in a correct state, etc.
    # Here's a simple example to print the name of each service profile:
    Write-Host "Service Profile: $ServiceProfileName"
}

# Disconnect from the UCS Manager
Disconnect-Ucs

Important Notes:

  • The script above connects to the UCS Manager using the provided credentials and retrieves all the service profiles.
  • Inside the foreach loop, you can add your custom validation checks based on your specific requirements. For example, you can check if the service profile is associated with a server, check its operating state, check for specific configurations, etc.
  • Customize the script with appropriate values for your UCS Manager, such as IP address, credentials, and any additional validation checks you want to perform.

Please be cautious when running any scripts against your production UCS infrastructure. Always test the script in a lab or non-production environment first to ensure it behaves as expected.

Cisco Data Center Network Manager (DCNM)

Cisco Data Center Network Manager (DCNM) is a management solution that provides centralized management and monitoring capabilities for Cisco data center infrastructure. It offers comprehensive features to manage and troubleshoot Cisco Nexus switches, Cisco MDS (Multilayer Director Switches) storage switches, and other Cisco data center devices. Let’s explore some key features and examples of how to use Cisco DCNM:

1. Discovery and Inventory: DCNM can automatically discover and inventory the network devices in the data center. This provides administrators with a centralized view of the network topology and device details. To initiate the discovery process, navigate to “Inventory” > “Discovery” and follow the steps to add the devices to the DCNM inventory.

2. Device Configuration: DCNM allows administrators to manage the configuration of Cisco Nexus switches and MDS storage switches from a single interface. You can make changes to device configurations, push configurations to multiple devices simultaneously, and roll back configurations if needed. To access device configurations, go to “Configure” > “Templates” and create or modify templates for various device settings.

3. Monitoring and Alerts: DCNM provides real-time monitoring of network devices and interfaces. It can generate alerts and notifications for specific events or threshold violations. To configure alerts, navigate to “Monitor” > “Events and Alerts” and define the conditions and actions for different events.

4. Virtual Machine Manager (VMM): DCNM offers integration with VMware vSphere, enabling administrators to manage virtual and physical networking resources in a coordinated manner. This integration allows seamless provisioning and management of virtual machines and network resources. The VMM feature requires proper configuration and integration with VMware vCenter.

5. SAN (Storage Area Network) Management: For Cisco MDS storage switches, DCNM provides storage management capabilities, including zone management, VSAN (Virtual SAN) management, and monitoring of storage paths and devices. To manage SAN components, navigate to “SAN” > “SAN Configuration” or “SAN” > “Monitoring.”

6. Performance Monitoring: DCNM allows administrators to monitor and analyze the performance of network devices and interfaces. You can view real-time performance statistics, historical data, and generate reports. To access performance monitoring, go to “Monitor” > “Performance.”

7. Traffic Analyzer: DCNM’s Traffic Analyzer feature enables administrators to capture and analyze network traffic for troubleshooting and performance optimization. It allows packet captures and packet analysis within the DCNM interface.

8. Fabric and VLAN Management: DCNM simplifies fabric management for Cisco Nexus switches, including fabric provisioning, configuration, and troubleshooting. It also provides VLAN management capabilities for virtual LAN segmentation.

9. Troubleshooting with Logs: DCNM captures and stores logs from network devices, which can be valuable for troubleshooting network issues. Logs can be viewed and downloaded from the “Admin” > “Logs” section.

10. Multi-Site Management: DCNM supports multi-site management, allowing administrators to manage and monitor multiple data centers from a centralized DCNM instance.

Example: Configuring Interface Profile and VLAN in DCNM: Let’s walk through a basic example of how to configure an interface profile and VLAN using DCNM.

  1. Log in to the DCNM web interface.
  2. Navigate to “Configure” > “Interfaces.”
  3. Click on “Interface Profiles” and then click “Create” to create a new profile.
  4. Configure the interface profile settings, such as speed, duplex, and VLAN assignment.
  5. Save the profile.
  6. Navigate to “Configure” > “Interfaces” > “Interfaces.”
  7. Select the interfaces you want to assign to the new profile.
  8. Click “Assign” and select the newly created profile from the list.
  9. Save the configuration.

The above steps demonstrate how to create an interface profile in DCNM and assign interfaces to it. This simplifies the configuration process by applying common settings to multiple interfaces simultaneously.

Please note that the above example is a basic demonstration, and the actual configuration may vary depending on your specific network requirements and DCNM version.

Conclusion: Cisco DCNM offers a wide range of features to streamline data center network management and monitoring. From device discovery and inventory to performance monitoring and traffic analysis, DCNM provides comprehensive tools for efficient data center operations. With its user-friendly web interface, administrators can easily configure and manage Cisco Nexus switches and MDS storage switches, simplifying day-to-day network management tasks.

SEL log reading (CISCO Switches)

Cisco SEL (System Event Log) analysis is a critical task for network administrators and engineers to identify and troubleshoot issues in Cisco networking devices. The SEL captures various system events and messages, providing valuable insights into the health, performance, and security of Cisco devices. In this comprehensive guide, we’ll explore the importance of SEL analysis, the structure of SEL messages, common SEL entries, and examples of SEL log analysis to demonstrate how it can be performed effectively.

1. Importance of Cisco SEL Analysis:

Cisco devices, such as routers, switches, and servers, generate SEL messages to record events related to hardware, software, and system operations. SEL analysis is essential for the following reasons:

  • Troubleshooting: SEL messages can help identify the root cause of network issues, hardware failures, or abnormal behaviors in Cisco devices.
  • Monitoring: Monitoring SEL logs enables proactive maintenance, detecting potential hardware failures or system events that may affect network performance.
  • Security: SEL logs can provide evidence of unauthorized access attempts or security breaches in Cisco devices.
  • Compliance: Many industry regulations require logging and monitoring of critical events, making SEL analysis crucial for compliance purposes.

2. Structure of Cisco SEL Messages:

SEL messages are stored in the device’s SEL log, usually in non-volatile memory (NVRAM). The SEL log follows the IPMI (Intelligent Platform Management Interface) v2.0 standard and consists of timestamped records, each containing various fields, including:

  • Record ID: A unique identifier for the SEL record.
  • Timestamp: The date and time when the event occurred.
  • Sensor Type: Identifies the source of the event, such as temperature, voltage, fan, power supply, etc.
  • Event Type: Describes the nature of the event, such as threshold crossing, sensor-specific, or OEM-specific events.
  • Event Data: Specific data related to the event, such as sensor values, device status, or error codes.
  • Sensor ID: Identifies the specific sensor generating the event.
  • Entity ID: Identifies the device or component that the sensor belongs to.

3. Common SEL Entries and Examples:

a) Temperature Threshold Crossing:

ID | TimeStamp                | Sensor Type       | Event Type              | Event Data         | Sensor ID | Entity ID
-------------------------------------------------------------------------------------------------------------------
1  | 2023-07-20T09:32:15.123Z | Temperature       | Threshold Crossed Upper | Temperature = 80C  | 7Fh       | System Board

Explanation: In this example, the SEL entry indicates that the temperature sensor on the system board (Entity ID) has crossed the upper threshold of 80°C.

b) Power Supply Status Change:

ID | TimeStamp                | Sensor Type       | Event Type              | Event Data                 | Sensor ID | Entity ID
---------------------------------------------------------------------------------------------------------------------------
2  | 2023-07-20T09:45:22.789Z | Power Supply      | State Deasserted        | Power Supply Status = Off  | 01h       | Power Supply 1

Explanation: This SEL entry indicates that Power Supply 1 (Entity ID) has been deasserted, and its status is now “Off.”

c) Fan Speed Threshold Crossing:

ID | TimeStamp                | Sensor Type       | Event Type              | Event Data                  | Sensor ID | Entity ID
---------------------------------------------------------------------------------------------------------------------------
3  | 2023-07-20T10:15:55.456Z | Fan               | Threshold Crossed Lower | Fan Speed = 1200 RPM        | 04h       | Fan Module 2

Explanation: The SEL entry shows that the fan speed (Sensor ID) in Fan Module 2 (Entity ID) has crossed the lower threshold of 1200 RPM.

d) Security Event: Login Attempt Failure

ID | TimeStamp                | Sensor Type       | Event Type              | Event Data         | Sensor ID | Entity ID
-------------------------------------------------------------------------------------------------------------------
4  | 2023-07-20T10:30:08.987Z | Security Audit    | Login Failed            | User = admin       | 70h       | Management Controller

Explanation: This SEL entry indicates a security event where a login attempt by the user “admin” (Event Data) to the Management Controller (Entity ID) has failed.

4. Analyzing SEL Logs:

  • Identify Critical Events: Analyze SEL logs regularly to identify critical events or alarms, such as temperature threshold crossings, power supply failures, or fan speed anomalies.
  • Cross-Reference with Other Logs: Correlate SEL logs with other system logs, such as syslog or SNMP traps, to get a comprehensive view of the system’s behavior and diagnose issues effectively.
  • Automated Monitoring: Use SNMP-based monitoring tools or SIEM (Security Information and Event Management) solutions to automate SEL log analysis and receive real-time alerts for critical events.
  • Track Hardware Changes: Monitor SEL logs during hardware upgrades or replacements to ensure new components are functioning correctly and detect any compatibility issues.
  • Regular Maintenance: Regularly clear SEL logs and archive old logs to maintain optimal system performance and prevent storage overflow.

Troubleshooting Maximum Transmission Unit (MTU) issues on Cisco switches

Troubleshooting Maximum Transmission Unit (MTU) issues on Cisco switches using PowerShell requires the use of SSH or Telnet to interact with the switches’ command-line interface (CLI). PowerShell does not have native support for SSH or Telnet, but we can leverage external modules or utilities to achieve this. In this guide, we’ll use the popular SSH module called “Posh-SSH” to demonstrate how to troubleshoot MTU-related problems on Cisco switches using PowerShell.

Before proceeding, make sure you have the following prerequisites:

  1. Install the Posh-SSH module for PowerShell.
  2. Enable SSH access on the Cisco switches and ensure you have the necessary credentials.

Step 1: Install Posh-SSH Module: The Posh-SSH module can be installed from the PowerShell Gallery using the following command:

powershellCopy code

Install-Module -Name Posh-SSH

Step 2: Create the PowerShell Script: Below is a PowerShell script to connect to a Cisco switch, retrieve MTU-related information, and troubleshoot MTU issues:

# Import the Posh-SSH module
Import-Module Posh-SSH

# Cisco Switch Details
$SwitchIP = "192.168.1.1"
$Username = "your_username"
$Password = "your_password"

# Function to Connect to Cisco Switch
function Connect-CiscoSwitch {
    param (
        [Parameter(Mandatory=$true)]
        [string]$SwitchIP,
        [Parameter(Mandatory=$true)]
        [string]$Username,
        [Parameter(Mandatory=$true)]
        [string]$Password
    )

    # Create a new SSH session to the Cisco switch
    $session = New-SSHSession -ComputerName $SwitchIP -Credential (Get-Credential -UserName $Username -Password $Password)

    if ($session -eq $null) {
        Write-Host "Failed to connect to the Cisco switch."
        return $null
    }

    Write-Host "Connected to the Cisco switch."
    return $session
}

# Function to Run Commands on Cisco Switch
function Invoke-CommandOnCiscoSwitch {
    param (
        [Parameter(Mandatory=$true)]
        $Session,
        [Parameter(Mandatory=$true)]
        [string]$Command
    )

    $output = Invoke-SSHCommand -SSHSession $Session -Command $Command

    if ($output -eq $null) {
        Write-Host "Command execution failed."
        return $null
    }

    return $output
}

# Connect to the Cisco switch
$session = Connect-CiscoSwitch -SwitchIP $SwitchIP -Username $Username -Password $Password

if ($session -ne $null) {
    # Check the MTU settings on the switch interfaces
    $mtuOutput = Invoke-CommandOnCiscoSwitch -Session $session -Command "show interface | include MTU"

    if ($mtuOutput -ne $null) {
        Write-Host "MTU Information on the Cisco switch:"
        Write-Host $mtuOutput
    }

    # Additional MTU troubleshooting commands can be executed here

    # Close the SSH session
    $session | Remove-SSHSession
}

Step 3: Run the PowerShell Script: Save the PowerShell script with a .ps1 extension and execute it in a PowerShell session. The script will connect to the Cisco switch using SSH, retrieve MTU-related information using the show interface command, and display the output.

Please note that this script provides a basic example of how to troubleshoot MTU issues on Cisco switches using PowerShell and the Posh-SSH module. Depending on your specific requirements and the complexity of the MTU-related problems, you may need to customize the script or include additional commands for more in-depth troubleshooting.

Remember to exercise caution when working with network devices, and always test the script in a lab or non-production environment before using it in a production environment. Additionally, ensure that you have the necessary permissions and credentials to access the Cisco switches using SSH.

Troubleshooting Cisco UCS with Log Examples using PowerShell

Introduction: Cisco Unified Computing System (UCS) is a powerful and complex infrastructure platform that combines computing, networking, and storage resources. Troubleshooting UCS-related issues can be challenging, especially when dealing with large-scale deployments. Fortunately, PowerShell, along with the Cisco UCS PowerTool module, provides a comprehensive set of tools to troubleshoot UCS and analyze logs. In this article, we will explore how to use PowerShell to troubleshoot UCS issues by examining log examples and leveraging the Cisco UCS PowerTool module.

1. Establishing Connection to UCS Manager: To begin troubleshooting UCS using PowerShell, we need to establish a connection to the UCS Manager using the Cisco UCS PowerTool module. This module allows us to interact with the UCS API and retrieve log information. Use the following commands to connect to the UCS Manager:

powershell
# Import the Cisco UCS PowerTool module
Import-Module CiscoUcsPowerTool

# Connect to the UCS Manager
Connect-Ucs -Name <UCS_Manager_IP> -User <Username> -Password <Password>

Replace “, “, and “ with the appropriate values for your UCS Manager.

2. Retrieving UCS System Logs: The UCS Manager maintains various logs that can provide valuable insights into system events and errors. PowerShell, combined with the Cisco UCS PowerTool module, allows us to retrieve and analyze these logs. Use the following script to retrieve UCS system logs:

powershell
# Retrieve the UCS system logs
$logs = Get-UcsSystemLog

The `Get-UcsSystemLog` cmdlet retrieves all system logs from the UCS Manager and stores them in the `$logs` variable.

3. Analyzing UCS Logs: Once we have retrieved the UCS system logs, we can analyze them to identify potential issues or errors. PowerShell provides powerful string manipulation capabilities that can help extract relevant information from the logs. Use the following script as an example to analyze UCS logs:

powershell
# Loop through each log entry
foreach ($log in $logs) {
    Write-Host "Log ID: $($log.Id)"
    Write-Host "Timestamp: $($log.Timestamp)"
    Write-Host "Severity: $($log.Severity)"
    Write-Host "Message: $($log.Message)"
    Write-Host "---------------------------------------"
    
    # Add your analysis logic here
    # Example: Check for specific keywords or patterns in the log message
    
    # Example: Extract additional information from the log message using regular expressions
    if ($log.Message -match "(?i)error") {
        $errorDetails = [regex]::Match($log.Message, "(?i)error: (.+)")
        if ($errorDetails.Success) {
            Write-Host "Error Details: $($errorDetails.Groups[1].Value)"
        }
    }
}

This script loops through each log entry and displays relevant information such as the log ID, timestamp, severity, and message. It also demonstrates two examples of log analysis: checking for specific keywords or patterns in the log message and extracting additional information using regular expressions.

4. Taking Action Based on Log Analysis: Once you have analyzed the UCS logs and identified potential issues, you can take appropriate actions to resolve them. PowerShell allows you to automate these actions using the Cisco UCS PowerTool module. Use the following script as a starting point to take action based on log analysis:

powershell
# Example: Taking action based on log analysis
foreach ($log in $logs) {
    # Add your analysis logic here
    
    # Example: Check for specific keywords or patterns in the log message and take action
    if ($log.Message -match "(?i)error") {
        # Add your action logic here
        # Example: Send an email notification, generate an alert, or execute a remediation script
        Write-Host "Error detected! Taking action..."
    }
}

This script demonstrates an example where it checks for specific keywords or patterns in the log message (e.g., “error”) and takes action accordingly. You can customize the logic to suit your environment and requirements, such as sending email notifications, generating alerts, or executing remediation scripts.

5. Automating UCS Log Analysis: To continuously monitor UCS logs and automate the troubleshooting process, you can schedule the PowerShell script to run at regular intervals using the Windows Task Scheduler or any other automation tool. By doing so, you can proactively detect and address UCS-related issues, minimizing their impact on your infrastructure.

Conclusion: Troubleshooting Cisco UCS issues can be a complex task, but PowerShell, along with the Cisco UCS PowerTool module, provides a powerful toolset to simplify the process. By leveraging PowerShell’s capabilities, administrators can establish connections to UCS managers, retrieve system logs, analyze log entries, and take appropriate actions based on their findings. This enables efficient troubleshooting and resolution of UCS-related issues, ensuring the availability and stability of the infrastructure.

UCS LSI Adapter Beeps Continuously for the UCS C-Series Rack servers

A Cisco UCS server beeps continuously; this beep originates from the LSI RAID adapter.

You can view the LSI MegaRAID Card Beep Codes in order to identify the specific alarm.

If there are no failed drives reported by the Cisco Integrated Management Controller (CIMC), this problem might occur due to an issue described in this LSI MegaRaid firmware bug: LSIP200139764.

Note: TheLSI defect is reported as fixed in MegaRaid firmware 21.0.1-0111 and 21.0.1-0110. This issue affects any server vendor that uses these adapters (for example, Cisco, HP, and Dell).

You can silence the alarm in the LSI MegaRAID BIOS Config Utility. However, be aware that this process requires a server reboot in order to access the WebBIOS. The alarm control is located in Controller Properties.

Also ,if you want to avoid an LSI firmware upgrade, you can complete these steps in order to resolve this issue:

  1. Install MegaCLI. Refer to the LSI Documents & Downloadspage in order to locate documentation on this procedure.
  2. Run this command in MegaCLI in order to silence the alarm:
    # MegaCli -AdpSetProp -AlarmSilence -aALL

LSI MegaRAID Card Beep Codes:

These beep codes indicate activity and changes from the optimal state of your RAID array. For full documentation on the LSI MegaRAID cards and the LSI utilities, refer to the LSI documentation for your card.

I hope this helps.

Troubleshoot Memory issues in Cisco UCS Box

Type of errors that we find in Memory :

    • DIMM Error
      • ECC(Error Correcting Code) Error
        • Multibit = Uncorrectable
          • POST it is mapped out by BIOS, OS does not see DIMM
          • Runtime usually causes OS reboot
        • Singlebit = Correctable
          • OS continues to see memory, performance could degrade
      • Parity Error
      • SPD (Serial Presence Detect) Error
    • Configuration Error
      • Unpaired DIMMs
      • Mismatch errors
        • Not supported DIMMs
        • Not supported DIMM population
    • Identity unestablishable error
      • Check and update the catalog

We need to understand what is a Correctable and Uncorrectable error in order to troubleshoot on any Memory related issues on UCS box.

Whether a particular error is correctable or uncorrectable depends on the strength of the ECC code employed within the memory system. Dedicated hardware is able to fix correctable errors when they occur with no impact on program execution.

The DIMMs with correctable error are not disabled and are available for the OS to use. The Total Memory and Effective Memory be the same (taking memory mirroring into account). These correctable errors reported in UCSM operability state as Degraded while overall operability Operable with correctable errors.

Uncorrectable errors generally cannot be fixed, and may make it impossible for the application or operating system to continue execution. The DIMMs with uncorrectable error is disabled and OS does not see that memory. UCSM operState change to “”Inoperable”” in this case.

To Check Errors from CLI

These commands are useful when troubleshooting errors from CLI.

scope server x/y -> show memory detail
scope server x/y -> show memory-array detail
scope server x/y -> scope memory-array x -> show stats history memory-array-env-stats detail

From memory array scope you can also get access to DIMM.

scope server X/Y > scope memory-array Z > scope DIMM N

From there then you can obtain per-DIMM statistics or reset the error counters.

bdsol-6248-06-B /chassis/server/memory-array/dimm # reset-errors                
bdsol-6248-06-B /chassis/server/memory-array/dimm* # commit-buffer               
bdsol-6248-06-B /chassis/server/memory-array/dimm # show stats memory-error-state

If you see a correctable error reported that matches the information above, the problem can be corrected by resetting the BMC instead of reseating or resetting the blade server. Use these Cisco UCS Manager CLI commands:

Resetting the BMC does not impact the OS running on the blade.

UCS1-A# scope server x/y
UCS1-A /chassis/server # scope bmc
UCS1-A /chassis/server/bmc # reset
UCS1-A /chassis/server/bmc* # commit-buffer

With UCSM releases 3.1 and 2.2.7, the thresholds for memory corrected errors have been removed.

Therefore, memory modules (DIMM) shall no longer be reported as “Inoperable” or “Degraded” solely due to corrected memory errors.

As per whitepaper http://www.cisco.com/c/dam/en/us/products/collateral/servers-unified-computing/ucs-manager/whitepaper-c11-736116.pdf

Industry demands for greater capacity, greater bandwidth, and lower operating voltages lead to increased memoryerror rates. Traditionally, the industry has treated correctable errors in the same way as uncorrectable errors, requiring the module to be replaced immediately upon alert. Given extensive research that correctable errors are not correlated with uncorrectable errors, and that correctable errors do not degrade system performance, the Cisco UCS team recommends against immediate replacement of modules with correctable errors. Customers who experience a Degraded memory alert for correctable errors should reset the memory error and resume operation. If you follow this recommendation, it avoids unnecessary server disruption. Future enhancements to error management are coming and  helps distinguish among various types of correctable errors and identify the appropriate actions, if any, needed.

It is recommended to be minimum of version 2.1(3c) or 2.2(1b) which has enhancement with UCS memory error management

Methods to Clear DIMM Blacklisting Errors:

UCSM GUI

UCSM CLI

UCS-B/chassis/server # reset-all-memory-errors

If the above troubleshooting did not help please feel free to raise a support request for assistance.

 

Troubleshooting Memory Errors in UCS

Memory errors are encountered when an attempt is made to read a memory location. The value read from the memory does not match the value that is supposed to be there. Classification of Memory Errors Detected Versus Undetected Errors A system without error-correcting code (ECC) memory will not detect hardware errors. Hence, memory errors will silently lead to data corruption, incorrect processing of the operating system or application, and eventually system failures. Cisco Unified Computing System™ (Cisco UCS® ) servers use ECC memory. Therefore, powerful error correcting codes such as those provided by the Intel® Xeon® processors in Cisco UCS servers can detect memory errors so that silent data corruption does not occur.

Hard Versus Soft Errors:

Errors that are caused by a persistent physical defect are traditionally referred to as “hard” errors. A hard error may be caused by an assembly defect such as a solder bridge or cracked solder joint, or may be the result of a defect in the memory chip itself. Rewriting the memory location and retrying the read access will not eliminate a hard error. This error will continue to repeat. Errors caused by a brief electrical disturbance, either inside the DRAM chip or on an external interface, are referred to as “soft” errors. Soft errors are transient and do not continue to repeat. If the soft error was the result of a disturbance during the read operation, then simply retrying the read may yield correct data. If the soft error was caused by a disturbance that upset the contents of the memory array, then rewriting the memory location will correct the error. Hard errors are typically detected by memory tests run by the Cisco UCS BIOS at boot time, and any modules containing hard errors are mapped out so that they cannot cause errors during runtime. Cisco UCS servers employ memory patrol scrubbing to automatically detect and correct soft errors during runtime.

Hard errors are typically detected by memory tests run by the Cisco UCS BIOS at boot time, and any modules containing hard errors are mapped out so that they cannot cause errors during runtime. Cisco UCS servers employ memory patrol scrubbing to automatically detect and correct soft errors during runtime.

Correctable Versus Uncorrectable Errors:

Whether a particular error is correctable or uncorrectable depends on the strength of the ECC code employed in the memory system. Dedicated hardware is able to fix correctable errors when they occur with no impact on program processing. Uncorrectable errors generally cannot be fixed and may make it impossible for the application or operating system to continue processing.

Cisco UCS B-Series and C-Series Operating in UCSM 2.2 and 3.1 :

To reset memory-error counters on a Cisco UCS B-Series or C-Series server in UCSM 2.2 and 3.1, run the following script on the CLI:

ca-1-A# scope server 1/8

ca-1-A /chassis/server # reset-all-memory-errors

ca-1-A /chassis/server* # commit

Cisco UCS B-Series and C-Series Operating in UCSM 2.1 :

To reset memory-error counters on a Cisco UCS B-Series or C-Series server in UCSM 2.1, run the following script on the CLI:

Switch-A # scope server 1/1

Switch-A /chassis/server # scope memory-array 1

Switch-A /chassis/server/memory-array # scope dimm 2

Switch-A /chassis/server/memory-array/dimm # reset-errors

Cisco UCS C-Series Rack Servers Operating in Standalone Mode

To reset memory-error counters on a Cisco UCS C-Series Rack Server operating in standalone mode, run the following script on the CLI:

C240-FCH092779J# scope reset-ecc

C240-FCH092779J /reset-ecc # set enabled yes

C240-FCH092779J /reset-ecc *# commit

 

For additional information about memory, please refer to these resources: DIMMs: Reasons to Use Only Cisco Qualified Memory on Cisco UCS Servers

Cisco UCS Manager and Cisco UCS C-Series

Most x86-architecture servers today include a management function commonly known as a baseboard management controller (BMC). The BMC is usually embedded on the motherboard or main circuit board of the server and includes a specialized service processor and firmware to monitor and manage the physical state of the server hardware. BMC functions and standards are defined in the IPMI specifications, originally developed jointly by Intel, Hewlett-Packard Enterprise, Dell, and NEC. The specification is maintained and published at Intel’s corporate website, helping ensure that BMC functions are consistently implemented on all x86 managed server platforms.

Intel includes BMCs on its customer reference board (CRB) designs, which are given to original equipment manufacturers (OEMs) and original design manufacturers (ODMs) to accelerate time to market and help ensure compliance with industry standards such as IPMI. Cisco has added value to the basic BMC functions by reengineering the BMC to make it an important part of the Cisco UCS architecture. This integration helps enable powerful, industry-leading unified computing features and the use of service profiles for server provisioning and change management.

Cisco UCS Manager runs in the Cisco UCS 6100, 6200, and 6300 Series Fabric Interconnects. It provides a wide range of powerful features for the integrated and unified computing, networking, and storage environment of Cisco UCS. The features include the rapid provisioning of infrastructure from shared pools of computing, networking, and storage resources and the rapid scaling and provisioning of IT infrastructure through the model-based management approach of Cisco service profiles.

Service profiles are used to provision and manage Cisco UCS C-Series Rack Servers and their I/O properties within a single management domain. They are created by server, network, and storage administrators and are stored in the fabric interconnects. Infrastructure policies needed to deploy applications are encapsulated in the service profile. The policies coordinate and automate element management at every layer of the hardware stack, including RAID levels, BIOS settings, firmware revisions and settings, server identities, adapter settings, VLAN and VSAN network settings, network quality of service (QoS), and data center connectivity.

Service profile templates are used to simplify the creation of new service profiles, helping ensure consistent policies within the system for a given service or application. Whereas a service profile is a description of a logical server and there is a one-to-one relationship between the profile and the physical server, a service profile template can be used to define multiple servers. The template approach makes it just as easy to configure one server as it is to configure hundreds of servers with perhaps thousands of virtual machines. This automation reduces the number of manual steps needed, helping reduce the opportunities for human error, improving consistency, and further reducing server and network deployment times.

Cisco IMC communicates vital information about each individual server to Cisco UCS Manager. Cisco IMC provides many diagnostic and health monitoring services that contribute to the holistic management environment enabled by Cisco UCS.

Diagnostic and health monitoring features provided with Cisco IMC include:

●   SNMP

●   XML API event subscription and configurable alerts

●   System event log

●   Audit log

●   Monitoring of field-replaceable units (FRUs), HDD faults, dual inline memory module (DIMM) faults, network interface card (NIC) MAC addresses, CPU, and thermal faults

●   Configurable alerts and thresholds

●   Watchdog timer

●   RAID configuration and monitoring

●   Predictive failure analysis of HDD and DIMM

●   Converged network adapters (CNAs)

●   Reliability, availability, and serviceability (RAS)

●   Network Time Protocol (NTP)

●   Graphical and command-line client

Cisco IMC in Standalone Mode on Cisco UCS C-Series Servers

Many customers deploy Cisco UCS C-Series servers in a standalone environment as x86 servers (see Figure above). In such a deployment, the servers are not integrated with other Cisco UCS components, such as the fabric interconnects, fabric extenders, or Cisco UCS Manager. When Cisco UCS C-Series servers are operating in standalone mode, administrators can use Cisco IMC as an industry-standard BMC through a web-based GUI, a secure shell (SSH)–based command-line interface (CLI), or the native API to configure, administer, and monitor the server. IMC provides users full control of the server, allowing complete configuration and management. It can be configured to operate in several different network modes, taking advantage of the dedicated management port or sharing the same physical interface as the host in Shared LOM or Cisco Card mode. With Cisco IMC, administrators can perform the following server management tasks:

●   Power on, power off, power cycle, reset, and shut down the server

●   Toggle the locator LED

●   Configure the server boot order

●   View server properties and sensors

●   Complete out-of-band storage configuration

●   Manage remote presence

●   Firmware management

●   Create and manage local user accounts and enable authentication through Active Directory and LDAP

●   Configure network-related settings, including NIC properties, IPv4, IPv6, VLANs, and network security

●   Configure communication services, including HTTP, SSH, and IPMI over LAN

●   Manage certificates

●   Configure platform event filters

●   Monitor faults, alarms, and server status

Cisco IMC is included with each Cisco UCS C-Series server at no additional cost to customers.

The latest release of IMC, version 3.0(1), introduces a number of new features to better align with the needs of our customers. Most of these new capabilities are the HTML5 web UI and KVM, Redfish support, and XML API transactional support. HTML5 provides customers with a simplified user interface and, along with a reliable KVM, eliminates the need for Java to use IMC.

The IPMI interface is not able to address the scale-out and cloud-based requirements for simplicity and security available in modern programming interfaces, so Intel and other server vendors have developed the new Redfish standard. Redfish is an open industry standard specification and schema that specifies a RESTful interface and uses JSON and OData to help customers integrate solutions within their existing tool chains. It establishes a new management for system control that is scalable, easy to use, and secure. Redfish is sponsored by the Distributed Management Task Force (DMTF), a peer-review standards body recognized throughout the industry. Cisco UCS has adopted support for the Redfish standard on IMC version 3.0.

Redfish introduces a RESTful API to the IMC and is a simple, secure replacement for IPMI. Finally, XML API transactional support is catered toward users who utilize the programmability aspects of the IMC. Users can now configure multiple managed objects in a single transaction, allowing for quicker, simpler deployments.

Along with the many new software capabilities, Cisco has enhanced several of the utilities that rely on the IMC:

●   Cisco IMC Emulator

●   Non-Interactive Server Configuration Utility (NI-SCU)

●   Separation of SCU and diagnostics

●   Driver Update (Linux)

 

For More Information:

●   Cisco UCS Services: Accelerate Your Transition to a Unified Computing Architecture: http://www.cisco.com/en/US/services/ps2961/ps10312/Unified_Computing_Services_Overview.pdf

●   Cisco UCS C-Series Servers Integrated Management Controller CLI Configuration Guide, Release 3.0: http://www.cisco.com/c/en/us/td/docs/unified_computing/ucs/c/sw/cli/config/guide/3_0/b_Cisco_UCS_C-Series_CLI_Configuration_Guide_301.html

●   Setup for Cisco IMC on Cisco UCS C-Series Servers: http://www.cisco.com/en/US/partner/products/ps10493/products_configuration_example09186a0080b10d66.shtml

●   Cisco UCS C-Series Rack-Mount Servers: http://www.cisco.com/en/US/partner/products/ps10493/tsd_products_support_series_home.html

●   Unified computing: http://www.cisco.com/en/US/partner/netsol/ns944/index.html

●   Intelligent Platform Management Interface (IPMI) Specifications: http://www.intel.com/design/servers/ipmi/spec.htm