Root Cause Analysis (RCA) - Alert Based Workflows
Overview of Framework
The Root Cause Analysis (RCA) workflows are a new framework that has been deployed to your organizations. This new framework assists in the remediation of alerts that arise in the Dashboard. RCAs from the various products are being designed and deployed regularly as alerts are identified that have a value added workflow that are used to more efficiently guide the network admin closer to a resolution. This document lists the current RCAs that have been made generally available in the framework and are listed below with explanations of their respective workflows included.
How do you engage?
You will be presented with 3 entry points to begin the RCA process. The Alert Hub, Device Details Page, and the Organization Alerts Page will have entry points. Both the Alert Details and Take Action Link/Button will take you into the RCA side drawer workflow. These are generic examples showing where in the Dashboard you are able to begin the RCA troubleshooting workflow.
What do you see? (Example images of framework)
When entering the RCA workflow a new side drawer will open to guide you through the process that has been curated for this specific alert. Below the alert title there are 2 tabs for the sections Alert details and Suggested Actions. Depending which link/button you selected previously, you will land in one of these sections to begin, and you can freely move between them. Each RCA will be customized with content that is product/alert specific, but the overall framework will retain a comparable user experience.
Note: As RCA workflows are built they will be added to this list. The following workflows are currently available in the Dashboard.
Ethernet Uplink Speed Degraded
Alert Description - Access point connected to the 10/100Mbps speed Ethernet connection
The performance of an Ethernet can significantly influence a wireless network's overall effectiveness and efficiency. Ethernet connections can have various speed and duplex permutations, such as 10 Mbps, 100 Mbps, 1 Gbps, 2.5 Gbps, and 5 Gbps, with either Half or Full Duplex modes. These parameters are established through a negotiation process between connected devices. However, if this negotiation fails, the devices may not achieve their optimal speed and duplex settings, leading to suboptimal network performance.
Modern Wi-Fi standards, such as Wi-Fi 6 and 6E, support speeds exceeding 1 Gbps, with Wi-Fi 6E Access Points (APs) typically requiring a minimum Ethernet speed of 2.5 Gbps to function effectively. If an AP's Ethernet connection is limited to 10 Mbps or 100 Mbps, it will significantly impact wireless performance. This creates a bottleneck, preventing the AP from fully utilizing its high-speed wireless capabilities and ultimately degrading overall network efficiency and user experience.
The below image shows a trend view of the Ethernet port's current speed. The top line graph shows Red highlights timeline trend when the the AP has alerted about the negotiation failure. This page also shows the number of clients impacted when alert is first reported. By clicking on Clients Impacted, It will display the clients that were connected to this particular AP. There is another trending chart, indicate average wireless data rate from all clients that has been connected to this AP during the period of alert. Network administrator can assess actual performance impact and bottleneck by comparing connected wireless data rate vs. Ethernet uplink data rate
Guided Troubleshooting Flow
Suggested Actions for AP Ethernet uplink degraded on Cisco Meraki MS Switches:
Troubleshooting assistance in each suggested action can only be performed to its full capability while the AP is connected to a Cisco Meraki MS Switch. 3rd party switches will not have an active test capabilities but alert detail can display connected 3rd party switch details and suggested actions.
Make sure to have the required network access to take the required actions from the suggestions.
1. Cable Test: This will test out the cable and the switch port connection where the AP is connected to.
Once you run the cable test, it will list out all the parameters it was able to identify and different test run results as shown below:
2. Update link negotiation settings: This action allow network administrator to force specific speed and negotiation settings on AP connected switch port without leaving suggested action page.
If the negotiation was successfully changed on the switch and establish faster than 1 Gbps full duplex speed, the Alert will be automatically moved to resolved condition.
3. Cycle port on switch: This action will turn off and on the the switch port, force AP to reboot and restart Ethernet negotiation.
This action will power down the access point momentarily. Make sure to run this during a maintenance window
Once the port has powered back on, the Access point has renegotiated speed. If the port cycling helped negotiate the right speed for the AP, the alert will be resolved.
Suggested Actions and Test Assistance for 3rd Party Switches:
1. Cable Test: This will test out the cable and the switch port connection where the AP is connected to. This test would identify the uplink switch and prompt you to check if the cable if damaged or not.
2. Check auto-negotiation in switch port settings: This recommendation is to verify if your switchport is configured to Auto-negotiation or the right speed.
3. Cycle port on switch: This recommendation is to the power cycle of the switch port where the access point is connected.
Cyclic Redundancy Check (CRC) Errors Detected
CRC is used as a means of detecting errors in transmitted data.
The sending device generates a value derived from the remainder of a polynomial division of its data contents. The receiving device compares the recalculated CRC value with the one received along with the data. If the two CRC values match, it indicates that the data hasn't been corrupted during transmission.
When the receiving device detects a mismatch between the received CRC value and the recalculated CRC value, it flags a CRC error. Seeing CRC errors reported by the Meraki switch on the dashboard indicates that the data may have been altered or corrupted during transmission.
Note that a port experiencing CRC errors could be shown as Red or Amber. Amber would equate to a High amount of L1 packet errors: port is sending or receiving a high amount (greater than 100 hits/hour or greater than 1% of traffic) of CRC align errors, Fragments, and/or Collisions. Whereas a Red status would be related to a Very High amount of L1 packet errors: port is sending or receiving a very high amount (greater than 1000 hits/hour or greater than 10% of traffic) of CRC align errors, Fragments, and/or Collisions.
Guided Troubleshooting Flow
This feature is designed to reduce troubleshooting effort, make issue resolution more intuitive, and save more time for our customers. Guided CRC troubleshooting flow automates and outlines suggested action items (refer to flow diagram and the short video below) to be performed to resolve CRC error alerts. This tool is designed to help network administrators efficiently and effectively identify the root cause of the CRC error on switch ports and resolve it.
The issue/alert wil be shown in several areas. One of them being the switch details page itself:
The dashboard will also prompt and highlight the issue within the Alert Hub -> drop down. This drop down will allow you to troubleshoot the issue regardless the page you are currently viewing. From here you will be able to view the alert, details and suggested actions.
The details section will highlight the timeframes within the last 2 weeks at which this alert was triggered.
The suggested actions section will allow you to perform several actions to help rectify the alert.
1. The first item would be a validation of link negotiation between the 2 devices. If a mismatch of configuration is found it could be corrected from this drop down itself, instead of having to navigate to each switch and switch port page to make the required changes.
2. In the event the link negotiation configurations do match between the connected devices we will then suggest a cable test to verify the integrity of the physical cable itself.
You are not allowed to run the cable test on your uplink port because it will disrupt traffic
3. More suggestions will be listed within the drop down to help you identify the root cause of the issue:
Unplanned Low Power Mode in Access Points
When an Access Point operates in low power mode, it may face potential situations that can compromise its performance and reliability. The low power mode is typically a fallback state that occurs when the AP does not receive sufficient power to support its full range of functionalities. This can happen due to several underlying issues, often related to the physical infrastructure supporting the device.
Risks and Implications:
- Potential Risk of Unplanned Resets: The AP is more susceptible to unplanned resets in low power mode, especially under heavy network loads. This occurs because the device struggles to maintain its operations without adequate power, leading to instability and potential disruptions in connectivity.
- Disabled Hardware Features: Several hardware features may be impacted to conserve power. This includes:
- Air Marshal: This security feature uses the access point's dedicated scanning radio to help detect and mitigate rogues and other wireless threats. Disabling the Air Marshal can leave the network vulnerable to security breaches.
- Radios: The AP may shut down one or more of its radios or reduce the number of spatial streams, reducing its ability to provide wireless coverage and handle client connections. This can result in decreased network performance and coverage gaps.
- USB Interface: The GNSS receiver and third-party ESL gateway module could turn off while the AP operates in low-power mode, which would result in losing access to the USB interface's data.
The primary causes of low power mode are usually physical issues, which can include:
- Low-Quality Cables: Substandard Ethernet cables can lead to insufficient power delivery. These cables may not meet the necessary specifications to carry Power over Ethernet (PoE), resulting in power constraints.
- Low PoE Budget: The PoE switch or injector may not provide enough power to support all the AP's features. This can happen if the power budget is not properly calculated or the switch is overburdened with too many connected devices.
- Loose RJ-45 Connections: A loose or improperly connected RJ-45 plug can lead to intermittent power delivery. This can cause AP power supply fluctuations, triggering low power mode.
- Cable Damage: Physical damage to the Ethernet cable, such as cuts, kinks, or excessive bending, can impair the cable’s ability to deliver power effectively. This damage can result from environmental factors or poor installation practices.
Guided Troubleshooting Flow
Suggested Actions for AP Unplanned low power mode on Cisco Meraki MS Switches:
Troubleshooting assistance for each suggested action can only be performed to its full capability while the AP is connected to a Cisco Meraki MS Switch. Third-party switches will not have active test capabilities, but alert details can display connected third-party switch details and suggested actions.
Ensure the network access is required to take the required actions from the suggestions.
1. The first action is to check the POE budget in the switch. This will ensure that the switch can budget enough power for the access point to operate at its full potential.
If the POE budget exceeds, you will see recommendations as shown in the image below
2. If the link negotiation configurations between the connected devices match, we will suggest a cable test to verify the integrity of the physical cable itself.
3. The next item is to cycle the operation of the switch port where the AP is connected.
4. The last option is to capture packets on the Access point port to look for LLDP negotiation failures.
Make sure to have the right admin privileges to run packet capture
Once the Packet capture runs successfully, you should be able to download or view the PCAP right here as shown below.
Please refer to the Low power mode KB for more information: https://documentation.meraki.com/MR/Monitoring_and_Reporting/Low_Power_Mode
Suggested Actions and Test Assistance for 3rd Party Switches:
1. 1st step would be to perform a cable test to see if the connection to the switch is accurate.
2. This would ask you to check if LLDP configurations are done correctly on the switch
3. Try to power cycle the port of the switch where the AP is connected
4. Check the PoE budget on the switch console to make sure you have enough power available to operate the AP at the minimum required budget
5. Run a packet capture on the AP ethernet uplink to find the LLDP failure negotiations