Routed HA Failover Behavior

Last updated
Save as PDF

Overview

The Cisco Meraki MX offers seamless hardware failover using a warm spare, high availability configuration. This article will detail how an HA pair of MX use Virtual Router Redundancy Protocol (VRRP) to fail over and maintain connectivity for downstream clients.

Please note, this article assumes working knowledge of VRRP and Routed HA.

For more information about VRRP, please reference the RFC.

For more information about Routed HA, please reference our documentation.

VRRP Mechanics for HA

A pair of MX in an HA configuration will use VRRP advertisements to monitor the status of the current active. In a working state, the active MX will send VRRP advertisements out to the LAN every second. If the passive MX does not receive any advertisements for 3 seconds, it assumes that the active MX has failed and will take over as the new active (including sending its own advertisements). This mechanism allows a spare MX to take over in the event of a hardware failure.

In addition to this simple heartbeat mechanic, the active will also report its VRRP priority in the advertisements it sends. For reference, the following priority values are used by each MX, which also depends on whether or not the MX has uplink connectivity:

VRRP Priority Values set on the MX

	Primary MX	Spare MX
Working Uplinks on primary	255	235
No Uplink Connection on primary	75	235
No Uplink Connection on spare	255	55

VRRP Priority Values sent by the MX

	Primary MX	Spare MX
Working Uplinks on primary	255	Nothing
No Uplink Connection on primary	0 (single advertisement)	235
No Uplink Connection on spare	255	Nothing

If either MX sees a VRRP advertisement with a lower priority than its own, that MX will take over as active.

For example: If the active/primary MX loses all uplink connectivity, it changes its own internal VRRP priority to 75 and sends one-time advertisement with a priority of 0 - a priority of 0 indicates that the sender will no longer be sending advertisements. When the spare MX receives the advertisement with priority 0, it sees that its own priority (235) is greater than the priority within the advertisement, so the spare takes over as the current active and begins sending advertisements with the priority of 235. The primary MX stops sending advertisements until it goes back into a working state.

This mechanism allows a spare MX to take over in the event of an upstream failure on the primary MX.

Eventlog

Message	Meaning
VRRP Transition	Event log Event message denoting a change
if_up	VRRP interface state after the event
old_if_up	VRRP interface state before the event
mode	Mode after the event
old_mode	Mode before the event
prio	VRRP Priority after the event
old_prio	VRRP Priority before the event
elector_state	State of MX after the event
last_state_change_reason	Reason for the state change

Additional VRRP Notes

Only the current active MX will send VRRP advertisements. In addition to the VRRP priority, there are two key values used by the HA pair:

VRRP Router ID - A shared router ID that is also used by both of the MX in the warm spare pair.
VRRP MAC address - The virtual MAC address used on the LAN by both MX.

These two fields are used in conjunction to indicate that a VRRP advertisement is sent by the other MX; they will ignore any VRRP advertisements that do not match these values.

In addition, the VRRP MAC address is shared by both MXs for LAN communication. Clients on the LAN will associate this shared MAC address with the MX's LAN IPs. As such, in the event of failover, LAN clients won't need to update their ARP table with a new MAC address.

MX vs MS Advertisement Timers

MXs use a 1-second timer for VRRP advertisements. This is in contrast to the advertisement timers used by MS switches, where the advertisements are sent every 0.3 seconds. That is why MS switches will failover in 0.9 seconds as opposed to the expected 3-second failover for MX.

Typical Failover Scenario

The following sections walk step-by-step through a common HA failover scenario, wherein the primary MX loses all uplink connectivity and the spare MX takes over.

The following scenario assumes that the primary and spare MXs are connected on the LAN side, and that they are able to exchange VRRP advertisements across all configured VLANs.

Normal State

Starting from a baseline working state, both the primary and spare MX are online with dual uplinks. Everything is normal, so the primary MX is the current active:

Diagram depicting a Primary and Spare MX HA pair. Both MXs have dual ISP connections and are connected to the same LAN switch. The primary MX is active.

In this state, the primary MX sends VRRP advertisements (with a priority of 255) every second:

Screenshot of a Wireshark packet capture showing VRRP messages sent by the primary MX. The value of the VRRP Priority field is 255.

Primary Uplink Failure

After the primary MX loses all uplink connectivity, it will send a VRRP advertisement with a priority of 0.

The priority value zero (0) has special meaning indicating that the current active has stopped participating in VRRP. This is used to trigger backup routers to immediately transition to active without having to wait for the current active to timeout.

Screenshot of a Wireshark packet capture showing VRRP messages sent by the primary MX after losing uplink connectivity. The value of the VRRP Priority field is now 0.

Failover to Spare MX

Once the spare MX receives the 0-priority VRRP advertisement, it will become the new active.

Diagram depicting a Primary and Spare MX HA pair. Both MXs have dual ISP connections and are connected to the same LAN switch. The primary has lost connectivity to both its ISP connections, so the spare MX is currently the active member.

As the new active, the spare MX takes over the LAN by sending its own advertisements with a priority of 235:

Screenshot of a Wireshark packet capture showing VRRP messages sent by the spare MX after taking over as the active member. The value of the VRRP Priority field is 235.

Additional Failover Scenarios

The following sections outline some less common failover scenarios:

Both MXs Lose Uplink Connectivity

Assume the end of the scenario above, where the primary MX has no uplink connectivity and the spare MX is the current active.

If the spare MX also loses all uplink connectivity, it will send a VRRP message with a priority of 0:

Screenshot of a Wireshark packet capture showing VRRP messages sent by the spare MX after taking over as the active member then losing uplink connectivity itself. The value of the VRRP Priority field is 0.

In this scenario, the primary MX will transition back into the current active role. Without any working uplinks, it will only provide LAN routing:

Diagram depicting a Primary and Spare MX HA pair. Both MXs have dual ISP connections and are connected to the same LAN switch. The primary and spare have both lost connectivity to both ISP connections, so the primary MX is the active member.

When the primary MX receives the 0-priority VRRP advertisement, the primary starts sending its own VRRP advertisements with a priority of 75, indicating that it does not have uplink connectivity:

Screenshot of a Wireshark packet capture showing VRRP messages sent by the primary MX after receiving VRRP advertisements from the spare with a priority of doing so. The value of the VRRP Priority field is now 75.

Uplinks and Primary MX Down

In the unlikely scenario that the primary MX's hardware goes down entirely while the spare has no working uplinks, the spare will transition back to the current active role in order to provide LAN routing:

Diagram depicting a Primary and Spare MX HA pair. Both MXs have dual ISP connections and are connected to the same LAN switch. The primary has lost power while the spare has no working uplinks. The spare transitions back to the current active role.

When the spare MX stops seeing any VRRP messages from the primary, the spare MX takes over the LAN by sending its own advertisements with a priority of 55, indicating that it does not have uplink connectivity:

Screenshot of a Wireshark packet capture showing VRRP messages sent by the spare MX after it stops receiving VRRP advertisements from the primary. The value of the VRRP Priority field is now 55.

Cellular Failover Behavior

Meraki supports cellular failover with high-availability (HA) pair, limited to the MX67C and MX68CW models with embedded cellular modules. In order to support HA, customers must be using firmware MX 14.53, MX 15.42, or MX 16.11 or higher. At this time, if a cellular uplink is used in an HA pair, the following will occur in order:

Primary MX WAN 1+2 fails > fails over to secondary MX
Secondary MX WAN 1+2 fails > fails over to primary MX cellular
Primary MX cellular fails > fails over to secondary MX cellular

Note: While it is possible to use cellular failover as described above, it is not officially supported by Meraki if leveraging other MX models and USB cellular dongle.

While DDNS is enabled, if all Primary MX WAN links fail but cellular is still active, DDNS will resolve to the Primary MX cellular uplink IP. It is recommended that you perform a swap of the Secondary MX to the Primary to prevent any issues with DDNS updates while the original primary MX WAN links are offline.