Skip to main content
Cisco Meraki Documentation

Connection Monitoring for WAN Failover

Connection Monitor Overview

When the primary uplink goes down on an MX Security Appliance, events will appear under Network-wide > Monitor > Event log indicating a change in the primary uplink status. In the example below, "uplink: 0" indicates that internet 1 is being used, while "uplink: 1" indicates that internet 2 is being used. 

db158677-951a-4903-8008-edc8b9f1198e

 

In the dashboard, the preferred primary uplink can be configured, but that only matters when both are functioning. The MX will use the non-preferred uplink as the primary if it is the only one available. The MX monitors all uplinks and will discontinue use of that link when it decides it has no connectivity. 

Note: If the MX is using the non-preferred uplink as the primary and the preferred uplink comes back online, the MX will wait about 15 seconds before switching the primary uplink to the preferred one to prevent the primary connection from flapping in the event of intermittent failure or an unreliable link. 

Failover Connectivity Tests

The MX runs the following tests to determine uplink status:

DNS test

  • Query the DNS servers (primary or secondary) configured on the internet interface for the following hosts:
    • meraki.com
    • google.com
    • yahoo.com

Internet test

  • Pings to either 209.206.55.10 or 8.8.8.8. One ping per second.
  • Uses a round-robin technique to send an HTTP GET to http://meraki.com or http://canireachthe.net. An HTTP response of any kind will result in a success.

 ARP test 

  • ARP for the default gateway and its own IP (to detect a conflict).  

Connection Monitoring Test Process

Connection monitoring runs on the uplink once it is activated, meaning a carrier is detected and an IP address is assigned (static or dynamic).

The first test DNS query is sent, if a DNS response is received, DNS is marked as good for 300 seconds on that uplink. During this time, the MX continues running the DNS test every 150 seconds. Each successful DNS query test results in DNS being marked as good for another 300 seconds.

If a test DNS query times out at any point, the MX decreases the testing interval to 30 seconds. If the DNS test continues to fail for a time period exceeding 300 seconds, which is last time the test was successful, DNS will be marked as failed on the uplink. Otherwise, a successful test will again mark the DNS as good for another 300 seconds. Once marked as good, the test is run every 150 seconds.

Note: The MX will only decrease the DNS testing interval to 30 seconds if a test DNS query times out. Any record-type response to a test DNS query will result in a successful DNS test.

The MX then begins performing the internet test. If either the ICMP or the HTTP test is successful, the internet test is marked as good for 300 seconds on that uplink. During this time, the MX continues running the internet test every 150 seconds. Each successful internet test (meaning either a successful ICMP test or a successful HTTP test) results in the internet being marked as good for another 300 seconds. If any test within the internet group fails, the MX decreases the testing interval to 20 seconds. If the tests continue to fail for a time period exceeding 300 seconds from the last successful test, the internet will be marked as failed on the uplink. Otherwise, any successful ICMP or HTTP test will mark the internet test as good for another 300 seconds. Once marked as good, the test is run every 150 seconds.

When both the HTTP and ICMP tests have been unsuccessful for a period of time that exceeds 300 seconds, the uplink will be failed over. Therefore, it can take approximately five minutes for failover to occur in the event of a soft failure (where the physical link is still up but provides no internet access).

If a physical link is not detected, the failover would take place immediately. 

Note: An MX will only failover to a backup cellular connection if all three tests (internet, DNS, and ARP) are marked as failed.

Note: Please be aware of the failover traffic flow behavior between the primary and secondary uplinks.

Traffic is mapped to an internet interface by source and destination IP address and port. Any newly initialized IP traffic matching the source and destination IP address and port of an existing mapping will be sent over the same internet interface. This is done to preserve the connection state of certain flows that require the source and destination to remain the same for the duration of the connection.

Each of these traffic mappings expires after 300 seconds (five minutes) of no traffic matching the mapping. This duration is reset each time new traffic is generated that matches the mapping. With frequent communication between a pair of hosts, this can result in traffic consistently using a single uplink for communication, as the mapping is continuously refreshed.

In summary, if the primary uplink goes down, all traffic will failover to the secondary uplink. When the primary uplink is back-up, traffic that doesn't have a mapping will use the primary uplink. All traffic with an existing mapping will continue to use the secondary uplink. This is the Graceful failback behaviour.

These mappings can't be cleared by support. You could temporarily remove the non-primary uplink, reboot the MX/Z, or prevent the client device from sending traffic to the MX/Z for a period of 300 seconds (five minutes).

Enhanced WAN Failover and Failback

Note: Enhanced WAN failover and failback requires MX17+ firmware

Note: Template-bound networks are not currently supported

Current implementation (without “Enhanced WAN failover and failback feature”)

  • During failover (when primary link is lost) 

    • Existing flows will remain on the primary link until expiry

    • New flows will route via the backup path

  • During failback (when primary link is recovered) 

    • Existing flows will remain on the backup path until expiry

    • New flows will route via the primary path

New implementation (with “Enhanced WAN failover and failback feature”)

Screen Shot 2022-03-21 at 8.44.25 AM.png

Navigate to Security & SD-WAN > SD-WAN & traffic shaping > Uplink selection page to configure the Enhanced WAN failover and failback feature (not available in template parent and template child networks at this time). WAN failover and failback feature behavior is currently limited to the primary uplink and not flow preferences.

    • During failover (when preferred link is lost)

      • Existing flows will remain on the primary link until expiry

      • New flows will route via the backup path

    • During failback (when primary link is recovered)

      • Existing flows will remain on the backup path until expiry

      • New flows will route via the primary path

    • During failover (when preferred link is lost)

      • Existing and new flows will route via the backup path (flows will be disrupted)

    • During failback (when primary link is recovered) 

      • Existing and new flows will route via the primary path (flows will be disrupted)

Cellular

Cellular connection testing is reduced in an effort to minimize the overall data usage on the link.

Cellular as Primary

  • ICMP to 209.206.48.0/20 and/or 8.8.8.8/32 every 4 hours

  • DNS queries for “meraki.com, google.com, yahoo.com” every 150 to 300 seconds

  • Uses a round-robin technique to send an HTTP GET to http://google.com, http://yahoo.com, or http://meraki.com every 30 minutes

  • Periodical ARP tests for default gateway and its own IP to detect conflict

Cellular as Backup

  • ICMP to 209.206.48.0/20 and/or 8.8.8.8/32 every 4 hours

  • DNS queries for “meraki.com, google.com, yahoo.com” every 4 hours

  • Uses a round-robin technique to send an HTTP GET to http://google.com, http://yahoo.com, or http://meraki.com every 4 hours

  • Periodical ARP tests for default gateway and its own IP to detect conflict

Note: It is important to understand that if the tests fail, the MX will continue to perform them every few seconds until they succeed. In such cases, depending on the failure and retry, there may be more utilization than expected on the link.

SD-WAN Monitoring

Refer to the article on SD-WAN monitoring for more information.