The impact of router outages on the AS-level internet
Luckie, M. J., & Beverly, R. (2017). The impact of router outages on the AS-level internet. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (pp. 488–501). Conference held Los Angeles, USA: ACM. https://doi.org/10.1145/3098822.3098858
Permanent Research Commons link: https://hdl.handle.net/10289/13169
We propose and evaluate a new metric for understanding the dependence of the AS-level Internet on individual routers. Whereas prior work uses large volumes of reachability probes to infer outages, we design an efficient active probing technique that directly and unambiguously reveals router restarts. We use our technique to survey 149,560 routers across the Internet for 2.5 years. 59,175 of the surveyed routers (40%) experience at least one reboot, and we quantify the resulting impact of each router outage on global IPv4 and IPv6 BGP reachability. Our technique complements existing data and control plane outage analysis methods by providing a causal link from BGP reachability failures to the responsible router(s) and multi-homing configurations. While we found the Internet core to be largely robust, we identified specific routers that were single points of failure for the prefixes they advertised. In total, 2,385 routers -- 4.0% of the routers that restarted over the course of 2.5 years of probing -- were single points of failure for 3,396 IPv6 prefixes announced by 1,708 ASes. We inferred 59% of these routers were the customer-edge border router. 2,374 (70%) of the withdrawn prefixes were not covered by a less specific prefix, so 1,726 routers (2.9%) of those that restarted were single points of failure for at least one network. However, a covering route did not imply reachability during a router outage, as no previously-responsive address in a withdrawn more specific prefix responded during a one-week sample. We validate our reboot and single point of failure inference techniques with four networks, finding no false positive or false negative reboots, but find some false negatives in our single point of failure inferences.
© 2018 ACM. This is the author's accepted version.