From Crisis to Recovery: Reddit's Pi-Day Outage and the Power of Effective Kubernetes Management

From Crisis to Recovery: Reddit's Pi-Day Outage and the Power of Effective Kubernetes Management

Table of contents

No heading

No headings in the article.

On Pi Day 2023, Reddit experienced an outage that lasted 314 minutes. The outage was caused by an upgrade from Kubernetes 1.23 to 1.24 on one of the most important clusters in the company. In this blog post, I will discuss the outage and the steps Reddit took to restore the site.

The outage began when the site came to a halt two minutes after the cluster was upgraded. The Compute team immediately opened an incident call, but there was no supported downgrade procedure for Kubernetes, so the team decided to restore from a backup and state reload. However, the restore procedure was written several years ago and had not been updated with changes in the environment. After 30 minutes, the team decided to restore from backup, which took about 20 minutes.

The procedure had been written against a now end-of-life Kubernetes version and pre-dated the switch to CRI-O, so it had to be rewritten live to accommodate the changes. The first node was restored successfully, but the team was not ready to start scaling back up yet, so the autoscaler was turned off to buy some time to get things back to a known state. However, the new nodes wouldn't join, and the team had to split off into a separate call to look at why the connection was failing.

After a few minutes, the problem was discovered: the TLS certificates presented by the working node weren't valid for anything else to talk to it. The main group of responders started bringing traffic back online conservatively, and eventually, the site was live again.

Further investigation was needed to find the exact moment of failure, and luckily, the logging agent was unaffected, so the team could dig into the logs. It was discovered that the route reflectors were configured to use the control plane nodes as reflectors, which caused the outage. The cause of the outage was the terminology change from "master" to "control-plane" in the 1.20 series of Kubernetes, but the actual cause was more systemic: Inconsistency.

In conclusion, the outage on Pi Day 2023 was caused by a Kubernetes upgrade, which revealed inconsistencies in the cluster. The outage lasted for 314 minutes, and the team had to restore from a backup, which took about 20 minutes. The team then had to deal with various issues to bring the site back online. Despite the progress that has been made in improving Reddit's availability, there is still work to be done to prevent future outages.