What could make the world’s biggest social media platform go down?
IT outages happen all the time, and as consumers, we don’t really think about it that much. But if you’re running a business, an IT outage is very bad news. During the space of those hours or days, when your site is down, you’re unable to engage with your customers. Every minute that a consumer is unable to access your services or buy your product directly impacts your ability to do business and generate revenue.
This is exactly what happened when, in October of 2021, Facebook, arguably the world’s biggest single source of social media, went down. In just 6 hours, the company lost $100 million in revenue and 3.5 billion users around the world were impacted.
So, what happened?
What happened at Facebook?
As it turned out, it was a minor change that was so seemingly inconsequential that maintenance teams had overlooked the possibility of it causing damage.
During a routine maintenance check, somebody on the back-end team entered the wrong command, which led to an error in the system. Normally, this error would have been corrected by a fail-safe, but on this occasion, the fail-safe… failed.
Before anyone knew what had occurred, the issue had snowballed, spreading from network to network until the entire Facebook ecosystem came crashing down. Not just the social media website but also Facebook Messenger, Whatsapp, Instagram, etc.
But how does something so small create so much chaos?
Incident management—what we call the process of detecting an issue and correcting it—is actually an exceedingly complex process. A small pebble can cause ripples that spread across an entire pond. Similarly, a minute error can affect an entire ecosystem of applications, networks, and systems.
I can imagine the panic that must have spread across the different dev, maintenance, security, and operations teams as they scrambled to find the root cause. Is it malware in the code? Was there a cybersecurity breach? Maybe it’s QA’s fault. Maybe the dev team was responsible.
Between the cascading failures and the mounting pressure of unhappy users, the techies at Facebook would have been going through hell.
We can’t know exactly what was happening at Facebook that day but imagine what it would be like if an unexpected outage were to occur in your organization.
To understand why even minor incidents can pose a real challenge to an SRE team, let’s take a closer look at what incidents are, why they occur, and why they can be so confusing.
What Are Incidents, and Why Do They Occur?
In site reliability engineering, an incident refers to any unexpected event or condition that disrupts the normal operation of a system. SRE teams need to give it immediate attention and resolution in order to restore the system and prevent further complications.
The challenge in dealing with incidents lies in their unpredictable nature and the interwoven complexities of modern IT systems. This is why SRE and DevOps professionals usually gauge incident severity by time-taken-to-resolve rather than any other metric. Specifically, they use two terms:
- MTTD (Mean-Time-To-Detect): How long did it take to figure out the problem?
- MTTR (Mean-Time-To-Repair/Resolve): How long did it take to fix the problem?
The more convoluted the issue, the higher the MTTD and MTTR.
So, what constitutes an incident? Let’s look at a few examples:
- Software Glitches: Sometimes, a minor bug or coding error, as small as a missing semicolon, can lead to significant problems in an application or an entire network, as was the case with Facebook.
- Hardware Failures: Physical hardware can break down. A malfunctioning server, a broken cooling fan, or a defective hard drive can bring down critical systems.
- Security Breaches: A slight vulnerability in the system could allow malicious attackers to gain access, leading to potential data theft or system damage.
- Network Failures: A small misconfiguration in network settings can lead to outages in connectivity across different parts of a system or an entire organization.
Why these incidents occur can be attributed to a myriad of reasons, but the most common ones include:
- Human Error: As with Facebook, a mere slip in judgment or a typo can lead to chaos. Human error is often the most perplexing since it can slip past automated checks and balances.
- Complex Interdependencies: Modern systems are complex. A change in one part might unexpectedly affect another. It’s like a precarious house of cards; touch one, and the whole structure might collapse.
- Lack of Proper Testing: Sometimes, inadequately tested updates are rushed into production, only to realize that they introduce new, unforeseen issues.
- Environmental Factors: Even factors like temperature, humidity, and physical environment can impact the hardware.
Challenges to incident resolution
In order to implement smooth workflows, you need to be able to detect and resolve incidents in real-time/near-real time.
But each step in the above process comes with its own set of challenges and delays.
In the first stage of the incident lifecycle, the incident occurs, and is detected—either by a user or the system itself— and flagged as a potential problem. The system then creates alerts to let the appropriate professional know that a problem needs their attention.
We can further subdivide this stage into:
Detection: The incident can be identified through various means like monitoring systems, customer complaints, or internal reports.
In this stage, the incident is first detected and then pinpointed using various methods, like monitoring systems, customer feedback, or internal reports.
And remember, if users are reporting problems, that means user experience has already been impacted.
The goal should be to know it before users start noticing it.
Once detected the incident is logged and then a triage is conducted to categorize severity and give it a priority.
When we talk about slow Mean-Time-To-Detect a smooth workflow at this stage is crucial.
Logging: The identified incident is documented in an incident management system. Important details like time of occurrence, source, symptoms, and any other relevant data are recorded.
Triage: Based on the initial information, the incident is categorized based on its nature, severity, and potential impact.
Prioritization: The incident is prioritized based on its impact and urgency. This step helps determine the order of incident resolution activities.
Challenges in Stage 1
Incomplete or incorrect monitoring
Without comprehensive and accurate monitoring in place, incidents might not be detected promptly, leading to delays in resolution.
False positives and alert storms
An alert storm is when a system generates a large volume of alerts in a short time, often due to a single root cause. This can happen when one failure triggers a cascade of alerts from various interconnected components or services. It can be challenging to identify the root cause due to the sheer volume of alerts.
These can distract teams and lead to resources being wasted on non-issues.
This is basically what happened with the Facebook outage. Once the failsafe failed, it set off a storm of alerts from various programs across all of FB’s platforms that the erroneous code impacted.
Is it a network error? Is it a DNS failure? Infrastructure? Security?
The maintenance teams were overwhelmed by the alert storms, and that would have delayed triage considerably.
Misclassification of the incident
An issue that can arise when dealing with alert storms. This will delay your triaging process and MTTR.
Stage 2: Response and Resolution
Having detected the incident and logged it, we now come to the second stage of the incident lifecycle, in which we look at how the incident is resolved.
Response Team enters: The incident is assigned to a response team or individual based on the category and priority of the incident. This assignment is also dependent on the skills needed to resolve the incident.
Diagnosis: The team or individual conducts a preliminary investigation to understand the incident better, identify the cause, and determine potential solutions.
Escalation (if required): If the incident is beyond the capability of the current team, it is escalated to a higher-level team.
Resolution: The team attempts to resolve the incident using the identified solution. They may need to test several solutions before finding one that works.
Challenges in Stage 2
Skillset mismatch
If the incident is misdiagnosed, or if the Ops team cannot identify the accurate case of the incident due to cascading failures, then the wrong team may be assigned to resolve the issue.
Delays in escalation
Another fallout from delayed diagnoses could be delays in bringing in the right Dev team to fix the issue.
Unintended consequences
Solving one problem might lead to side effects and more issues arising.
Stage 3: Post-resolution review
Finally, with the incident resolved, SRE teams need to make sure that these problems don’t reoccur, and if they do, there is already a solution strategized and ready to implement. That is done through:
Verification: Once the incident is resolved, the solution is verified to ensure the incident doesn’t reoccur.
Closure: After resolution verification, the incident is officially closed. The resolution is documented in the incident management system for future reference.
Review (Lessons Learned): After closure, a review is carried out to understand what caused the incident, how it was resolved, and how similar incidents can be prevented in the future. Lessons learned are documented and shared with relevant teams.
Prevention: Based on the post-incident review, preventive measures are implemented to avoid similar incidents in the future.
Challenges in Stage 3
Regressions
The fix worked once. It may not work every time. Or the fault may reoccur, especially if the incident is closed prematurely.
Ignoring lessons learned
If the lessons aren’t integrated into future work the same mistakes might be repeated.
Resistance to change
If teams are resistant to implementing changes based on the lessons learned, prevention efforts may be ineffective.
What is the solution?
Introducing AIOps (Artificial Intelligence for IT Operations) to SRE gives us a transformative solution to the challenges of IT incidents. By leveraging machine learning and predictive analytics, AIOps can analyze vast amounts of data, detect anomalies, and predict potential issues before they occur.
This proactivity not only minimizes human error but also allows for faster, more precise incident response. In an environment where seconds can mean significant revenue loss, the automation and intelligence offered by AIOps stand as a vital tool in maintaining system stability, thereby enhancing efficiency and reliability across the entire technological landscape.
In Part 2 of this article, we will explore how AIOps gives us the tools to automate a large part of the incident management and resolution process, vastly reducing both MTTD and MTTR and minimizing the impact on your business.
Read More: IT and Telecommunication: Rethinking Transformation Strategies
Conclusion
The incident at Facebook serves as a stark reminder that no system is invincible. Even with the best minds and the most sophisticated technologies at hand, incidents can still occur, baffling those who might believe they have everything under control. It underscores the need for robust incident management processes, thorough testing, and a culture that learns from these inevitable technological hiccups.
In the end, incidents are not just technical problems to be solved. They are valuable lessons that push organizations to evolve, adapt, and continually strive for excellence in an ever-changing digital landscape. A multi-billion-dollar organization like Facebook can eat a loss like this and walk away. But if you’re running an SME and especially if you depend on your digital services for revenue, you need to be paying attention and learning.