3 Combat Sports Principles that Apply to Site Reliability Engineering

November 18, 2021

Watching my father's interest in boxing and martial arts growing up has translated into my own interest as a boxing and MMA fan.

My old man, John Marsicovetere, the martial artist in action, mid/late 1970s Melbourne, Australia

My old man, John Marsicovetere, the martial artist in action, mid/late 1970s Melbourne, Australia

While I don't have posters of Bruce Lee, Rocky Marciano, or VHS tapes of Ali vs. Foreman like my old man, I still find myself tuning in to watch the big fights when they happen. As a Senior Cloud Infrastructure Engineer, I find that my life is vastly different from athletes competing in combat sports for many obvious reasons. However, there are some principles from the combat sports world that have an interesting application to my professional life in Site Reliability Engineering (SRE). And no, it is not simply "Protect yourself at all times." 😉

1. Sometimes you are the hammer, sometimes you are the nail

This is a common combat sport saying/principle, given each competitor is never closer to their next loss than when they face their opponent to begin a fight. In SRE, due to the complex nature of cloud infrastructure and the modern web, you too are never far away from your next outage or difficult incident at any given time. However, it is important for you to remember that when things are going very well or very poorly, the inverse is always right around the corner.

There are times in an SRE's life when things can be going incredibly well, and every project is turning to gold. You may have implemented an extra Availability Zone that provides redundancy and protects against single AZ failures. You could have optimized that extremely slow database query that now improves both the memory availability and overall performance of a service. You might finally have instrumented the perfect backup and retention service that gives you peace of mind knowing you have copies of your data in case of a ransomware attack.

All these things could be going well and you are "winning the fight," as they say. However, an outage can suddenly knock you off your feet, changing the direction of the fight. It could be trivial, like a particular service from a cloud vendor is unavailable. Or it could be catastrophic, like a full region or entire DNS outage from an upstream provider.

Like combat sports, in SRE you need to "roll with the punches." Your next outage is never far away and as best you try, you cannot prevent or architect around each of them. There are going to be busy and complex days, situations, and incidents where you are clobbered from many angles. It is important to remember that the days where awesome improvements are instrumented are always just around the corner, so never be too hard on yourself.

2. Don't throw and hope; aim and fire

In combat sports, it can be tempting to "throw haymakers" and hope that a knockout shot lands to win you the fight. Similar situations occur in SRE. When faced with a slow-performing website, you may just increase the infrastructure memory or CPU availability to see if that resolves the issue. You might start terminating running services to free up available memory and CPU resources to see if that helps. You could even perform the classic "turn it off and on again" tactic and hope that fixes the issue.

These are some SRE examples of "throwing haymakers" rather than properly investigating and debugging why a system or service is failing and then making the necessary changes to fix the underlying issue. Sure, the "haymakers" thrown might solve an issue completely or buy you time during a particular incident or outage. However, always defaulting to those responses every time an incident occurs is not a valid long-term tactical response. It is far better for you to aim at the root cause(s) or reason(s) why an issue may be occurring and then fire your remediations via monitoring, alerting, or code fix deployments.

3. Don't react, respond

This principle reminds combat sports athletes to keep their emotions in check when something is happening, as it can help control the outcome of what happens next. Uncontrolled, a reaction may cost an athlete a victory or expose a flaw that could be used against them in the future. A response is always more calculated and tactical in the long term.

This particularly applies to SRE, especially during an incident or outage. When faced with an unfortunate situation it is far too easy to think that the world is crumbling or that the incident is completely unresolvable. Some folks may even start to blame their colleagues, vendors, or the software/application logic. This is very easy to do, but very hard to undo, particularly the consequences.

A more measured and mature response during an incident or outage has several advantages. It allows your SRE team to focus clearly on the task at hand and not be distracted by the external noise. It also shows great leadership and calmness under pressure to those involved with the incident or outage. No one wants to be on the receiving end of a negative reaction to an incident or outage, and a measured response will always elicit a better outcome than a knee-jerk reaction. After all, we in SRE call this "Incident Response", and not "Incident Reaction" for a reason. Take a moment to breathe, think, plan, and then respond accordingly. At the end of the day, remember that all problems have solutions.

Bonus: Everyone has a plan until they are punched in the face

A sometimes flippant remark that has now transcended into everyday advice, this principle still rings true for SRE. The best-laid plans for launching a service or project can go awry quickly due to a whole host of reasons—from failing tests to incorrectly sized scaling groups or even an expired SSL certificate that no one was notified about. SRE teams need to "stay on their toes" and stay focused for when they are metaphorically "punched in the face." In tech, not just SRE, plans often change quickly and drastically, so this becomes easier to adapt to over time and with experience. However, the threats of seemingly random errors, incidents, and outages that alter the original plan will forever remain.

Until next time, UFC Hall of Famer Urijah Faber and I wish you all a good clean fight in the SRE ring.

Me (being completely star-struck), running into UFC Hall of Famer Urijah Faber in Kahului, 2018

Me (being completely star-struck), running into UFC Hall of Famer Urijah Faber in Kahului, 2018

Related Posts

Mastering Auto-Complete: A Practical Guide Using Postgres and Raw SQL

July 18, 2023
In this article, you'll learn by example how to implement an auto-complete feature for your application using raw SQL and Postgres.

So you wanna learn some AWS skills huh?

December 13, 2022
Paul shares approaches to learning and levelling up your AWS skill set when starting as a beginner.

SQL in io-ts, Part Two: Discriminating Unions & Expressions

September 8, 2022
In this article, we’ll continue the learning journey of implementing SQL in io-ts.