The 5 anti-patterns of SLO adoption

Common pitfalls I've observed when teams try to introduce SLOs

Feb 14, 2024

SLOs are a powerful tool when used well, as they allow Engineering Leaders to introduce objective metrics and processes to ensure a good balance between moving fast in delivering features and ensuring a good quality of service for the end users.

Adopting SLOs in practice is full of challenges, and through my experience, I've been observing 5 common anti-patterns of SLO adoption:

#𝟭: 🙉 𝗨𝗻𝗶𝗹𝗮𝘁𝗲𝗿𝗮𝗹𝗹𝘆 𝗲𝗻𝗳𝗼𝗿𝗰𝗶𝗻𝗴 𝗦𝗟𝗢𝘀 𝗳𝗿𝗼𝗺 𝘁𝗵𝗲 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝘀𝗶𝗱𝗲

#𝟮: 🙇‍♂️ 𝗖𝗮𝗿𝗴𝗼-𝗰𝘂𝗹𝘁 𝗦𝗟𝗢𝘀

#𝟯: 🚛 𝗤𝘂𝗮𝗻𝘁𝗶𝘁𝘆 𝗼𝘃𝗲𝗿 𝗤𝘂𝗮𝗹𝗶𝘁𝘆

#𝟰: 🔕 𝗛𝗮𝘃𝗶𝗻𝗴 𝗦𝗟𝗜𝘀 𝗱𝗶𝘀𝗴𝘂𝗶𝘀𝗲𝗱 𝗮𝘀 𝗦𝗟𝗢𝘀

#𝟱: 🔧 𝗙𝗮𝗻𝗰𝘆 𝘁𝗼𝗼𝗹𝘀 𝗯𝘂𝘁 𝗻𝗼 𝗽𝗿𝗼𝗰𝗲𝘀𝘀𝗲𝘀

In today's article, we'll discover these 5 anti-patterns together with approaches and recommendations on how to avoid the most common pitfalls.

This article is not intended as a general introduction to SLIs, SLOs, SLAs, and Error Budgets.

There is a lot of literature available online on the subject.

If you want to know more about the topic of Resilience Engineering, I recommend you check out Alex Ewerlöf Notes as he regularly posts about this topic!

Let's dive into it!

Source: https://imgflip.com/i/8f8nwb?lerp=1707476220481

🙉 #1: Unilaterally enforcing SLOs from the Engineering side

SLOs are meant to set clear expectations in terms of the quality of service we want to provide to our users: availability, latency, data quality, transaction processing time, etc…

It would be very naïve to assume that the engineering team alone has all the knowledge and insights into the company's users to set these expectations without input from other areas.

Doing this properly will require you to go through a lot of conversations with various peers, many of whom will struggle with the concept.

You will need to spend time explaining those key concepts, repeatedly.

When faced with this perspective, many Engineering Leaders might hesitate. You might think it is too much of an investment just to get the first MVP out. The anticipation of those tedious and potentially painful conversations might lure you into the common trap of “it'll just be faster if we do it on our end, without involving too many people”.

If you've gone down this path, you might find yourself in a situation that shows some of the following signs:

👥 Your SLOs are disconnected from your user's and business needs. You might focus on API latency, while payment errors might be more important.
🧸 You SLOs are just engineering toys. Likely nobody outside of the engineering organisation even knows they exist.
🤷🏼‍♂️ You SLOs are misunderstood. Other parts of the organization might know they exist, but won't understand their essence. For them, SLO is just another acronym in the esoteric tech lingo they tend to ignore.
👎🏼 Work on improving SLOs is rarely prioritized. It's constantly in the do-it-later bucket, and it takes a lot of heroic efforts from the engineering leaders to get them into the roadmap.

If you want your SLOs to be useful tools to make your company, your team, and yourself more successful, you will need to start on the right foot. You will need to involve peers from your organization who own decisions around user experience and business priorities.

This is not to say you should just ask them to set the SLOs for you.

It is a collaborative effort, and you'll likely be the one driving it. You'll need to spend time explaining the benefits of adopting SLOs and will have to remind people to get back to you on pending discussions more often than you might like.

That's a good investment of your time, as it will lead to SLOs that are well understood across the organization, tied to business and user expectations, and are part of the standard operating procedures across the company.

🙇‍♂️ #2: 𝗖𝗮𝗿𝗴𝗼-𝗰𝘂𝗹𝘁 𝗦𝗟𝗢𝘀

This is often combined with Anti-pattern #1, but not necessarily.

When you're getting started with SLOs, you typically have no idea about how to set meaningful targets. It's a new domain, and unless you have someone on your team who has extensive experience with SLOs, you might be on your own here.

Setting appropriate SLOs for your case might require you to look at a lot of historical data, make assumptions on business impacts, and get input from different areas of your company.

That sounds long and tedious, and as you've never done this before, you might even struggle to figure out where to start.

As you feel the pressure to deliver, you might look for shortcuts as you tell yourself you will always have time to improve them later.

That's when you start looking at so-called Industry Standards or Industry Best Practices.

As you're ambitious, you want to look at the best in the industry: Big Tech, FAANG, etc. Especially as they're among the few companies that communicate broadly about these topics.

You take whatever targets they're using for their services, and adopt them for your case. Who are you to think you know better than them?

There is value in being humble, but when this turns into cargo-culting, the results can be very disappointing.

If you've gone down this path, you might observe the following:

📈 You have unrealistic targets that you're consistently missing. Every minor glitch in your systems or a bumpy deployment will set you back by a distance.
💰 You spend way too much time and resources trying to achieve those unreasonable targets. You're deploying massive cloud resources or spending more than 50% of your team's capacity just to achieve those ambitious targets.
😞 Your team feels disempowered. Members of your team feel that those targets were imposed on them as they were not consulted in the process. This can lead to issues with motivation and satisfaction.
👎🏼 People around you think you are under-delivering. They'll get used to seeing your “red” numbers and assume you're just unable to deliver on your promises. This can severely impact your reputation within the firm.

There are ways to mitigate such pitfalls, and they all start with taking the time to set good targets. Instead of rushing the process, make sure to take the time to go through the following activities:

Track SLIs for a long period to understand where you are, and if there has been any significant trend in the past you need to be aware of.
When looking at outside references, pick companies that target a similar population as you do. Expectations can be very different from industry to industry and across different socio-demographic groups.
Involve your team to get a reality check on how feasible it would be to meet certain requirements.
Discuss with your key peers in the company to help them understand the tradeoffs between resiliency and cost.
Start with low targets, and gradually increase them as you observe that the team is perfectly capable of meeting them without heroic efforts.

🚛 #3: 𝗤𝘂𝗮𝗻𝘁𝗶𝘁𝘆 𝗼𝘃𝗲𝗿 𝗤𝘂𝗮𝗹𝗶𝘁𝘆

Humans have a bias for preferring more over less in many aspects of their lives. Unfortunately, metrics, targets, and SLOs are no exception to this.

If you're interested in the general issue, I recommend you to listen to this podcast episode. I found it very interesting and Leidy Klotz's book discussed in the podcast is on my reading list for this year.

When you start looking into defining SLIs and SLOs, you might be tempted to try to cover it all. You can easily get carried away with the possibilities offered by modern observability tools.

You might end up going both too broad and too granular.

Often we choose to go with a lot of metrics as a way to compensate for a lack of deeper understanding of what matters for our users and our company. You assume or hope that good metrics will arise over time.

As you can't have a good SLI without an SLO attached to it, you end up defining tens of Service Level Objectives all at once.

While you might be priding yourself on such an impressive level of coverage, you and your team will likely be experiencing the following:

🤯 Overwhelm and tiredness. Just maintaining such a setup and understanding all the SLOs will add a non-trivial amount of cognitive load on your team. A lot of overhead will be generated, distracting them from their main purpose of delivering value.
😫 Alert Fatigue. Chances are you'll always have one or two SLOs being violated once you have too many in place. Your team might constantly be jumping from one alert to the next one.
↕️ Unclear priorities. As you're trying to focus on many fronts at the same time, it will make it very difficult to decide which SLO to prioritize when many of them will be suffering.

What I've seen working better is to start with a minimal number of SLOs. I even recommend starting with a single SLO. Something simple that is easy to track and understand. It will allow you to start building this new muscle without overwhelming the team.

Once the team has gotten to a place where they are comfortable with the current set of SLOs, you might consider gradually adding new SLOs.

Adding more SLOs should not be a goal in and of itself.

Less is more, and you should consider replacing existing SLOs with new ones instead.

🔕 #4 𝗛𝗮𝘃𝗶𝗻𝗴 𝗦𝗟𝗜𝘀 𝗱𝗶𝘀𝗴𝘂𝗶𝘀𝗲𝗱 𝗮𝘀 𝗦𝗟𝗢𝘀

You've done the work of setting your SLIs and defining their respective SLOs.

You have built fancy dashboards that track each SLI, with clear threshold lines that indicate whether or not the SLO is healthy.

You are the only one looking at those dashboards periodically, and the team relies on you to tell them if and when they're supposed to do something about those metrics. There is no automation in place, it all relies on eyeballs looking at charts.

You might even try to nudge your team to share the burden of checking those dashboards, to no avail.

This approach has high friction, as it requires both you and your team to go through a repetitive and tedious process. Something humans aren't good at.

It won't take long before people will just stop doing it.

These are some symptoms you might observe if you're in this situation:

🤷🏼 The team lacks awareness of how their systems are performing. They might look at those dashboards rarely, often after a nudge from your side. As such, the level of awareness of how the systems are operating will generally be low.
⏱️ You tend to react very late to SLO violations. As you rely on humans to check for potential issues, you'll find yourself dealing with problems after they have happened as opposed to working to prevent them.
🏝️ You often find bad surprises coming back from holidays. As the team is implicitly relying on you - or someone else - to periodically check those SLOs, you might find that nobody took up that responsibility during your absence.

The solution here is simple, though not necessarily easy to implement.

As part of defining your SLOs, you want to establish clear rules for automated alerting, who should be receiving those alerts, and what are the expectations on how to handle them.

This is where starting with a very small set of SLOs or a single one, as mentioned previously, will make it easier to implement a full end-to-end experience.

🔧 #5: 𝗙𝗮𝗻𝗰𝘆 𝘁𝗼𝗼𝗹𝘀 𝗯𝘂𝘁 𝗻𝗼 𝗽𝗿𝗼𝗰𝗲𝘀𝘀𝗲𝘀

You went a step further compared to anti-pattern #4. You have set up alerts for proactively alerting when an SLO is at risk of being breached. But you stopped there.

You assumed that having all the tools in place would be enough for your team to reap the benefits of this approach, and missed an important step: defining the relevant processes to deal with SLO violations.

You likely also didn't think about how often to review your SLO targets, and when and how to adjust them.

Setting those processes will require aligning with your product counterpart on an Error Budget policy, and then agreeing on applying it automatically whenever a problem arises. Those conversations will take time and will not always be a walk in the park, therefore the temptation to skip them can be very appealing.

If you have skipped this last important step, you might observe a combination of the following symptoms:

🗣️ Deciding what to prioritize when an SLO is breached requires a lot of discussion. As there is no agreed-upon mechanism in place, every decision becomes an ad-hoc decision. Sometimes requiring you to go through a lot of back and forth with your product counterpart to decide whether to act or not.
🏋🏼 Team overload due to heroic efforts. As there is no clear prioritization mechanism in place, you might implicitly request your team to resort to heroic efforts to handle those alerts while still delivering all their planned work.
🦋 Set and forget SLOs / Volatile SLOs. You set your targets once, and you never review them. Worse even, you end up implicitly questioning them every time there is an issue, effectively turning SLOs into mere indicators.
🤷🏼 Lack of ownership of SLOs within the team. As SLOs are either not part of the team's defined processes or they just come on top, there is a high chance your team will see them as unnecessary bureaucracy or burden.

One of the main benefits of SLOs is their ability to enable quick decision-making at scale through pre-defined agreements. That can only happen if you've gone through the steps required to establish clear and solid processes around them.

Spend the time to discuss and agree on how to deal with Error Budgets, document those agreements, and give your team the mandate to make prioritization decisions on the spot based on them.

🏁 Conclusions

In this week's article, we've explored the 5 anti-patterns of SLO adoption that I've been observing through my personal experience.

You might have observed part or all of them in your situation.

You might have encountered other issues that are not captured by this list.

I don't expect this list to be comprehensive, and I do welcome input and contributions in the comments section.

What's your suggestion for Anti-pattern #6?

🎉 Submit your suggestions, and I will update the article with the most valuable and relevant contributions!

See you all next week!

Discussion about this post

Ready for more?