What is Site Reliability Engineering (SRE)?

“What does SRE mean to you?” - is one of the questions that we ask when we interview people for our SRE team. Just like “devops”, this question has produced different answers, so I always find it interesting to hear people’s answers.

Interestingly, eventhough there is no right and wrong answer, I remember my manager told me on our first one on one that my definition of SRE is not SRE :) He’s right, it’s not that I don’t know what SRE is - it’s just it wasn’t clear to me what the difference is between SRE and DevOps. I was mumbling about bringing developers and operations together, breaking down silos etc2 - which is devops essentially.

Now few months in, I have a more concrete answer to what SRE mean and how does it relate to DevOps.

SRE implements DevOps

This is explained by Googlers in this video).

In a nutshell, DevOps as a philosophy has 5 principles that are “concretely” implemented by SRE:

DevOps Principles SRE Implementation
Reduce silos Share ownership
Accept failures as normal SLOs & Blameless postmortems
Implement gradual change Reduce cost of failure, for example: Canary releases
Tooling & automation Eliminate toil
Measure everything Measure toil & realibilty

SRE and DevOps have different focus

This SRE video from Atlassian is one of a must watch if you want to understand SRE.

In this video, Nick Wright explained that SRE’s primary focus is reliability and DevOps primary focus is delivery speed.

SRE interests are: operations, incident response, post mortems, monitoring, alerting, capacity planning.

DevOps interests are: delivery, release automation, environment builds, config management, infra as code.

There are definitely overlaps between two disciplines and for me personally SRE is a continuation of DevOps when reliability is important enough for the system.

It is a continuation, because DevOps is the foundation of SRE, if you don’t have DevOps foundation in place, it’d be hard to do SRE. To put it in other words, I believe all companies big and small should implement DevOps practices of some sort – however not every companies should have SRE (although they can still benefit from following few SRE principles).

SRE needs a business buys in

If we take the Google’s model of SRE, where SRE has the right to stop production deployment if certain cases like SLOs being broken – this obvioulsy need a strong buy in from the business as in this case reliability is prioritised over deployment velocity.

Conclusion

If you ever interview for an SRE role and you don’t mention reliability in your answer then maybe you don’t fully understand what SRE is (like I was when I started) and hopefully this has article has shed some lights on the subject.