This is just a summary from Chapter 32 of the SRE book. Oh! I am actually work as an SRE now too :D
But first of all, I want to state how the SRE team works in Google, as it wasn’t clear to me until recently (having read the book more and discussing it with my colleagues). Google SRE teams look after the deployment and production responsibilities of a number of Google services. These services developed and owned by different teams (product teams?).
With that understanding, the most striking lesson that I got from reading this chapter is: not all services will be accepted by the SRE team, because:
- Not all services need high availability and reliability.
- Number of development teams requesting SRE support exceeds the available bandwidth of SRE teams.
Thus, if I draw a connection to the non-Google world, SRE is a model that fits for companies/services with scale that neccessitates high availability and reliability, in other words it’s not for everyone. Having said that, the principles of SRE would still be applicable in most cases although the Google implementation might not.
Back to the topic of engagement model, for services to be onboarded into SRE, they need to go through Production Readiness Review (PRR).
There are 3 PRR models:
- Simple PRR model - for application already in production.
- Engagement: establish SLO/SLA for the service, plan for changes that improve reliability.
- Analysis: SRE team go through its checklist to gauge the maturity of the service.
- The next steps are Improvements & Refactoring, Training, Onboarding & Continous Improvements.
- Early Engagement model. With this engagement model, SRE is involved early in the development process, which means SRE involves in the design, build, launch & post launch. An interesting note, sometimes after post launch, a service might be identified as not suitable for a full-fledged SRE team support, in this case, SRE hands the service over back to the development team to support.
- Frameworks & SRE Platform.
Frameworks & SRE Platform
The Simple PRR & Early Engagement Model has some limitiations:
- High lead time for SREs (in Google, SRE headcount is less than 10% of all of engineers)
- Not all services can be re-implemented to meet PRR standards.
- Microservices need lower lead time for deployment, again not compatible with PPR model.
A solution for those limitiation is an SRE framework. A framework is prescriptive implementation using a set of software components and a certain way of using these components.
The framework principles:
- Codified best practices
- Reusable solutions
- Common production with a common control surface. Uniform interface to prod facilities, operational controls, monitoring, logging and config.
- Easier automation & smarter systems
SRE created a set of supported platform and services frameworks (maybe like boilerplate repos?), for each environment (Java, C++ & Go). These frameworks standarised metrics, instrumentation, monitoring dimensions, log format, load management etc.
With this approach, services that don’t receive SRE support can use production features like those that are under SRE.
This introduces a new engagement model: shared responsibility. In this engagement model, SRE is responsbile for the platform infrastructure while the development team provide on-call support for the application itself. I actually have experience working in a similar model with ScentreGroup and Fairfax.
I am not clear how the SRE team is structured under this model, I guess there will be a “services” SRE team and a “platform” SRE team then?