Work in progress note, I will continue updating this note as I read the book.
One of the books that I have wanted to read, I have made a couple attempts to read but I want to commit to finish this book this year. Especially as I consider SRE as a path for my future career.
You can read the online version for free from Google: here.
I am hoping to get the principles and the practical applications from this book.
- SRE is what happens when you ask a software engineer to design an operation team.
- 50-60% of SRE team in Google are Google Software Engineers. The remaining have 85-99% software engineering skills plus other skills that are rare in software engineers, for example: UNIX system internals and networking.
- Google places 50% cap on “ops” work. This is upper bound, over time this should decrease as the “dev” work would lead to automation and that would lead in system that runs and repairs itself.
The cost to build a 100% reliable services would be too high (if such target is even achievable), thus we must embrace risk and failure.
Risk should be measured by identifying objective metrics to represent property of the system that’s to be optimised.
Google uses request success rate to measure availability rather than time based metrics (e.g uptime) - this is due to the nature of their globally distributed services (their services would nearly never down at the same time globally). Request success rate would work for batch job (system that runs periodically) - while uptime wouldn’t.
My comment: time metrics for availability would still make sense for the rest of us (non Google). I also think failed request/total request should be tracked anyway separately to time metrics - as error rate perhaps (four golden signals).
- In an SRE environment, all engineers (SRE & product) are rostered to be on-call.
- SRE is data driven by finding objective metrics of the system.
- Frontend: in Google this is reverse proxy & load balancers running close to edge network.