Book Review: Site Reliability Engineering

Work in progress note, I will continue updating this note as I read the book.

One of the books that I have wanted to read, I have made a couple attempts to read but I want to commit to finish this book this year. Especially as I consider SRE as a path for my future career.

You can read the online version for free from Google: here.

I am hoping to get the principles and the practical applications from this book.

Key Points

  • SRE is what happens when you ask a software engineer to design an operation team.
  • 50-60% of SRE team in Google are Google Software Engineers. The remaining have 85-99% software engineering skills plus other skills that are rare in software engineers, for example: UNIX system internals and networking.
  • Google places 50% cap on “ops” work. This is upper bound, over time this should decrease as the “dev” work would lead to automation and that would lead in system that runs and repairs itself.

Principles

Practices

My Observation

  • In SRE environment, all engineers are rostered to be on-call - it was my experience with Fairfax/Nine.
  • Looking quickly at SRE job ads at SEEK, most are looking for provisioning tools as Terraform, Helm etc - which Software Engineer won’t have much experience with.