My recording of tidbits that I read from various newsletters, twitter (it’s always twitter for me - not X), training and other sources. But also contains my own original thoughts (if there are such a thing as original thought).
Monitoring is hard
So let’s go fishing? Read Monitoring is a Pain article last week, a good reflective read - a lot of learning from someone who has done monitoring for years. This is the kind of article that I really love reading - a distilled form of years of experience.
I have started to jot down some key takeaways but there is so much on that article, so I won’t bother. But I do want to steal his conclusion here as I reminder for myself:
My experience has been monitoring is an unloved internal service of the worst kind. It requires a lot of work, costs a lot of money and never makes the company any money.
Gaming FinOps for your career progression
So I had this naughty thought the other day (but pretty sure this is a thing).
What if you purposely bring your project cost (say infrastructure cost) high in the beginning? Over-provision everything!
Then a couple of quarter later, you propose and lead a cost saving initiative. Then you slap this on your resume/LinkedIn - “have lead a FinOps initiative that save infrastructure cost by x%”. Winning!
Incentive and alignment, ah how easy it is to be gamed.
On Prometheus
So I am doing PromLabs training currently and here are some of my notes:
Limit of Prometheus
As a rule of thumb, a single large Prometheus server can ingest up to 1 million time series samples per second and uses 1-2 bytes for the storage of each sample on disk. It can handle several million concurrently active (present in one scrape iteration of all targets) time series at once.
Although not sure what constitutes as “large server” here.
Metrics vs Logs
You can store more on logs - unlimited cardinality Use metrics for high level view of system health, use logs to deep dive and troubleshoot.
Note: A 5-second scrape interval is quite aggressive, but useful for demonstration purposes where you want data to be available quickly. In real-world scenarios, intervals between 10 and 60 seconds are more common.