Monitoring @ Mettle šØ
Upon arrival at Mettle, infrastructure was being monitored using Zabbix š¢ which meant that Kubernetes itself nor the services running on it were being monitored. My main objective was to increase visibility into platform performance using a mixture of Slack alerts and Grafana dashboards leveraging Prometheus as the centralized mechanism for collating metrics.
Therefore, we started by deploying a basic custom-built Prometheus helm chart. The configuration of Prometheus itself and the alerts were tightly coupled with the helm chart.
All alerts were being fired to a single slack channel (one per environment) which meant these channels become extremely noisy and ended up with no ownership on issues this is something we needed to fix!
Our new monitoring architecture
Over the past few weeks, the team has been designing out our new monitoring architecture. Finally, we converged on something which offers robustness, high-availability, long term storage and most importantly a setup which we feel is easily extendible. Our new architecture looks like this:
Note: Pagerduty is greyed out as its the next receiver we want to implement.
The platform team has a paradigm of running everything on Kubernetes, so each dark box represents a āpodā and the lighter boxes represent individual containers.
As with all the platform components at Mettle, we architect for scale and high availability. All components can be scaled horizontally to handle increased load and ensure weāre resilient to failure.
What's in the Prometheus Pod?
Prometheus Server
The Prometheus server is the core component in the monitoring stack. Itās responsible for scraping targets, storing metrics, and providing the interface which allows us to query the data. Weāre running a completely standard server at version v2.12.0.
Thanos Sidecar
Will be covered in the Thanos section of this blog post
Config Reloader
As weāre running on Kubernetes, we use a ConfigMap to store the configuration for Prometheus. When we apply a change to the ConfigMap, the file mounted within the Prometheus server is automatically updated, but by default, it requires a manual action to trigger the server to reload.
To automate this step, we deploy a sidecar (https://github.com/kubernetes/git-sync) which mounts the configuration file and syncs changes from the repository storing this file. With this in place, updating the configuration file in GitHub is a sufficient action for Prometheus to be updated and reloaded.
Alerting Rules Fetcher
We use Prometheus as our primary tool for alerting and want to make the experience of defining new alerts as straightforward as possible. We store all of our alerts in a Git repo containing the alert definitions. The process for adding a new alert is as follows:
- Raise a PR for a new alert
- Validate on CI: Check alert syntax with amtool
- Merge the PR into the master branch
- The new alert is live.
The Rules fetcher watches the master branch and automatically pulls new alerts to a volume mounted within the Prometheus pod. When a change is synced the Thanos sidecar triggers a reload of the Prometheus server.
Currently, we only have one receiver configured in AlertManager which is Slack, in the next iteration we would like to include PagerDuty as well.
We route alerts based on the team responsible for that given service. We also send a copy of all alerts to a single Slack channel on-call-alerts which serves as a platform-wide alert stream (see below).
We have a custom template for our alerts, I plan to write a separate post on this to cover it in more detail.
Whatās in the AlertManager Pod?
Weāre running AlertManager at version v0.18.0 without modification. We use the same configuration reload mechanism as in the Prometheus server, such that ConfigMap updates trigger an update and reload of the system.
How does Thanos work?
In our initial architecture, we store metrics for 7ā14 days (depending upon the environment) due to the data being written to block storage, in our case an EBS volume. However, we saw this as an issue because of the fact an EBS volume has a defined size.
Thanos solves this by introducing a complementary system to run alongside existing Prometheus setups. Using a combination of object storage and a smart query layer, it provides a single view of all our metrics for an unlimited about of time. Additionally, it reduces our dependency on any block storage which makes our monitoring stack truly stateless. So letās dig in and understand a little more about the Thanos components
Thanos Sidecar
The Thanos sidecar container runs in the Prometheus āpodā and mounts the Prometheus data directory through a shared volume. Prometheus periodically writes its data for a fixed time window to immutable block files. The sidecar is responsible for backing up these files to object storage (S3 in our case) and acting as a local data source for the global Thanos Query component.
Thanos Query
The Thanos query component provides a Prometheus native interface for requesting data from multiple distinct Prometheus servers. Queries made through this component fan out to other Prometheus Pods and request data through the Thanos sidecar. If the data is available on the local volume, itās returned to Thanos Query where results are merged and returned to the system making the original request. The combination of Sidecar and Query components solves the single-view problem, as well as backing up the data to S3.
Thanos Store
The Thanos store acts just like a sidecar, in that it represents a source for metrics data. The main difference is that, while the sidecar acts as a proxy for data locally available from a Prometheus server, Thanos store acts as a gateway for data stored remotely in S3.
With this component in place, queries made through Thanos Query fan out to both the Thanos Store and Thanos Sidecar components, providing not only a seamless view across multiple Prometheus servers but also one across a long term time window (for as long as we keep data in S3).
Thanos Compactor
The final piece of the puzzle is Thanos Compactor, which is responsible for compacting and downsampling data stored in S3, to help with efficient querying over long time periods. I wonāt go into the details of how and why, but if youāre interested read this!
Conclusion
While this new setup took some time, we are very pleased with the outcome and the user-friendly experience our engineers now have to monitor and alert on their services.