Migrate Prometheus + Grafana to Datadog
I was using Prometheus to monitor basic everything of my application from Kafka Cluster, Debezium, EC2, Kafka Connect, Kafka Mirror Maker … with Grafana to create dashboard and alerts.
But after 3 months things start to fell apart after adding new monitoring points for ec2 instances, emr cluster with spots instances, mysql databases …
- Can’t track spot instances due to the way Prometheus getting data by pulling.
- CPUs getting higher and higher as adding more points to monitor and I need to change the frequency to 60s — 120s.
- Grafana is not good to creating alerts and I have lots of trouble to create alerts on its.
- SLAs, Alerts become critical when infra growth …
I was looking for an alternative solution that reliable, cheap, easy to use because from my point of view I don’t think you should spend lots of effort on setting up and maintain large monitoring service. I picked Datadog over New Relic because of its more for infra than application which I focus on but both have support infra and application monitoring just feel Datadog has better UI.
Datadog compare to Prometheus have lots of advantages:
- Reasonable price
- Easier to create dashboards, prebuilt dashboards …
- Easier to create monitoring, alerts
- Push model instead of pulling model.
- Lots of built-in and integrations, 1 agent handle lots of things out of the box.
- SLOs, Notebooks, Logs, APM …
Migration to Datadog
Datadog has lots of built-in dashboards but because it is very easy to create dashboard I able to create dashboard that I have from Grafana like Debezium, Kafka Connect, Mirror Maker is pretty fast.
Most of stuffs are working well except JMX metrics.
The way Datadog collect JMX metrics is different with Prometheus.
- You need to list clearly what metrics you want to get
- Don’t support regex on JMX - At the point of this article
It is a pain in the ass to migrate huge JMX metrics like Kafka, Kafka Connect, Debezium …. for example alone Kafka may have 2000–3000 metrics if you want to collect.
Since Datadog supports Openmetrics integration the way to collect JMX metrics is much simpler, basically keeping the JMX Exporter existing and point the datadog agent to collect from its. That’s it.
init_config:
service: debezium-kafka-connect
instances:
## @param prometheus_url - string - required
## The URL where your application metrics are exposed by Prometheus.
- prometheus_url: http://localhost:7073/metrics
namespace: kafka-connect
metrics:
- "debezium*"
type_overrides:
"debezium*": gauge
tags:
- env:production
The things to consider here is type_overrides. Supported <METRIC_TYPE> are `gauge`, `counter`, `histogram`, and `summary`.
SLOs to monitor like uptime of services, data quality, custom metric … For example I want to track daily ingestion success rate I created a metric call data.dq.daily_ingestion
with tag include database, table I want to track. When pipeline run successful it send to datadog with value 1 and 0 for failed.
Then I get SLOs of daily ingestion by
(sum:data.dq.daily_ingestion{db:cdm}) / (count:data.dq.daily_ingestion{db:cdm})
Another good feature is Notebook to help deep dive into some metrics or investigate an incident.
Conclusion
After migrating to Datadog the result is positive despite some learning curse but it is quick to pick up.
- Reduce time to maintain Prometheus/Grafana cluster
- Easy to setup new monitor point
- Easy to create monitoring, dashboards, alerts
- SLOs is help on tracking SLAs, OKRs
- Good docs and support from datadog if you encounter issue.