Migrate Prometheus + Grafana to Datadog

Introduction

3 min readJun 19, 2021

I was using Prometheus to monitor basic everything of my application from Kafka Cluster, Debezium, EC2, Kafka Connect, Kafka Mirror Maker … with Grafana to create dashboard and alerts.

But after 3 months things start to fell apart after adding new monitoring points for ec2 instances, emr cluster with spots instances, mysql databases …

Can’t track spot instances due to the way Prometheus getting data by pulling.
CPUs getting higher and higher as adding more points to monitor and I need to change the frequency to 60s — 120s.
Grafana is not good to creating alerts and I have lots of trouble to create alerts on its.
SLAs, Alerts become critical when infra growth …

I was looking for an alternative solution that reliable, cheap, easy to use because from my point of view I don’t think you should spend lots of effort on setting up and maintain large monitoring service. I picked Datadog over New Relic because of its more for infra than application which I focus on but both have support infra and application monitoring just feel Datadog has better UI.

Datadog compare to Prometheus have lots of advantages:

Reasonable price
Easier to create dashboards, prebuilt dashboards …
Easier to create monitoring, alerts
Push model instead of pulling model.
Lots of built-in and integrations, 1 agent handle lots of things out of the box.
SLOs, Notebooks, Logs, APM …

Migration to Datadog

Datadog has lots of built-in dashboards but because it is very easy to create dashboard I able to create dashboard that I have from Grafana like Debezium, Kafka Connect, Mirror Maker is pretty fast.

Most of stuffs are working well except JMX metrics.

The way Datadog collect JMX metrics is different with Prometheus.

You need to list clearly what metrics you want to get
Don’t support regex on JMX - At the point of this article

It is a pain in the ass to migrate huge JMX metrics like Kafka, Kafka Connect, Debezium …. for example alone Kafka may have 2000–3000 metrics if you want to collect.

Since Datadog supports Openmetrics integration the way to collect JMX metrics is much simpler, basically keeping the JMX Exporter existing and point the datadog agent to collect from its. That’s it.

init_config:
    service: debezium-kafka-connect
instances:
  ## @param prometheus_url - string - required
  ## The URL where your application metrics are exposed by Prometheus.
  - prometheus_url: http://localhost:7073/metrics
    namespace: kafka-connect
    metrics:
    - "debezium*"
    type_overrides:
      "debezium*": gauge
    tags:
    - env:production

The things to consider here is type_overrides. Supported <METRIC_TYPE> are `gauge`, `counter`, `histogram`, and `summary`.

SLOs to monitor like uptime of services, data quality, custom metric … For example I want to track daily ingestion success rate I created a metric call data.dq.daily_ingestion with tag include database, table I want to track. When pipeline run successful it send to datadog with value 1 and 0 for failed.

Then I get SLOs of daily ingestion by

(sum:data.dq.daily_ingestion{db:cdm}) / (count:data.dq.daily_ingestion{db:cdm})

Another good feature is Notebook to help deep dive into some metrics or investigate an incident.

Conclusion

After migrating to Datadog the result is positive despite some learning curse but it is quick to pick up.

Reduce time to maintain Prometheus/Grafana cluster
Easy to setup new monitor point
Easy to create monitoring, dashboards, alerts
SLOs is help on tracking SLAs, OKRs
Good docs and support from datadog if you encounter issue.

Migrate Prometheus + Grafana to Datadog

Introduction

Migration to Datadog

Conclusion

Written by Chanh Le

No responses yet