Launch Graph Manager

Integrating Graph Manager with Datadog

The Apollo Datadog integration enables you to forward Graph Manager performance metrics to your Datadog account. Datadog supports an advanced function API, which enables you to create sophisticated graphs and alerts for GraphQL metrics.


To integrate with Datadog, you provide your Datadog API key to Graph Manager. A Datadog account with administrator privileges is required to obtain an API key.

  1. Go to your Datadog Integrations page and select Apollo Engine from the list:


    Then go to the Configuration tab and click Install Integration at the bottom.

  2. Go to your Datadog APIs page and create an API key:


  3. In Graph Manager, go to your graph's Integrations page:


  4. Toggle the Datadog integration to turn it on. Paste your API key and click Save.


    You can use the same Datadog API key for all of your graphs, because all forwarded metrics are tagged with the corresponding graph's ID (service:<graph-id>).

  5. That's it! After about five minutes, your Datadog metrics explorer will begin showing metrics forwarded from Graph Manager.

Forwarded metrics

Graph Manager forwards the following metrics to Datadog:

apollo.engine.operations.countThe number of GraphQL operations that were executed. This includes queries, mutations, and operations that resulted in an error.
apollo.engine.operations.error_countThe number of GraphQL operations that resulted in an error. This includes both GraphQL execution errors and HTTP errors if Graph Manager failed to connect to your server.
apollo.engine.operations.cache_hit_countThe number of GraphQL queries for which the result was served from Graph Manager's full query cache.
A histogram of GraphQL operation response times, measured in milliseconds. Because of Graph Manager's aggregation method (logarithmic binning), these values are accurate to +/- 5%.

These metrics are aggregated in 60-second intervals and tagged with the GraphQL operation name as operation:<query-name>. Unique query signatures with the same operation name are merged, and queries without an operation name are ignored.

These metrics are also tagged with both the associated Graph Manager graph ID (as service:<graph-id>) and the associated variant name (as variant:<variant-name>). If you haven't set a variant name, then current is used.

Exploring metrics

In the Datadog metrics explorer, all Graph Manager metrics are tagged with the graph ID (service:<graph-id>), the variant name (variant:<variant-name>), and the operation name (operation:<query-name>). These values are normalized according to Datadog naming requirements (all letters are lowercase, and illegal symbols are converted to underscores).

Tagging enables you to see data at any level of granularity, whether you want to aggregate across all operations or zoom in to a particular operation. You can control granularity by choosing a relevant set of operation tags for filtering, along with appropriate functions for time aggregation and space aggregation. Similarly, if you want to compare metrics across staging and production environments, you can filter with the appropriate variant tags.


Suppose you want to see the 95th percentile request latency averaged across all operations for a staging and a production service.

In the Datadog metrics explorer:

  1. In the Graph field, select apollo.engine.operations.latency.95percentile.
  2. In the Over field, select the name of the service to graph.
  3. In the One graph per field, select variant. Choose the variants for your production and staging environments.
  4. In the On each graph, aggregate with the field, select Average of reported values.

At Apollo, we use Graph Manager to monitor Graph Manager itself, so this graph for us looks like the following:

Compare p95

To generate more advanced reports, open up a Datadog notebook.

Alerting with Datadog

You can configure complex alerts with Datadog monitors.

Example #1

Graph Manager's Notifications feature supports alerts that trigger when the percentage of requests with an error in the last 5 minutes exceeds some threshold for a specific operation. Suppose that instead of alerting on a specific operation in the last 5 minutes, we want to alert on the error percentage over all operations in some graph in the last 10 minutes, such as when the percentage exceeds 1% for a graph mygraph with variant staging.

The Datadog metric alert query needed here is:

sum(last_10m):sum:apollo.engine.operations.error_count{service:mygraph,variant:staging}.as_count().rollup(sum).fill(null) / sum:apollo.engine.operations.count{service:mygraph,variant:staging}.as_count().rollup(sum).fill(null) > 0.01

The .rollup(sum).fill(null) is necessary because apollo.engine.operations.count is a Datadog gauge, which means it defaults to using avg for time aggregation and defaults to linear interpolation during space aggregation and query arithmetic. The .as_count() is necessary to ensure that operation counts are summed before the division and not after.

Example #2

Consider the error percentage monitor from the previous example. When the number of operations is small, a few errors might cause the error percentage to exceed the threshold, resulting in a noisy monitor during periods of low traffic. We want to alert only when the number of operations isn't small (e.g., more than 10 in the last 10 minutes).

You can use Datadog composite monitors to support this kind of alert. First, create a monitor with the following metric alert query:

sum(last_10m):sum:apollo.engine.operations.count{service:mygraph,variant:staging}.rollup(sum).fill(null) > 10

Then create a composite monitor for the two monitors of the form a && b, which will have the desired behavior.

Edit on GitHub