February 10, 2026

Deep Dive: Telemetry cardinality in the Apollo GraphOS Router

Nick Marsh

Nick Marsh

The GraphOS Router has been built with comprehensive observability in mind, using OpenTelemetry as its telemetry backbone. While we strive to provide defaults that work well for most use-cases, understanding how telemetry works within the router can be crucial for high-scale production graphs. This post provides a deep technical exploration of the router’s telemetry system, from initial instrumentation through to the export to various observability backends.

Router Instrumentation

Let’s start where the metrics and traces are generated in the router. At its core, the GraphOS Router uses OpenTelemetry (OTel) for all telemetry instrumentation. The router uses the OpenTelemetry Rust SDK, which provides the foundational primitives for metrics and traces. Metrics and traces flow through OpenTelemetry’s standardized APIs before being exported to various backends.

Measurement tools like counters, histograms, and gauges that record telemetry data throughout the router codebase are known in OTel terminology as Instruments, which are created by Meters, which are in turn created by Meter Providers. A Metric Reader periodically collects the measurements recorded by the Instruments. Finally, the reader passes the collected data to a Metric Exporter that serializes and transmits it to a telemetry backend.

A critical limitation of the version of the OpenTelemetry SDK currently in use by GraphOS Router (SDK version 0.24.0 in Router version v2.10.0) is that it has a hard-coded metric cardinality limit of 2000. This limit exists at the SDK level, not in the router’s code itself.

In the context of metrics, cardinality refers to the number of unique metrics created by combining a metric name with all possible values of its attributes. For example, if you have a cache metric (which tracks cache usage by schema type) that includes attributes for one of 10 subgraphs, 1000 operation names, and 100 type names, you would end up with a maximum cardinality of 1,000,000 unique combinations of attributes for that metric.

This limit of 2000 unique attribute combinations applies to each batch of metrics that is exported by the router. If this limit is reached, the OpenTelemetry library emits a warning such as:

1Metrics error: Warning: Maximum data points for metric stream exceeded.2Entry added to overflow. Subsequent overflows to same metric until next3collect will not be logged.

The router detects this error condition and increments a special counter metric: apollo.router.telemetry.metrics.cardinality_overflow. If you see this counter incrementing, you know you’ve hit the limit. Once you pass the limit, metrics will still be recorded, but any attributes used after the limit is reached will have been be removed and can no longer be used for grouping or filtering.

More recent versions of the OpenTelemetry Rust SDK remove this hard-coded limit, allowing for configurable cardinality constraints. We are working to upgrade the OTel SDK in future versions of the Router to unlock this configurability. Doing so is complex, and the newer SDK versions introduce breaking changes. It is valuable to understand how to mitigate this cardinality limit, and we will share a few strategies you can do that soon, but first let’s take a look how these metrics are sent to Apollo or to your metric or trace service of choice.

Metrics Exporters

Once metrics are being generated, they are sent by the router to various endpoints using one or more exporters, each serving different purposes. The behavior of these exporters can be configured using a few different settings (note that this is only a subset of all config options, see Router YAML Configuration Reference for a complete reference):

1telemetry:2  apollo:3    metrics:4      usage_reports:5        batch_processor:6          max_export_timeout: 30s7          scheduled_delay: 5s8          max_queue_size: 20489      otlp:10        batch_processor:11          scheduled_delay: 5s12          max_export_timeout: 30s13  exporters:14    metrics:15      otlp:16        batch_processor:17          max_export_timeout: 30s18          scheduled_delay: 5s19        enabled: true20        endpoint: example_endpoint21      prometheus:22        enabled: true23        listen: 0.0.0.0:908024        path: /metrics

These exporters share common options in their batch_processor config:

  • scheduled_delay is the delay between metric exports (default 5 seconds).
  • max_export_timeout is the amount of time that the exporter will wait before cancelling the export (default: 30s).
  • max_queue_size is the maximum number of unique metrics that can be held in the buffer (default: 2048). If the queue fills up completely, metrics will be dropped.

Since router v2.7.0, the configured values for these batch processors are logged, e.g.:

1configuring Apollo usage report metrics: ApolloUsageReportsBatchProcessorConfiguration { scheduled_delay=12s, max_queue_size=2042, max_export_timeout=32s }

Let’s explore each of these exporters in detail.

Apollo Usage Reports

Apollo usage reports are the router’s specialized metrics format designed for GraphOS Studio. They have been in use since early versions of Apollo Server and pre-date the OpenTelemetry standard. While all new metrics are being sent using OpenTelemetry, we still have some important metrics that use these usage reports. These reports aggregate detailed operation and field-level statistics including:

  • Request latency histograms with custom bucketing optimized for GraphQL operations
  • Fields referenced by each operation
  • Field-level execution statistics (when field-level instrumentation is enabled)
  • Error rates and error paths within GraphQL responses
  • Operation type and subtype classification
  • Client identification (name and version)

Apollo OTel Metrics

While baseline metrics are sent via usage reports, newer metrics are also sent to Apollo via OpenTelemetry. This includes metrics such as feature usage, enhanced error details, and subgraph insights.

Standard OTel Metrics

The router can export standard OpenTelemetry metrics to any OTLP-compatible endpoint (OpenTelemetry Collector, Datadog, New Relic, Grafana Cloud, etc.). This exporter uses the standard OTLP protocol over gRPC or HTTP.

Prometheus Metrics

The Prometheus exporter exposes metrics via a pull-based HTTP endpoint, following Prometheus conventions. This exporter creates a registry that accumulates metrics, which are then scraped by Prometheus servers. Since Prometheus is pull-based, there are no batch_processor configuration options for this exporter.

Trace Exporters

Tracing in the router captures the complete request lifecycle as a distributed trace, with spans representing each stage of query processing. As with the metrics exporters, the behavior of these trace exporters can be configured using a few different settings (again note that this is only a subset of all config options, see Router YAML Configuration Reference for a complete reference):

1telemetry:2  apollo:3    tracing:4      batch_processor:5        max_export_timeout: 30s6        scheduled_delay: 5s7        max_export_batch_size: 5128        max_concurrent_exports: 19        max_queue_size: 204810  exporters:11    tracing:12      common:13        sampler: always_on14      datadog:15        batch_processor:16          max_concurrent_exports: 117          max_export_batch_size: 51218          max_export_timeout: 30s19          max_queue_size: 204820          scheduled_delay: 5s21        enabled: true22        endpoint: example_endpoint23      otlp:24        batch_processor:25          max_concurrent_exports: 126          max_export_batch_size: 51227          max_export_timeout: 30s28          max_queue_size: 204829          scheduled_delay: 5s30        enabled: true31        endpoint: example_endpoint32        protocol: grpc33      propagation:34        zipkin: true35      zipkin:36        batch_processor:37          max_concurrent_exports: 138          max_export_batch_size: 51239          max_export_timeout: 30s40          max_queue_size: 204841          scheduled_delay: 5s42        enabled: true43        endpoint: example_endpoint

These exporters share the same options in their batch_processor config as the metrics exporters, but the cardinality limits apply to trace spans rather than distinct metrics. The tracing exporters also include some new configurations:

  • max_concurrent_exports (default 1), which is the maximum number of concurrent export operations that can run simultaneously. This limits the number of spawned tasks for exports and thus memory consumed by an exporter. A value of 1 causes exports to be performed synchronously on the batch processor task.
  • max_export_batch_size: the maximum number of spans to include in a single export batch (default: 512). Once the queue reaches this size, the batch will be exported immediately even if the scheduled_delay period has not been reached. This must be set to less than or equal to max_queue_size.

An excessive amount of traces are usually not useful, so you may want to modify the telemetry.exporters.tracing.common.sampler configuration so that fewer traces are sampled. This defaults to always_on which is equivalent to a value of 1 or 100%. Sampling fewer traces (1-10 percent based on requirements) will reduce the cardinality as well as have a beneficial impact on CPU and memory usage.

Let’s explore each of these exporters in detail as well.

Apollo Usage Report Traces

Similar to metrics, Apollo usage reports include trace data for sampled operations. These traces are transformed into Apollo’s proprietary trace format, which includes:

  • Query plan tree structure (parallel, sequence, fetch, flatten nodes)
  • Subgraph request/response details
  • Field-level execution timing (when instrumented)
  • HTTP request/response metadata
  • Error details with redaction based on configuration

The exporter implements an LRU cache to store recent spans, keyed by parent span ID. This allows field-level statistics to be aggregated before export.

Apollo OTel Traces

The router can send traces in OpenTelemetry format to Apollo’s OTLP endpoint, enabling trace visualization in Apollo Studio with full OpenTelemetry semantics. These traces include all of the details that exist in the Apollo Usage Report Traces, but also contain additional spans that capture internal router behavior.

The percentage of traces sent via OTel vs those sent via Apollo Usage Reports can be configured using the telemetry.apollo.otlp_tracing_sampler configuration, which has been defaulted to 1 (100% OTel) since router v2.0.0.

Standard OTel Traces

This exporter exports traces to any OpenTelemetry collector using standard OTLP protocol. This can be used to send traces to services like Datadog, Jaeger, or your own OpenTelemetry Collector.

Zipkin Traces

The router also supports exporting traces in Zipkin format for compatibility with Zipkin-based tracing systems.

Managing Cardinality

As mentioned previously, the version of the OTel library currently used by the router has a hard-coded cardinality limit of 2000. If you notice that some metrics are missing metrics, the likely reason is that you have hit this limit.

The first step to fixing this is confirming that you have hit the limit. You can do this in a number of ways:

  • The router will be emitting the apollo.router.telemetry.metrics.cardinality_overflow metric. If the value of this metric is greater than 0, it means you have hit the limit.
  • The router will output log messages that look like:
    • Metrics error: Warning: Maximum data points for metric stream exceeded. Entry added to overflow. Subsequent overflows to same metric until next collect will not be logged.
  • Your metrics will include data points with the attribute otel.metric.overflow=true. These represent measurements that exceeded the cardinality limit. The metric values are preserved but their original attribute labels are lost.

One way to be able to see the apollo.router.telemetry.metrics.cardinality_overflow metric and the otel_metric_overflow attribute is to enable Prometheus logging in your router config:

1telemetry:2  exporters:3    metrics:4      prometheus:5        enabled: true6        listen: 0.0.0.0:90807        path: /metrics

When this exporter is enabled, metrics are be visible at http://<your router IP address>:9080/metrics. The overflow metric will look like:

1apollo_router_telemetry_metrics_cardinality_overflow_total{otel_scope_name="apollo/router"} 3

Metrics with the overflow attribute will look like:

1apollo_router_operations_entity_cache_total{otel_metric_overflow="true",otel_scope_name="apollo/router"} 1992

If you have high cardinality metrics, you may need to update your metric exporter batch config by decreasing scheduled_delay. This will mean that metric batches will be sent more often, and the time window containing the batch of metrics is less likely to hit the cardinality limit. However, reducing scheduled_delay to values lower than 1 second can result in dropped metrics as you may be sending metrics faster than they can be received.

For very high cardinality attributes, you may reach the cardinality limit extremely quickly. In this case, the best way to work around the OTel cardinality limit is to audit the attributes that you are adding to your metrics using your router config. By default, the router does not emit high-cardinality attributes, so you are unlikely to see an overflow unless you have customized the attributes in some way.

A detailed description of the possible configuration for metrics and attributes can be found at Instruments, but an example of a possibly problematic configuration is:

1telemetry:2  instrumentation:3    instruments:4      default_requirement_level: required5      router:6        http.server.request.body.size:7          attributes:8            client_name:9              request_header: "apollographql-client-name"10            client_version:11              request_header: "apollographql-client-version"

This configuration will add the client name and version as attributes to all http.server.request.body.size metrics. If you use highly unique client names or versions, or you’re using some other request header that has a large number of distinct values, you will end up with lots of separate metrics for each combination of header values. The fact that the attributes are on a histogram metric exacerbates the issue even more, since a metric is emitted for each histogram bucket:

1http_server_request_body_size_bytes_bucket{client_name="some-client",client_version="some-version",http_request_method="POST",http_response_status_code="200",server_address="1.2.3.4",server_port="4000",otel_scope_name="apollo/router",le="0.001"} 02http_server_request_body_size_bytes_bucket{client_name="some-client",client_version="some-version",http_request_method="POST",http_response_status_code="200",server_address="1.2.3.4",server_port="4000",otel_scope_name="apollo/router",le="0.005"} 03http_server_request_body_size_bytes_bucket{client_name="some-client",client_version="some-version",http_request_method="POST",http_response_status_code="200",server_address="1.2.3.4",server_port="4000",otel_scope_name="apollo/router",le="0.015"} 04etc

The general recommendation is that any attributes you add to metrics should have a well-constrained list of values and should definitely not come directly from a value that a user of your graph specifies.

Apollo’s Cardinality Protection

After metrics are sent to Apollo GraphOS, we also implement own cardinality protection. When certain cardinality thresholds are exceeded, Apollo may replace high-cardinality attribute values with the value # CardinalityLimitExceeded.

This protection can affect:

  • Client names
  • Client versions
  • Operation names

If you observe # CardinalityLimitExceeded appearing in your GraphOS Studio metrics, it indicates that your graph has exceeded Apollo’s cardinality limits for that dimension. This is a protective measure to ensure the stability and performance of Apollo’s aggregation infrastructure.

If you see this, here are some things you can check:

  1. Review your client identification strategy – are you inadvertently creating unique client names/versions per request? Ensure your client instrumentation is correctly identifying client name and version (not using request IDs, timestamps, or GUIDs).
  2. Consider consolidating operation names – dynamically generated operation names create unbounded cardinality.
  3. Send all query variables using parameters rather than hard-coded strings or numbers. The entire operation body is hashed to generate a unique ID, so the following queries will create two distinct query IDs:
1query AuthorQuery {2  author(id: "1") {3    name4  }5}
1query AuthorQuery {2  author(id: "2") {3    name4  }5}

If you pass this variable as a parameter, the query ID will be the same for both executions:

1query AuthorQuery($authorId: ID!) {2  author(id: $authorId) {3    name4  }5}
  1. When using parameters, you should ensure that the parameter names are static and not unique like $authorId_<some GUID>, as the parameter names form part of the query body and will contribute to the hash.

Conclusion

By understanding the OpenTelemetry foundation, the various exporters, and the configuration options available, you can build a robust observability strategy that scales with your GraphQL infrastructure. The key is balancing telemetry completeness with cardinality management, using the batch processor and router configuration to tune performance for your specific workload.

To learn more about GraphOS Router telemetry capabilities check out our documentation.

Written by

Nick Marsh

Nick Marsh

Read more by Nick Marsh