Docs
Launch GraphOS Studio

Collecting metrics in the Apollo Router


The Apollo provides built-in support for metrics collection via Prometheus and OpenTelemetry Collector.

Common configuration

Service name

router.yaml
telemetry:
metrics:
common:
# (Optional) Set the service name to easily find metrics related to the apollo-router in your metrics dashboards
service_name: "router"

Service name discovery is handled in the following order:

  1. OTEL_SERVICE_NAME env
  2. OTEL_RESOURCE_ATTRIBUTES env
  3. router.yaml service_name
  4. router.yaml resource

If none of the above are found then the service name will be set to unknown_service:router or unknown_service if the executable name cannot be determined.

Resource

A Resource is a set of key-value pairs that provide additional information to an exporter. APMs may interpret and display resource information.

router.yaml
telemetry:
metrics:
common:
resource:
"environment.name": "production"
"environment.namespace": "{env.MY_K8_NAMESPACE_ENV_VARIABLE}"

Using Prometheus

You can use Prometheus and Grafana to collect metrics and visualize the metrics.

router.yaml
telemetry:
metrics:
prometheus:
# By setting this endpoint you enable the Prometheus exporter
# All our endpoints exposed by plugins are namespaced by the name of the plugin
enabled: true
listen: 127.0.0.1:9090
path: /metrics

Using in a containers environment

The Prometheus endpoint listens to 127.0.0.1 by default, which won't allow connections issued from a network. While this is a safe default, other containers won't be able to access the Prometheus endpoint, which will disable metric scraping.

You can change this by setting:

router.yaml
telemetry:
metrics:
prometheus:
# By setting this endpoint you enable other containers and pods to access the Prometheus endpoint
enabled: true
listen: 0.0.0.0:9090
path: /metrics

Assuming you're running locally:

  1. Run a query against the .
  2. Navigate to http://localhost:9090/metrics to see something like:
# HELP apollo_router_http_request_duration_seconds Total number of HTTP requests made.
# TYPE apollo_router_http_request_duration_seconds histogram
apollo_router_http_request_duration_seconds_bucket{le="0.5"} 1
apollo_router_http_request_duration_seconds_bucket{le="0.9"} 1
---SNIP---

Note that if you haven't run a query against the yet, you'll see a blank page because no metrics have been generated!

Datadog via OTLP

To use Datadog, you must configure the Datadog agent to accept OTLP metrics. This can be done by adding the following to your datadog.yaml:

datadog.yaml
otlp_config:
receiver:
protocols:
grpc:
endpoint: <dd-agent-ip>:4317

The must also be configured to send traces to the Datadog agent:

router.yaml
telemetry:
metrics:
otlp:
enabled: true
# Temporality MUST be set to delta. Failure to do this will result in incorrect metrics.
temporality: delta
# Optional endpoint, either 'default' or a URL (Defaults to http://127.0.0.1:4317)
endpoint: "${env.DATADOG_AGENT_HOST}:4317"

See Datadog Agent configuration for more details

Available metrics

The following metrics are available for Prometheus and OpenTelemetry. Attributes are listed where applicable.

HTTP

  • apollo_router_http_request_duration_seconds_bucket - HTTP request duration
  • apollo_router_http_request_duration_seconds_bucket - HTTP request duration, attributes:
    • subgraph: (Optional) The being queried
  • apollo_router_http_requests_total - Total number of HTTP requests by HTTP status
  • apollo_router_timeout - Number of triggered timeouts
  • apollo_router_http_request_retry_total - Number of requests retried, attributes:
    • subgraph: The being queried
    • status : If the retry was aborted (aborted)

Session

  • apollo_router_session_count_total - Number of currently connected clients
  • apollo_router_session_count_active - Number of in-flight GraphQL requests

Cache

  • apollo_router_cache_size — Number of entries in the cache
  • apollo_router_cache_hit_count - Number of cache hits
  • apollo_router_cache_miss_count - Number of cache misses
  • apollo_router_cache_hit_time - Time to hit the cache in seconds
  • apollo_router_cache_miss_time - Time to miss the cache in seconds

All cache metrics listed above have the following attributes:

  • kind: the cache being queried (apq, query planner, introspection)
  • storage: The backend storage of the cache (memory, redis)

Coprocessor

  • apollo_router_operations_coprocessor_total - Total s with coprocessors enabled.
  • apollo_router_operations_coprocessor.duration - Time spent waiting for the coprocessor to answer, in seconds.

The coprocessor s metric has the following attributes:

  • coprocessor.stage: string (RouterRequest, RouterResponse, SubgraphRequest, SubgraphResponse)
  • coprocessor.succeeded: bool

Performance

  • apollo_router_processing_time - Time spent processing a request (outside of waiting for external or requests) in seconds.
  • apollo_router_query_planning_time - Time spent planning queries in seconds.
  • apollo_router_query_planning_warmup_duration - Time spent planning queries in seconds.
  • apollo_router_schema_load_duration - Time spent loading the in seconds.
  • apollo_router_uplink_fetch_duration_seconds_bucket - Uplink request duration, attributes:

    • url: The Uplink URL that was polled
    • query: The query that the sent to Uplink (SupergraphSdl or License)
    • kind: (new, unchanged, http_error, uplink_error)
    • code: The error code depending on type (if an error occurred)
    • error: The error message (if an error occurred)
  • apollo_router_uplink_fetch_count_total

    • status: (success, failure)
    • query: The query that the sent to Uplink (SupergraphSdl or License)

Note that the initial call to uplink during startup will not be reflected in metrics.

Subscription

  • apollo_router_opened_subscriptions - Number of different opened s (not the number of clients with an opened subscriptions in case it's deduplicated)
  • apollo_router_deduplicated_subscriptions_total - Number of s that has been deduplicated
  • apollo_router_skipped_event_count - Number of events that has been skipped because too many events have been received from the but not yet sent to the client.

Batching

  • apollo_router.operations.batching - A counter of the number of query batches received by the .
  • apollo_router.operations.batching.size - A histogram tracking the number of queries contained within a query batch.

Studio

  • apollo.router.telemetry.studio.reports - The number of reports submitted to Studio by the .
    • type: The type of report submitted: "traces" or "metrics"

Using OpenTelemetry Collector

You can send metrics to OpenTelemetry Collector for processing and reporting metrics.

router.yaml
telemetry:
metrics:
otlp:
# Enable the OpenTelemetry exporter
enabled: true
# Optional endpoint, either 'default' or a URL (Defaults to http://127.0.0.1:4317 for gRPC and http://127.0.0.1:4318 for HTTP)
endpoint: default
# Optional protocol. Only grpc is supported currently.
# Setting to http will result in configuration failure.
protocol: grpc
# Optional Grpc configuration
grpc:
domain_name: "my.domain"
key: ""
ca: ""
cert: ""
metadata:
foo: bar
# Optional batch_processor configuration
batch_processor:
scheduled_delay: 100ms
max_concurrent_exports: 1000
max_export_batch_size: 10000
max_export_timeout: 100s
max_queue_size: 10000

Remember that file. and env. prefixes can be used for expansion in config yaml. e.g. ${file.ca.txt}.

Adding custom attributes/labels

You can add custom attributes (OpenTelemetry) and labels (Prometheus) to your generated metrics. You can apply these across all requests, or you can selectively apply them based on the details of a particular request. These details include:

  • The presence of a particular HTTP header
  • The value at a particular JSON path within a request or response body (either from a or from the itself)
  • A custom value provided via the plugin context

Examples of all of these are shown in the file below:

router.yaml
telemetry:
metrics:
common:
attributes:
supergraph: # Attribute configuration for requests to/responses from the router
static:
- name: "version"
value: "v1.0.0"
request:
header:
- named: "content-type"
rename: "payload_type"
default: "application/json"
- named: "x-custom-header-to-add"
response:
body:
# Apply the value of the provided path of the router's response body as an attribute
- path: .errors[0].extensions.http.status
name: error_from_body
# Use the unique extension code to identify the kind of error
- path: .errors[0].extensions.code
name: error_code
context:
# Apply the indicated element from the plugin chain's context as an attribute
- named: my_key
subgraph: # Attribute configuration for requests to/responses from subgraphs
all:
static:
# Always apply this attribute to all metrics for all subgraphs
- name: kind
value: subgraph_request
errors: # Only work if it's a valid GraphQL error (for example if the subgraph returns an http error or if the router can't reach the subgraph)
include_messages: true # Will include the error message in a message attribute
extensions: # Include extensions data
- name: subgraph_error_extended_type # Name of the attribute
path: .type # JSON query path to fetch data from extensions
- name: message
path: .reason
# Will create this kind of metric for example apollo_router_http_requests_error_total{message="cannot contact the subgraph",subgraph="my_subgraph_name",subgraph_error_extended_type="SubrequestHttpError"}
subgraphs:
my_subgraph_name: # Apply these rules only for the subgraph named `my_subgraph_name`
request:
header:
- named: "x-custom-header"
body:
# Apply the value of the provided path of the router's request body as an attribute (here it's the query)
- path: .query
name: query
default: UNKNOWN

Example JSON path queries

Let's say you have a JSON request body with the following structure:

{
"items": [
{
"unwanted": 7,
"wanted": { "x": 3, "y": 7 },
"array": [3, 2, 1]
},
{
"isImportant": true
}
]
}

To fetch the value of the isImportant, the corresponding path is .items[1].isImportant.

To fetch the value of the x, the corresponding path is .items[0].wanted.x.

JSON path queries always begin with a period .

Changing default buckets for histograms

You can customize the buckets for all generated histograms:

router.yaml
telemetry:
metrics:
common:
buckets:
- 0.05
- 0.10
- 0.25
- 0.50
- 1.00
- 2.50
- 5.00
- 10.00
- 20.00
Previous
OpenTelemetry tracing
Next
Client awareness
Edit on GitHubEditForumsDiscord