Collecting metrics in the Apollo Router
The Apollo Router provides built-in support for metrics collection via Prometheus and OpenTelemetry Collector.
Using Prometheus
You can use Prometheus and Grafana to collect metrics and visualize the router metrics.
telemetry:metrics:common:# (Optional, default to "apollo-router") Set the service name to easily find metrics related to the apollo-router in your metrics dashboardsservice_name: "apollo-router"# (Optional)service_namespace: "apollo"prometheus:# By setting this endpoint you enable the prometheus exporter# All our endpoints exposed by plugins are namespaced by the name of the pluginenabled: truelisten: 127.0.0.1:9090path: /metrics
Using in a containers environment
The prometheus endpoint listens to 127.0.0.1 by default, which won't allow connections issued from a network. While this is a safe default, other containers won't be able to access the prometheus endpoint, which will disable metric scraping.
You can change this by setting:
telemetry:metrics:prometheus:# By setting this endpoint you enable other containers and pods to access the prometheus endpointenabled: truelisten: 0.0.0.0:9090path: /metrics
Assuming you're running locally:
- Run a query against the router.
- Navigate to http://localhost:9090/metrics to see something like:
# HELP apollo_router_http_request_duration_seconds Total number of HTTP requests made.# TYPE apollo_router_http_request_duration_seconds histogramapollo_router_http_request_duration_seconds_bucket{le="0.5"} 1apollo_router_http_request_duration_seconds_bucket{le="0.9"} 1---SNIP---
Note that if you haven't run a query against the router yet, you'll see a blank page because no metrics have been generated!
Available metrics
The following metrics are available when using Prometheus. Attributes are listed where applicable.
HTTP
apollo_router_http_request_duration_seconds_bucket
- HTTP router request durationapollo_router_http_request_duration_seconds_bucket
- HTTP subgraph request duration, attributes:subgraph
: (Optional) The subgraph being queried
apollo_router_http_requests_total
- Total number of HTTP requests by HTTP statusapollo_router_timeout
- Number of triggered timeoutsapollo_router_http_request_retry_total
- Number of subgraph requests retried, attributes:subgraph
: The subgraph being queriedstatus
: If the retry was aborted (aborted
)
Session
apollo_router_session_count_total
- Number of currently connected clientsapollo_router_session_count_active
- Number of in-flight GraphQL requests
Cache
apollo_router_cache_size
— Number of entries in the cacheapollo_router_cache_hit_count
- Number of cache hitsapollo_router_cache_miss_count
- Number of cache missesapollo_router_cache_hit_time
- Time to hit the cache in secondsapollo_router_cache_miss_time
- Time to miss the cache in seconds
All cache metrics listed above have the following attributes:
kind
: the cache being queried (apq
,query planner
,introspection
)storage
: The backend storage of the cache (memory
,redis
)
Performance
apollo_router_processing_time
- Time spent processing a request (outside of waiting for external or subgraph requests) in seconds.apollo_router_query_planning_time
- Time spent planning queries in seconds.
Uplink
apollo_router_uplink_fetch_duration_seconds_bucket
- Uplink request duration, attributes:url
: The Uplink URL that was polledquery
: The query that the router sent to Uplink (SupergraphSdl
orEntitlement
)kind
: (new
,unchanged
,http_error
,uplink_error
)code
: The error code depending on type (if an error occurred)error
: The error message (if an error occurred)
apollo_router_uplink_fetch_count_total
status
: (success
,failure
)query
: The query that the router sent to Uplink (SupergraphSdl
orEntitlement
)
Note that the initial call to uplink during router startup will not be reflected in metrics.
Using OpenTelemetry Collector
You can send metrics to OpenTelemetry Collector for processing and reporting metrics.
telemetry:metrics:otlp:# Either 'default' or a URLendpoint: default# Optional protocol. Only grpc is supported currently.# Setting to http will result in configuration failure.protocol: grpc# Optional Grpc configurationgrpc:domain_name: "my.domain"key: ""ca: ""cert: ""metadata:foo: bar# Optional batch_processor configurationbatch_processor:scheduled_delay: 100msmax_concurrent_exports: 1000max_export_batch_size: 10000max_export_timeout: 100smax_queue_size: 10000
Remember that file.
and env.
prefixes can be used for expansion in config yaml. e.g. ${file.ca.txt}
.
Adding custom attributes/labels
You can add custom attributes (OpenTelemetry) and labels (Prometheus) to your generated metrics. You can apply these across all requests, or you can selectively apply them based on the details of a particular request. These details include:
- The presence of a particular HTTP header
- The value at a particular JSON path within a request or response body (either from a subgraph or from the router itself)
- A custom value provided via the router plugin context
Examples of all of these are shown in the file below:
telemetry:metrics:common:attributes:supergraph: # Attribute configuration for requests to/responses from the routerstatic:- name: "version"value: "v1.0.0"request:header:- named: "content-type"rename: "payload_type"default: "application/json"- named: "x-custom-header-to-add"response:body:# Apply the value of the provided path of the router's response body as an attribute- path: .errors[0].extensions.http.statusname: error_from_body# Use the unique extension code to identify the kind of error- path: .errors[0].extensions.codename: error_codecontext:# Apply the indicated element from the plugin chain's context as an attribute- named: my_keysubgraph: # Attribute configuration for requests to/responses from subgraphsall:static:# Always apply this attribute to all metrics for all subgraphs- name: kindvalue: subgraph_requesterrors: # Only work if it's a valid GraphQL error (for example if the subgraph returns an http error or if the router can't reach the subgraph)include_messages: true # Will include the error message in a message attributeextensions: # Include extensions data- name: subgraph_error_extended_type # Name of the attributepath: .type # JSON query path to fetch data from extensions- name: messagepath: .reason# Will create this kind of metric for example apollo_router_http_requests_error_total{message="cannot contact the subgraph",service_name="apollo-router",subgraph="my_subgraph_name",subgraph_error_extended_type="SubrequestHttpError"}subgraphs:my_subgraph_name: # Apply these rules only for the subgraph named `my_subgraph_name`request:header:- named: "x-custom-header"body:# Apply the value of the provided path of the router's request body as an attribute (here it's the query)- path: .queryname: querydefault: UNKNOWN
Example JSON path queries
Let's say you have a JSON request body with the following structure:
{"items": [{"unwanted": 7,"wanted": { "x": 3, "y": 7 },"array": [3, 2, 1]},{"isImportant": true}]}
To fetch the value of the field isImportant
, the corresponding path is .items[1].isImportant
.
To fetch the value of the field x
, the corresponding path is .items[0].wanted.x
.
JSON path queries always begin with a period .
Changing default buckets for histograms
You can customize the buckets for all generated histograms:
telemetry:metrics:common:buckets:- 0.05- 0.10- 0.25- 0.50- 1.00- 2.50- 5.00- 10.00- 20.00
Adding custom resources
Resources are similar to attributes, but there are more globals. They're configured directly on the metrics exporter, which means they're always present on each of your metrics.
As an example, it can be useful to set a environment_name
resource to help you identify metrics related to a particular environment:
telemetry:metrics:common:resources:environment_name: "production"
See OpenTelemetry conventions for resources.
For example, if you want to use a Datadog agent and specify a service name, you should set the service.name
resource as shown above and described in the conventions document.