Graph Security

Best practices and GraphOS features for securing GraphQL APIs

GraphQL APIs benefit from the same standard methods you use to reduce the attack surface of any API. In addition, there are GraphQL-specific actions your organization should take to limit your graph's exposure to potential threats. These threats are mostly related to denial-of-service (DoS) attacks, and they fall under the categories of API discoverability and malicious operations.

Following the best practices outlined below, you can deploy a defense-in-depth strategy using the GraphOS Platform and GraphOS Router's security features.

API discoverability

One of the most important ways to protect a GraphQL API against would-be attackers is to limit the API's discoverability in production. Although the inherent discoverability of a GraphQL API enhances developer experience when working locally, it's best to restrict discoverability in a production environment for non-public APIs. The following sections explore some of the key ways to limit API discoverability.

Turn off introspection in production

GraphQL's built-in introspection query is the fastest way for bad actors to learn about your schema. To prevent this, turn off introspection in your production graph and limit access to any staging environments where introspection is enabled.

This video about GraphQL API abuse provides a deep dive into how introspection can facilitate API exploitation and why it's important to layer additional measures to limit discoverability.

note

When using the GraphOS Router, introspection is turned off by default.

Obfuscate error details in production

Many GraphQL servers improve developer experience by providing detailed error information in operation responses. Be sure to remove verbose error details from API responses in your production graph.

For example, by default, an Apollo-Server-based subgraph provides the exception.stacktrace property under the errors key in a response. This value is useful while developing and debugging your server, but you shouldn't expose stack trace details to public-facing clients.

In your production environment, you might want to selectively expose error details to clients. You can do this by combining the GraphOS Router's include_subgraph_errors option with Rhai scripts for response manipulation.

note

The GraphOS Router omits all error data by default.

Avoid autogenerating schemas

Another strategy to reduce your GraphQL API's discoverability is to avoid autogenerating schemas, especially the fields on the root operation types. Many developer tools enable you to autogenerate a GraphQL schema based on a set of initial object type definitions in a schema or existing database tables. Although these approaches to schema generation can speed up initial API development, they also make it easier for bad actors to guess generic CRUD-related fields based on commonly used patterns.

An autogenerated schema also increases the risk of accidentally exposing sensitive data. As a schema design best practice, you should deliberately design your schema to serve client use cases and product requirements. Intentional, demand-driven schema design helps your organization get the most out of your graph.

Allow only the router to query subgraphs

As a best practice for supergraphs, only the router should query individual subgraphs directly. The Apollo Federation subgraph specification outlines that each subgraph schema includes _entities and _service root fields on the Query type to help with composition and query planning. These fields expose the subgraph to additional security concerns if accessed directly by a client:

The Query._service object includes an sdl field, which includes the full SDL representation of the subgraph's schema. This field exposes as much data about a subgraph's schema as a standard introspection query, which means it should not be accessible in production.
The Query._entities field enables the router to resolve fields of any type marked with @key by providing the ID for that entity. If this field is exposed publicly, any client can circumvent internal resolver logic and fetch any entity data by mimicking the router. Because your subgraph library automatically provides the resolver for _entities, you can't modify that logic. That means you would have to manually check all operations that include the _entities field and block any malicious operations.

Another reason to restrict access to subgraphs is related to the collection of field-level traces. This tracing data is included in the extensions key of a subgraph's response to the router, where the data is aggregated into a trace shape based on the query plan and then sent to GraphOS. That means any client that can query your subgraphs directly can see this data in the operation response and make inferences about a subgraph based on it.

Aside from the above security concerns, restricting direct access to subgraphs offers another benefit. It ensures that clients, including well-meaning ones, route all operations to the consolidated graph. This prevents them from inventing unintended use cases for subgraph types and fields meant only for executing the router's query plan.

Malicious operations

After implementing measures to limit API discoverability in public-facing environments, the next step is to guard your GraphQL API against malicious operations. The following sections explore a variety of ways to mitigate the impact of malicious operations for any GraphQL API.

Validate and sanitize data

Validating and sanitizing client-submitted data is important for any API, and a graph is no exception. In the GraphQL context, the usual rules for validating and sanitizing untrusted inputs apply when resolving fields based on user-provided inputs. And as previously stated, when clients supply invalid values as operation arguments, the resulting errors should provide as few details as possible in production environments.

A well-designed GraphQL schema can also help guard against injection attacks by codifying validation and sanitization directly into types. For example, enum values can limit what clients can submit for argument values, and custom scalars or directives can also help validate, escape, or normalize values. However, custom scalars should be used carefully because misusing them might create other vulnerabilities, such as a JSON scalar type enabling a NoSQL injection attack.

Paginate fields where appropriate

Paginating fields is an important mechanism to control how many items a client can request at once. For example, a Posts subgraph might have no problem resolving a thousand total Post objects in this request:

GraphQL

query {
  authors(first: 10) {
    name
    posts(last: 100) {
      title
      content
    }
  }
}

What happens when the orders of magnitude increase for each field argument, and a hundred thousand Posts are requested?

GraphQL

query {
  authors(first: 100) {
    name
    posts(last: 1000) {
      title
      content
    }
  }
}

When paginating fields, it's important to set a maximum number of items to return in a single response. In the example above, you might want to return a GraphQL error when executing the posts field resolver instead of attempting to return a thousand posts for each of the hundred authors.

To learn more about pagination methods and best practices refer to the following:

Authentication and authorization in the router

Enforcing authentication and authorization in the router protects your underlying APIs from malicious operations. Dropping unauthenticated, unauthorized operations at the entry point of your supergraph frees up your downstream graphs to process only valid requests, thereby reducing load and enhancing performance.

Hardening access to your supergraph at the router also adds another layer of security when implementing zero-trust and defense-in-depth strategies. The router centralizes authentication and authorization logic, which downstream services can reinforce with their own checks.

To enforce authentication and authorization in the router refer to

JSON Web Token (JWT) authentication.
Access control to fields and types with authorization directives.

Set operation limits

GraphQL enables clients to traverse a graph and express complex relationships between the nodes in an operation's selection set. However, this can quickly overwhelm backing data sources without guardrails to limit query depth. For example:

GraphQL

query DeepBlogQuery {
  author(id: 42) {
    posts {
      author {
        posts {
          author {
            posts {
              author {
                # and so on...
              }
            }
          }
        }
      }
    }
  }
}

One of the most straightforward protections against deeply nested operations such as this one is to set a maximum query depth. And because an operation can specify multiple root fields, you may also consider limiting query breadth at the root level.

The GraphOS Router supports limiting requests with configurations like max_depth, max_root_fields, and more.

Consider operation costs when setting limits

For GraphQL APIs consumed by third-party clients, pagination and operation limits may not provide enough demand control. For these cases, rate-limiting API requests may be warranted.

Enforcing rate limits for a GraphQL API is more complicated than a REST API because GraphQL operations may vary widely in size and complexity. Therefore, the rate limit shouldn't be based on individual requests alone. Instead, they should take into account how much of the graph an operation may traverse in the context of a single request.

The GraphOS Router lets you protect your graph from high-cost operations by calculating operation costs and using them to configure demand control.

Safelisting with persisted queries

Beyond operation limits, GraphOS enables first-party apps to register trusted operations in a persisted query list (PQL) or safelist. The GraphOS Router then checks incoming requests against the PQL and either rejects unregistered operations or allows registered ones.

In addition to the security benefits, safelisting can improve performance by enabling clients to request operations by their PQL-specified ID. Learn more.

Batched requests

Batched requests are another potential attack vector for malicious operations. There are two different flavors of batching attacks to consider. The first threat is related to GraphQL's inherent ability to "batch" requests by allowing multiple root fields in an operation document:

GraphQL

query {
  astronaut(id: "1") {
    name
  }
  second: astronaut(id: "2") {
    name
  }
  third: astronaut(id: "3") {
    name
  }
}

Without any restrictions in place, clients could effectively enumerate through all nodes in a single request like the one above while slipping past other brute-force protections. Limiting query breadth or using operation cost analysis can help protect a GraphQL API from this abuse.

Another form of batching occurs when a client sends batches of full operations in a single request, which can be helpful for performance reasons in some scenarios. In this form of batching, clients send an array of operations and your GraphQL service or the router sends back an array of responses to be parsed by the client:

GraphQL

[
  {
    “operationName”: "FirstAstronaut"
    “variables":{},
    "query":"query FirstAstronaut {\n  astronaut(id: \"1\") {\n    name\n  }\n}\n”
  },
  {
    “operationName”: "SecondAstronanut"
    “variables":{},
    "query":"query SecondAstronanut {\n  astronaut(id: \"2\") {\n    name\n  }\n}\n”
  },
  {
    “operationName”: "ThirdAstronaut"
    “variables":{},
    "query":"query ThirdAstronaut {\n  astronaut(id: \"3\") {\n    name\n  }\n}\n”
  }
]

With batched operations, it's important to consider how an entire batch might impact rate limit calculations and query cost analysis to ensure that clients can't cheat rate limits through race conditions.

Finally, beyond batching fields and operations, some forms of GraphQL-related batching can help mitigate DoS attacks and generally make your API more performant. Even with depth limiting in place, GraphQL operations can easily lead to exponential growth of requests to backing data sources. DataLoaders are one way to help make as few requests as possible to backing data sources from resolver functions in a single operation.

Set timeouts

Timeouts are another useful tool for stopping GraphQL operations that consume more server resources than expected. In a supergraph, timeouts are commonly applied at any combination of three different levels:

At the highest level, you can set a timeout on the router's HTTP server or an idle timeout on a load balancer in front of it.
An an intermediate level, you can set a timeout on the router's requests to individual subgraphs. You can configure timeouts at both the HTTP and subgraph levels using the GraphOS Router's traffic shaping configuration.
At the most granular level, subgraphs can set a timeout for individual operations. The duration of the request can be checked against this timeout as each field resolver function is called. You might accomplish this using resolver middleware or an Apollo Server plugin in a subgraph.

Additional best practices

Securing your GraphQL API involves more than just blocking bad actors and safeguarding private data. You also need visibility into who is using your API and how. With GraphOS' schema registry, access controls, and observability features, you can control who changes your API, track usage, and receive alerts when something goes wrong.

Know who's using your graph (and how)

To improve trace insights, it's a best practice to require every client to identify itself and assign a name to every operation it executes. Apollo Client's web and mobile SDKs provide straightforward APIs for setting custom headers for a client's name and version. These help you segment traces and metrics in GraphOS by client. Other API clients can set the apollographql-client-name and apollographql-client-version request headers manually to provide client awareness. Client awareness also helps you identify which clients might be impacted by a proposed breaking change to your API when running schema checks.

Tracing data in GraphOS can also help you monitor API performance and errors. You can configure alerts to push notifications to your team when something goes wrong, whether it's an increase in requests per minute, changes in your p50, p95, or p99 response times, or errors in operations run against your graph. For example, a notification about a sudden increase in the error percentage might indicate that a bad actor is trying to circumvent introspection that's been turned off and learn about a graph's schema by rapidly guessing and testing different field names. You can also use GraphOS Router's support for OpenTelemetry to integrate with other APM tools.

Restrict write access to your graph

You should manage internal access to your graph as thoughtfully as you manage communication with external clients. GraphOS provides both graph API keys and personal API keys to restrict access to the graphs within an organization. It also supports SSO integration and different member roles so that team members can be assigned appropriate permissions when contributing to the graph.

Beyond member roles, GraphOS also allows certain variants to be designated as protected variants to further restrict who can make changes to their schemas, which is especially important in production environments.

Additional resources

Refer to Supergraph Architecture Framework's security pillar for principles on securing a federated GraphQL API. For further reading on GraphQL-related security concerns, the OWASP GraphQL Cheat Sheet is an excellent resource to help you review the security posture of your graph. For a more generic resource, the OWASP Top Ten list summarizes some of the most common security risks developers face.

Log Exporters

Metrics Exporters

Trace Exporters

Instrumentation

AWS Lattice