Launch Apollo Studio

Securing a federated data graph


As data graph contributions and consumption expands across teams in a company, the imperative to secure potential vulnerabilities and protect the graph from bad actors grows as well. The tactics required to secure a GraphQL API overlap with those used to secure any kind of API, but there are some important additional measures to consider due to the client-focused nature of GraphQL. What's more, at an enterprise scale, you'll want to ensure that internal access to your API is managed as thoroughly as access from external clients.

Concerning your GraphQL security posture for a federated data graph, there are three main areas we’ll consider. First, we'll explore the attack surface of GraphQL APIs and effective mitigation strategies. We’ll also discuss a common approach for handling authentication and authorization with a federated data graph. And lastly, we’ll see how managed federation and Apollo Studio can enhance your current approach to application performance monitoring and controlling internal access to your data graph.

GraphQL API attack surface

Generally speaking, many of the standard measures you would implement to reduce the attack surface of any API will apply to a GraphQL API in some way (the OWASP Top Ten list is a good place to start to understand some of the common risks). That said, there are some additional GraphQL-specific considerations you must make to limit your data graph's exposure to potential threats, which will be explored in greater depth below. These threats are mostly related to denial-of-service (DoS) attacks and fall under the categories of API discoverability and malicious queries.

API discoverability

One of the most important ways to protect a GraphQL API against would-be attackers who are trying to learn what vulnerabilities may exist within it is to limit API discoverability in production. While the inherent discoverability of a GraphQL API enhances developer experience when working locally, we typically don't want to offer these same affordances in a production environment for non-public APIs. The following sections explore some of the key ways to limit API discoverability.

Turn off introspection in production

Turning off introspection and access to GraphQL IDEs in production environments (as well as any publicly exposed staging environments) is the first step toward limiting the discoverability of the API because an introspection query is the quickest and easiest way for bad actors to learn about your schema. When using Apollo Server, introspection is turned off by default when the NODE_ENV environment variable is set to production. You may also consider disabling introspection manually for any publicly exposed staging environments, or at a minimum, use URLs for these environment endpoints that are difficult to guess.

This video about GraphQL API abuse provides a deep dive into how introspection can facilitate API exploitation and why it's also important to layer additional measures on top of disabled introspection to limit discoverability.

Obfuscate error details in production

Beyond introspection, many GraphQL servers attempt to enhance developer experience further by providing detailed error information in responses when something goes wrong. In a production environment, you’ll want to ensure that verbose error details are removed from API responses.

For example, Apollo Server provides the exception.stacktrace property under the errors key in a response, which can be useful while developing and debugging your server. In most cases, you shouldn't expose these stack trace details to public-facing clients, so Apollo Server will omit this field if the NODE_ENV variable is set to either production or test by default.

You may also wish to mask other error messages sent back to clients in production environments, which may be done selectively using Apollo Server's formatError option. For example, graphql.js (an underlying dependency of Apollo Server) will provide hints in error responses when fields in an operation appear to be misspelled, so it's a good practice to mask these errors for publicly exposed endpoints to further limit API discoverability.

Avoid autogenerating subgraph schemas

Another important consideration when making a GraphQL API less discoverable is to avoid autogenerating schemas where possible, and particularly for the fields on the root operation types. There are many tools currently available that allow you to autogenerate a GraphQL schema quickly based on a handful of initial object type definitions in a schema or some existing database tables. While these approaches to schema generation may speed up initial API development, they also make it easier for bad actors to guess what kind of generic CRUD-related fields were autogenerated based on commonly used patterns.

What's more, an autogenerated schema increases the risk that you expose sensitive data unintentionally via your GraphQL API. And related to schema design best practices, you've also missed out on an opportunity to design your schema deliberately around client use cases and product requirements, which is one of the most compelling features of GraphQL.

Only allow the gateway to query subgraphs directly

As a best practice for federated data graphs, Apollo recommends that subgraphs services may only be queried by the gateway Apollo Server directly. The Apollo Federation specification outlines that subgraph schemas will have _entities and _service root fields on the Query type to assist with composition and query planning. An sdl field may be queried in the _service selection set and the output of that field will include the SDL representation of the subgraph's schema. Because this query will expose as much data about a subgraph's schema as a standard introspection query would, you should ensure this information isn't accessible to the outside world.

Another reason to restrict access to subgraphs is related to the collection of field-level traces. This tracing data is included in the extensions key of the response from a subgraph to the gateway where the data is aggregated into a trace shape based on the query plan and then sent to Apollo Studio. That means that any client that can query your subgraphs directly would be able to view this data in the operation response and make inferences about a subgraph based on it.

Lastly, beyond these security concerns, preventing the outside world from accessing subgraphs directly also helps you ensure that clients (even well-meaning ones) route operations to the consolidated data graph only and don't invent unintended use cases for the types and fields in a subgraph schema that are solely meant for executing the gateway's query plan.

Malicious queries

Once there are measures in place to limit API discoverability in public-facing environments, the next step in protecting a GraphQL API is to guard it against both intentionally and unintentionally malicious queries. Again, many GraphQL-related vulnerabilities have to do with how an unprotected API may be exploited in DoS attacks, but there are other considerations as well.

In the sections that follow, we will explore measures that can help mitigate the impact of malicious queries for any GraphQL API, such as limiting query depth and amount, paginating list fields where appropriate, validating and sanitizing data, setting timeouts, and guarding against batched query abuse. And for GraphQL APIs with third-party clients, we will also explore using query cost analysis to support rate-limiting.

Limit query depth

GraphQL allows clients to traverse through a data graph and express complex relationships between the nodes in an operation's selection set. But as far as backing data sources are concerned, this can quickly turn into too much of a good thing when there are no guardrails to restrict how deeply queries can be nested. For example:

query DeepBlogQuery {
  author(id: 42) {
    posts {
      author {
        posts {
          author {
            posts {
              author {
                # and so on...
              }
            }
          }
        }
      }
    }
  }
}

One of the easiest things to do is to guard against deeply nested queries such as this one is to set a maximum query depth. This can be done using an Apollo Server plugin in your gateway. And because an operation can specify multiple root fields, you may also consider limiting query breadth at the root level as well.

Paginate fields where appropriate

Paginating fields is another important mechanism to control how many items a client can request at once. For example, a posts subgraph service may have no problem resolving 1,000 post nodes in this request:

query {
  authors(first: 10) {
    name
    posts(last: 100) {
      title
      content
    }
  }
}

But what will happen when the orders of magnitude increase for each field argument and 1,000,000 posts are requested at once?

query {
  authors(first: 100) {
    name
    posts(last: 1000) {
      title
      content
    }
  }
}

When paginating fields, it's important to set a maximum number of items that may be returned in a single response. In the example above, you may wish to return a GraphQL error when executing the posts field resolver instead of attempting to return 1000 posts for each of the 100 authors.

And as mentioned in the federated schema design section, with federation, it’s a best practice to consider how you standardize these pagination characteristics of your fields across subgraphs so that client developers have a predictable experience querying different fields of the API. That’s not to say that all fields should be paginated in the same way (for example, only using cursor-based pagination or only using offset-based pagination), but their arguments and their outputs should provide an intuitive experience on the client-side.

Validate and sanitize data

Validating and sanitizing client-submitted data is important for any API, and a federated data graph is no exception. In general, the usual rules for validation and sanitization of untrusted inputs will apply to GraphQL when resolving fields based on user-provided inputs. And as previously discussed, when users supply invalid values as operation arguments the resulting errors should provide as little details as possible in production environments.

A well-designed GraphQL schema can also help guard against injection attacks by codifying validation and sanitization directly into types. For example, enum values can limit the range of what may be submitted for argument values and custom scalars or directives may be used to validate, escape, or normalize values as well. However, custom scalars should be handled with care as their misuse may create other vulnerabilities, such as a JSON scalar type enabling a NoSQL injection attack.

Set timeouts

Timeouts are another useful tool for stopping GraphQL operations that are consuming more server resources than expected. With federation, timeouts may commonly be applied at three different levels. At the highest level, a timeout may be set on the gateway's HTTP server (or an idle timeout on a load balancer in front of it).

The second level where you may configure timeouts is on the gateway's requests to subgraph services. This can be done by defining a custom fetcher for ApolloGateway in a RemoteGraphQLDataSource as follows:

import { ApolloGateway, RemoteGraphQLDataSource } from "@apollo/gateway";
import fetch from "node-fetch";

const gateway = new ApolloGateway({
  // ...
  buildService({ name, url }) {
    const fetcher = (input, init) => {
      if (init) {
        init.timeout = 3000;
      } else {
        init = { timeout: 3000 };
      }
      return fetch(input, init);
    };
    return new RemoteGraphQLDataSource({ url, fetcher });
  }
});

In the code above, if any request to a subgraph takes longer than three seconds, then the gateway will return an error to the client.

Finally, at a more granular level, subgraphs may set a timeout for individual operations. The duration of the request can be checked against this timeout as each field resolver function is called. This could be accomplished using resolver middleware or an Apollo Server plugin in a subgraph.

Use rate-limiting as needed

Particularly for GraphQL APIs that are consumed by third-party clients, depth and breadth limiting and paginated fields may not provide enough demand control. For these cases, rate-limiting API requests may be warranted. Enforcing rate limits for a GraphQL API is more complicated than a REST API because GraphQL operations may vary widely in size and complexity so the rate limit can't be based on individual requests alone. Instead, we have to think about how much of the data graph an operation may traverse in the context of a single request.

There's no one-size-fits-all approach to implementing rate limits for a GraphQL API. For example, the GitHub GraphQL API sets a maximum node limit as well as a point score based on the field connections in a query. It then counts this score against a maximum of 5,000 points per hour.

The Shopify API, on the other hand, assigns different point values to various types and connection fields (also considering the number of items returned by the connection field), while assigning mutation operations a higher value due to the server resources they typically consume. They then use a leaky bucket algorithm that allocates 50 points per second (up to a maximum of 1,000 points) to accommodate sudden bursts in API traffic from a client.

Both the GitHub and Shopify rate-limiting approaches lead to the very interesting topic of query cost analysis (also known as query complexity analysis). As we can see from these examples, how the cost of a query is analyzed is somewhat complicated and nuanced and should be done in a way that suits the API in question. There are several query cost-related packages on npm that can be added to a GraphQL server, but before using any of them, be sure that the assumptions that these libraries have made on your behalf will make sense for your API.

For example, you may want to set fixed costs for different kinds of nodes, or you may manually set costs on a per type or per-field basis by annotating them with directives (or do some combination of both). Additionally, you may have different considerations for how type complexity (the cost returning the number of fields requested) and response complexity (the cost of providing responses for the requested fields) are handled. Or for a completely different approach that doesn't explicitly count types and fields, you could set and iterate query costs based on field tracing data and set a maximum time budget per query.

Given the potential scope of developing a bespoke query cost analysis solution, you will first want to ensure that your API truly needs it. For data graphs that are consumed by first-party clients only, other demand control mechanisms may be sufficient. If you do need to add comprehensive query cost analysis to your GraphQL API, then the work that IBM has done in this area to develop the GraphQL Cost Directives specification may be instructive. Their work in this area was originally published in a paper (with a supplemental video to highlight some of the key concepts) and is further explored in this series of blog posts.

Batched requests

Batched requests are another potential attack vector for malicious queries. There are two different flavors of batching attacks to consider. The first threat is related to GraphQL's inherent ability to "batch" requests by allowing multiple root fields in an operation document:

query {
  astronaut(id: "1") {
    name
  }
  second:astronaut(id: "2") {
    name
  }
  third:astronaut(id: "3") {
    name
  }
}

Without any restrictions in place, clients could effectively enumerate through all nodes in a single request like the one above while slipping past other brute force protections. Limiting query breadth or using query cost analysis will help protect GraphQL APIs from this type of abuse.

Another form of batching occurs when a client sends batches of full operations in a single request, which can be helpful for performance reasons in some scenarios. When operations are batched, Apollo Server receives an array of operations and sends back an array of responses to be parsed by the client (there's a batch link directly available in Apollo Client to facilitate this):

[
  {operationName: "FirstAstronaut"variables":{},
    "query":"query FirstAstronaut {\n  astronaut(id: \"1\") {\n    name\n  }\n}\n},
  {operationName: "SecondAstronanut"variables":{},
    "query":"query SecondAstronanut {\n  astronaut(id: \"2\") {\n    name\n  }\n}\n},
  {operationName: "ThirdAstronaut"variables":{},
    "query":"query ThirdAstronaut {\n  astronaut(id: \"3\") {\n    name\n  }\n}\n}
]

With batched operations, a key consideration to keep in mind is how they may impact rate limit calculations and query cost analysis to ensure that clients won't be able to cheat rate limits through race conditions.

Finally, beyond batching of fields and operations, some forms of GraphQL-related batching can help mitigate DoS attacks and generally make your API more performant overall. Even with depth limiting in place, GraphQL queries can easily lead to exponential growth of requests to backing data sources. Dataloaders are one way that you can help make as few requests as possible to backing data sources from resolver functions within the context of a single operation.

Authentication and authorization

There is no one "right" way to handle access control with GraphQL, and consolidated data graphs are no exception to this. How authentication and authorization are handled for requests to any APIs you have in use currently will likely influence how you approach these concerns with federation as well.

However, one key question to address with federation is the matter of where you will validate a token sent from a client. For example, you may use middleware to intercept and validate a token in a gateway Apollo Server, add it to the request object, and then get that validated token and add it to the Apollo Server context:

const server = ApolloServer({
  gateway,
  context: ({ req }) => {
    const userId = req.user?.sub || null;
    return { userId };
  }
});

The token can then be forwarded as a header in requests to the subgraph services during query execution and the subgraphs can then authorize field usage as needed using a RemoteGraphQLDataSource:

const gateway = new ApolloGateway({
  // ...
  buildService({ name, url }) {
    return new RemoteGraphQLDataSource({
      url,
      willSendRequest({ request, context }) {
        request.http.headers.set("user-id", context.userId);
      }
    });
  }
});

When taking this approach, a very important consideration here is network security. If you’re decoding and validating the token at the gateway and passing some of that data along as a header, then you will want to make sure that subgraphs aren’t directly accessible to the outside world. Ideally, communication between the gateway and subgraph services should be encrypted as well.

Alternatively, you may opt for passing a token through and validating at the level of each subgraph. There will be a trade-off for security over performance with this approach because you may end up validating the same token multiple times in a single subgraph for particularly complex query plans, but that tradeoff may be worthwhile.

Security with managed federation

Apart from protecting your GraphQL API from bad actors and locking down private data, you also need a window into how your API is being used and by whom to harden your GraphQL security posture. This is where a schema registry and observability tooling (such as Apollo Studio) come into play to help control who makes changes to your API and also monitor API usage and send alerts when something isn’t right.

Know who’s using your graph (and how)

To enhance the utility of traces collected in your observability tooling, it's a best practice to require clients to identify themselves and that they consistently name operations. The web and mobile versions of Apollo Client make it easy to set custom headers for the name and version of a specific client to segment traces and metrics in Apollo Studio by client. Other API clients can set the apollographql-client-name and apollographql-client-version request headers manually to provide client awareness as well. (As a bonus, client awareness also makes it easier to identify what clients may be impacted by a proposed breaking change to your API when running schema checks.)

Additionally, tracing data in Apollo Studio can be used to monitor API performance and errors. You can also configure alerts to push notifications to your team when something goes wrong, whether it’s an increase in requests per minute, changes in your p50, p95, or p99 response times, or errors in operations run against your graph. For example, a notification about a sudden increase in the error percentage may indicate that a bad actor is trying to get around disabled introspection and learn about a data graph's schema by rapidly guessing and testing different field names. And if you want to leverage error data outside of Studio as well, you can also use Apollo Server's support for OpenTelemetry or Apollo Server plugins to integrate with other APM tools.

Restrict write access to your data graph

Just as important as it is to set some limitations around how the outside world interacts with your data graph, internal access to your GraphQL API should be managed thoughtfully as well. Apollo Studio provides both graph API keys and personal API keys to restrict access to the graphs within an organization. It also supports SSO integration and different member roles so that team members can be assigned appropriate permissions when contributing the data graph.

Beyond member roles, Apollo Studio also allows certain variants to be designated as protected variants to further restrict who can make changes to their schemas, which is especially important in production environments.

Summary

Throughout this section, we covered some essential GraphQL-related security considerations and how they are related to federated data graphs. We learned how to make internal APIs less discoverable to the outside world to make it harder for bad actors to exploit them and how to guard against malicious queries to help prevent DoS attacks. We also explored how authentication and authorization may be handled in a federated data graph and saw how Apollo Studio can support your security-minded observability needs.

For further reading on GraphQL-related security concerns, the OWASP GraphQL Cheat Sheet is an excellent resource to help you review the security posture of your data graph.

Edit on GitHub