Overview
Our GraphQL API is already equipped to serve up some basic soundtrack data. We can run a query for featured playlists, or ask for one playlist in particular. We can see data about the playlist itself, along with the tracks it contains.
Furthermore, for each track, we can also query data for the
Artist that created it. But right now, we're facing a big performance issue with how this is implemented.
In this lesson, we will:
- Learn about the n+1 problem
- Discuss how to resolve it
Playlists, tracks, and artists
To see our performance bottleneck in action, let's run a test query against our GraphQL API.
Make sure the app is running either by running the following command in the root of the project.
./gradlew bootRun
Now, let's navigate to Apollo Sandbox Explorer, and paste in the address of our locally running server in the input at the top of the screen. By default, our server should be running on
http://localhost:8080/graphql.
http://localhost:8080/graphql
Let's begin our query by selecting the
playlist field from our
Query type in the Documentation panel. For the playlist we query, we'll request the basics: just an
id and
name, along with a list of its
tracks.
For each
Track object in the playlist, we'll return
id,
name, and
durationMs. Then, we'll ask for its
artist field. This field returns an
Artist type, from which we'll request
id,
name,
followers,
genres, and
uri.
Here's what our query should look like.
query GetPlaylist($playlistId: ID!) {playlist(id: $playlistId) {idnametracks {idnamedurationMsartist {idnamefollowersgenresuri}}}}
And in the Variables panel:
{"playlistId": "6Fl8d6KF0O4V5kFdbzalfW"}
Let's take this query for a spin and... we get data back! Great. So what's the problem, exactly?
To find out, we'll take a closer look at our terminal where our server is running. Run the query again, and... did you catch that? The terminal filled up with statements logging out:
I am calling GET /artists/{artist_id} for 3GBPw9NK25X1Wt2OUvOwY3I am calling GET /artists/{artist_id} for 33QmoCkSqADuQEtMCysYLhI am calling GET /artists/{artist_id} for 6H1RjVyNruCmrBEWRbD0VZI am calling GET /artists/{artist_id} for 2JY5qzEozvTdogkDTkkOMfI am calling GET /artists/{artist_id} for 3WrFJ7ztbogyGnTHbHJFl2I am making...
We see one line printed out here for each track's artist ID, and each of these represents a single request across the network to our data source. Many more requests than we probably expected from our lean and precise GraphQL query! Let's dive into what's happening here.
Different tracks, same artist
We can try this again with another playlist ID—this time, we'll use one that contains tracks by the same artist. Keeping the query in Sandbox the same, update the Variables panel with the following.
{"playlistId": "5evmObkq06UCWmtlcxK4Ev"}
And when we run the query... three identical requests are being made for the same artist ID!
I am calling GET /artists/{artist_id} for 3WrFJ7ztbogyGnTHbHJFl2I am calling GET /artists/{artist_id} for 3WrFJ7ztbogyGnTHbHJFl2I am calling GET /artists/{artist_id} for 3WrFJ7ztbogyGnTHbHJFl2
This is worse than making lots of network requests to the same endpoint: here, we're making multiple identical requests for the same information!
For every artist, a new request
Let's back up and review the endpoints at work in our application.
GET /browse/featured-playlistsGET /playlists/{playlist_id}GET /playlists/{playlist_id}/tracksGET /artists/{artist_id}
Our datafetchers call these endpoints when certain pieces of our GraphQL API are requested.
Here's a breakdown of how our query for a single playlist, its tracks, and each track's artist is resolved.
To get our single playlist, our datafetcher first makes a request to the
GET /playlists/{playlist_id} endpoint. This returns a big JSON object containing our playlist details, along with data for each of its tracks.
But we need more granular detail for each track's primary artist! This means for each track in the playlist, we make a request to
GET /artists/{artist_id} using the track's primary artist ID.
This extra request gets us the artist data we need, but at a cost: the
Track.artist datafetcher is executed for every track in the query response, as expected, but this means it calls the REST API endpoint for each track's artist ID. So depending on how many tracks there are, we might have a lot of extra requests to the API on our hands!
The n+1 problem
This is the n+1 problem in action. We start with an initial request (the
1 in the
n+1 equation), and this first request determines how many follow-up requests will be necessary (the
n in the
n+1 equation). The number of required follow-up requests,
n, is not known until our first request is executed.
We saw this in action: our first request gave us our playlist and its associated tracks, but we then needed a follow-up request per track to get the track's associated artist data.
This doesn't look too bad with just one or two additional requests, but it leads to some troubling situations as our queries scale. Imagine our playlist has fifty tracks; this means we'll send a total of 51 requests! One request to fetch playlist and track data, and 50 additional requests to get the artist information for each track!
Even worse, this can also lead to duplicate requests. A playlist could contain multiple tracks by the same artist, but the
Track.artist datafetcher doesn't know the difference; it will still call the data source for every track, resulting in multiple identical requests for the same artist.
Data loaders
To solve the n+1 problem in our application, we'll use data loaders. A data loader's job is to replace multiple similar requests with a single batched request.
We use data loaders inside of our datafetcher methods. When the process of resolving a query requires a datafetcher method to be called multiple times for different parameters, the data loader can batch the parameters together and make a single network request with them.
In our example, this means that when the
Track.artist datafetcher is called using every track's artist ID, it won't call our REST API directly anymore; instead, it will pass the parameters to the data loader to collect.
Once the individual artist IDs are gathered in one list, the data loader can assume the responsibility of calling the data source. It's able to dispatch a single request to the REST API endpoint for all of the IDs at once—a huge performance boost over letting the datafetcher make a network request for each!
Best of all, with DGS, our data loaders automatically deduplicate the identifiers we pass them. This means even if our playlist contains multiple tracks by the same artist, we'll only ever request that artist once.
Data loaders work great when a single resource can provide data for multiple identifiers simultaneously. They're also scoped to the life of a single query; this means that if we send two queries back-to-back, (each requesting a different list of playlists, tracks, and artists) the data loader will not try to batch artist IDs from both queries together. Instead, it will handle them separately, resolving each request independent of the other.
What a data loader needs
There's just one big requirement for data loaders to work as expected: the endpoint that receives the batched request needs to have the ability to provide data for multiple objects. This requires a change in our application; right now, the
Track.artist datafetcher sends each artist ID individually to the
GET /artists/{artist_id} endpoint, which only returns data for a single provided value.
Fortunately, we have a different endpoint in our REST API that we can use:
GET /artists. It accepts multiple artist IDs joined as a single string, and returns data for them all at once.
We've provided a method that utilizes this new endpoint in our data source. Jump into
SpotifyClient to take a closer look.
public List<MappedArtist> multipleArtistsRequest(List<String> artistIds) {System.out.println("I am making a call to the artists endpoint with artists " + artistIds);ArtistCollection artistCollection = client.get().uri(uriBuilder -> uriBuilder.path("/artists").queryParam("artists_ids", String.join(",", artistIds)).build()).retrieve().body(ArtistCollection.class);if (artistCollection != null) {return artistCollection.getArtists();}return null;}
This method is set up to accept a
List of artist IDs. It makes a request to the
GET /artists/{artists_ids} endpoint, then receives the response body as an instance of the
ArtistCollection class. If the request is successful, we return the results of calling the
ArtistCollection class'
getArtists method, which returns our
Artist instances. Otherwise, the method returns null.
Now that we have a data source method that accepts multiple artist IDs, we can update our
Track.artist datafetcher—and benefit from the power of a data loader!
Practice
Key takeaways
- The n+1 problem occurs when we make an initial request, followed by some unknown number of follow-up requests.
- Data loaders let us batch a list of identifiers (such as IDs) in a single request rather than sending an individual request for each.
- Before data loaders can work properly, our data source (whether another API, or a database) needs to implement a method that accepts multiple keys (such as IDs), and returns multiple objects.
Up next
We've learned about data loaders and the problem that they solve in our application. We also have a new data source method that accepts multiple artist IDs, and resolves multiple artist objects. Next up, we'll implement the data loader logic that gathers up multiple artist IDs in a single request.
