February 23, 2023

How we deploy the Apollo GraphOS monorepo

Ryan Wilcox

Ryan Wilcox

Did you know that the majority of Apollo GraphOS’s features are shipped via a monorepo?

Yes, these mythic beasts still roam the earth. Ours is a humble but temperamental creature that, until recently, required Apollo engineers to do a lot of manual deployment to get the fleet entirely rolled. This year our team undertook the effort to get things continuously deployed to our staging environment using our continuous integration (CI) provider, CircleCI.

What follows is a step-by-step guide to what happens behind the scenes here at Apollo when our developers push their code, get it approved, and merge it. It’s also a nice little introduction to using our internal orb (CircleCI’s name for a package of CircleCI config).

First, a few little disclaimers…

This isn’t supposed to be an endorsement for CircleCI’s monorepo support, our orb, or us saying you should move all of your code into a monorepo. Assessing whether any of these are right for your organization will require thought about your team’s structure, processes, time, and needs.

Monorepos (ours included) have a nasty habit of turning into/masking the distributed monolith. But, when you already have one, it doesn’t make practices like continuous deployment any less valuable, just possibly a lil’ trickier. Our orb also isn’t the strict definition of “open source”. We want others to benefit from what we’ve done, but we’re probably going to be poor maintainers for folks who aren’t working internally here at Apollo 🤷.

Now, let’s get into it.

🔮 Dynamic Workflow: building the builders

A few years back, CircleCI released a feature set called dynamic configuration.

In the regular CircleCI development flow when you push a commit, a webhook is picked up and your static configuration is executed by CircleCI.

version: 2.1
setup: true

For our Monorepo, we leverage Circle’s dynamic configuration. The initial “commit to webhook” flow is the same, but the key difference is that you get to re-submit another static configuration after the first one.

The first workflow which runs is the setup workflow. The setup  workflow can generate more workflows branching off of it – and generate it does! We named our setup workflow root to make it obvious what it does.

Inside of root is a job we named setup (again, something we picked to make things obvious). This job does 2 things for us:

  1. Detects changes and realizes change scope: Compares the current commit against the head of main (or the previous commit when it is the head of main – i.e., you merged). Using this comparison, builds up a simple dependency graph of all of the various modules in the repo to figure out which nodes in the graph are dirty and thus need to be dealt with.
  2. Generating Workflows: Enriches a number of YAML files using Mustache templating, and then packs those into a CircleCI config using FYAML.

Detecting Changes and Realizing Change Scope

Our monorepo is mainly a large multi-module Gradle build. In this setup, anything that impacts production is a Gradle module, including things you might not commonly put inside a Gradle project, but that’s discussion for another time.

Changes are detected using a Gradle task generated by the plugin: Changed Projects. It operates using the same core logic as CircleCI’s path-filtering Orb: File paths from git diff get matched against regex.

In Changed Projects the root of every module is the regex to match. If any file under that path has changed, we mark that node as dirty. We then recursively dirty the rest of the dependency graph and write all the changed modules into a JSON file. As the monorepo contains both libraries and microservices that use those libraries, a change to a common library may result in a large number of microservices needing to be rebuilt and redeployed.

Generating Workflows

Under our .circleci directory, we have the following layout (again, thanks FYAML):

$ ls .circleci/

config.yml

config.fyaml/
  @base.yml
  commands/ ...
  executors/ ...
  jobs/ ...
  parameters/ ...
  workflows/
    flyway.yml
    kotlin.yml
    odyssey.yml
    kubernetes.yml
    registry.yml
    terraform.yml
    ...

But, our secret is that the files in the workflows folder aren’t just YAML, they are YAML-enhanced Mustache templates. Using the Python chevron package for Mustache templating, we turn the templates in .circleci/config.fyaml/ into fully-fledged CircleCI workflow configs.

For example, our templates can’t know how many microservices to rebuild – that’s something detected at build time by Changed Projects. However, Mustache does have looping features! For example, if 3 microservices (odyssey, registry, and mailer) have changed, our Mustache code looks like this:

{{#modules}}
- run:
	name: deploy {{name}} to dev environment
	command: gradle deploy:{{name}}:toDev
{{/modules}}

And it will output syntax CircleCI is expecting, dynamically:

- run:
	name: deploy odyssey to dev environment
	command: gradle deploy:odyssey:toDev
- run:
	name: deploy registry to dev environment
	command: gradle deploy:registry:toDev
- run:
	name: deploy mailer to dev environment
	command: gradle deploy:mailer:toDev

At this point, a developer only sees a bunch of generated workflows being run by Circle, and then only for microservices. We realized very quickly that we needed developers to also understand the scope of their changes. So in every pull request, CircleCI posts a comment that is a table of the scope of the change: what libraries changed and what microservices changed.

In November we made this a graph, which greatly helps engineers understand both the scope of their change and helps them visualize maybe what dependencies could be clipped, reducing the scope of every change.

NOTE: Turns out, dynamic workflows generated this way can exceed some Circle imposed limits! If you need more detail on this, see the CircleCI discussion post we wrote on Circle’s compiled config is too large error.

What we build

Now, let’s see what we’re actually building with the workflows we’ve generated…

Terraform

Detecting that a Terraform workflow needs to be run is pretty easy for us. All of our Terraform code exists inside a single root directory: terraform/. If a file changes in there, we run a plan then (on merge) run an apply.

One of the little tricks we employed here was to have our human and machine readable plans output to the GitHub pull request. Now, when there’s a Terraform change, reviewers can see everything in one place. We know that there are services that do this for you, but this is kind of the genius of CircleCI: simple workflows are like LEGO bricks.

Flyway

Any persistent datastore’s migrations (Spanner, PostgreSQL) are run here. We’ve wrapped the normal Flyway plugin in Gradle so that we can even leverage GCP’s new IAM Service Account authentication from Flyway. This is nice because it means we’re no longer having to tunnel into our clusters/projects. Instead, our Terraform workflow puts rotating tokens into a CircleCI context.

Kotlin

The majority of the code in our monorepo is written in Kotlin. If the module is written in Kotlin, its tests and subsequent deployment steps will be in this workflow.

A fun job that runs inside this repo is a gradle-dependencies checker. We leverage a version catalog in our monorepo to keep dependencies in line. This has the advantage of centralized and consistent versions, but the downside is that detecting changes is much harder (our Gradle plugin for dirty modules doesn’t play nice with version catalogs for a number of reasons).

To sidestep this, we use Gradle lockfiles! Every module has its own lil’ lockfile and thus, if that’s changed, we know that a version change impacts said module.

To prevent devs from having to constantly rebuild these files (and to allow us to still be able to use things like Renovate) this job will rebuild the lockfiles on every commit, then diff the results to what’s checked in. If changes are ready to be staged, it stages them and then pushes them to the branch, thus canceling the current running Circle build and triggering a new one with all of the new lockfiles.

Odyssey

Yup, our tutorials are federated into our monorepo too! The team that owns this lands more solidly in our Education department rather than Engineering. As such, their workflow looks a lot different from the others. Having the flexibility to support this was something we designed for from day 1.

Deployments

You’ve gotten approval and you’ve merged; onto Staging!

Once your code gets approved by peers/codeowners you merge your code to main and all of a sudden your workflows look a smidge different. There are now a number of deployment options. Only code merged to main is allowed to be promoted to our staging or production environments.

Deployments to staging are NOT optional and take off immediately. We alert developers in Slack when a deployment is happening. We created a simple hack to make sure that folks would be alerted when things were happening that pertained to them: in the message, we include the commiter’s Github handle! For example, gh:examplehandle for Github user examplehandle.

Our standard operating practice is that everyone adds their GitHub name, using this format, to their Slack Keywords (some of our devs have the handle HQ, or CY… so Slack keywords alone wouldn’t work for them 😅). This allowed us to not have to maintain some complicated mapping between Github and Slack.

You’re running in staging…and you forgot to promote your build to production 🤦

Once your code has shipped successfully to staging, an approval job for promoting to production is unlocked!

Once your code has sat in staging for a while, you really should be clicking that button… but you know, it’s hard to remember to do this. There are many posts online talking about how click-button deployments are fraught since there is a delay between clicking the merge button and the time that you want to deploy.

Our solution? For now, we wrote a reminder bot (the star of another blog post!) that observes all workflows on main which are in the pending state and pokes the developer responsible for the commit to “push the button”. This is a simple cron bot that uses scheduled triggers in CircleCI to periodically look and poke.

Oh no! main is down!

The other thing the aforementioned bot does is observe whether any commits in main are reporting as failed so that all hands get on deck and get things fixed. Keeping main healthy is everyone’s responsibility.

We want to share these tools with the world!

We’ve released these features in a public CircleCI orb, our platform-internal-orb. This library is open source, but not officially supported by Apollo. However, it contains jobs and commands to do everything mentioned in this blog: from enriching pipelines with Mustache, to commenting on GitHub PRs, to reminder bot commands.

Conclusion

That’s all folks! We spent 2022 building this system up, using it day to day, and discovering what worked and what didn’t. We hope that these ideas – and those ideas codified and isolated for the general public to use – come in handy in your efforts to build your best self! (Maybe they’ll even help you build a supergraph 😉 )

Happy hacking!

Written by

Ryan Wilcox

Ryan Wilcox

Read more by Ryan Wilcox