Optimizing CI/CD with CircleCI for data engineering

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Right away, I will admit: CircleCI is not my favorite CI/CD tool. There has been an explosion of new-generation tooling that isn’t all hype in this product space which has presented improved ergonomics, functionality, and pricing for developers compared to CircleCI.

However, sometimes, the correct choice isn’t what we want but what we have.

Given the client’s longstanding familiarity with CircleCI as a platform and the task at hand, a monorepo orchestrated with CircleCI seemed a suitable choice for encouraging code sharing and enforcing a consistent set of practices across business units.

And so, dear reader, I have identified and navigated all the foot-guns and false-starts so that you may learn from my begrudging, grumbling hours spent accomplishing this task using my not-favorite CI/CD tool.

The project structure

To begin, let’s imagine a repository with the following structure:

$ tree -a myproject

myproject
├── .python-version
├── __init__.py
├── common
│   ├── common
│   │   └── __init__.py
│   ├── poetry.lock
│   └── pyproject.toml
├── poetry.lock
├── poetry.toml
├── pyproject.toml
├── subproject_one
│   ├── Dockerfile
│   ├── poetry.lock
│   ├── pyproject.toml
│   └── subproject_one
│       └── __init__.py
└── subproject_two
    ├── Dockerfile
    ├── poetry.lock
    ├── pyproject.toml
    └── subproject_two
        └── __init__.py

Two projects (subproject_one, subproject_two) are independently deployable services, both of which consume a common package of library-level code. I wanted to orchestrate a CI/CD pipeline such that merging changes to our main branch would automatically deploy to staging and that deploying changes to a prod branch would automatically deploy to production. Further, I had build and validation steps that were common to all three directories and build, validation, and deployment steps that were unique to each directory. Nothing exotic here - for example, I might run a linter across all files and I might build and push an image for subproject_one to a particular Docker repository that is different from subproject_two.

Don’t fight their APIs

My initial inclination was to create three configuration files. One for tasks that might be common across all projects, for example, running the tests across a project or validating that the code is properly formatted. And two others, each corresponding to our subprojects, where we could place logic specific to those projects.

I recognized that this was not the official recommendation but attempted this (for a time) anyway. Orbs (CircleCI’s word for packages) bundle functionality to 1) filter based on paths and 2) invoke a “continuation” of a pipeline in order to run another file. These can be stitched together to create separate files for each project, and I did this for a time. However, in retrospect, I would not recommend it, and I migrated away from this approach. It was finicky, error-prone, and difficult to maintain. You live and learn, right?

CircleCI’s dynamic configuration prescribes creating two files: a config.yml, where we can author jobs common to all projects and invoke our project-based workflows, and a continuation_config.yml, where we can author our project-based jobs and workflows.

You may be wondering: won’t that become a huge mess of a file? Particularly, if many subprojects are present in our monorepo, one file containing many mixed concerns would make most software engineers eager to refactor.

Well, you’re right. It could become a huge mess of a file.

But! I have identified a few techniques we can use to keep it modular, DRY (don’t-repeat-yourself), and maintainable.

So where does that leave us?

First, we need a directory for our CircleCI configuration files and some proprietary setup:

├── .circleci
│   ├── config.yml
│   └── continue_config.yml

Your config.yml file must include a setup: true block alongside some CircleCI-specific configuration.

From there, we can move on to the aforementioned techniques.

Use the path-filtering orb, Luke

CircleCI’s path filtering orb provides functionality to continue a pipeline based on the paths of changed files. The mapping parameter allows us to pass variables to our continuation configuration for use in when clauses of our workflow. This provides a mechanism to trigger particular workflow branches. In practice, this will look like:

version: 2.1

setup: true

orbs:
  path-filtering: circleci/path-filtering@1.0.0

jobs:
  validate-source-code:
    steps:
      ...

workflows:
  always-run:
    jobs:
      - validate-source-code
      - path-filtering/filter:
          name: check-updated-files
          mapping: |
            common/.* run-common-workflow true
            subproject_one/.* run-subproject-one-workflow true
            subproject_two/.* run-subproject-two-workflow true
          base-revision: main
          config-path: .circleci/continue_config.yml

‍

...

parameters:
  run-common-workflow:
    type: boolean
    default: false
  run-subproject-one-workflow:
    type: boolean
    default: false
  run-subproject-two-workflow:
    type: boolean
    default: false

...

workflows:
  subproject-one:
    when:
      or:
        - equal: [true, << pipeline.parameters.run-subproject-one-workflow >>]
        - equal: [true, << pipeline.parameters.run-common-workflow >>]
    jobs:
      ...
  subproject-two:
    when:
      or:
        - equal: [ true, << pipeline.parameters.run-subproject-two-workflow >> ]
        - equal: [true, << pipeline.parameters.run-common-workflow >>]
    jobs:
      ...

‍Notably, this provides the flexibility to run all workflows when a change occurs in common and only run a particular workflow when changes occur in its subdirectory.

‍

Keep things DRY with the tooling available

YAML isn’t a programming language, but it is a declarative configuration language with not-often explored advanced features. Some of my favorite features to use are anchors, aliases, and merge keys. Combined, they allow us to author re-usable snippets in our CircleCI template (and most yaml documents in general):

common_settings: &common_settings
  executor:
    name: python/default
    tag: 3.10.8

subproject_one_common_settings: &subproject_one_common_settings
  working_directory: ~/myproject/subproject_one
  <<: *common_settings
  
...

jobs:
  subproject-one-validate:
    <<: *subproject_one_common_settings
    steps:
      - myproject-checkout
      - install-acme-cli
      - validate

So, if you have repeated snippets of orchestration (and you likely do, given you’re working in a monorepo), creating a common block of configuration, anchoring it, and then using that anchoring via aliases and merge keys allow us to write it once and run it everywhere, DRYing up your configuration file.

Use filters for branch-based logic

I am more familiar with the GitHub Actions style workflow triggers to invoke particular workflows based on branch conditions. CircleCI offers similar functionality via filters.

For our example project, I wanted to create three different workflows based on branching:

First, for every pull request and merge, run some common tasks (such as validating the change has no syntax errors).
Second, when a change is merged to main, and has no git tag, deploy it to a staging environment.
Third, when a change is merged to prod and has a tag of the form v$.$.$ (such asv1.0.0), deploy it to the production environment.

In practice, this looks like:

stg-filters: &stg-filters
  filters:
    branches:
      only: main
    tags:
      ignore: /.*/

prod-filters: &prod-filters
  filters:
    branches:
      only: prod
    tags:
      only: /^v.*/

...

workflows:
  subproject-one:
    jobs:
      - subproject-one-validate
      - subproject-one-deploy-stg:
          requires:
            - subproject-one-validate
          <<: *stg-filters
      - subproject-one-deploy-prod:
          requires:
            - subproject-one-validate
          <<: *prod-filters

Combined with the aforementioned anchoring, aliasing, and merge keys, we can compose a common set of branch-based rules to use in our workflows for each subproject included in our monorepo.

Don’t be afraid to offload complex logic into scripts

If you’re struggling to fit a complicated step into your job or workflow declarations, offload that logic into a script. This can be authored with bash, or even your favorite programming language, for example:

#!/usr/bin/env python

import os

NAME = os.environ["NAME"]

print(f"Hello, {NAME}!")

‍

For my purposes, this was helpful to orchestrate a sequence of steps that required the usage of an API client given my deployment target did not have a CircleCI orb available. I know I would rather debug a python script than a hobbling of bash in a CI configuration file when it (inevitably) breaks.

RTFM!

This sounds naive, but consulting the official documentation for a CircleCI configuration file proved to be the best source of information while exploring the tools available. Further, it informed me of what options were available to me and provided brief examples for their implementation.

Googling for answers tended to lead to outdated community answers. And using ChatGPT for CircleCI was often flat-out wrong. So, in this instance, doing things the old-fashioned way paid the most dividends.

If you’ve made it to the end of this post, you’ve either (hopefully) added new tools to your toolbox or (unfortunately) continued to search for answers.

Feel free to reach out to mavrick.laakso@testdouble.com in either case with feedback, praise, or condemnation (maybe you really like CircleCI - no judgement!) Until next timeFor my current engagement, I was tasked with setting up a new repository and change management processes to support an enterprise-ready data engineering and machine learning platform. My client has dozens of repositories successfully validating pull requests, promoting changes, and orchestrating deployments into their respective environments using CircleCI.

Related Insights

Explore our insights

See all insights

Developers

Anyone can code: Software Is having Its Ratatouille moment

Gusteau said it best: "anyone can cook", and now, "anyone can code." LLMs and agentic coding are the Remy to our Linguini. Our job isn't to guard the kitchen—it's to help others cook something worth serving.