Skip to main content
Test Double company logo
Services
Services Overview
Holistic software investment consulting
Software Delivery
Accelerate quality software development
Product Management
Launch modern product orgs
Legacy Modernization
Renovate legacy software systems
DevOps
Scale infrastructure smoothly
Upgrade Rails
Update Rails versions seamlessly
Technical Recruitment
Build tech & product teams
Technical Assessments
Uncover root causes & improvements
Case Studies
Solutions
Accelerate Quality Software
Software Delivery, DevOps, & Product Delivery
Maximize Software Investments
Product Performance, Product Scaling, & Technical Assessments
Future-Proof Innovative Software
Legacy Modernization, Product Transformation, Upgrade Rails, Technical Recruitment
About
About
What's a test double?
Approach
Meeting you where you are
Founder's Story
The origin of our mission
Culture
Culture & Careers
Double Agents decoded
Great Causes
Great code for great causes
EDI
Equity, diversity & inclusion
Insights
All Insights
Hot takes and tips for all things software
Leadership
Bold opinions and insights for tech leaders
Developer
Essential coding tutorials and tools
Product Manager
Practical advice for real-world challenges
Say Hello
Test Double logo
Menu
Services
BackGrid of dots icon
Services Overview
Holistic software investment consulting
Software Delivery
Accelerate quality software development
Product Management
Launch modern product orgs
Legacy Modernization
Renovate legacy software systems
Cycle icon
DevOps
Scale infrastructure smoothly
Upgrade Rails
Update Rails versions seamlessly
Technical Recruitment
Build tech & product teams
Technical Assessments
Uncover root causes & improvements
Case Studies
Solutions
Solutions
Accelerate Quality Software
Software Delivery, DevOps, & Product Delivery
Maximize Software Investments
Product Performance, Product Scaling, & Technical Assessments
Future-Proof Innovative Software
Legacy Modernization, Product Transformation, Upgrade Rails, Technical Recruitment
About
About
About
What's a test double?
Approach
Meeting you where you are
Founder's Story
The origin of our mission
Culture
Culture
Culture & Careers
Double Agents decoded
Great Causes
Great code for great causes
EDI
Equity, diversity & inclusion
Insights
Insights
All Insights
Hot takes and tips for all things software
Leadership
Bold opinions and insights for tech leaders
Developer
Essential coding tutorials and tools
Product Manager
Practical advice for real-world challenges
Say hello
Developers
Developers
Developers
Accelerate quality software

How to optimize CI/CD with CircleCI for data engineering and ML

Discover the best practices and techniques for using CircleCI to streamline your data engineering and machine learning workflows.
Mavrick Laakso
|
February 11, 2024
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Right away, I will admit: CircleCI is not my favorite CI/CD tool. There has been an explosion of new-generation tooling that isn’t all hype in this product space which has presented improved ergonomics, functionality, and pricing for developers compared to CircleCI.

However, sometimes, the correct choice isn’t what we want but what we have.

Given the client’s longstanding familiarity with CircleCI as a platform and the task at hand, a monorepo orchestrated with CircleCI seemed a suitable choice for encouraging code sharing and enforcing a consistent set of practices across business units.

And so, dear reader, I have identified and navigated all the foot-guns and false-starts so that you may learn from my begrudging, grumbling hours spent accomplishing this task using my not-favorite CI/CD tool.

The project structure

To begin, let’s imagine a repository with the following structure:

$ tree -a myproject

myproject
├── .python-version
├── __init__.py
├── common
│   ├── common
│   │   └── __init__.py
│   ├── poetry.lock
│   └── pyproject.toml
├── poetry.lock
├── poetry.toml
├── pyproject.toml
├── subproject_one
│   ├── Dockerfile
│   ├── poetry.lock
│   ├── pyproject.toml
│   └── subproject_one
│       └── __init__.py
└── subproject_two
    ├── Dockerfile
    ├── poetry.lock
    ├── pyproject.toml
    └── subproject_two
        └── __init__.py

Two projects (subproject_one, subproject_two) are independently deployable services, both of which consume a common package of library-level code. I wanted to orchestrate a CI/CD pipeline such that merging changes to our main branch would automatically deploy to staging and that deploying changes to a prod branch would automatically deploy to production. Further, I had build and validation steps that were common to all three directories and build, validation, and deployment steps that were unique to each directory. Nothing exotic here - for example, I might run a linter across all files and I might build and push an image for subproject_one to a particular Docker repository that is different from subproject_two.

Don’t fight their APIs

My initial inclination was to create three configuration files. One for tasks that might be common across all projects, for example, running the tests across a project or validating that the code is properly formatted. And two others, each corresponding to our subprojects, where we could place logic specific to those projects.

I recognized that this was not the official recommendation but attempted this (for a time) anyway. Orbs (CircleCI’s word for packages) bundle functionality to 1) filter based on paths and 2) invoke a “continuation” of a pipeline in order to run another file. These can be stitched together to create separate files for each project, and I did this for a time. However, in retrospect, I would not recommend it, and I migrated away from this approach. It was finicky, error-prone, and difficult to maintain. You live and learn, right?

CircleCI’s dynamic configuration prescribes creating two files: a config.yml, where we can author jobs common to all projects and invoke our project-based workflows, and a continuation_config.yml, where we can author our project-based jobs and workflows.

You may be wondering: won’t that become a huge mess of a file? Particularly, if many subprojects are present in our monorepo, one file containing many mixed concerns would make most software engineers eager to refactor.

Well, you’re right. It could become a huge mess of a file.

But! I have identified a few techniques we can use to keep it modular, DRY (don’t-repeat-yourself), and maintainable.

So where does that leave us?

First, we need a directory for our CircleCI configuration files and some proprietary setup:

├── .circleci
│   ├── config.yml
│   └── continue_config.yml

Your config.yml file must include a setup: true block alongside some CircleCI-specific configuration.

From there, we can move on to the aforementioned techniques.

Use the path-filtering orb, Luke

CircleCI’s path filtering orb provides functionality to continue a pipeline based on the paths of changed files. The mapping parameter allows us to pass variables to our continuation configuration for use in when clauses of our workflow. This provides a mechanism to trigger particular workflow branches. In practice, this will look like:

version: 2.1

setup: true

orbs:
  path-filtering: circleci/path-filtering@1.0.0

jobs:
  validate-source-code:
    steps:
      ...

workflows:
  always-run:
    jobs:
      - validate-source-code
      - path-filtering/filter:
          name: check-updated-files
          mapping: |
            common/.* run-common-workflow true
            subproject_one/.* run-subproject-one-workflow true
            subproject_two/.* run-subproject-two-workflow true
          base-revision: main
          config-path: .circleci/continue_config.yml

‍

...

parameters:
  run-common-workflow:
    type: boolean
    default: false
  run-subproject-one-workflow:
    type: boolean
    default: false
  run-subproject-two-workflow:
    type: boolean
    default: false

...

workflows:
  subproject-one:
    when:
      or:
        - equal: [true, << pipeline.parameters.run-subproject-one-workflow >>]
        - equal: [true, << pipeline.parameters.run-common-workflow >>]
    jobs:
      ...
  subproject-two:
    when:
      or:
        - equal: [ true, << pipeline.parameters.run-subproject-two-workflow >> ]
        - equal: [true, << pipeline.parameters.run-common-workflow >>]
    jobs:
      ...

‍Notably, this provides the flexibility to run all workflows when a change occurs in common and only run a particular workflow when changes occur in its subdirectory.

‍

Keep things DRY with the tooling available

YAML isn’t a programming language, but it is a declarative configuration language with not-often explored advanced features. Some of my favorite features to use are anchors, aliases, and merge keys. Combined, they allow us to author re-usable snippets in our CircleCI template (and most yaml documents in general):

common_settings: &common_settings
  executor:
    name: python/default
    tag: 3.10.8

subproject_one_common_settings: &subproject_one_common_settings
  working_directory: ~/myproject/subproject_one
  <<: *common_settings
  
...

jobs:
  subproject-one-validate:
    <<: *subproject_one_common_settings
    steps:
      - myproject-checkout
      - install-acme-cli
      - validate

So, if you have repeated snippets of orchestration (and you likely do, given you’re working in a monorepo), creating a common block of configuration, anchoring it, and then using that anchoring via aliases and merge keys allow us to write it once and run it everywhere, DRYing up your configuration file.

Use filters for branch-based logic

I am more familiar with the GitHub Actions style workflow triggers to invoke particular workflows based on branch conditions. CircleCI offers similar functionality via filters.

For our example project, I wanted to create three different workflows based on branching:

  • First, for every pull request and merge, run some common tasks (such as validating the change has no syntax errors).
  • Second, when a change is merged to main, and has no git tag, deploy it to a staging environment.
  • Third, when a change is merged to prod and has a tag of the form v$.$.$ (such asv1.0.0), deploy it to the production environment.

In practice, this looks like:

stg-filters: &stg-filters
  filters:
    branches:
      only: main
    tags:
      ignore: /.*/

prod-filters: &prod-filters
  filters:
    branches:
      only: prod
    tags:
      only: /^v.*/

...

workflows:
  subproject-one:
    jobs:
      - subproject-one-validate
      - subproject-one-deploy-stg:
          requires:
            - subproject-one-validate
          <<: *stg-filters
      - subproject-one-deploy-prod:
          requires:
            - subproject-one-validate
          <<: *prod-filters

Combined with the aforementioned anchoring, aliasing, and merge keys, we can compose a common set of branch-based rules to use in our workflows for each subproject included in our monorepo.

Don’t be afraid to offload complex logic into scripts

If you’re struggling to fit a complicated step into your job or workflow declarations, offload that logic into a script. This can be authored with bash, or even your favorite programming language, for example:

#!/usr/bin/env python

import os

NAME = os.environ["NAME"]

print(f"Hello, {NAME}!")

‍

For my purposes, this was helpful to orchestrate a sequence of steps that required the usage of an API client given my deployment target did not have a CircleCI orb available. I know I would rather debug a python script than a hobbling of bash in a CI configuration file when it (inevitably) breaks.

RTFM!

This sounds naive, but consulting the official documentation for a CircleCI configuration file proved to be the best source of information while exploring the tools available. Further, it informed me of what options were available to me and provided brief examples for their implementation.

Googling for answers tended to lead to outdated community answers. And using ChatGPT for CircleCI was often flat-out wrong. So, in this instance, doing things the old-fashioned way paid the most dividends.

If you’ve made it to the end of this post, you’ve either (hopefully) added new tools to your toolbox or (unfortunately) continued to search for answers.

Feel free to reach out to mavrick.laakso@testdouble.com in either case with feedback, praise, or condemnation (maybe you really like CircleCI - no judgement!) Until next timeFor my current engagement, I was tasked with setting up a new repository and change management processes to support an enterprise-ready data engineering and machine learning platform. My client has dozens of repositories successfully validating pull requests, promoting changes, and orchestrating deployments into their respective environments using CircleCI.

Related Insights

🔗
A practical guide to AWS Lambda and serverless
🔗
How to speed up Docker builds for cloud deployments
🔗
From aerospace to DevOps: 4 surprising lessons for better software

Explore our insights

See all insights
Leadership
Leadership
Leadership
The business of AI: Solve real problems for real people

After participating in the Perplexity AI Business Fellowship, one thing became clear: the AI hype cycle is missing the business fundamentals. Here are 3 evidence-based insights from practitioners actually building or investing in AI solutions that solve real problems.

by
Cathy Colliver
Leadership
Leadership
Leadership
Pragmatic approaches to agentic coding for engineering leaders

Discover essential practices for AI agentic coding to enhance your team’s AI development learning and adoption, while avoiding common pitfalls of vibe coding.

by
A.J. Hekman
by
Aaron Gough
by
Alex Martin
by
Dave Mosher
by
David Lewis
Developers
Developers
Developers
16 things software developers believe, per a Justin Searls survey

Ruby on Rails developer Justin Searls made a personality quiz, and more than 7,000 software developers filled it out. Here's what it revealed.

by
Justin Searls
Letter art spelling out NEAT

Join the conversation

Technology is a means to an end: answers to very human questions. That’s why we created a community for developers and product managers.

Explore the community
Test Double Executive Leadership Team

Learn about our team

Like what we have to say about building great software and great teams?

Get to know us
Test Double company logo
Improving the way the world builds software.
What we do
Services OverviewSoftware DeliveryProduct ManagementLegacy ModernizationDevOpsUpgrade RailsTechnical RecruitmentTechnical Assessments
Who WE ARE
About UsCulture & CareersGreat CausesEDIOur TeamContact UsNews & AwardsN.E.A.T.
Resources
Case StudiesAll InsightsLeadership InsightsDeveloper InsightsProduct InsightsPairing & Office Hours
NEWSLETTER
Sign up hear about our latest innovations.
Your email has been added!
Oops! Something went wrong while submitting the form.
Standard Ruby badge
614.349.4279hello@testdouble.com
Privacy Policy
© 2020 Test Double. All Rights Reserved.