Amid the rise of object-relational mapping (ORM) tools like ActiveRecord in the mid-aughts, many developers learned in a hurry what “N+1” queries were and word quickly spread on how to avoid the crippling performance problems they can cause.
Thanks to broader awareness, smarter API design, and tools that detect them for you, N+1 queries for selecting a batch of records are typically spotted and fixed pretty quickly these days.
And yet, when it comes to updating a batch of records, we’ve been slow to learn the same lesson. Most apps that use an ORM will naively find-or-insert-and-then-update each record individually, effectively executing a multiple of N+1 database queries. For context, I’m referring to code like this, which follows a pattern frequently found in Rails apps:
category_names.map do |name|
Category.find_or_initialize_by(name: name).tap do |c|
c.update!(last_fetched_at: Time.zone.now)
end
end
As we’ll see momentarily, this approach will result in a lot of unnecessary round trips to the database.
Inserting and updating records in bulk
So, if the naive approach of individually finding-or-creating and then updating each item is slow, what would be faster?
Sadly, prior to Rails 6, ActiveRecord itself didn’t offer any real alternatives short of dropping down to raw SQL (Rails has long had an update_all method, but it merely updates every record in a relation with an identical set of attributes). As usual, our community stepped up to fill the gap with projects like our longtime friend & colleague Zach Dennis’s activerecord-import and Seamus Abshere’s upsert gem.
What libraries like these have always been able to do is batch up multiple updates into a few queries: one to figure out which records already exist, one to insert new records, and one to update the existing ones. But, we live in exciting times! Recent versions of many databases have distilled this very common dance of find-or-insert-and-then-update into a single command, colloquially referred to as an “upsert”. In Postgres, this is exposed via an ON CONFLICT clause that can be added to any INSERT
command.
Enter upsert_all
Given all this, I was surprised and delighted to find—buried in the Rails 6 release notes—a new upsert_all method that provides a faster way to update a large number of records.
The class method is called with an array of model attribute hashes and usually (at least for Postgres & SQLite) a unique_by
option for specifying how to discern new records from existing ones. The brief example above could be rewritten using upsert_all
like so:
Category.upsert_all(category_names.map { |name|
{name: name, last_fetched_at: Time.zone.now}
}, unique_by: [:name])
How does it perform?
To serve as a real-ish example of using upsert_all
, I threw together this little demo app that you can pull down and play with. It simulates a simple batch process for downloading-and-upserting the charitable organizations in a given city, using the orgHunter Charity API.
In addition to Category
above, you can read its code for upserting a Charity
model as well:
def upsert_charities(city, charities_json, categories)
charity_attrs = charities_json.map { |charity|
{
city_id: city.id,
category_id: categories.find { |c| c.name == charity["category"] }&.id,
name: charity["charityName"],
ein: charity["ein"],
mission_statement: charity["missionStatement"],
zip_code: charity["zipCode"],
accepting_donations: charity["acceptingDonations"] == 1,
org_hunter_url: charity["url"],
donation_url: charity["donationUrl"],
web_site_url: charity["website"],
last_fetched_at: Time.zone.now
}
}
Charity.upsert_all(charity_attrs, unique_by: [:city_id, :ein])
end
In addition to inspecting the upsert code, the repo also provides an analogous class that implements a naive find-or-insert-and-then-update approach, as well as a script that uses query_diet to compare the query count and time of each approach:
$ ./script/compare_update_and_upsert_all
Downloading charities in Columbus, OH
Found 9025 charities on orghunter.com
Run #1: Inserting charities with upsert_all:
Upserting 27 categories
Upserting 9025 charities
Charities and categories upserted in 11 queries 567ms
Run #2: Updating charities with upsert_all:
Upserting 27 categories
Upserting 9025 charities
Charities and categories upserted in 3 queries 404ms
Run #3: Inserting charities by finding and updating each one:
Finding and updating 27 categories
Updating 9025 charities
Charities and categories updated in 18104 queries 11344ms
Run #4: Updating charities by finding and updating each one:
Finding and updating 27 categories
Updating 9025 charities
Charities and categories updated in 27129 queries 17150ms
As you can see, 14 queries in 0.97 seconds seems an awful lot better than 45,233 queries in 28.5 seconds. And this is a pretty small dataset of less than 10,000 records—imagine if you had a daily batch process for updating hundreds of thousands or millions of records.
If you’ve got some time, maybe you can think of something cool to build with this charity API—feel free to use this repository as a starting point!
Notes & caveats
There are a few prerequisites and limitations to consider with upsert_all
:
- Whatever attribute or attributes uniquely identify each record (e.g. an ID from a third-party API or a combination of associated foreign keys), a unique index is required in order to use
upsert_all
- ActiveRecord validations & callbacks are not called (adding to the long list of reasons not to rely too heavily on either)
- Timestamp columns
created_at
/updated_at
are not handled automatically, but (in recent versions of ActiveRecord) are both non-nullable by default. As a result, setting a default value of the current time (e.g.t.timestamps default: -> { "CURRENT_TIMESTAMP" }
in Postgres) might simplify things
Let’s make upserts more of a thing
ORMs like ActiveRecord get a bad rap when they make it easy for developers to write code that does the wrong thing, so we should applaud Rails core and its contributors for working to make it easier to perform complex tasks efficiently while retaining such a simple API.
Since ActiveRecord’s feature set is pretty mature at this point, new features like this one often go unnoticed by the majority of developers—so let’s spread the word!
Right now, the economic freeze is resulting in some teams having more time than usual to pause and consider important-but-non-urgent facets of the systems they maintain; things like performance, testing, and accessibility. If you find yourself with some time to devote to refactoring, you can start taking advantage of this feature by searching for calls to find_or_initialize_by
or find_or_create_by
and consider if they might be candidates for replacing with a more performant call to upsert_all
.
And if you think your team could benefit from some outside help identifying or implementing improvements and upgrades to your application, consider reaching out to us—helping you is what we’re here for!