Recently, a situation came up where some alphanumeric data I was working with needed to be sorted in a way that felt more “natural.” Normally, when you’re sorting an alphanumeric list of data, it sorts by character code points. For example,
BAC012
ABC102
CAB210
ABC103
gets sorted as
ABC102
ABC103
BAC012
CAB210
That seems right … so what do I mean by “natural?” The “natural” descriptor is meant to provide an alternative label for how the numeric part of the alphanumeric data gets ordered. When we sort numbers, it’s sometimes awkward to sort them by character code points.
101
11
2
2003
31
That ordering isn’t wrong; but it’s usually not what we want. When sorting numbers, it’s more “natural” to sort numbers as numbers.
2
11
31
101
2003
Naturally, numbers, by default, get treated as numbers and sorted from low to high value, but when they are specifically part of an alphanumeric value, they’re sorted according to the code point, or value tied to that character.
Postgres to the rescue
With that understanding of “natural,” the problem I was working to solve led me to learning that you can create your own collation objects in Postgres. Within that section of the documentation, I found a nifty example provided.
Numeric ordering, sorts sequences of digits by their numeric value, for example: A-21 < A-123 (also known as natural sort).
CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true');
CREATE COLLATION numeric (provider = icu, locale = 'en@colNumeric=yes');
Apparently this was a common enough desire, that they simply give an example in the docs. It seemed like exactly what I was looking for; the only problem was that I didn’t really know what anything after numeric
meant in the CREATE COLLATION
clause. And I wanted to understand it!
This is where the small task of simply creating and using the collation became a rabbit hole to Wonderland.
Down the rabbit hole of Postgres docs
To get a better understanding, I read all the previous context in the Postgres docs and I followed the links about Unicode to discover much more than I expected. One example: I thought I knew what a collation is for, but if asked, I wouldn’t have had an answer. Now I do.
In case you’re like me and collations are a fuzzy concept, hopefully I can explain by simply saying it’s used for determining the sort order of a character set. Before, I think I saw collations as something that defined a character set. But according to Unicode, it’s specifically for defining how characters in a character set should get ordered. This sounds perfect if I want to treat numbers different from non-numbers in an alphanumeric string.
ICU but do UC me?
Now that I have that title out of my system, we can start breaking down what the CREATE COLLATION
clause is even saying.
But! Before I could look more into what ICU is, my brain needed to know if it was an acronym. So, for anyone with a brain like mine, I eventually found it (after more digging than I ever expected it to require). ICU is an acronym for “International Components for Unicode.”.
With that out of the way, the ICU provider
gives context about how to interpret the locale
value given. It’s a common Unicode syntax, and it’s also helpful for defining collations that are more than just language+country. There’s a lot about ICU, but for our purposes, the locale
value is where I’ll shift focus.
Nested rabbit holes
Both of the CREATE COLLATION
clauses use the same provider, but have two separate locale values that allegedly do the same thing: en-u-kn-true
and en@colNumeric=yes
. The Postgres docs mention that en-u-kn-true
is a “’language tag’ per BCP 47” while 'en@colNumeric=yes'
is the “traditional ICU-specific locale syntax.” They also mention that the language tag is preferred but isn’t supported by older ICU versions. To be entirely honest, I didn’t look further into it. If you choose the one that makes the most sense to you right now, it might be fine. YMMV.
With that said, it’s possible you’re thinking what I was: what does that mean?
[Note: The separator can also be an underscore for backward compatibility reasons only. A dash is the preferred approach if your database supports it.]
en-u-kn-true
The en-u-kn-true
value is known as a Unicode locale identifier. Each section is preferably separated by a -
.
-
en
: The first part of the string refers to the language this locale string is being created for. If you wanted a region, you could doen_US
oren_GB
, buten
is also acceptable (case also seems to matter). -
u
: Figuring this out took some effort. Despite there being a lot written, so much of it had implicit assumptions. Eventually I tracked it down here and subsequently RFC6067 to determine that it is a Unicode locale language tag extension. By itself, it doesn’t mean much. -
kn
: With theu-
prefix before it, I looked at more extensions under that BCP 47 specification and foundkn
to be the Unicode locale extension for numeric ordering here. -
true
: This last part simply enables thekn
extension to treat any sequence of decimal digits as digits for sorting numerically.
Overall, the u-kn-true
is the key section in the language tag that enables a numeric ordering within alphanumeric strings. For the second string, I’d assume it’s just a different way to say the same thing. (Spoiler alert: it is.)
en@colNumeric=yes
-
en
: This is the same as above. -
colNumeric=yes
: The best place I’ve found for finding this information is in the ICU GitHub repository. Within that file, the available values are broken down along with a small note about what each do. I tried a few sources, but nothing I found went over the traditional ICU-specific syntax as clearly. Per this page, the@
performs a similar duty tou
in the BCP 47 language tag by specifying what follows is an extension.
Information for determining what is right is sadly not cohesive. You really have to know what you’re looking for first. So if you’re needing to create a collation with different requirements, it may take some time to track down exactly what you want. Hopefully, I’ll have saved you some time.
Applying the collation
Now for the straightforward part! Once I had the collation created, I applied it to the columns I wanted with:
ALTER TABLE my_table ALTER COLUMN my_column TYPE character varying(255) COLLATE numeric
Originally, I assumed I would need to group numeric values together and sort more manually, but thankfully Postgres is exceptional and manages this ordering..