Bring order to Python data chaos
In the real world, data is messy. Required information is left out, a number is typed with an O instead of a zero, and don't even get me started on date formatting. Python has a reputation as being the language for wrangling info; however, how can we protect our apps from all that data debris?
Pydantic is a library for modeling data, but that simple task belies its power to bring order to the data chaos. If you've ever asked yourself, "what's in this dictionary again," this talk is for you. Validation, serialization, mapping, conversions … we'll cover a variety of ways to wrangle your bits and bytes into data zen
Kyle Adams delivered this talk at the IndyPy November 2025 meetup, hosted by Six Feet Up
More Pydantic for Python content resources
We also have a Pydantic for Python blog series unfolding:
- A beginner's guide to Pydantic to Python type safety
- Seamlessly handle non-pythonic naming conventions
- Normalize legacy data
- Declare rich validation rules
- Field report in progress: Build shareable domain types
- Field report in progress: Add your custom logic
- Field report in progress: Apply alternative validation rules
- Field report in progress: Validate your app configuration
- Field report in progress: Put it all together with a FHIR example
Transcript
0:01
[Music] Hi, I'm Kyle Adams and today tonight
0:09
we're going to talk about Pydantically perfect in every way. This presentation is going to be a deep dive on uh into
0:16
Using Pydantic which is a speedy data modeling and validation library to
0:22
wrangle ordinary data. Now, first I want to give you a little
0:28
bit about me. I uh I'm a staff software consultant at a company called Test Double
0:35
and uh we're a software consultancy based in Columbus, Ohio.
0:40
And we are known for or sorry, we have remote workers even though we're based in Ohio, we have
0:46
remote workers across the US and Canada. Now, we're known primarily for Ruby and
0:53
JavaScript, but we actually work across a wide swath of tech, including Python,
0:59
which is why we're here tonight. So,
1:05
uh, our studios as you sit down at your keyboard and to
1:13
work on your next work and you take a look at it. Doesn't look too bad. You're going to be pulling in
1:20
patient data and it's probably going to look something like this. So, we've got an ID, first name, then
1:28
another record with an ID, the first name. Pretty sensible, straightforward,
1:34
short and sweet. Now, let's open the attached example to see what it actually looks like.
1:43
Well, we've got a problem. We've got Pascal snakes uh as I refer to this uh
1:51
casing scheme. Uh we've got nested data. We've got a
1:56
value that is nested within an object that is nested within a list instead of just being directly
2:04
uh associated with the ID key. And we've got whatever's happening here where
2:10
we've had an ID that's in one place and then in another place it's called MRN.
2:16
So we've got some data cats to herd. Fortunately, we have a cat and his name
2:24
is Pandandy. Uh now some of you may be wondering
2:32
uh and so I want to take a moment uh before we get into the really technical advanced wizardry stuff
2:39
to give a brief overview of what Pydantic is
2:45
and there may be people who are asking what is this Pydantic library? Well, as
2:51
I've mentioned, uh, Pydantic is a Python library for data modeling and
2:57
validation. But let's take a look at a basic example to see that in action.
3:05
So the first thing uh that we have is we have to define our model.
3:11
The first thing we're going to do is define our model. Now I feel like I'm too close, but uh if if there was more
3:18
feedback from the shot, let me know. So we've we've defined a a name, which is a
3:24
string, a breed, which can either be a Labrador or a Chihuahua, and an ID,
3:32
which is an integer. The next thing we do is we load our data. And we're doing that here using
3:40
model validate. and that's going to both parse your data
3:46
and validate it. Now, there are actually two ways to load data into Pydantic model depending on
3:54
the context. The first way is with a function and that's what we've seen already in this
4:00
example because this is a a good way to load your data in if you don't control
4:05
your data. For example, if you're getting it from a third party API.
4:12
Now the second way is with constructing and keyword arcs and you see this a lot
4:19
of times in blog posts about Pydantic or uh documentation.
4:26
This is a really good way to use if you control the data. Uh another good example would be for tesst. Uh once we've
4:34
loaded our data in, then we can use it to do whatever it is that we want to do for either our uh you know, maybe we've
4:42
got a uh brilliant money-making idea, but in this case, it's moving healthcare
4:47
data around. So once we get the data into the model,
4:54
importantly we can also get it back out of the model using model dump and that
4:59
will serialize it to either a Python dictionary uh or a JSON string.
5:07
And the last thing I want to highlight is that we're protected. So if we try to pass in bad data and in this case I try
5:14
to pass in um a a dog whose name is spot has a breed of boxer and an ID of two
5:23
then it's going to throw a validation error
5:28
and I want to go through that validation error a little bit because there's some good info in here.
5:33
So the first thing is it tells me how many validation errors there were. That's really important when you're
5:40
dealing with a complex object with a lot of data in it. Uh our example is pretty
5:46
simple. It's pretty straightforward to see where the error is at. But when you have large objects, that can be a lot
5:53
harder thing to track down. Not just where it is, but how many because you might have multiple validation errors
5:59
too. So the next thing is it shows me
6:07
exactly the the the kind of information that I need to see in order to debug
6:13
where my validation error is at. It tells me what field is problematic. In
6:18
this case, it's breed. Tells me what the input should be. In this case, I'm
6:24
limited to Labrador or Chihuahua. and it tells me what I actually sent in
6:29
which was boxer. So that's my error.
6:35
Now when we were playing around with Pydantic the first time um as we were learning it
6:42
this was kind of a light bulb moment for us because we realized in Python
6:50
type hints give you buildtime type enforcement
6:55
but paid with Pydantic that buildtime type enforcement now becomes runtime
7:03
type enforcement. And that was a like we didn't realize it at the time, but that became a very
7:11
helpful thing when dealing with APIs that would spew a lot of really messy data to have that kind of runtime data
7:18
enforcement uh data typing enforcement in place. So now you all have graduated Pydantic
7:26
101. Don't expect any diplomas in the mail. Uh, so we're going to move on to
7:31
the more advanced wizardry. Back to our problems.
7:38
We're going to take them one at a time. We're going to start with Pascal snakes.
7:45
So, Pydantic's solution to this is something called aliases.
7:51
And Pydantic handle. Let's, sorry, let's take a look at what we
7:58
might do if we didn't know about aliases. So we could just mimic the same casing
8:07
structure in our model. Why would this be bad? Well, we're going
8:14
to see this. This is going to be a re reoccurring theme tonight. It's allowing the complexity from the data to
8:20
seep into our code. And then so that means any code that interacts with our patient is also going to have to use
8:26
these kind of non-Pythonic uh uh naming conventions
8:32
case conventions. Aliases on the other hand
8:38
they let us confine the external complexity to the Pydantic layer.
8:44
Here we're using the Pydantic field to define uh alias the alias as part of the
8:53
um attributes metadata. These aliases are alternative names
8:58
available to Pydantic when validating or serializing.
9:04
The problem with a field-level alias is that in our example we only have two
9:13
fields. It's very easy to do, but when you have 20 different models
9:18
with 10 to 15 fields each, it gets to be a lot of typing. So, there's it'd be
9:25
really nice if there were a way to automate all of this typing. Of course,
9:30
Pydantic has a way to automate all of that. Uh, and it's called alias generators.
9:36
So, our second pass here, our third pass is going to be using an alias generator.
9:44
Every Pydantic model has a model config attribute in it. And this model config
9:50
attribute contains kind of the default configuration if you don't touch it at all. But we can override that uh with
9:57
the config dict object. Uh and we can pass in an alias generator.
10:04
uh and this alias generator will be used to create aliases for every field in the
10:11
model. Now, Pydantic also offers a few out
10:16
ofthe-box alias generators for our convenience uh including the two camel
10:22
function that we see here which transforms the names into camel case.
10:28
However, our actual data isn't in camel case.
10:36
It's in snake Pascal snake case. So, what are we going to do about that?
10:41
Well, it turns out we can also create our own alias generators.
10:46
Alias generators are just functions that take in strings and return other strings. So here we're taking in our
10:55
snake case field name and we're converting it to a Pascal
11:01
snake case uh field alias.
11:07
Oops. Uh so once the custom alias generator is
11:13
ready to go, we can pass it to our uh we can pass it into our alias generator.
11:20
Now, aliases are awesome. We should definitely do more of them,
11:27
but there are some gotchas to know about them.
11:33
So, Pydantic's default behavior with regards to aliases is inconsistent
11:39
across validation, which is when you're reading data in, and serialization, which is when you're dumping the data
11:45
back out again. When we're doing validation, Pydantic
11:50
prefers the field alias. But when you're doing serialization or
11:57
writing out, it wants to use the field name. I'm sorry, I keep I'm trying to
12:04
hold this at just the right place. Um so the first trap happens when
12:11
validating where Pydantic's default is to use the field alias
12:17
and that trap is constructing. Uh remember when we talked about how we could load data into a model with
12:24
keyword instruct or keyword arguments in the constructor.
12:30
Anyone want to venture a guess as to what's going to happen here where we have our two Pascal snake alias
12:38
generator and we're trying to construct a patient using snake case in the
12:43
constructor. If you guess validation error, pat
12:48
yourself on the back and let's take a look at what's going on here.
12:57
So we have
13:03
uh our input is using snake case
13:09
but python python Pydantic is expecting Pascal snake case. Uh so it's saying hey
13:18
first name is required and you didn't give it to me. We gave it to them. We just he used the
13:24
wrong case for it. So how do we fix this?
13:31
We can switch the validation behavior to use field names by default rather than the aliases by setting the um validate
13:40
by name in the model config. The trade-off is that anytime we want to
13:46
to um validate by alias, we have to now explicitly set it using that by alias
13:53
setting. So as you can see
14:00
um we'll need to anytime that we want to do
14:08
a model validate we now need to explicitly set by alias. Uh so there's another way and that's
14:16
that we can say okay we want to validate by name or by alias by saying those both
14:21
to true and the trade-off here
14:26
is that uh this is a great option if you don't mind your validation being a little bit
14:34
less strict. Uh, and what Pydantic does
14:39
is it's checking both the names and the aliases
14:45
to see if they exist and if they are set to a valid value. If they they do and
14:50
they're set to a valid value, it'll use that value.
14:55
Now, bit of foreshadowing here, but I want to remind you that when serializing, Pydantic uses the field
15:02
name. The second trap is when we try to take the data in our model and serialize it
15:09
back to another format like a Python dictionary or a JSON string. Here's what
15:14
that behavior looks like in the code. Now, using a different case scheme when
15:21
dumping out could be a problem if we have downstream systems that still
15:26
expected to be in Pascal snake case. We need to override pyantic's default
15:33
behavior when doing that model dump. We can do that by setting a by alias
15:39
argument. Uh we can do that at the function level or as we've seen uh previously we can
15:48
set it at the model level by setting this serialize by alias to true.
15:57
So now we can safely serialize and get the results that we want with the Pascal snake case in the output
16:04
um and avoid that that gotcha. Now for our next spell, we're going to
16:12
look at nested data. So what do we do when the data that we
16:18
want isn't at the level in the data structure that we would like it at?
16:24
Pydantic solution should look familiar as alias paths are adjacent to aliases.
16:31
As with our Pascal snakes issue, fir first pass might
16:37
uh might be to represent the nested values.
16:42
So here we have gem nested inside of value, nested inside of list, nested inside of first name as nested models.
16:50
So here now we're um nesting our name inside of a value model which gets
16:57
nested inside of a list which gets nested inside first name which gets nested inside of patient.
17:04
Now I will admit there's a certain amount of simplicity to this approach
17:10
but let's look at what happens when we try to access our patient’s first name.
17:16
We have did I lose? Okay, sorry. We have this long value
17:23
uh where we instead of just being patient first name, it's patient first name index zero dov value.
17:31
So alias paths let us dig into our data structure and pluck out just the values
17:36
that we want. Let's take a closer look at how that works. We define a path that navigates from the
17:43
root of the object down to the data that we want. In this case, we need to go down to the first name field, then to
17:50
index zero in our list, and then into the value field.
17:56
Once we have our path, we can pass that path into the alias path constructor.
18:01
Two notes here. For this example, I've defined the path and the alias path separately
18:08
so I could fit them onto one slide. But in real world code, you'd almost always
18:14
just define the path in line. For my second note, some of you may be
18:19
wondering why we're using validation alias here rather than alias.
18:24
Unfortunately, as you get deeper into more advanced alias features, you lose the ability to be bidirectional. What I
18:31
mean by bidirectional is that when you read data in, it does the same things as when you write data out.
18:38
So in this case, we can no longer serialize our first name field out to back out to a nested path. So if I do
18:45
this model dump, it's just going to put it in first name Jim. Consequently, we that's why we're using validation alias
18:51
here. uh is because alias path only works on input and it doesn't apply to
18:58
uh which is validation and it doesn't apply to output which is serialization.
19:04
So back to our updated code, how does it look to access a first name now that
19:12
we're using alias path? Much better. Now it's just the patient first name that we expected.
19:20
All right, we are at our final uh incantation here and this one is going
19:26
to deal with the discrepancy between ID and our first patient and MRN and our
19:33
second patient and I will say MRN here is a medical record number. It's a common identifier in healthcare systems.
19:41
Uh so that's what MRN is and it's time to solve the problem of
19:49
multiple paths to the data that we want. How do we tell Pyantic that that about
19:54
these multiple paths? Well, we're going to use alias choices often in
19:59
conjunction with alias path. We could try to deal with the problem by creating two different models.
20:08
The first model would deal with uh the ID that's deeply nested
20:14
and the second would handle our MRN model.
20:21
The problem here is the same as with our other examples. We're letting the complexity of the data structure seep
20:28
into our code. And so now every bit of code that deals with the patient has to know, does this patient have an MRN? Am
20:34
I checking the MRN for the ID? or does it have a deeply nested ID?
20:40
So instead, let's write a single model and let alias choices abstract away that
20:46
complexity. Here we can see that alias choices lets us specify multiple paths at which we
20:53
might find the user's ID. In this case, either under an MRN attribute or
20:59
following an alias path to a more deeply nested number. And again, I've split alias choices onto
21:07
its own line for brevity's sake. However, you'd likely inline it into that field definition in the real code.
21:16
So, now we've talked about how to address all of our problems.
21:22
We can kick back and relax, right? Well, what does the full solution look like,
21:27
though? So, let's step through this. uh we have all of our imports and we have a few
21:33
utility functions here. We've already talked about to Pascal case. So I just want to point out this gen
21:41
path function. Uh we have a pattern with our alias paths that you may have noticed.
21:48
Uh specifically we navigate a list and then we get the value uh from the first
21:53
item in that list for each of our attributes in our model. and gen path
21:59
helps dry up that pattern a little bit. Uh so looking at the model
22:06
in our model config, we use two Pascal snake uh to deal with uh making sure
22:13
that we can use nice Pythonic attribute names and we use alias path as generated by
22:21
our gen path function to give us access to the deeply nested data. And then
22:27
finally, we use alias choices to normalize across multiple paths access
22:34
to the user's ID. Now we're back to our problematic data
22:40
here. Let's see how it does when we run it through our new Pydantic model.
22:47
Perfect. This is exactly what we want. We've abstracted away the complexity of
22:52
the third party API, the third party data schema, and now our client code
22:58
doesn't have to know anything about that. Just ask for the patient ID, gets the ID, asks for the first name, gets
23:05
the first name. So, we've hit the portion of my talk
23:10
here where I'm going to open up for questions. And while we're talking, I'm going to throw my contact info up there.
23:16
Uh if there are any questions that you guys have that we don't get to talk about, uh feel free to reach out to me
23:21
on LinkedIn or email me. I'd love to set up a Zoom call. Uh this is the the kind
23:27
of stuff that I love to talk about. Uh so any questions
23:34
actually here, I'll give you this mic and you can ask us everyone can hear it.
23:39
Have you created any Pydantic models for owl files? For what kind of files? owl house OL
23:46
they're uh I have not okay they're a transfer mechanism for medical terminology records from one
23:52
system to another yeah that's seemed relevant that's interesting because I have not created Pydantic models for owl
23:58
files but I've created lots of Pydantic models for fire uh for HL7 fire stuff
24:04
and that's another data transfer um schema that's common in the healthcare
24:09
world um and I will say uh modeling fire
24:14
in pretty much anything is a lot of work. Uh, fire is a very big specification if you've ever dug into
24:20
it. All right, more questions.
24:26
Well, you all got more questions. What What have there been downsides to
24:33
the static typing Pydantic? Has it caused you to spend more time someplace or has
24:41
it always been a total upside? Yeah, I I think probably the most difficult thing uh is the learning
24:48
curve. It's the human aspect of it. Um and it honestly it's it's less to do
24:54
with Pydantic and more to do with with uh like runtime enforcement of typing. Uh
24:59
like that's kind of a shift if you've spent a big portion of your career working in a very uh I I don't want to
25:06
say Python's loosely type because it's, it's very complicated. It's strictly typed, but it's dynamically typed. And
25:12
that's a whole other conversation. Uh but yeah, it is most people come into
25:18
using Pydantic with uh a mindset uh that it's a bit of a shift uh and it can be a
25:25
bit jarring to see all these red squiggies show up in your code that weren't there before.
25:32
Other questions for Kyle? I will also say uh sorry to to build on that a little bit more. This talk
25:39
actually comes out of a series of of exercises that we created when we were
25:44
rolling Pydantic out at the healthcare organization that I was working at. Uh so this actually came out of that
25:51
challenge of uh getting people up to speed on Pydantic.
25:58
How many people in the room are using statically or statically typed and and paid?
26:04
Oh, so lots of people still to adopt. have a question.
26:12
My question is like I I use different integration products
26:17
like Mulesoft, Delmov and stuff like that. Like if I have to deal with the
26:23
data which comes from different source systems like you have all kinds of crazy data coming but we want a kind of a
26:30
single schema. Yeah. Like if I have to use this library
26:36
like am I running it in like a lambda like like how would I use this library
26:43
Python library on top of what my integration platform is to kind of validate the data
26:50
maybe you know unify it. Yeah. And put it somewhere else like you know CSV or JSON. Um
26:58
so what we did uh is to say okay we want to keep the
27:05
mess as close to the boundaries of our systems as possible and then this this
27:10
particular uh engagement um it was a collection of microservices that were all talking with each other. So we
27:17
wanted to keep when when messy data came in from all the different platforms that we were integrating with, we wanted to
27:24
clean up that data as soon as possible like right at that boundary. And we used
27:29
Pydantic to define a whole set of models
27:35
uh that then we packaged up and it became uh like a schema that could be
27:40
used across multiple applications. you could just in every application, every microservice that you spun up, you could
27:47
download this this one library and would define all of the essentially domain
27:52
models uh for the whole system. So patients, uh facilities, doctors, all of
28:00
those domain models were defined in this library and they were all defined in Pydantic. Uh and so then that gave us a
28:07
tool for any micros service, any integration I did, it could um clean up
28:13
that data, put it into our internal uh schema, our internal domain models, and
28:19
then our internal systems kind of all worked seamlessly because they all had the same view. They all knew what a
28:26
doctor was. They all knew what an employee was. They all knew what all these domain models were. So does that
28:32
does that answer your question? Yeah. Any other questions or
28:42
so? Python for a bit now for the last couple of years is bringing in uh typing
28:47
into the language and we're seeing it more and more featurerich in that area
28:54
with 13 and 14. Yeah. Now rolling out. Are you
29:00
you think Pydantic's uh significance will devolve now with the typing that's
29:08
coming into Python itself? Yeah. So I would say Pydantic is actually built to take really good advantage of
29:14
all of that new stuff. And that goes back to that those yellow slides where I talked about the the light bulb moment.
29:21
uh and that all of these new things that are being added by Python are all
29:26
build-time things like you have to run my pi or pyite to you know to do static analysis on your code and flag anything
29:33
that might be a problem. Uh if you run it without doing that static analysis in production, Python's not going to
29:38
complain if you store an integer in a uh
29:44
variable with a string uh type on it. Uh and so what Pydantic does is it kind of
29:50
fills in that gap of pi python's um defining all of these buildtime
29:57
uh type hints and features around those. Pydantic is taking those over and enforcing them at runtime.
30:07
There's still no performance improvements because you're using typing in your code like like compiled
30:13
languages like yeah it is there there's no performance improvements. What we did find and and
30:19
there are lots of great talks online about this uh we found it reduced the
30:25
number of bugs uh that we could run into and and one of the analogies I I've seen
30:32
that I love for this um unit testing and and various types of testing if you
30:37
think about your your your bugs it's like a big circle like a a vin diagram uh unit tests punch holes in that uh big
30:45
circle of all possible bugs So it looks like a little bit like Swiss cheese, but when you have uh um types that are
30:54
enforced at runtime, that like slices a whole slice out of that uh realm of that
31:00
circle of all the possible bugs. Uh so that you don't even have to worry about these kinds of problems.
31:06
Yeah. Whole whole classes of issues just go away. Yeah. Yep. Now, there are there are other issues. I don't want to I'm not
31:13
here to be a strong typing uh evangelist. Uh but we did find that in
31:20
our real world use. Excellent. Any other questions for Kyle?
31:26
Can you repeat repeat the question? Yeah, go ahead. Is there a
31:36
Oh, like a competitor. Is there a competitor to Python? Yeah. Is there is there a a open source library that does similar things to
31:42
Python? There is I think it's called Marshmallow and uh so we had initially
31:48
built our our microservices on flask and using specifically using a library for flask called flask restx that gave like
31:56
rest api capabilities to flask. Flask restx is currently rewriting because
32:02
they had a lot of these capabilities baked in. they realized they weren't able to keep up with the rate of
32:09
innovation. They were rewriting to um specifically to support Marshmallow, but
32:14
also with kind of um secondass support for uh Pydantic. Now, I'm not sure if
32:21
they're maybe rethinking those plans because that was way back when Marshmallow and Pydantic were both getting started and Pyantic's kind of
32:27
come out on top. Um, but yeah, that's there are other libraries out there that
32:34
do this. Uh, they're just not they don't have the community that Pydantic has.
32:40
Awesome. There's any other questions? I think we're good. Let's give a big round of applause for Kyle. Oh, uh, one more thing here.
32:48
One more thing here. Uh I I want to leave you guys if if you leave here with only one thing in your head, I want that
32:54
to be paid aliases are useful and I should read through the docs. Uh or we
32:59
can simplify that to aliases are useful. Uh now still read the docs.
33:05
I will also say uh this is not the end. Um I mentioned that this came out of a
33:11
series of exercises we did and we actually turned them into a series of blog posts. So I I tell everyone, you
33:17
know, I don't expect you to remember anything that I said here tonight. Um, but what I do hope you remember is if
33:22
you forget uh that you remember to that our blog is at testdouble.com/insights
33:28
and you can go read this Pydantically perfect blog series to find uh the the stuff I covered tonight is in the first
33:34
three blog posts. We actually have six more that are in the works right now. So there's going to be a lot of information
33:40
on kind of advanced Pydantic details out there. So, when you're dealing with gnarly data, come back to uh the
33:47
Test Double blog and figure out uh if we've got anything in there that might help
33:52
you. Excellent. Super awesome. All right, thank you very much. I give a big round of applause, Kyle.
33:59
[Music]
Kyle Adams is a staff software consultant at Test Double who lives for that light bulb moment when a solution falls perfectly in place or an idea takes root.









