Skip to main content
Test Double company logo
Services
Services Overview
Holistic software investment consulting
Software Delivery
Accelerate quality software development
Product Strategy & Performance
Level up product strategy & performance
Legacy Modernization
Renovate legacy software systems
Pragmatic AI
Solve business problems without hype
Upgrade Rails
Update Rails versions seamlessly
DevOps
Scale infrastructure smoothly
Technical Recruitment
Build tech & product teams
Technical & Product Assessments
Uncover root causes & improvements
Case Studies
Solutions
Accelerate Quality Software
Software Delivery, DevOps, & Product Delivery
Maximize Software Investments
Product Performance, Product Scaling, & Technical Assessments
Future-Proof Innovative Software
Legacy Modernization, Product Transformation, Upgrade Rails, Technical Recruitment
About
About
What's a test double?
Approach
Meeting you where you are
Founder's Story
The origin of our mission
Culture
Culture & Careers
Double Agents decoded
Great Causes
Great code for great causes
EDI
Equity, diversity & inclusion
Insights
All Insights
Hot takes and tips for all things software
Leadership
Bold opinions and insights for tech leaders
Developer
Essential coding tutorials and tools
Product Manager
Practical advice for real-world challenges
Say Hello
Test Double logo
Menu
Services
BackGrid of dots icon
Services Overview
Holistic software investment consulting
Software Delivery
Accelerate quality software development
Product Strategy & Performance
Level up product strategy & performance
Legacy Modernization
Renovate legacy software systems
Pragmatic AI
Solve business problems without hype
Cycle icon
DevOps
Scale infrastructure smoothly
Upgrade Rails
Update Rails versions seamlessly
Technical Recruitment
Build tech & product teams
Technical & Product Assessments
Uncover root causes & improvements
Case Studies
Solutions
Solutions
Accelerate Quality Software
Software Delivery, DevOps, & Product Delivery
Maximize Software Investments
Product Performance, Product Scaling, & Technical Assessments
Future-Proof Innovative Software
Legacy Modernization, Product Transformation, Upgrade Rails, Technical Recruitment
About
About
About
What's a test double?
Approach
Meeting you where you are
Founder's Story
The origin of our mission
Culture
Culture
Culture & Careers
Double Agents decoded
Great Causes
Great code for great causes
EDI
Equity, diversity & inclusion
Insights
Insights
All Insights
Hot takes and tips for all things software
Leadership
Bold opinions and insights for tech leaders
Developer
Essential coding tutorials and tools
Product Manager
Practical advice for real-world challenges
Say hello
Developers
Developers
Developers
Software tooling & tips

Pydantically perfect: Normalize legacy data in Python

Learn how to normalize inconsistent data structures in Python with Pydantic. The post guides you through different approaches and pitfalls, using Pydantic's alias path and alias choices features.
Gabriel Côté-Carrier
Kyle Adams
|
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Welcome to Pydantically Perfect, the blog series where we explore how to solve data-related problems in Python using Pydantic, a feature-rich data validation library written in Rust. Whether you're a seasoned developer or just starting, we're hoping to give you actionable insights you can start applying right now to make your code more robust and reliable with stronger typing.

If you're a newcomer here, we encourage you to take a look at our first installment: Pydantically perfect: A beginner’s guide to Pydantic for Python type safety.

Where you are in the Pydantic for Python blog series:

  • A beginner's guide to Pydantic to Python type safety
  • Seamlessly handle non-pythonic naming conventions
  • You are here: Normalize legacy data in Python
  • Field report in progress: Declare rich validation rules
  • Field report in progress: Build shareable domain types
  • Field report in progress: Add your custom logic
  • Field report in progress: Apply alternative validation rules
  • Field report in progress: Validate your app configuration
  • Field report in progress: Put it all together with a FHIR example

The problem: inconsistent data

We're trying to parse error responses from a legacy system. The issue is that the response structure varies depending on the endpoint queried and where the error occurred internally. We can't predict in advance which structure we'll receive.

The inconsistent part for us is where the user ID lives and we needed to properly log the error in our systems.

There are three different formats:

1. The user ID is an attribute of the root.

2. The user ID is nested inside a user object.

3. The user ID is nested inside a list of user objects. It's unclear why, but there's always exactly one user object.

data_format_one = {
    "timestamp": "2025-09-08T15:16:03Z",
    "error_message": "'NoneType' object has no attribute 'lower'",
    "error_type": "AttributeError",
    "user_id": "e1c3cd56-ed1f-4291-9dea-fd54f9b379c2",
}

data_format_two = {
    "timestamp": "2025-09-08T15:16:03Z",
    "error_message": "'NoneType' object has no attribute 'lower'",
    "error_type": "AttributeError",
    "user": {
        "id": "e1c3cd56-ed1f-4291-9dea-fd54f9b379c2",
    },
}

data_format_three = {
    "timestamp": "2025-09-08T15:16:03Z",
    "error_message": "'NoneType' object has no attribute 'lower'",
    "error_type": "AttributeError",
    "users": [
        {
            "id": "e1c3cd56-ed1f-4291-9dea-fd54f9b379c2",
        }
    ],
}

What are our goals here?

1. We want to handle this complexity exactly once rather than letting it spread in our application. Repeatedly checking for where the user ID is stored will muddle our business logic that should only care about the value of the user ID.

2. We want to lean on Pydantic to contain that complexity as much as possible. That will limit the amount of custom validation logic we’ll need to write without compromising on robustness or quality because Pydantic is a dedicated validation library.

We'll do this step by step by making things slightly better each time.

First step: mapping the models

We can start by creating the first version of our models in Pydantic. The three models below directly map the existing structures:

from datetime import datetime
from uuid import UUID
from pydantic import BaseModel

class ErrorFormatOne(BaseModel):
    timestamp: datetime
    error_message: str
    error_type: str
    user_id: UUID

class NestedUser(BaseModel):
    id: UUID

class ErrorFormatTwo(BaseModel):
    timestamp: datetime
    error_message: str
    error_type: str
    user: NestedUser

class ErrorFormatThree(BaseModel):
    timestamp: datetime
    error_message: str
    error_type: str
    users: list[NestedUser]

‍

Notice how the models contain more advanced types like datetime, UUID, and other Pydantic models? Natively handling advanced types is one of Pydantic's biggest strengths and we'll deep dive into these in a later installment of the series.

Why aren't these models satisfying yet?

They simply map the structure without any abstraction, which means the responsibility of handling the structure discrepancies would be forwarded to the rest of our code. That's exactly the opposite of what we want.

The rest of our code shouldn't have to check whether to use error_model.user_id, error_model.user.id, or error_model.users[0].id. Let's address this inconsistency in accessing the user ID.

The trap: combining everything as optional

From here, it could be tempting to directly combine the three models, but mark the different sources of user ID as optional to accommodate the fact that we'd only ever have one of them at once. It would look something like this:

from datetime import datetime
from uuid import UUID
from pydantic import BaseModel, Field

class NestedUser(BaseModel):
    id: UUID

class Error(BaseModel):
    timestamp: datetime
    error_message: str
    error_type: str
    user_id: UUID | None = Field(default=None)
    user: NestedUser | None = Field (default=None)
    users: list[NestedUser] | None = Field(default=None)

# This data has no user ID and won't raise an error
invalid_data = {
    "timestamp": "2025-09-08T15:16:03Z",
    "error_message": "'NoneType' object has no attribute 'lower'",
    "error_type": "AttributeError",
}

Error.model_validate(invalid_data)
# Error(
#     timestamp=datetime.datetime(2025, 9, 8, 15, 16, 3, tzinfo=TzInfo(UTC)),
#     error_message="'NoneType' object has no attribute 'lower'",
#     error_type="AttributeError",
#     user_id=None,
#     user=None,
#     users=None,
# )

‍

While the code does combine all structures into a single model, there are two main drawbacks:

  1. We're still not handling the different structure discrepancies. The main issue of muddling our business logic with the inconsistent ways of accessing the user ID is just as present as in the beginning and that would make the rest of our code unnecessarily complex.
  2. We've lost the validation that we have a user ID. As you can see in the invalid_data example, no validation error is raised to let us know that there’s no valid user ID. Our own custom logic would have to double-check that there is a valid user ID, which is the opposite of what we want to do by bringing in Pydantic validation. We want to lean on Pydantic’s tried and true validation process so we can focus on business logic instead.

Let's hold off on combining the models for now and focus on normalizing the structure.

Step two: flattening the models

Amongst Pydantic's alias features, we can use the AliasPath to flatten the structure. We provide the AliasPath with one or more keys and Pydantic will follow these keys into the nested objects when populating an attribute.

Let's update the relevant error models to reach inside their nested structure and extract a user_id attribute directly. We'll also test the changed models with our example data:

from datetime import datetime
from uuid import UUID
from pydantic import AliasPath, BaseModel, Field

class ErrorFormatOne(BaseModel):
    timestamp: datetime
    error_message: str
    error_type: str
    user_id: UUID 

class ErrorFormatTwo(BaseModel):
    timestamp: datetime
    error_message: str
    error_type: str
    # Equivalent to reaching for `input_data["user"]["id"]`
    user_id: UUID = Field(validation_alias=AliasPath("user", "id"))

class ErrorFormatThree(BaseModel):
    timestamp: datetime
    error_message: str
    error_type: str
    # Equivalent to reaching for `input_data["users"][0]["id"]`
    user_id: UUID = Field(validation_alias=AliasPath("users", 0, "id"))

data_format_two = {
    "timestamp": "2025-09-08T15:16:03Z",
    "error_message": "'NoneType' object has no attribute 'lower'",
    "error_type": "AttributeError",
    "user": {
        "id": "e1c3cd56-ed1f-4291-9dea-fd54f9b379c2",
    },
}

ErrorFormatTwo.model_validate(data_format_two)
# ErrorFormatTwo(
#     timestamp=datetime.datetime(2025, 9, 8, 15, 16, 3, tzinfo=TzInfo(UTC)),
#     error_message="'NoneType' object has no attribute 'lower'",
#     error_type='AttributeError',
#     user_id=UUID('e1c3cd56-ed1f-4291-9dea-fd54f9b379c2')
# )

data_format_three = {
    "timestamp": "2025-09-08T15:16:03Z",
    "error_message": "'NoneType' object has no attribute 'lower'",
    "error_type": "AttributeError",
    "users": [
        {
            "id": "e1c3cd56-ed1f-4291-9dea-fd54f9b379c2",
        }
    ],
}

ErrorFormatThree.model_validate(data_format_three)
# ErrorFormatThree(
#     timestamp=datetime.datetime(2025, 9, 8, 15, 16, 3, tzinfo=TzInfo(UTC)), 
#     error_message="'NoneType' object has no attribute 'lower'", 
#     error_type='AttributeError', 
#     user_id=UUID('e1c3cd56-ed1f-4291-9dea-fd54f9b379c2')
# )

Why use validation_alias rather than alias?

In our previous post, Pydantically perfect: seamlessly handle non-Pythonic naming conventions,  we used the alias argument because it'll assign that value both to the validation and serialization aliases. However, the more advanced alias validation features don't conceptually make sense as part of a serialization alias. That's why alias will only accept string values while validation_alias will accept stronger alias types.

With the changes above, all models' user IDs can now be accessed directly with error_model.user_id. 

We're not quite there yet though because there are still three different models. We want Pydantic to handle all of that complexity at once with a single model.

Step three: combining the models

The only complexity in combining all three models into one is how user_id is populated. All the other fields are the same.

We can lean into the other main Pydantic alias feature: AliasChoices. 

AliasChoices lets us provide a list of potential sources for the field value, and the first one to exist will be the one used in the validation process. The best part? It also accepts AliasPath values, so we can provide one option per format and Pydantic will handle it all:

from datetime import datetime
from uuid import UUID
from pydantic import AliasChoices, AliasPath, BaseModel, Field

class Error(BaseModel):
    timestamp: datetime
    error_message: str
    error_type: str
    user_id: UUID = Field(
        validation_alias=AliasChoices(
            "user_id",
            AliasPath("user", "id"),
            AliasPath("users", 0, "id"),
        )
    )

data_format_one = {
    "timestamp": "2025-09-08T15:16:03Z",
    "error_message": "'NoneType' object has no attribute 'lower'",
    "error_type": "AttributeError",
    "user_id": "e1c3cd56-ed1f-4291-9dea-fd54f9b379c2",
}

Error.model_validate(data_format_one)
# Error(
#     timestamp=datetime.datetime(2025, 9, 8, 15, 16, 3, tzinfo=TzInfo(UTC)),
#     error_message="'NoneType' object has no attribute 'lower'",
#     error_type='AttributeError',
#     user_id=UUID('e1c3cd56-ed1f-4291-9dea-fd54f9b379c2')
# )

data_format_two = {
    "timestamp": "2025-09-08T15:16:03Z",
    "error_message": "'NoneType' object has no attribute 'lower'",
    "error_type": "AttributeError",
    "user": {
        "id": "e1c3cd56-ed1f-4291-9dea-fd54f9b379c2",
    },
}

Error.model_validate(data_format_two)
# Error(
#     timestamp=datetime.datetime(2025, 9, 8, 15, 16, 3, tzinfo=TzInfo(UTC)),
#     error_message="'NoneType' object has no attribute 'lower'",
#     error_type="AttributeError",
#     user_id=UUID("e1c3cd56-ed1f-4291-9dea-fd54f9b379c2"),
# )

data_format_three = {
    "timestamp": "2025-09-08T15:16:03Z",
    "error_message": "'NoneType' object has no attribute 'lower'",
    "error_type": "AttributeError",
    "users": [
        {
            "id": "e1c3cd56-ed1f-4291-9dea-fd54f9b379c2",
        }
    ],
}

Error.model_validate(data_format_three)
# Error(
#     timestamp=datetime.datetime(2025, 9, 8, 15, 16, 3, tzinfo=TzInfo(UTC)),
#     error_message="'NoneType' object has no attribute 'lower'",
#     error_type="AttributeError",
#     user_id=UUID("e1c3cd56-ed1f-4291-9dea-fd54f9b379c2"),
# )

At this point, we have a single model that can accept and normalize all three different structures we can receive. The Pydantic engine will fully encapsulate this complexity for us and none of our code past this point will have to know about structural discrepancies. Mission accomplished!

Conclusion: what's next for the Pydantically Perfect series?

With this, we've expanded our coverage of alias features to include normalizing inconsistent legacy data. The next post will move away from aliases to take a deeper look into describing models with a rich set of validation rules and constraints strictly with Pydantic features.

Our goal isn't to go through all of Pydantic's features, but rather to provide a curated list of Pydantic features we found helpful when adopting it.

If you're looking for a larger overview or want to know more without waiting for future posts, we encourage you to take a look at the official Pydantic documentation.

Gabriel Côté-Carrier is a senior software consultant at Test Double, and has experience in full–stack development, leading teams and teaching others.

Kyle Adams is a staff software consultant at Test Double who lives for that light bulb moment when a solution falls perfectly in place or an idea takes root.

Like a monthly pairing session

Test Double Dispatch shares useful tips and lessons learned solving the hardest problems in software.

Subscribe

Related Insights

No items found.

Explore our insights

See all insights
Leadership
Leadership
Leadership
5 rules to avoid the 95% AI project failure rate

MIT research shows 95% of corporate AI pilots fail. The problem isn't the technology—it's transformation. Based on decades of implementation experience, here are the 5 non-negotiables every C-suite needs to master for AI success.

by
Ed Frank
Developers
Developers
Developers
Keep your coding agent on task with mutation testing

Code quality tools are helpful guardrails for humans, but coding agents benefit even more. Mutation testing is a rarely-used tool showing new promise as we leverage AI to write more and more software.

by
Neal Lindsay
Leadership
Leadership
Leadership
AI in the workforce: A shifting talent equation

AI is transformative: reshaping the workforce, blurring roles, and breaking career ladders. Explore how orgs can balance people, technology, and business in this new era of work.

by
Jonathon Baugh
Letter art spelling out NEAT

Join the conversation

Technology is a means to an end: answers to very human questions. That’s why we created a community for developers and product managers.

Explore the community
Test Double Executive Leadership Team

Learn about our team

Like what we have to say about building great software and great teams?

Get to know us
Test Double company logo
Improving the way the world builds software.
What we do
Services OverviewSoftware DeliveryProduct StrategyLegacy ModernizationPragmatic AIDevOpsUpgrade RailsTechnical RecruitmentAssessments
Who WE ARE
About UsCulture & CareersGreat CausesEDIOur TeamContact UsNews & AwardsN.E.A.T.
Resources
Case StudiesAll InsightsLeadership InsightsDeveloper InsightsProduct InsightsPairing & Office Hours
NEWSLETTER
Sign up hear about our latest innovations.
Your email has been added!
Oops! Something went wrong while submitting the form.
Standard Ruby badge
614.349.4279hello@testdouble.com
Privacy Policy
© 2020 Test Double. All Rights Reserved.

Like a monthly pairing session

Test Double Dispatch shares useful tips and lessons learned solving the hardest problems in software.

Subscribe