Pydantically perfect: Normalize legacy data in Python

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Welcome to Pydantically Perfect, the blog series where we explore how to solve data-related problems in Python using Pydantic, a feature-rich data validation library written in Rust. Whether you're a seasoned developer or just starting, we're hoping to give you actionable insights you can start applying right now to make your code more robust and reliable with stronger typing.

If you're a newcomer here, we encourage you to take a look at our first installment: Pydantically perfect: A beginner’s guide to Pydantic for Python type safety.

Where you are in the Pydantic for Python blog series:

A beginner's guide to Pydantic for Python type safety
Seamlessly handle non-pythonic naming conventions
You are here: Normalize legacy data in Python
Field report in progress: Declare rich validation rules
Field report in progress: Build shareable domain types
Field report in progress: Add your custom logic
Field report in progress: Apply alternative validation rules
Field report in progress: Validate your app configuration
Field report in progress: Put it all together with a FHIR example

The problem: inconsistent data

We're trying to parse error responses from a legacy system. The issue is that the response structure varies depending on the endpoint queried and where the error occurred internally. We can't predict in advance which structure we'll receive.

The inconsistent part for us is where the user ID lives and we needed to properly log the error in our systems.

There are three different formats:

1. The user ID is an attribute of the root.

2. The user ID is nested inside a user object.

3. The user ID is nested inside a list of user objects. It's unclear why, but there's always exactly one user object.

data_format_one = {
    "timestamp": "2025-09-08T15:16:03Z",
    "error_message": "'NoneType' object has no attribute 'lower'",
    "error_type": "AttributeError",
    "user_id": "e1c3cd56-ed1f-4291-9dea-fd54f9b379c2",
}

data_format_two = {
    "timestamp": "2025-09-08T15:16:03Z",
    "error_message": "'NoneType' object has no attribute 'lower'",
    "error_type": "AttributeError",
    "user": {
        "id": "e1c3cd56-ed1f-4291-9dea-fd54f9b379c2",
    },
}

data_format_three = {
    "timestamp": "2025-09-08T15:16:03Z",
    "error_message": "'NoneType' object has no attribute 'lower'",
    "error_type": "AttributeError",
    "users": [
        {
            "id": "e1c3cd56-ed1f-4291-9dea-fd54f9b379c2",
        }
    ],
}

What are our goals here?

1. We want to handle this complexity exactly once rather than letting it spread in our application. Repeatedly checking for where the user ID is stored will muddle our business logic that should only care about the value of the user ID.

2. We want to lean on Pydantic to contain that complexity as much as possible. That will limit the amount of custom validation logic we’ll need to write without compromising on robustness or quality because Pydantic is a dedicated validation library.

We'll do this step by step by making things slightly better each time.

First step: mapping the models

We can start by creating the first version of our models in Pydantic. The three models below directly map the existing structures:

from datetime import datetime
from uuid import UUID
from pydantic import BaseModel

class ErrorFormatOne(BaseModel):
    timestamp: datetime
    error_message: str
    error_type: str
    user_id: UUID

class NestedUser(BaseModel):
    id: UUID

class ErrorFormatTwo(BaseModel):
    timestamp: datetime
    error_message: str
    error_type: str
    user: NestedUser

class ErrorFormatThree(BaseModel):
    timestamp: datetime
    error_message: str
    error_type: str
    users: list[NestedUser]

‍

Notice how the models contain more advanced types like datetime, UUID, and other Pydantic models? Natively handling advanced types is one of Pydantic's biggest strengths and we'll deep dive into these in a later installment of the series.

Why aren't these models satisfying yet?

They simply map the structure without any abstraction, which means the responsibility of handling the structure discrepancies would be forwarded to the rest of our code. That's exactly the opposite of what we want.

The rest of our code shouldn't have to check whether to use error_model.user_id, error_model.user.id, or error_model.users[0].id. Let's address this inconsistency in accessing the user ID.

The trap: combining everything as optional

From here, it could be tempting to directly combine the three models, but mark the different sources of user ID as optional to accommodate the fact that we'd only ever have one of them at once. It would look something like this:

from datetime import datetime
from uuid import UUID
from pydantic import BaseModel, Field

class NestedUser(BaseModel):
    id: UUID

class Error(BaseModel):
    timestamp: datetime
    error_message: str
    error_type: str
    user_id: UUID | None = Field(default=None)
    user: NestedUser | None = Field (default=None)
    users: list[NestedUser] | None = Field(default=None)

# This data has no user ID and won't raise an error
invalid_data = {
    "timestamp": "2025-09-08T15:16:03Z",
    "error_message": "'NoneType' object has no attribute 'lower'",
    "error_type": "AttributeError",
}

Error.model_validate(invalid_data)
# Error(
#     timestamp=datetime.datetime(2025, 9, 8, 15, 16, 3, tzinfo=TzInfo(UTC)),
#     error_message="'NoneType' object has no attribute 'lower'",
#     error_type="AttributeError",
#     user_id=None,
#     user=None,
#     users=None,
# )

‍

While the code does combine all structures into a single model, there are two main drawbacks:

We're still not handling the different structure discrepancies. The main issue of muddling our business logic with the inconsistent ways of accessing the user ID is just as present as in the beginning and that would make the rest of our code unnecessarily complex.
We've lost the validation that we have a user ID. As you can see in the invalid_data example, no validation error is raised to let us know that there’s no valid user ID. Our own custom logic would have to double-check that there is a valid user ID, which is the opposite of what we want to do by bringing in Pydantic validation. We want to lean on Pydantic’s tried and true validation process so we can focus on business logic instead.

Let's hold off on combining the models for now and focus on normalizing the structure.

Step two: flattening the models

Amongst Pydantic's alias features, we can use the AliasPath to flatten the structure. We provide the AliasPath with one or more keys and Pydantic will follow these keys into the nested objects when populating an attribute.

Let's update the relevant error models to reach inside their nested structure and extract a user_id attribute directly. We'll also test the changed models with our example data:

from datetime import datetime
from uuid import UUID
from pydantic import AliasPath, BaseModel, Field

class ErrorFormatOne(BaseModel):
    timestamp: datetime
    error_message: str
    error_type: str
    user_id: UUID 

class ErrorFormatTwo(BaseModel):
    timestamp: datetime
    error_message: str
    error_type: str
    # Equivalent to reaching for `input_data["user"]["id"]`
    user_id: UUID = Field(validation_alias=AliasPath("user", "id"))

class ErrorFormatThree(BaseModel):
    timestamp: datetime
    error_message: str
    error_type: str
    # Equivalent to reaching for `input_data["users"][0]["id"]`
    user_id: UUID = Field(validation_alias=AliasPath("users", 0, "id"))

data_format_two = {
    "timestamp": "2025-09-08T15:16:03Z",
    "error_message": "'NoneType' object has no attribute 'lower'",
    "error_type": "AttributeError",
    "user": {
        "id": "e1c3cd56-ed1f-4291-9dea-fd54f9b379c2",
    },
}

ErrorFormatTwo.model_validate(data_format_two)
# ErrorFormatTwo(
#     timestamp=datetime.datetime(2025, 9, 8, 15, 16, 3, tzinfo=TzInfo(UTC)),
#     error_message="'NoneType' object has no attribute 'lower'",
#     error_type='AttributeError',
#     user_id=UUID('e1c3cd56-ed1f-4291-9dea-fd54f9b379c2')
# )

data_format_three = {
    "timestamp": "2025-09-08T15:16:03Z",
    "error_message": "'NoneType' object has no attribute 'lower'",
    "error_type": "AttributeError",
    "users": [
        {
            "id": "e1c3cd56-ed1f-4291-9dea-fd54f9b379c2",
        }
    ],
}

ErrorFormatThree.model_validate(data_format_three)
# ErrorFormatThree(
#     timestamp=datetime.datetime(2025, 9, 8, 15, 16, 3, tzinfo=TzInfo(UTC)), 
#     error_message="'NoneType' object has no attribute 'lower'", 
#     error_type='AttributeError', 
#     user_id=UUID('e1c3cd56-ed1f-4291-9dea-fd54f9b379c2')
# )

Why use `validation_alias` rather than `alias`?

In our previous post, Pydantically perfect: seamlessly handle non-Pythonic naming conventions, we used the alias argument because it'll assign that value both to the validation and serialization aliases. However, the more advanced alias validation features don't conceptually make sense as part of a serialization alias. That's why alias will only accept string values while validation_alias will accept stronger alias types.

With the changes above, all models' user IDs can now be accessed directly with error_model.user_id.

We're not quite there yet though because there are still three different models. We want Pydantic to handle all of that complexity at once with a single model.

Step three: combining the models

The only complexity in combining all three models into one is how user_id is populated. All the other fields are the same.

We can lean into the other main Pydantic alias feature: AliasChoices.

AliasChoices lets us provide a list of potential sources for the field value, and the first one to exist will be the one used in the validation process. The best part? It also accepts AliasPath values, so we can provide one option per format and Pydantic will handle it all:

from datetime import datetime
from uuid import UUID
from pydantic import AliasChoices, AliasPath, BaseModel, Field

class Error(BaseModel):
    timestamp: datetime
    error_message: str
    error_type: str
    user_id: UUID = Field(
        validation_alias=AliasChoices(
            "user_id",
            AliasPath("user", "id"),
            AliasPath("users", 0, "id"),
        )
    )

data_format_one = {
    "timestamp": "2025-09-08T15:16:03Z",
    "error_message": "'NoneType' object has no attribute 'lower'",
    "error_type": "AttributeError",
    "user_id": "e1c3cd56-ed1f-4291-9dea-fd54f9b379c2",
}

Error.model_validate(data_format_one)
# Error(
#     timestamp=datetime.datetime(2025, 9, 8, 15, 16, 3, tzinfo=TzInfo(UTC)),
#     error_message="'NoneType' object has no attribute 'lower'",
#     error_type='AttributeError',
#     user_id=UUID('e1c3cd56-ed1f-4291-9dea-fd54f9b379c2')
# )

data_format_two = {
    "timestamp": "2025-09-08T15:16:03Z",
    "error_message": "'NoneType' object has no attribute 'lower'",
    "error_type": "AttributeError",
    "user": {
        "id": "e1c3cd56-ed1f-4291-9dea-fd54f9b379c2",
    },
}

Error.model_validate(data_format_two)
# Error(
#     timestamp=datetime.datetime(2025, 9, 8, 15, 16, 3, tzinfo=TzInfo(UTC)),
#     error_message="'NoneType' object has no attribute 'lower'",
#     error_type="AttributeError",
#     user_id=UUID("e1c3cd56-ed1f-4291-9dea-fd54f9b379c2"),
# )

data_format_three = {
    "timestamp": "2025-09-08T15:16:03Z",
    "error_message": "'NoneType' object has no attribute 'lower'",
    "error_type": "AttributeError",
    "users": [
        {
            "id": "e1c3cd56-ed1f-4291-9dea-fd54f9b379c2",
        }
    ],
}

Error.model_validate(data_format_three)
# Error(
#     timestamp=datetime.datetime(2025, 9, 8, 15, 16, 3, tzinfo=TzInfo(UTC)),
#     error_message="'NoneType' object has no attribute 'lower'",
#     error_type="AttributeError",
#     user_id=UUID("e1c3cd56-ed1f-4291-9dea-fd54f9b379c2"),
# )

At this point, we have a single model that can accept and normalize all three different structures we can receive. The Pydantic engine will fully encapsulate this complexity for us and none of our code past this point will have to know about structural discrepancies. Mission accomplished!

Conclusion: what's next for the Pydantically Perfect series?

With this, we've expanded our coverage of alias features to include normalizing inconsistent legacy data. The next post will move away from aliases to take a deeper look into describing models with a rich set of validation rules and constraints strictly with Pydantic features.

Our goal isn't to go through all of Pydantic's features, but rather to provide a curated list of Pydantic features we found helpful when adopting it.

If you're looking for a larger overview or want to know more without waiting for future posts, we encourage you to take a look at the official Pydantic documentation.

Gabriel Côté-Carrier is a senior software consultant at Test Double, and has experience in full–stack development, leading teams and teaching others.

Kyle Adams is a staff software consultant at Test Double who lives for that light bulb moment when a solution falls perfectly in place or an idea takes root.

Related Insights

Explore our insights

See all insights

Developers

Power up with Rails scripts Part 2: Docker

In part 2 of the three part series on Rails scripts, learn about short shell scripts for simplifying Docker interactions from Rails apps.

Ed Toro

Developers

Power up with Rails scripts Part 1: Environment setup

Short shell scripts for simplifying onboarding from Rails apps.

Ed Toro

Leadership

5 rules to avoid the 95% AI project failure rate

MIT research shows 95% of corporate AI pilots fail. The problem isn't the technology—it's transformation. Based on decades of implementation experience, here are the 5 non-negotiables every C-suite needs to master for AI success.