Introduction: observability has a human dimension
We don't always need to start with a complicated plan to make things better. Sometimes all it takes is having a clear direction, taking the first step and seeing where it leads you.
Developers at my previous client had a hard time finding out what was going wrong in their services. Investigations took a long time and the results weren't always conclusive. Observability was a pain point for us.
Observability is the ability to see what's happening inside software systems. It's typically achieved through querying and visualizing logs, metrics and traces.
I felt that pain alongside them and raised my hand when the team wanted to do something about it. It became a much bigger project than I could have envisioned at the start and solving that technical problem also turned out to have a big human component.
The trigger: time consuming bugs
We had a couple of recurring bugs in different services in a short period of time. Attempts to fix them took time and could fail if our investigations into the error logs didn't yield the right root cause. At that point, leadership made the decision to invest engineering time to make logs more robust in these services.
The double agents at the client made the case that it'd be a good idea to approach the issue holistically instead of only addressing the problematic services. The ball was in our court afterward to display what that could look like.
From that point on, I worked on and off toward making observability better for several months. I repeatedly addressed the next most painful parts of our observability and adjusted course based on the team's feedback.
Technical wins: making things easier
The first thing I did was build a path to structured logging. It enabled logs to contain arbitrary key-value pairs, making them much easier to query and visualize through a lot of new dimensions like unique identifiers, route names, status codes, downstream services, etc.
After that, I integrated our logging with our HTTP server framework and our HTTP request library. The team could now send context-rich logs with just a few lines of code. The best part of this was that it standardized most logs across our different services. Querying behavior for a single web application or across different web applications could now be done easily without cross-referencing with application code.
The next biggest pain point was how complex our log queries were. We couldn't jump on the log platform and query something without finding an example somewhere to copy and tweak. I built a log query function that abstracted away the querying complexity. Queries could now be written from scratch in seconds.
After that, I made it easy to provision new dashboards to all of our environments. We could now build and refine graphs displaying all of the new rich logs and metrics we had. Dashboards made it much easier to grasp problematic patterns in our services at a glance.
Beyond the technical: paving a road for others
While building these technical capabilities, I realized that they would only be part of the solution.
One of our values I resonate with the most at Test Double is leaving our clients better than we found them. As a consultant, I knew I'd eventually leave, so keeping all the knowledge of operating our tools in my head would be doing my client a disservice. I repeatedly asked myself how I could pass on that know-how.
The result? I wrote extensive documentation with code samples, tutorials and upgrade paths. Every new capability came with release notes and demos to the team. I jumped on calls to share knowledge and help debug issues.
This took a lot more effort than I thought, but that work was rewarding because it was essential to achieving the best outcome for the team. I knew deep down that elevating the team's practices mattered much more than elevating their tooling. I was elated the first time I heard another senior engineer mention they'd deployed a new dashboard without my involvement. I knew at that point that I'd accomplished that personal goal.
New outcomes: reframing observability
With all the new additions in technical capabilities, I realized that our observability platform could be used for much more than assisting developers. Being an internal service team, we often had to collaborate with other teams and I planted the idea that we could be building dashboards for them.
The first application of that idea was in a new project with a fair amount of risk involved. We knew we'd have a lot of back-and-forth to fine-tune the result properly and to that end we built a dashboard that would provide answers for most of our collaborators' questions.
Shifting our observability platform from team tooling to something we could offer was hugely helpful for the project. It removed our team from being a bottleneck for our collaborators. They could now progress on their work and fix bugs while rarely needing to wait for our responses or book meetings with us.
Stakeholders especially enjoyed having access to these dashboards because it gave them accurate metrics on-demand. That transparency strengthened our stakeholders' trust in our systems.
The end result: find and fix issues quickly
I don't have an exact date, but observability stopped being a source of complaints within a few months. I still kept building more capabilities past that point because I saw a lot more potential and the leadership at the client trusted me to keep delivering value in that space. We wouldn't have achieved these new outcomes without that trust and I'm grateful for it.
I ended up staying at the client roughly two years after I finished that initiative and it's been clear that observability wasn't a weakness anymore. If anything, it had become a strength. Our stakeholders could trust that we'd find and fix issues quickly. We often did it so quickly that we'd let them know of a fix before they raised any issue to us.
Looking back: what I'd do differently
If I had to do it all over again, I would build hands-on workshops on the new observability capabilities. Demos are helpful, but there's a big qualitative difference between watching someone and trying something yourself. It would have sped up adoption of the new practices and made the collective expertise more consistent.
I would also be deliberately asking for internal collaborators to keep championing this work forward. I regret not having the conversation with a few developers that I wouldn't be around forever and that I'd love to give them more ownership. It would have helped make this observability initiative a collective one quicker.
Getting started: observability resources
If you’re feeling observability pains in your systems, my first suggestion would be to have a serious look at your observability platform’s documentation. For me, building a deeper understanding of our platform’s features helped me identify all sorts of wins that kept us moving in the right direction. Sometimes it’s as easy as knowing what you have access to and figuring out how to connect your systems to it.
If you’re looking for a soup to nuts understanding of observability, I’d recommend reading the book Observability Engineering. I read it after my observability initiative, but I would have had an easier time with the information in that book. It not only does a great job going in depth with the technical aspects and reasoning behind observability, but also talks of the human challenges involved in adopting observability.
Closing out: focusing on the direction
I wasn't an expert in structured logging at the time nor did I have a big plan. I did however have a rough north star: observability should be powerful and easy to use. From there, I built more and more expertise and leaned on the people around me for feedback and suggestions. Every step forward gave me enough visibility to see where to go next.
I rotated away from this client earlier this year and I have every confidence that they'll be able to sustain and increase that expertise without me. That outcome came from keeping in mind that software problems are also human problems and only tackling the technical part wouldn't have taken us all the way there.
Resource