DevOps Started When Production Failed | Leonel Castañeda

At least for me, that’s how it happened.

In 2017 I was working as a Data Engineer for the Mexican federal government.
I had previously been a developer lead, then moved into data analysis, and was starting to realize that what I really enjoyed was building data systems.

The requirement was simple: build a solution to publish open geographic data.

There was no additional context, no clear constraints, no architecture discussions. So I did what I knew how to do: I designed a system that met the requirement and, along the way, added some extra value. The idea was to be able to publish any dataset regardless of its original format, and also provide WMS/WFS services so the data could be consumed directly in web maps.

I deployed a GeoServer service.
I wrote scripts to install and configure it.
After a few weeks, the system was running on-premise.

I gave a demo showing how users could consume WMS services to generate analysis. Everything worked. The requirement was met, with added value.

Everything works until it doesn’t

One day, the entire service went down. Production was on fire.

There was no staging environment.
There were no snapshots.
There was no clear recovery plan.

And most importantly: I had no answers ready when someone asked what happened.

I had scripts, yes. But when I ran them, they only returned errors. That’s when I understood something that seems obvious now, but wasn’t back then: there’s a huge difference between running a script on a clean system and running it on a half-configured, half-broken one.

I made an inelegant but effective decision: I took the entire service down, deleted all my files, and rebuilt everything from scratch. After a few hours, it was back up.

It worked.
But it felt more like a patch than a win.

It wasn’t my fault, but I was responsible

That day I understood that everything eventually fails: hardware, software, internet providers, Cloudflare, or AWS us-east-1.

And I understood something more uncomfortable: even if I wasn’t at fault for the outage, I was still responsible for the system’s health.

The pressure, the confusion, being in the spotlight without clear answers—it was an experience I didn’t want to repeat.

That pushed me to face two major problems I had been ignoring:

I didn’t have real system replicability.
My scripts existed, but they depended on too many implicit things: versions, repositories, intermediate states. All it took was MySQL moving to a different repository or a version no longer being available for everything to break.
I had no fault-tolerance mechanisms.
The system worked… as long as everything worked.

Tools were a response

While trying to solve the first problem, I discovered Docker. Not because it was trendy, but because it was exactly what I needed at that moment: a way to package the system and replicate it.

With the few tutorials available in Spanish at the time, I wrote my first Dockerfile. I used Docker Compose to deploy services, restart them if needed, or destroy them entirely. For the first time, I felt I could get the same result more than once without crossing my fingers.

But another inevitable question came up: what happens when the server fails?

If I was assuming everything could fail, then the server would fail at some point too. I needed to run the service on more than one machine.

That’s when Kubernetes showed up. Not as a goal, but as a consequence. It solved a real problem, but it also brought much more complexity. Docker Compose could be installed with a single command. Kubernetes required deploying etcd, SSHing into servers, configuring hostnames, networks, certificates… and I didn’t want to trust all of that to scripts that had already failed me once.

That led me to Ansible. I wrote my playbooks and deployed Kubernetes on bare metal.

Getting myself out of the way

Everything was working.
But there was still friction.

Adding or removing nodes was still a manual process. I wanted the system to adjust itself, even if the on-premise environment didn’t change much. That’s how I got into service discovery with Consul—and, eventually, the HashiCorp stack.

The idea was simple: let the machine take on as much responsibility as possible. I only had to run a small Consul script. It didn’t matter if I ran it once or a hundred times—the result was the same: the node was added to the network only once. Later I learned that this was called idempotency, but at the time I just knew it gave me peace of mind.

What I was really learning

Over time, I realized something simple: there are many ways to meet a requirement, but not all of them survive the first production failure.

I didn’t know this was DevOps.
I wasn’t trying to “do DevOps.”
I was just trying to make systems fail without breaking myself in the process.