Guardrails in Software

Published by   on Fri Nov 08 2024

learning in public

Car driving on narrow road

Guardrails

Today I was going to cover instruction sets and my observations of what makes good instructions vs bad ones.

But then I made a mistake at work

I didn’t bring down production or anything, but I did nuke one of the pre-production compute units in our deployment pipeline. Fortunately, the AWS account remained and the infrastructure is defined in CDK. However, it caused our integration tests to fail in that pipeline stage and blocked changes from flowing through without manual intervention.

So instead, I want to talk about something that is both a blessing and a curse to developers: guardrails.

For context: I have been wiring up a new service and setting up all of its infrastructure. That includes messing with its compute units for the deployment pipeline. I realized I had misconfigured the compute units for the pipeline stages in the new service and needed to re-create them. So I went to delete one of the stages, clicked the confirm button… and made the unfortunate discovery that I was editing the compute units of the wrong service.

The scary part is that I also needed to delete the prod compute unit for the new service’s deployment pipeline. So if I hadn’t realized I deleted the compute unit in the wrong service as soon as I did, I would have probably brought down production.

Guardrails are bad

I want to start off by talking about why guardrails are bad. Not necessarily bad in the objective sense, but in a more opinionated nature.

Guardrails slow you down.

Nothing is worse than trying to do one simple task and facing the throttling effect of guardrails. 2-person reviews, checking boxes, filling out forms, and generating new credentials every 8 hours; can be seen as obstacles to an otherwise trivial job. Developers love speed and automation, but guardrails are the opposite of that. They’re manual steps that slow us down and make us frustrated.

Sometimes, they don’t work

The interface I used to delete compute units had a guardrail in place. It didn’t work. The guardrail was just a checkbox: “I’m sure I want to delete the compute unit for XYZ.” This guardrail suffers from two problems:

  1. If the service XYZ is named similarly to another service, I may not catch that I’m editing the wrong service.

  2. If I want to go fast, I just check the box and move on.

After I made the mistake, I did feel a bit frustrated that I was even allowed to make such an impactful mistake so easily. Why was the guardrail so weak, if people could potentially bring down production?

They’re obstacles for those who repeat tasks frequently

For someone who goes through a specific workflow frequently, it can be very frustrating to encounter guardrails. You know what you’re doing, right? Why slow you down?

They feel like a non-solution

If something needs guardrails, we should probably be asking if there’s a different solution to the problem. Why is this a manual task? If the task needs guardrails, is it because it’s a one-way task that can’t be undone? If so, can we lower the stakes instead? Can we design the system such that any decision is not irrecoverable and catastrophic?

Guardrails are good

Now that I’ve gone over why guardrails can be frustrating and feel wrong in certain circumstances, I want to go over why they’re good. First of all, when I realized what I had done, my heart sank. I immediately messaged a more senior engineer on the team with these exact words:

I messed up

call???

We also discussed the insufficient guardrails and all of this could have been avoided with the right ones.

The right guardrails will save you

It’s Friday, I was technically working after hours. My brain was getting that buzzing feeling telling me to stop working and thinking because it was time to relax and enjoy my weekend. I was rushing, and I was tired.

The guardrail I wish I had was this:

To confirm you want to delete this compute unit, type the name of the service you are editing: ___

This would have saved me. I would have typed in the name of the service I thought I was making changes to, and the system would have rejected me. I see this guardrail a lot, and I do like how it demands my focus. Particularly, the placeholder of the textbox can be an example of what to put and is not present anywhere I can simply copy/paste it. Because yes, I have also done that where applicable, and it defeats the purpose of the guardrail.

I have been working in the weeds of this system for an entire week. I know the ins and outs of it. Even then, I made a mistake at a moment when I was tired and rushed. Guardrails aren’t there to slow you down 99% of the time, they’re present to slow you down the 1% of the time you really need to slow down.

Guardrails will make you frustrated, but never fearful

Messing up feels bad. Depending on what you did, you might feel a little scared. I definitely felt scared. That service I messed up was deploying important changes, and what I did could have ruined the deployment pipeline…

It’s times like that when I am reminded of all the frustration that seemingly “stupid” guardrails had caused me, and how much I wish I had felt that instead of feeling the fear of the consequences of my mistake.

Conclusion

Guardrails can be painful. However, when balanced correctly they can save your ass. Guardrails should reflect the severity of the worst possible outcome a particular action can have. If an action might cause a minor issue, a warning could suffice. But if an action can take down prod, gate it with something meaningful. Like a game of wordle, and the solve is the name of the resource you’re modifying. I don’t know, make it fun!