Managing mistakes with Continuous Delivery
One of the realities of adopting Continuous Delivery is that you will ship software more often, but you will also ship mistakes more often. The good news is that adopting Continuous Delivery will also reduce the complexity of every mistake leading to faster recovery time.
Developer error is still the number one cause of production outages. This is to be expected: We all make mistakes.
If your team hasn’t practiced Continuous Delivery before, you likely don’t spend a lot of time planning to handle failure. Y’all may budget or scramble after shipping a lot of code all at once to fix any issues that arise impacting your customers. But the nature of Continuous Delivery should make your team more deliberate with your process to restore a good working state when you make mistakes.
The best thing you can do when you deploy a mistake live to production is to keep calm and remember this rule: We are not fixing the issue, we are restoring the system to a known good state.
Why you should never prioritize fixing the issue
Not immediately fixing the issue sounds counterintuitive. If all the changes heading to production are simpler so to should be the fixes. But restoring a good working state before problem solving is actually one of the best things you can do to help arrive at a well-reasoned solution. A handful of reasons:
- The increased pressure and stress of trying to analyze the right fix while production is degraded or down will not help your brain think. In fact, this pressure will increase the odds that you will make another mistake.
- The solution is almost assuredly less trivial than you think. If it was entirely trivial you would have caught it in the first place. And if it’s not trivial, the solution is likely to be non-obvious.
- Finding the right way to restore a known good working state is guaranteed to be a quicker than finding the right solution and then implementing peer reviewing, testing and delivering it.
Of course, once you do restore a known good state and the issue is no longer affecting customers, then you can slow down and analyze the situation to identify the cause and work on a solution – all with less pressure and clearer heads.
3 options for restoring to a known good state (and how to decide which to use)
Hopfeully we’ve sold you on the importance of restoring to a known good state rather than fixing the issue. How do you actually do that? Three common options exist, each of which has its pros and cons. The option that’s right for your team depends in part on how you’ve set up your infrastructure, but the right choice is the option with the fastest time to recovery and least impact on the development team. Here’s what you need to know:
If you’re confident that the issue is with code that resides behind a feature flag, then you can simply retoggle to turn off that feature flag temporarily.
This approach is almost assuredly the quickest way to fix the issue and will only impact the customers that had the feature turned on. It also has the benefit of not affecting the rest of your delivery teams because it doesn’t create any code changes that they will have to merge into their downstream development branches.
Pros: Near instantaneous solution to restoring good working order that minimizes impact for delivery teams and allows you to work on a fix in a calm and comfortable manner.
Cons: Customers can no longer use the new feature, which may slow down any information gathering.
If your delivery tooling allows you to directly rollback to a previous version of the code, you can make that rollback and then freeze or gate other deliveries from going to production.
This type of rollback is a bit more involved with legacy solutions, but with Kubernetes it’s quite simple. That’s because Kubernetes keeps a rollout history that records every set of changes you make, so you can easily and declaratively query that history to restore to a known good state. It’s often as simple as a few commands:
$ kubectl rollout history deploy/spaceship-blog REVISION CHANGE-CAUSE 21 Merge pull request #21 from onspaceship/post-02-02 22 Merge pull request #22 from onspaceship/post-02-09 23 Merge pull request #24 from onspaceship/post-02-16 24 Merge pull request #23 from onspaceship/post-02-23 $ kubectl rollout undo deploy/spaceship-blog --to-revision=23
Pros: You can avoid the overhead of putting a new code change through the process of peer review, CI checks, testing and delivery pipeline.
Cons: You need to gate other deliveries to production until you can restore the offending code to a known good state. As a result, if the fix seems like it may take some time to develop, it will be a judgment call for the team on whether or not to revert the code until it can be fixed or to fix it immediately. However, as you make that decision you’ve still restored functionality for your customers.
Additionally, database migrations can cause issues with a rollback. Ideally, you’ll deploy database migrations in isolation and you should be able to easily identify whether or not the previous version of the code is compatible with the existing schema. If it is not, then you’ll need to migrate down to a previous version of the schema and confirm that you will not lose any data in the process.
If the affected functionality is neatly wrapped up in a pull request, you might consider reverting the whole pull request. A revert is a code change to undo a previous code change, and this method will be the slowest of all options other than actually researching and fixing the issue.
That said, GitHub accommodates this method nicely in its web UI. If you do revert, you should treat it the same as any other code change in your delivery pipeline and your process, but you prioritize it ahead of anything else.
Pros: It’s generally easier to revert an entire pull request than it is to find the specific problematic code and revert it. However, It is also possible to generate a revert commit with git.
Cons: Reverting code is a lengthier process since you are still making a code change and should ideally go through the typical QA process (including peer review, light testing and CI suite checks) as a result – all of which leads to more downtime.
This approach is also typically riskier than a rollback. While rollbacks will often return code-adjacent functionality that restores application configuration to a known good working state, reverts will not do that, so any incompatibility between the current code and the previous configuration could cause additional errors.
Finally, database migrations can still cause issues with reverts in the same way they do with rollbacks.
What to do next: recover and retro
Once you’ve restored to a known good state, your team can begin to recover from the mistake by properly identifying the issue and finding a permanent solution. This work should be prioritized, but can follow your typical planning process. At this point, you can also safely ungate your production deliveries to keep things moving forward.
Finally, you should schedule a blameless retro following the complete resolution of any incident so that your team can properly understand what happened and avoid making the same mistake again.
Interested in learning more about how to build a successful Continuous Delivery program? Contact us today to learn how Spaceship can help.