13023553_s.jpg

9 Steps for Rapid Recovery when a Deployment Breaks Badly

At some point in every programmer’s career, a deployment does not go well. The result is extended downtime, business disruption, and a breakdown in trust among the people who count on the software. If you’re in the middle of a deployment gone wrong, the following steps can help get you back on track, and make you better prepared going forward.

1. Don’t try to fix it on the fly – roll back.

This is the worst time to make fixes. Don’t. It’s too stressful and the stakes are too high for ad hoc fixes. You risk introducing more errors and downtime if you try to fix things in a crisis. In fact, Capers Jones shows that almost 10% of all defects come from bad fixes. Instead, roll back to the previous working version of the system as soon as you can. Regroup from there. Make sure every step in your software processes is small and reversible.

2. Prioritize the preservation of data.

Protect data at all costs. Users and customers need to trust the system. Outages are understandable. Lost data is a disaster. As soon as you detect that a deployment is failing, take another backup of your data stores immediately. Compare it with your pre-deployment backup. Then, put the database back where it needs to be. If you don’t, every customer or user segment will be triaging and guessing what records are missing. They won’t be sure, and they will have ongoing doubt, especially in complex data situations.

3. Begin data restores right away, they take a long time.

If your data has been changed, migrated, or corrupted, you’ll have to roll back using a backup. A good practice is to keep mirrored and incremental backups to prevent having to do a complete restore – even the fastest of networks can take days to copy large volumes of data. Consider restoring to separate databases side-by-side of the new one in order to preserve time and flexibility. Begin this task quickly. It takes a while. If you can be certain that data was not impacted, you are ahead of the game.

4. Communicate quickly and honestly.

Devote most of your energy to fixing the system, but take the time to tell customers what is going on. Keep them appraised. That way, they’ll let you work. Otherwise, they’ll be all over you, and demand all the more communicating, thereby disrupting the recovery. It will make the situation worse. Instead, be open, honest and transparent. If you have an online status page, keep it up-to-date, and be specific – this will save you time processing inquiries.

5. Make the application function first, then scale.

If you have multiple nodes in a web farm, or are distributed across lots of servers, focus on making the previous version of the application function properly first. Once that’s accomplished, then replace the high-scale or high throughput characteristics. If you attempt to put a high level of traffic on the application too early, you can cause it to fail because of overload. You won’t know the root cause – causing further delays.

6. Recreate environments.

New servers may be quicker to configure than figuring out why the existing servers have a bad configuration. Consider recovery steps in parallel. If your existing environments come up before new ones are recreated, that’s even better. This is straight-forward in cloud environments. It can also be done on-premise if you have HyperV or VMWare capabilities that include on-demand VM creation – and if you have the capacity. Use the same process that your DevOps pipeline uses to create pre-production environments on demand. If you don’t have environment creation scripts, it can still be worthwhile to begin manually recreating them in case the attempts to fix the existing environments fail.

7. Learn from the experience. Be brutally honest.

Once you have the system back up and running, you’ll want to rest. Don’t rest for too long. Get the team back together while the memories are fresh to learn from the event. You need to harden your process to make it nearly impossible for this disruption to happen again. Specific tips:

  • Write down the sequence of activities you took
  • Give your customers/investors an after-action report
  • Note the activities that took longer that you thought when you decided to take the action
  • Analyze and identify preventative quality control measures that can be introduced to your DevOps process to catch similar types of errors
  • Practice recovering from this type of failure in a pre-production environment

You should be able to stand up a new environment with a push of a button. If you found you didn’t have that capability, add it to your backlog and make it a priority.

8. Recognize that your automated process has failed.

If you are used to automated deployments, you might overly trust your automated process. In the middle of a failed deployment, recognize that the version of your deployment pipeline just failed. Don’t rely on it for any rollback logic you’ve designed into it and expect it to recover from a situation that was not expected. Some database migration tools provide migrate forward and migrate backwards capability. Object-Relational Mappers often have this with database schema migrations that are used during deployment. Now is not the time. Get your system back online, then fine-tune your automation.

9. Set a team vision for how deployments should go.

Now that you have the system back online, plan for the future. Set a vision. Be bold. Challenge your team to be able to do deployments in the middle of the day without your customers noticing. Encourage them to build routines and scripts that will allow the recreation of environments on demand. This is the silver bullet of recovering from a failed deployment, if there ever was one.

Unfortunately, bad deployments are a fact of life in software development. With the DevOps-Centered Software Engineering principles we help our clients implement, you can be fully prepared to recover in minutes.

About Clear Measure

At Clear Measure, we believe great software development is the lifeblood of a great business. We’re a DevOps-Centered Software Engineering company that specializes in Microsoft technologies and cloud-based platforms. Our focus is to help you and your team move fast, build smart and run with confidence by taking full advantage of DevOps automation, processes and skillsets.