Your Disaster Recovery Plan Probably Doesn't Work and Here's How to Fix It


I ran a disaster recovery test for a financial services firm six weeks ago. Their DR plan was 140 pages long, reviewed annually, and signed off by the board. It looked thorough, professional, and comprehensive. It failed in under 20 minutes.

The first step—restoring the primary database from backup—couldn’t be completed because the backup target had been decommissioned eight months earlier during a storage migration. Nobody updated the DR plan. Nobody tested the new backup path. The 140-page document was fiction.

This isn’t unusual. In my experience, roughly 70% of DR plans fail their first genuine test. Not because the plans are poorly written—most are actually quite good on paper. They fail because the environment they describe no longer matches the environment that actually exists.

Why DR Plans Drift

Enterprise IT environments change constantly. New applications get deployed, old servers decommissioned, network paths change, cloud services replace on-premise components. Each change is individually small. But your DR plan doesn’t get updated with every change ticket.

After six months, your DR plan describes an environment that’s materially different from production. After twelve months, it’s a historical document. After two years, it’s dangerous—because people still believe it works.

What Breaks in Real Tests

I’ve run enough DR exercises to see the patterns. Here’s what fails most often:

DNS and networking. Your failover site has different IP ranges and DNS servers. Applications break because of hardcoded IP addresses, cached DNS entries, or missing firewall rules. This accounts for about 40% of DR failures I see.

Authentication and certificates. Service accounts have different permissions at the DR site. SSL certificates expired six months ago because nobody’s monitoring them. Users can’t log in, services can’t authenticate, everything cascades.

Data dependencies. You can restore the database, but the application also needs a Redis cache, an Elasticsearch index, and flat files on an NFS share. The DR plan covers the database. The other dependencies aren’t mentioned.

Third-party integrations. Your payment provider, SMS gateway, and shipping API all have IP whitelisting. Your DR site has different public IPs. None of the third parties have your DR IPs whitelisted.

People. The person who knows how to restore SAP left last year. The runbook assumes knowledge they never documented. DR plans assume specific humans will be available with specific knowledge. That assumption fails regularly.

How to Test Properly

Stop running tabletop exercises and calling them DR tests. Tabletops have value for coordination planning, but they don’t validate whether your technology actually recovers.

Start with component-level recovery tests. Pick one critical system and test recovering it from backup to a running state. Time it. Document every problem. Fix them. Move to the next system.

Test monthly, not annually. Annual DR tests are compliance theatre. Monthly component tests, rotating through critical systems, keep your capabilities sharp.

Use a different team. If only one person can restore your database, you don’t have a DR plan—you have a single point of failure wearing a lanyard. The test should validate that a competent engineer can follow the runbook without calling the system owner.

Measure Recovery Time Honestly. Your DR plan says RTO of 8 hours. How long did it actually take in your last test? Most organisations have never measured this. They’re guessing.

The Cloud Doesn’t Save You

Moving to the cloud doesn’t solve DR problems—it changes them. Instead of worrying about hardware failure at your DR site, you’re worrying about region failover, cross-region replication lag, and the cost of running hot standbys in a second availability zone.

I’ve seen organisations assume that “being in the cloud” means DR is handled. It’s not. Cloud providers guarantee their infrastructure is resilient. They don’t guarantee your application recovers correctly when you fail over to a different region. That’s still your problem.

The Minimum Viable DR Program

If you’re starting from scratch, here’s the minimum:

Document your actual critical systems. Just the 10-15 systems that cause significant business impact if down for more than 4 hours.

Verify your backups weekly. Not “check the backup job succeeded”—actually restore a backup and confirm the data is usable.

Run a component recovery test monthly. Rotate through critical systems. Half a day each, not a week-long exercise.

Update the DR plan after every significant change. Make it part of your change management process. Not later. Not at the annual review. Now.

The Real Question

Ask yourself this: if your primary data centre went offline right now, how confident are you—genuinely confident, not “well, we have a plan” confident—that your business would be operating again within your stated RTO?

If the answer isn’t “very confident because we tested it last month,” you have work to do. The time to find out your DR plan doesn’t work is during a test, not during a crisis. Every month you don’t test is a month you’re betting your business on a document nobody’s validated.

That’s not risk management. That’s hope. And hope isn’t a strategy.