Backup & recovery plan · RPO/RTO, schedule, storage, restore testing

A backup you have never restored is just a hope.

A fill-in template for a database backup and recovery plan: recovery objectives, backup schedule, encrypted off-account storage, a tested restore runbook, and clear ownership. Snapshots running quietly in your cloud account are not a plan — this turns them into one you can rely on when a bad deploy or a ransomware event hits.

6 sections, fill-in template

Draft it in an afternoon

For engineering & ops leads

Why a written recovery plan matters

Almost every team believes they have backups. Far fewer can tell you their recovery objectives, where the backups live, who restores them, or the last time anyone tried. The gap between “we take snapshots” and “we can be back online with an hour of data loss in under four hours” is the whole point of a plan. The first time you exercise a restore should never be during a real incident.

This template gives you the structure to write that plan down. It is the same discipline we apply when we manage cloud infrastructure for clients. If you want the underlying concepts, the encryption at rest and infrastructure as code glossary entries explain how backups should be protected and provisioned.

1. Recovery objectives

Recovery point objective (RPO): the maximum acceptable data loss, expressed in time. State it per database or dataset, since not all data carries equal risk.
Recovery time objective (RTO): the maximum acceptable downtime during recovery. Be honest about what the business can actually tolerate.
Criticality tier: rank each datastore so the most important systems get the tightest objectives and the most testing.
Scope: list exactly which databases, schemas, and dependent stores (caches, object storage, search indexes) the plan covers.

2. Backup strategy & schedule

Backup types: define your mix of full, incremental, and point-in-time recovery, and how they combine to meet your RPO.
Schedule: state the frequency of each backup type and the time window in which it runs.
Retention: define how long each backup is kept, and any longer-term archival required for compliance or contracts.
Automation: backups run automatically on a schedule, never as a manual task someone has to remember.
Verification: every backup job is checked for success, and failures raise an alert that a named person sees.

3. Storage & encryption

Location: store backups in a separate account or region from the primary database so one failure cannot destroy both.
Encryption: encrypt backups at rest and in transit, and document where the keys live and who can use them.
Access control: restrict who can read, write, or delete backups, and log all access to them.
Immutability: where supported, use write-once or object-lock storage so backups cannot be deleted or altered by a compromised account.

4. Restoration & testing

Restore runbook: a step-by-step procedure anyone on call can follow under pressure, not tribal knowledge.
Test cadence: restore to a clean environment at least quarterly and after major changes, and record how long it took against your RTO.
Validation: after a test restore, verify data integrity and application functionality, not just that the job completed.
Partial recovery: document how to restore a single table or a point in time, not only a full rebuild.
Dependencies: confirm the restore captures everything the app needs — schema, data, extensions, and related stores.

5. Monitoring & change control

Alerting: failed or missed backups, growing backup duration, and storage nearing capacity all trigger alerts.
Reporting: a regular review confirms backups are succeeding and objectives are still being met as data grows.
Change control: the plan is updated whenever the database, schedule, or infrastructure changes.

6. Roles & responsibilities

Plan owner: the named person accountable for keeping this plan current and tested.
On-call responder: who executes a recovery, and how they are reached at 3 a.m.
Escalation path: who is informed and who decides during a major data-loss event.
Review schedule: the recurring date on which this plan and its assumptions are revisited.

How to use this template

Start with the objectives section, because everything else flows from it. Decide your RPO and RTO per datastore with the people who own the business risk, not in isolation — those two numbers determine your backup frequency and how fast your restore process has to be. Then fill in the strategy, storage, and testing sections to actually meet the targets you set, and assign every responsibility to a named person.

The section that teams skip and later regret is restore testing. Schedule a real restore to a clean environment at least quarterly, time it against your RTO, and verify the data is correct — not just that the job finished. Pair this plan with a broader incident response plan, since data loss is one of the incidents your team has to be ready to handle.

How this connects to our work

A tested backup and recovery plan is part of how we set up and run cloud infrastructure and how we approach data engineering work. When we manage production systems with our DevOps engineering service, backups are codified, monitored, and exercised — not left to a default snapshot setting nobody reviews.

If you want help defining realistic recovery objectives, hardening where backups live, or building a restore process you have actually tested, see how we scope and price the work or reach out to talk it through.

Frequently asked questions

What is the difference between RPO and RTO?

RPO, the recovery point objective, is how much data you can afford to lose, measured in time — an RPO of one hour means you must be able to restore to a state no more than an hour old. RTO, the recovery time objective, is how long you can be down while you recover. RPO drives backup frequency; RTO drives how fast your restore process has to be.

Isn't a managed database's automatic backup enough?

Automatic snapshots are a good foundation, but they are not a plan. They do not tell you your recovery objectives, they are sometimes deleted with the instance they protect, and they are worthless if nobody has ever tested a restore. The plan ties the mechanism to objectives, ownership, off-account copies, and tested procedures.

How often should we test a restore?

At least quarterly, and after any significant change to the database or backup configuration. A backup you have never restored is a hypothesis, not a safeguard. Many teams discover their backups are corrupt, incomplete, or missing a critical dependency only the first time they try to use them — which should never be during a real incident.

Where should backups be stored?

In a separate location from the primary database — ideally a different account or region — so a compromised account, an accidental deletion, or a regional outage cannot take both at once. Backups should be encrypted at rest and in transit, and access to them should be tightly restricted and logged.

Related resources & reading

Incident Response Plan Template

The broader plan a data-loss event plugs into.

SaaS Security Checklist

Where backups fit in a full SaaS security baseline.

Cloud Cost Optimization Checklist

Right-size storage and retention without over-paying.

Cloud Infrastructure

How we set up resilient, recoverable infrastructure.

Not sure your backups would survive a real incident?

We can review your current setup, define realistic recovery objectives, and build a restore process you have actually tested. See how engagements are priced or book a call.