Disaster Recovery Standard Operating Procedure (SOP)¶
Platform: Smartsapp Scope: Kubernetes clusters running PostgreSQL via StackGres, with Velero for cluster backup and restore Audience: Platform / SRE / Infrastructure operators Last updated: 2026-02-08
1. Purpose¶
This document defines the standard operating procedure (SOP) for responding to incidents and performing disaster recovery (DR) for:
- Kubernetes clusters
- PostgreSQL databases managed by StackGres
The goal is to ensure predictable, repeatable, and safe recovery from infrastructure or data failures with minimal ambiguity during incidents.
2. Recovery Philosophy (Read This First)¶
Core principles:
-
Clusters are disposable
-
We do not repair broken clusters in place
-
We restore into a new or clean cluster when needed
-
We restore data, not disks
-
PersistentVolumes are not migrated
-
PostgreSQL data is restored using StackGres backups and PITR
-
Disaster Recovery ≠ High Availability
-
This SOP assumes minutes–hours RTO
-
Zero-downtime failover is out of scope
-
Backups must live outside the cluster
-
Object storage (S3 / MinIO) is the source of truth
3. Tools & Responsibilities¶
3.1 Tooling¶
| Tool | Responsibility |
|---|---|
| Velero | Backup & restore Kubernetes objects and PVC metadata |
| StackGres | PostgreSQL backups, WAL archiving, PITR |
| Object Storage (S3/MinIO) | Durable backup storage |
| Git (Helm/GitOps) | Desired cluster and app state |
3.2 Operator Responsibilities¶
Operators are responsible for:
- Declaring incident severity
- Choosing the correct recovery path
- Executing restores safely
- Communicating status updates
4. Incident Classification¶
4.1 Infrastructure Incidents¶
Examples:
- Kubernetes cluster deleted or unreachable
- Control plane failure
- Multiple node loss
- etcd corruption
4.2 Data Incidents¶
Examples:
- Accidental DELETE / UPDATE
- Bad migration
- Application bug corrupting data
- Partial data loss
4.3 Combined Incidents¶
Examples:
- Cluster loss + data loss
- Failed upgrade affecting DB and apps
5. Standard Response Flow (All Incidents)¶
-
Stop the blast radius
-
Freeze deployments
- Disable CI/CD auto-deploys
-
Pause write traffic if required
-
Identify incident type
-
Infrastructure-only
- Data-only
-
Combined
-
Choose recovery path (Sections 6–8)
-
Communicate
-
Internal status update
- ETA and impact assessment
6. Infrastructure Recovery (Kubernetes Cluster)¶
6.1 When to Use¶
Use this procedure if:
- The cluster is lost or unrecoverable
- Control plane integrity is compromised
6.2 Important Assumption: MinIO¶
MinIO running inside the cluster is always assumed LOST during infrastructure disasters unless explicitly proven otherwise.
- Do NOT attempt to recover or reattach old MinIO PersistentVolumes
- Do NOT trust in-cluster MinIO as a backup source
- The offsite object storage copy (AWS S3 / DigitalOcean Spaces) is the source of truth
6.3 Procedure¶
-
Provision a new Kubernetes cluster
-
Same Kubernetes version (or compatible)
-
Networking and storage configured
-
Deploy MinIO as a fresh, empty installation
-
Do not attach previous volumes
-
Configure credentials and buckets
-
Restore MinIO data from offsite storage
-
Validate presence of:
- Velero backups
- PostgreSQL base backups
- PostgreSQL WAL archives
-
Install Velero
-
Point Velero to the restored MinIO buckets
-
Restore Kubernetes objects
-
Restore namespaces, CRDs, Deployments, Services
-
Exclude PostgreSQL restore at this stage if data incident is suspected
-
Verify cluster health
-
Nodes Ready
-
Core services running
-
Proceed to database recovery if needed
7. PostgreSQL Recovery – Latest State¶
7.1 When to Use¶
Use this if:
- Cluster was lost
- Data itself is known to be correct
7.2 Procedure¶
-
Create a new StackGres Postgres cluster
-
Restore from latest successful backup
-
Wait for restore to complete
-
Validate:
-
Database connections
- Schema presence
-
Row counts / basic queries
-
Update application connection strings if needed
-
Resume write traffic
8. PostgreSQL Recovery – Point-in-Time Recovery (PITR)¶
8.1 When to Use¶
Use PITR if:
- Data was corrupted
- Accidental delete/update occurred
- A specific moment before failure is known
8.2 Procedure¶
-
Identify the target recovery timestamp
-
Prefer last known-good application event
-
Create a new StackGres Postgres cluster
-
Enable PITR restore
-
Specify recovery timestamp
-
Monitor restore process
-
Validate restored data
-
Confirm corrupted data is absent
-
Spot-check business-critical tables
-
Switch applications to the restored database
-
Archive or delete the old database cluster
9. Post-Recovery Checklist¶
Before declaring the incident resolved, all applicable sections below must be explicitly verified and checked off.
9.1 Platform & Infrastructure (Required)¶
- [ ] Kubernetes cluster healthy (all nodes
Ready) - [ ] Control plane stable (API server responsive, no crash loops)
- [ ] Core system namespaces healthy (
kube-system,monitoring,ingress) - [ ] Network policies enforced and functioning
9.2 MinIO & Backup Storage (Required)¶
- [ ] MinIO deployed as fresh installation (no reused PVCs)
- [ ] MinIO data restored from offsite source of truth
- [ ] Offsite bucket reachable and readable
- [ ] Object versioning confirmed enabled on offsite bucket
- [ ] Sample backup objects verified readable
9.3 PostgreSQL – StackGres (Required)¶
- [ ] Database restored using correct method (latest backup or PITR)
- [ ] PITR restore timestamp recorded (if applicable)
- [ ] Database accepting writes
- [ ] Application schemas present
- [ ] Business-critical tables spot-checked
- [ ] WAL archiving and scheduled backups resumed
9.4 Applications & Workloads (Required)¶
- [ ] All application pods running and stable
- [ ] No crash loops or prolonged
Pendingstates - [ ] Read paths validated
- [ ] Write paths validated
- [ ] Background workers / schedulers operational
9.5 Observability & Alerting (If Applicable)¶
- [ ] Metrics ingestion restored
- [ ] Logs flowing to logging backend
- [ ] Traces visible (if enabled)
- [ ] At least one test alert successfully fired
9.6 CI/CD & Platform Operations (Required)¶
- [ ] CI/CD pipelines re-enabled
- [ ] Automatic deployments resumed
- [ ] Velero backup schedule confirmed active
- [ ] Successful Velero backup observed post-recovery
9.7 Security & Access (If Applicable)¶
- [ ] Secrets restored correctly
- [ ] Database credentials rotated (if compromise suspected)
- [ ] Access logs reviewed for anomalies
9.8 Communication & Follow-Up (Required)¶
- [ ] Incident resolved notification sent
- [ ] Incident timeline documented
- [ ] Root cause analysis scheduled or completed
- [ ] Follow-up action items created and tracked
Do not close the incident until every required checklist section above is completed.
10. Testing & Drills¶
10.1 Required Practice¶
- Full restore test at least once per quarter
- PITR restore test at least once per quarter
10.2 Test Procedure¶
- Restore into a non-production namespace
- Validate data correctness
-
Record:
-
Time to restore
- Issues encountered
11. What This SOP Does NOT Cover¶
- Zero-downtime failover
- Cross-region active-active Postgres
- Real-time replication
These require separate HA architecture.
12. Final Rule (Non-Negotiable)¶
If a recovery path is unclear during an incident, STOP and escalate. Guessing during DR causes more damage than downtime.
End of document