Disaster Recovery Standard Operating Procedure (SOP)¶

Platform: Smartsapp Scope: Kubernetes clusters running PostgreSQL via StackGres, with Velero for cluster backup and restore Audience: Platform / SRE / Infrastructure operators Last updated: 2026-02-08

1. Purpose¶

This document defines the standard operating procedure (SOP) for responding to incidents and performing disaster recovery (DR) for:

Kubernetes clusters
PostgreSQL databases managed by StackGres

The goal is to ensure predictable, repeatable, and safe recovery from infrastructure or data failures with minimal ambiguity during incidents.

2. Recovery Philosophy (Read This First)¶

Core principles:

Clusters are disposable
We do not repair broken clusters in place
We restore into a new or clean cluster when needed
We restore data, not disks
PersistentVolumes are not migrated
PostgreSQL data is restored using StackGres backups and PITR
Disaster Recovery ≠ High Availability
This SOP assumes minutes–hours RTO
Zero-downtime failover is out of scope
Backups must live outside the cluster
Object storage (S3 / MinIO) is the source of truth

3. Tools & Responsibilities¶

3.1 Tooling¶

Tool	Responsibility
Velero	Backup & restore Kubernetes objects and PVC metadata
StackGres	PostgreSQL backups, WAL archiving, PITR
Object Storage (S3/MinIO)	Durable backup storage
Git (Helm/GitOps)	Desired cluster and app state

3.2 Operator Responsibilities¶

Operators are responsible for:

Declaring incident severity
Choosing the correct recovery path
Executing restores safely
Communicating status updates

4. Incident Classification¶

4.1 Infrastructure Incidents¶

Examples:

Kubernetes cluster deleted or unreachable
Control plane failure
Multiple node loss
etcd corruption

4.2 Data Incidents¶

Examples:

Accidental DELETE / UPDATE
Bad migration
Application bug corrupting data
Partial data loss

4.3 Combined Incidents¶

Examples:

Cluster loss + data loss
Failed upgrade affecting DB and apps

5. Standard Response Flow (All Incidents)¶

Stop the blast radius
Freeze deployments
Disable CI/CD auto-deploys
Pause write traffic if required
Identify incident type
Infrastructure-only
Data-only
Combined
Choose recovery path (Sections 6–8)
Communicate
Internal status update
ETA and impact assessment

6. Infrastructure Recovery (Kubernetes Cluster)¶

6.1 When to Use¶

Use this procedure if:

The cluster is lost or unrecoverable
Control plane integrity is compromised

6.2 Important Assumption: MinIO¶

MinIO running inside the cluster is always assumed LOST during infrastructure disasters unless explicitly proven otherwise.

Do NOT attempt to recover or reattach old MinIO PersistentVolumes
Do NOT trust in-cluster MinIO as a backup source
The offsite object storage copy (AWS S3 / DigitalOcean Spaces) is the source of truth

6.3 Procedure¶

Provision a new Kubernetes cluster
Same Kubernetes version (or compatible)
Networking and storage configured
Deploy MinIO as a fresh, empty installation
Do not attach previous volumes
Configure credentials and buckets
Restore MinIO data from offsite storage
Validate presence of:
- Velero backups
- PostgreSQL base backups
- PostgreSQL WAL archives
Install Velero
Point Velero to the restored MinIO buckets
Restore Kubernetes objects
Restore namespaces, CRDs, Deployments, Services
Exclude PostgreSQL restore at this stage if data incident is suspected
Verify cluster health
Nodes Ready
Core services running
Proceed to database recovery if needed

7. PostgreSQL Recovery – Latest State¶

7.1 When to Use¶

Use this if:

Cluster was lost
Data itself is known to be correct

7.2 Procedure¶

Create a new StackGres Postgres cluster
Restore from latest successful backup
Wait for restore to complete
Validate:
Database connections
Schema presence
Row counts / basic queries
Update application connection strings if needed
Resume write traffic

8. PostgreSQL Recovery – Point-in-Time Recovery (PITR)¶

8.1 When to Use¶

Use PITR if:

Data was corrupted
Accidental delete/update occurred
A specific moment before failure is known

8.2 Procedure¶

Identify the target recovery timestamp
Prefer last known-good application event
Create a new StackGres Postgres cluster
Enable PITR restore
Specify recovery timestamp
Monitor restore process
Validate restored data
Confirm corrupted data is absent
Spot-check business-critical tables
Switch applications to the restored database
Archive or delete the old database cluster

9. Post-Recovery Checklist¶

Before declaring the incident resolved, all applicable sections below must be explicitly verified and checked off.

9.1 Platform & Infrastructure (Required)¶

[ ] Kubernetes cluster healthy (all nodes Ready)
[ ] Control plane stable (API server responsive, no crash loops)
[ ] Core system namespaces healthy (kube-system, monitoring, ingress)
[ ] Network policies enforced and functioning

9.2 MinIO & Backup Storage (Required)¶

[ ] MinIO deployed as fresh installation (no reused PVCs)
[ ] MinIO data restored from offsite source of truth
[ ] Offsite bucket reachable and readable
[ ] Object versioning confirmed enabled on offsite bucket
[ ] Sample backup objects verified readable

9.3 PostgreSQL – StackGres (Required)¶

[ ] Database restored using correct method (latest backup or PITR)
[ ] PITR restore timestamp recorded (if applicable)
[ ] Database accepting writes
[ ] Application schemas present
[ ] Business-critical tables spot-checked
[ ] WAL archiving and scheduled backups resumed

9.4 Applications & Workloads (Required)¶

[ ] All application pods running and stable
[ ] No crash loops or prolonged Pending states
[ ] Read paths validated
[ ] Write paths validated
[ ] Background workers / schedulers operational

9.5 Observability & Alerting (If Applicable)¶

[ ] Metrics ingestion restored
[ ] Logs flowing to logging backend
[ ] Traces visible (if enabled)
[ ] At least one test alert successfully fired

9.6 CI/CD & Platform Operations (Required)¶

[ ] CI/CD pipelines re-enabled
[ ] Automatic deployments resumed
[ ] Velero backup schedule confirmed active
[ ] Successful Velero backup observed post-recovery

9.7 Security & Access (If Applicable)¶

[ ] Secrets restored correctly
[ ] Database credentials rotated (if compromise suspected)
[ ] Access logs reviewed for anomalies

9.8 Communication & Follow-Up (Required)¶

[ ] Incident resolved notification sent
[ ] Incident timeline documented
[ ] Root cause analysis scheduled or completed
[ ] Follow-up action items created and tracked

Do not close the incident until every required checklist section above is completed.

10. Testing & Drills¶

10.1 Required Practice¶

Full restore test at least once per quarter
PITR restore test at least once per quarter

10.2 Test Procedure¶

Restore into a non-production namespace
Validate data correctness
Record:
Time to restore
Issues encountered

11. What This SOP Does NOT Cover¶

Zero-downtime failover
Cross-region active-active Postgres
Real-time replication

These require separate HA architecture.

12. Final Rule (Non-Negotiable)¶

If a recovery path is unclear during an incident, STOP and escalate. Guessing during DR causes more damage than downtime.

End of document