Skip to content

Disaster Recovery Standard Operating Procedure (SOP)

Platform: Smartsapp Scope: Kubernetes clusters running PostgreSQL via StackGres, with Velero for cluster backup and restore Audience: Platform / SRE / Infrastructure operators Last updated: 2026-02-08


1. Purpose

This document defines the standard operating procedure (SOP) for responding to incidents and performing disaster recovery (DR) for:

  • Kubernetes clusters
  • PostgreSQL databases managed by StackGres

The goal is to ensure predictable, repeatable, and safe recovery from infrastructure or data failures with minimal ambiguity during incidents.


2. Recovery Philosophy (Read This First)

Core principles:

  1. Clusters are disposable

  2. We do not repair broken clusters in place

  3. We restore into a new or clean cluster when needed

  4. We restore data, not disks

  5. PersistentVolumes are not migrated

  6. PostgreSQL data is restored using StackGres backups and PITR

  7. Disaster Recovery ≠ High Availability

  8. This SOP assumes minutes–hours RTO

  9. Zero-downtime failover is out of scope

  10. Backups must live outside the cluster

  11. Object storage (S3 / MinIO) is the source of truth


3. Tools & Responsibilities

3.1 Tooling

Tool Responsibility
Velero Backup & restore Kubernetes objects and PVC metadata
StackGres PostgreSQL backups, WAL archiving, PITR
Object Storage (S3/MinIO) Durable backup storage
Git (Helm/GitOps) Desired cluster and app state

3.2 Operator Responsibilities

Operators are responsible for:

  • Declaring incident severity
  • Choosing the correct recovery path
  • Executing restores safely
  • Communicating status updates

4. Incident Classification

4.1 Infrastructure Incidents

Examples:

  • Kubernetes cluster deleted or unreachable
  • Control plane failure
  • Multiple node loss
  • etcd corruption

4.2 Data Incidents

Examples:

  • Accidental DELETE / UPDATE
  • Bad migration
  • Application bug corrupting data
  • Partial data loss

4.3 Combined Incidents

Examples:

  • Cluster loss + data loss
  • Failed upgrade affecting DB and apps

5. Standard Response Flow (All Incidents)

  1. Stop the blast radius

  2. Freeze deployments

  3. Disable CI/CD auto-deploys
  4. Pause write traffic if required

  5. Identify incident type

  6. Infrastructure-only

  7. Data-only
  8. Combined

  9. Choose recovery path (Sections 6–8)

  10. Communicate

  11. Internal status update

  12. ETA and impact assessment

6. Infrastructure Recovery (Kubernetes Cluster)

6.1 When to Use

Use this procedure if:

  • The cluster is lost or unrecoverable
  • Control plane integrity is compromised

6.2 Important Assumption: MinIO

MinIO running inside the cluster is always assumed LOST during infrastructure disasters unless explicitly proven otherwise.

  • Do NOT attempt to recover or reattach old MinIO PersistentVolumes
  • Do NOT trust in-cluster MinIO as a backup source
  • The offsite object storage copy (AWS S3 / DigitalOcean Spaces) is the source of truth

6.3 Procedure

  1. Provision a new Kubernetes cluster

  2. Same Kubernetes version (or compatible)

  3. Networking and storage configured

  4. Deploy MinIO as a fresh, empty installation

  5. Do not attach previous volumes

  6. Configure credentials and buckets

  7. Restore MinIO data from offsite storage

  8. Validate presence of:

    • Velero backups
    • PostgreSQL base backups
    • PostgreSQL WAL archives
  9. Install Velero

  10. Point Velero to the restored MinIO buckets

  11. Restore Kubernetes objects

  12. Restore namespaces, CRDs, Deployments, Services

  13. Exclude PostgreSQL restore at this stage if data incident is suspected

  14. Verify cluster health

  15. Nodes Ready

  16. Core services running

  17. Proceed to database recovery if needed


7. PostgreSQL Recovery – Latest State

7.1 When to Use

Use this if:

  • Cluster was lost
  • Data itself is known to be correct

7.2 Procedure

  1. Create a new StackGres Postgres cluster

  2. Restore from latest successful backup

  3. Wait for restore to complete

  4. Validate:

  5. Database connections

  6. Schema presence
  7. Row counts / basic queries

  8. Update application connection strings if needed

  9. Resume write traffic


8. PostgreSQL Recovery – Point-in-Time Recovery (PITR)

8.1 When to Use

Use PITR if:

  • Data was corrupted
  • Accidental delete/update occurred
  • A specific moment before failure is known

8.2 Procedure

  1. Identify the target recovery timestamp

  2. Prefer last known-good application event

  3. Create a new StackGres Postgres cluster

  4. Enable PITR restore

  5. Specify recovery timestamp

  6. Monitor restore process

  7. Validate restored data

  8. Confirm corrupted data is absent

  9. Spot-check business-critical tables

  10. Switch applications to the restored database

  11. Archive or delete the old database cluster


9. Post-Recovery Checklist

Before declaring the incident resolved, all applicable sections below must be explicitly verified and checked off.


9.1 Platform & Infrastructure (Required)

  • [ ] Kubernetes cluster healthy (all nodes Ready)
  • [ ] Control plane stable (API server responsive, no crash loops)
  • [ ] Core system namespaces healthy (kube-system, monitoring, ingress)
  • [ ] Network policies enforced and functioning

9.2 MinIO & Backup Storage (Required)

  • [ ] MinIO deployed as fresh installation (no reused PVCs)
  • [ ] MinIO data restored from offsite source of truth
  • [ ] Offsite bucket reachable and readable
  • [ ] Object versioning confirmed enabled on offsite bucket
  • [ ] Sample backup objects verified readable

9.3 PostgreSQL – StackGres (Required)

  • [ ] Database restored using correct method (latest backup or PITR)
  • [ ] PITR restore timestamp recorded (if applicable)
  • [ ] Database accepting writes
  • [ ] Application schemas present
  • [ ] Business-critical tables spot-checked
  • [ ] WAL archiving and scheduled backups resumed

9.4 Applications & Workloads (Required)

  • [ ] All application pods running and stable
  • [ ] No crash loops or prolonged Pending states
  • [ ] Read paths validated
  • [ ] Write paths validated
  • [ ] Background workers / schedulers operational

9.5 Observability & Alerting (If Applicable)

  • [ ] Metrics ingestion restored
  • [ ] Logs flowing to logging backend
  • [ ] Traces visible (if enabled)
  • [ ] At least one test alert successfully fired

9.6 CI/CD & Platform Operations (Required)

  • [ ] CI/CD pipelines re-enabled
  • [ ] Automatic deployments resumed
  • [ ] Velero backup schedule confirmed active
  • [ ] Successful Velero backup observed post-recovery

9.7 Security & Access (If Applicable)

  • [ ] Secrets restored correctly
  • [ ] Database credentials rotated (if compromise suspected)
  • [ ] Access logs reviewed for anomalies

9.8 Communication & Follow-Up (Required)

  • [ ] Incident resolved notification sent
  • [ ] Incident timeline documented
  • [ ] Root cause analysis scheduled or completed
  • [ ] Follow-up action items created and tracked

Do not close the incident until every required checklist section above is completed.


10. Testing & Drills

10.1 Required Practice

  • Full restore test at least once per quarter
  • PITR restore test at least once per quarter

10.2 Test Procedure

  1. Restore into a non-production namespace
  2. Validate data correctness
  3. Record:

  4. Time to restore

  5. Issues encountered

11. What This SOP Does NOT Cover

  • Zero-downtime failover
  • Cross-region active-active Postgres
  • Real-time replication

These require separate HA architecture.


12. Final Rule (Non-Negotiable)

If a recovery path is unclear during an incident, STOP and escalate. Guessing during DR causes more damage than downtime.


End of document