Postgres Restore Runbook¶
Procedures for restoring the self-hosted Hetzner Postgres database from Barman backups.
For the original install/setup of this DB, see ../../infrastructure/cloud_environment_setup/hetzner/postgres/setup.md.
Backup tool: Barman (EnterpriseDB). Replaced pgBackRest after pgBackRest was archived 2026-04-27. See ../migration-to-hetzner.md §1.4 for the rationale.
Scenarios at a glance¶
| Scenario | RTO | RPO | Procedure |
|---|---|---|---|
| Test restore (drill) | ~10 min | n/a | §1 — restore to a scratch directory, verify, discard |
| Roll back to a specific point in time | ~15 min | ~1 min (WAL streaming granularity) | §2 — PITR via --target-time |
| Full disaster recovery (DB VPS lost) | ~30 min | ~4 hours (off-host mirror cadence) | §3 — fresh VPS + restore from Storage Box mirror |
| Roll back to "before that bad migration" | ~15 min | ~1 min | §2 with --target-time set to just before deploy |
1. Test restore (drill — do this monthly)¶
The point: prove the backups are usable. A backup nobody has restored is not a backup.
# On the DB VPS — list available backups
sudo -u barman barman list-backup main
sudo -u barman barman show-backup main latest
Restore the latest backup into a scratch directory (does NOT touch the live DB):
sudo -u barman barman recover \
--target-immediate \
main latest \
/var/lib/postgresql/restore-test
--target-immediate stops recovery at the end of the base backup (no WAL replay) — fastest drill. For a fuller drill that also exercises WAL replay, use --target-time "$(date -u +%Y-%m-%dT%H:%M:%S)".
Spin up a temporary cluster on a different port:
sudo chown -R postgres:postgres /var/lib/postgresql/restore-test
sudo -u postgres /usr/lib/postgresql/16/bin/pg_ctl \
-D /var/lib/postgresql/restore-test \
-o "-p 5433" \
-l /tmp/restore-test.log \
start
sudo -u postgres psql -p 5433 -d system -c "SELECT count(*) FROM information_schema.tables;"
sudo -u postgres /usr/lib/postgresql/16/bin/pg_ctl -D /var/lib/postgresql/restore-test stop
sudo rm -rf /var/lib/postgresql/restore-test
If the row count looks sensible and the cluster started, the backup is good.
2. Point-in-time recovery (PITR)¶
Use this when a bad migration, accidental DROP, or app bug corrupted data and you need to roll back to just before the incident.
RPO is ~1 min thanks to continuous WAL streaming.
2.1 Stop the app¶
In Dokploy, scale system to 0 replicas. (Frontends and Kafka can stay up.) This prevents the app from writing during the restore.
2.2 Stop Postgres and clear the data directory¶
sudo systemctl stop postgresql@16-main
sudo -u postgres rm -rf /var/lib/postgresql/16/main/*
2.3 Restore to the target time¶
# Format: YYYY-MM-DD HH:MM:SS+00 (UTC). Pick a timestamp BEFORE the bad event.
sudo -u barman barman recover \
--target-time "2026-04-26 14:23:00+00" \
--target-action promote \
main latest \
/var/lib/postgresql/16/main
sudo chown -R postgres:postgres /var/lib/postgresql/16/main
2.4 Start Postgres and verify¶
sudo systemctl start postgresql@16-main
sudo -u postgres psql -d system -c "SELECT now(), max(created_at) FROM <some_table_with_timestamps>;"
The max(created_at) should be just before your target time. If correct, scale system back up in Dokploy.
2.5 Re-baseline backups¶
After PITR, take a fresh full backup so the next backup chain has a clean starting point:
sudo -u barman barman backup main
sudo -u barman barman check main
3. Full disaster recovery (DB VPS lost)¶
If the DB VPS itself is unrecoverable (Hetzner outage, accidental destroy, disk corruption beyond Postgres).
3.1 Provision a replacement DB VPS¶
Re-run terraform apply from infrastructure/cloud_environment_setup/hetzner/terraform/ (the cloud-init/db-vps.yaml script reinstalls Postgres 16 + Barman automatically). Manual procedure if Terraform isn't available: see ../../infrastructure/cloud_environment_setup/hetzner/postgres/setup.md §1–4.
3.2 Pull the Barman home directory back from Storage Box¶
# On the new DB VPS, as root:
rsync -azP -e 'ssh -p 23' [email protected]:/home/barman-mirror/ /var/lib/barman/
chown -R barman:barman /var/lib/barman
3.3 Configure Barman pointing at the recovered home¶
Copy /etc/barman.conf and /etc/barman.d/main.conf from your config vault (or restore them from terraform/cloud-init/db-vps.yaml).
sudo -u barman barman list-backup main # should list the existing backup history
sudo -u barman barman check main # may show "PostgreSQL: FAILED" until the DB is restored — that's fine
3.4 Restore latest¶
sudo systemctl stop postgresql@16-main
sudo -u postgres rm -rf /var/lib/postgresql/16/main/*
sudo -u barman barman recover \
--target-action promote \
main latest \
/var/lib/postgresql/16/main
sudo chown -R postgres:postgres /var/lib/postgresql/16/main
sudo systemctl start postgresql@16-main
Note the recovered cluster will be at the timestamp of the last off-host rsync mirror — up to 4 hours old per the cron in setup.md §6.6. To reduce that window, run the rsync more frequently.
3.5 Reconnect the App VPS¶
If the new DB VPS has a different private IP, update Dokploy's SPRING_DATASOURCE_URL env var on the system service and redeploy.
3.6 Re-baseline + verify¶
sudo -u barman barman backup main
sudo -u barman barman cron # restart WAL streaming
sudo -u barman barman check main
Run the Postman smoke suite against api-v3.smartsapp-staging.com to confirm end-to-end recovery.
4. Restore from DO managed Postgres (revert to standby)¶
During the 30-day standby window after Hetzner cutover, the DO managed Postgres still holds the cutover-snapshot data. To revert:
# From any machine with psql + the DO connection string saved at cutover:
pg_dump "postgresql://smartsapp:[email protected]:25060/system?sslmode=require" \
> /tmp/do-snapshot.sql
# On the Hetzner DB VPS:
psql -U smartsapp -d system < /tmp/do-snapshot.sql
After 30 days the DO Postgres is paused/downsized — see ../migration-to-hetzner.md §1.8.
5. Common pitfalls¶
- Wrong target timestamp format. Use ISO
YYYY-MM-DD HH:MM:SS+TZ.+00is UTC. - Forgot to clear
pg_databefore restore. Barman'srecoverwill fail if the target directory is non-empty unless you pass--get-wal --no-get-walflags or pre-clear. - PITR stops at the most recent WAL. If you target a time later than the latest archived WAL, Barman recovers up to the end of WAL and stops. Check
barman list-backup mainandbarman show-backup main latestfor the WAL range. - Restored cluster won't start with "missing WAL segment" errors. A WAL gap, usually because
barman cronwasn't running. Re-runbarman recoverwith an earlier--target-time(before the gap), then take a fresh full backup. - App can't connect after restore. The Hikari pool caches connections — restart the
systemcontainer in Dokploy to force a clean reconnect. - Replication slot grows unbounded. If
barman cronstops for an extended period, thebarmanreplication slot retains WAL on the primary, eventually filling the disk. Monitor withSELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots;.