Postgres Restore Runbook¶

Procedures for restoring the self-hosted Hetzner Postgres database from Barman backups.

For the original install/setup of this DB, see ../../infrastructure/cloud_environment_setup/hetzner/postgres/setup.md.

Backup tool: Barman (EnterpriseDB). Replaced pgBackRest after pgBackRest was archived 2026-04-27. See ../migration-to-hetzner.md §1.4 for the rationale.

Scenarios at a glance¶

Scenario	RTO	RPO	Procedure
Test restore (drill)	~10 min	n/a	§1 — restore to a scratch directory, verify, discard
Roll back to a specific point in time	~15 min	~1 min (WAL streaming granularity)	§2 — PITR via `--target-time`
Full disaster recovery (DB VPS lost)	~30 min	~4 hours (off-host mirror cadence)	§3 — fresh VPS + restore from Storage Box mirror
Roll back to "before that bad migration"	~15 min	~1 min	§2 with `--target-time` set to just before deploy

1. Test restore (drill — do this monthly)¶

The point: prove the backups are usable. A backup nobody has restored is not a backup.

# On the DB VPS — list available backups
sudo -u barman barman list-backup main
sudo -u barman barman show-backup main latest

Restore the latest backup into a scratch directory (does NOT touch the live DB):

sudo -u barman barman recover \
  --target-immediate \
  main latest \
  /var/lib/postgresql/restore-test

--target-immediate stops recovery at the end of the base backup (no WAL replay) — fastest drill. For a fuller drill that also exercises WAL replay, use --target-time "$(date -u +%Y-%m-%dT%H:%M:%S)".

Spin up a temporary cluster on a different port:

sudo chown -R postgres:postgres /var/lib/postgresql/restore-test
sudo -u postgres /usr/lib/postgresql/16/bin/pg_ctl \
  -D /var/lib/postgresql/restore-test \
  -o "-p 5433" \
  -l /tmp/restore-test.log \
  start
sudo -u postgres psql -p 5433 -d system -c "SELECT count(*) FROM information_schema.tables;"
sudo -u postgres /usr/lib/postgresql/16/bin/pg_ctl -D /var/lib/postgresql/restore-test stop
sudo rm -rf /var/lib/postgresql/restore-test

If the row count looks sensible and the cluster started, the backup is good.

2. Point-in-time recovery (PITR)¶

Use this when a bad migration, accidental DROP, or app bug corrupted data and you need to roll back to just before the incident.

RPO is ~1 min thanks to continuous WAL streaming.

2.1 Stop the app¶

In Dokploy, scale system to 0 replicas. (Frontends and Kafka can stay up.) This prevents the app from writing during the restore.

2.2 Stop Postgres and clear the data directory¶

sudo systemctl stop postgresql@16-main
sudo -u postgres rm -rf /var/lib/postgresql/16/main/*

2.3 Restore to the target time¶

# Format: YYYY-MM-DD HH:MM:SS+00 (UTC). Pick a timestamp BEFORE the bad event.
sudo -u barman barman recover \
  --target-time "2026-04-26 14:23:00+00" \
  --target-action promote \
  main latest \
  /var/lib/postgresql/16/main

sudo chown -R postgres:postgres /var/lib/postgresql/16/main

2.4 Start Postgres and verify¶

sudo systemctl start postgresql@16-main
sudo -u postgres psql -d system -c "SELECT now(), max(created_at) FROM <some_table_with_timestamps>;"

The max(created_at) should be just before your target time. If correct, scale system back up in Dokploy.

2.5 Re-baseline backups¶

After PITR, take a fresh full backup so the next backup chain has a clean starting point:

sudo -u barman barman backup main
sudo -u barman barman check main

3. Full disaster recovery (DB VPS lost)¶

If the DB VPS itself is unrecoverable (Hetzner outage, accidental destroy, disk corruption beyond Postgres).

3.1 Provision a replacement DB VPS¶

Re-run terraform apply from infrastructure/cloud_environment_setup/hetzner/terraform/ (the cloud-init/db-vps.yaml script reinstalls Postgres 16 + Barman automatically). Manual procedure if Terraform isn't available: see ../../infrastructure/cloud_environment_setup/hetzner/postgres/setup.md §1–4.

3.2 Pull the Barman home directory back from Storage Box¶

# On the new DB VPS, as root:
rsync -azP -e 'ssh -p 23' [email protected]:/home/barman-mirror/ /var/lib/barman/
chown -R barman:barman /var/lib/barman

3.3 Configure Barman pointing at the recovered home¶

Copy /etc/barman.conf and /etc/barman.d/main.conf from your config vault (or restore them from terraform/cloud-init/db-vps.yaml).

sudo -u barman barman list-backup main   # should list the existing backup history
sudo -u barman barman check main         # may show "PostgreSQL: FAILED" until the DB is restored — that's fine

3.4 Restore latest¶

sudo systemctl stop postgresql@16-main
sudo -u postgres rm -rf /var/lib/postgresql/16/main/*
sudo -u barman barman recover \
  --target-action promote \
  main latest \
  /var/lib/postgresql/16/main
sudo chown -R postgres:postgres /var/lib/postgresql/16/main
sudo systemctl start postgresql@16-main

Note the recovered cluster will be at the timestamp of the last off-host rsync mirror — up to 4 hours old per the cron in setup.md §6.6. To reduce that window, run the rsync more frequently.

3.5 Reconnect the App VPS¶

If the new DB VPS has a different private IP, update Dokploy's SPRING_DATASOURCE_URL env var on the system service and redeploy.

3.6 Re-baseline + verify¶

sudo -u barman barman backup main
sudo -u barman barman cron       # restart WAL streaming
sudo -u barman barman check main

Run the Postman smoke suite against api-v3.smartsapp-staging.com to confirm end-to-end recovery.

4. Restore from DO managed Postgres (revert to standby)¶

During the 30-day standby window after Hetzner cutover, the DO managed Postgres still holds the cutover-snapshot data. To revert:

# From any machine with psql + the DO connection string saved at cutover:
pg_dump "postgresql://smartsapp:[email protected]:25060/system?sslmode=require" \
  > /tmp/do-snapshot.sql

# On the Hetzner DB VPS:
psql -U smartsapp -d system < /tmp/do-snapshot.sql

After 30 days the DO Postgres is paused/downsized — see ../migration-to-hetzner.md §1.8.

5. Common pitfalls¶

Wrong target timestamp format. Use ISO YYYY-MM-DD HH:MM:SS+TZ. +00 is UTC.
Forgot to clear pg_data before restore. Barman's recover will fail if the target directory is non-empty unless you pass --get-wal --no-get-wal flags or pre-clear.
PITR stops at the most recent WAL. If you target a time later than the latest archived WAL, Barman recovers up to the end of WAL and stops. Check barman list-backup main and barman show-backup main latest for the WAL range.
Restored cluster won't start with "missing WAL segment" errors. A WAL gap, usually because barman cron wasn't running. Re-run barman recover with an earlier --target-time (before the gap), then take a fresh full backup.
App can't connect after restore. The Hikari pool caches connections — restart the system container in Dokploy to force a clean reconnect.
Replication slot grows unbounded. If barman cron stops for an extended period, the barman replication slot retains WAL on the primary, eventually filling the disk. Monitor with SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots;.