Skip to content

Postgres Restore Runbook

Procedures for restoring the self-hosted Hetzner Postgres database from Barman backups.

For the original install/setup of this DB, see ../../infrastructure/cloud_environment_setup/hetzner/postgres/setup.md.

Backup tool: Barman (EnterpriseDB). Replaced pgBackRest after pgBackRest was archived 2026-04-27. See ../migration-to-hetzner.md §1.4 for the rationale.


Scenarios at a glance

Scenario RTO RPO Procedure
Test restore (drill) ~10 min n/a §1 — restore to a scratch directory, verify, discard
Roll back to a specific point in time ~15 min ~1 min (WAL streaming granularity) §2 — PITR via --target-time
Full disaster recovery (DB VPS lost) ~30 min ~4 hours (off-host mirror cadence) §3 — fresh VPS + restore from Storage Box mirror
Roll back to "before that bad migration" ~15 min ~1 min §2 with --target-time set to just before deploy

1. Test restore (drill — do this monthly)

The point: prove the backups are usable. A backup nobody has restored is not a backup.

# On the DB VPS — list available backups
sudo -u barman barman list-backup main
sudo -u barman barman show-backup main latest

Restore the latest backup into a scratch directory (does NOT touch the live DB):

sudo -u barman barman recover \
  --target-immediate \
  main latest \
  /var/lib/postgresql/restore-test

--target-immediate stops recovery at the end of the base backup (no WAL replay) — fastest drill. For a fuller drill that also exercises WAL replay, use --target-time "$(date -u +%Y-%m-%dT%H:%M:%S)".

Spin up a temporary cluster on a different port:

sudo chown -R postgres:postgres /var/lib/postgresql/restore-test
sudo -u postgres /usr/lib/postgresql/16/bin/pg_ctl \
  -D /var/lib/postgresql/restore-test \
  -o "-p 5433" \
  -l /tmp/restore-test.log \
  start
sudo -u postgres psql -p 5433 -d system -c "SELECT count(*) FROM information_schema.tables;"
sudo -u postgres /usr/lib/postgresql/16/bin/pg_ctl -D /var/lib/postgresql/restore-test stop
sudo rm -rf /var/lib/postgresql/restore-test

If the row count looks sensible and the cluster started, the backup is good.


2. Point-in-time recovery (PITR)

Use this when a bad migration, accidental DROP, or app bug corrupted data and you need to roll back to just before the incident.

RPO is ~1 min thanks to continuous WAL streaming.

2.1 Stop the app

In Dokploy, scale system to 0 replicas. (Frontends and Kafka can stay up.) This prevents the app from writing during the restore.

2.2 Stop Postgres and clear the data directory

sudo systemctl stop postgresql@16-main
sudo -u postgres rm -rf /var/lib/postgresql/16/main/*

2.3 Restore to the target time

# Format: YYYY-MM-DD HH:MM:SS+00 (UTC). Pick a timestamp BEFORE the bad event.
sudo -u barman barman recover \
  --target-time "2026-04-26 14:23:00+00" \
  --target-action promote \
  main latest \
  /var/lib/postgresql/16/main

sudo chown -R postgres:postgres /var/lib/postgresql/16/main

2.4 Start Postgres and verify

sudo systemctl start postgresql@16-main
sudo -u postgres psql -d system -c "SELECT now(), max(created_at) FROM <some_table_with_timestamps>;"

The max(created_at) should be just before your target time. If correct, scale system back up in Dokploy.

2.5 Re-baseline backups

After PITR, take a fresh full backup so the next backup chain has a clean starting point:

sudo -u barman barman backup main
sudo -u barman barman check main

3. Full disaster recovery (DB VPS lost)

If the DB VPS itself is unrecoverable (Hetzner outage, accidental destroy, disk corruption beyond Postgres).

3.1 Provision a replacement DB VPS

Re-run terraform apply from infrastructure/cloud_environment_setup/hetzner/terraform/ (the cloud-init/db-vps.yaml script reinstalls Postgres 16 + Barman automatically). Manual procedure if Terraform isn't available: see ../../infrastructure/cloud_environment_setup/hetzner/postgres/setup.md §1–4.

3.2 Pull the Barman home directory back from Storage Box

# On the new DB VPS, as root:
rsync -azP -e 'ssh -p 23' [email protected]:/home/barman-mirror/ /var/lib/barman/
chown -R barman:barman /var/lib/barman

3.3 Configure Barman pointing at the recovered home

Copy /etc/barman.conf and /etc/barman.d/main.conf from your config vault (or restore them from terraform/cloud-init/db-vps.yaml).

sudo -u barman barman list-backup main   # should list the existing backup history
sudo -u barman barman check main         # may show "PostgreSQL: FAILED" until the DB is restored — that's fine

3.4 Restore latest

sudo systemctl stop postgresql@16-main
sudo -u postgres rm -rf /var/lib/postgresql/16/main/*
sudo -u barman barman recover \
  --target-action promote \
  main latest \
  /var/lib/postgresql/16/main
sudo chown -R postgres:postgres /var/lib/postgresql/16/main
sudo systemctl start postgresql@16-main

Note the recovered cluster will be at the timestamp of the last off-host rsync mirror — up to 4 hours old per the cron in setup.md §6.6. To reduce that window, run the rsync more frequently.

3.5 Reconnect the App VPS

If the new DB VPS has a different private IP, update Dokploy's SPRING_DATASOURCE_URL env var on the system service and redeploy.

3.6 Re-baseline + verify

sudo -u barman barman backup main
sudo -u barman barman cron       # restart WAL streaming
sudo -u barman barman check main

Run the Postman smoke suite against api-v3.smartsapp-staging.com to confirm end-to-end recovery.


4. Restore from DO managed Postgres (revert to standby)

During the 30-day standby window after Hetzner cutover, the DO managed Postgres still holds the cutover-snapshot data. To revert:

# From any machine with psql + the DO connection string saved at cutover:
pg_dump "postgresql://smartsapp:[email protected]:25060/system?sslmode=require" \
  > /tmp/do-snapshot.sql

# On the Hetzner DB VPS:
psql -U smartsapp -d system < /tmp/do-snapshot.sql

After 30 days the DO Postgres is paused/downsized — see ../migration-to-hetzner.md §1.8.


5. Common pitfalls

  • Wrong target timestamp format. Use ISO YYYY-MM-DD HH:MM:SS+TZ. +00 is UTC.
  • Forgot to clear pg_data before restore. Barman's recover will fail if the target directory is non-empty unless you pass --get-wal --no-get-wal flags or pre-clear.
  • PITR stops at the most recent WAL. If you target a time later than the latest archived WAL, Barman recovers up to the end of WAL and stops. Check barman list-backup main and barman show-backup main latest for the WAL range.
  • Restored cluster won't start with "missing WAL segment" errors. A WAL gap, usually because barman cron wasn't running. Re-run barman recover with an earlier --target-time (before the gap), then take a fresh full backup.
  • App can't connect after restore. The Hikari pool caches connections — restart the system container in Dokploy to force a clean reconnect.
  • Replication slot grows unbounded. If barman cron stops for an extended period, the barman replication slot retains WAL on the primary, eventually filling the disk. Monitor with SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots;.