Skip to content

Migrate from DigitalOcean to Hetzner — Hybrid Dokploy → k3s

Context

DigitalOcean costs ~$63/mo (DOKS 2× s-2vcpu-4gb + managed Postgres db-s-1vcpu-1gb + container registry, optional $12 LB). The workload is single-replica everything, no HPA, dev-only — a single-VPS workload pretending to be K8s. docs/deployment.md already states "Production will run on Hetzner," so this catches the codebase up to its stated direction.

User decision: hybrid path — stand up Dokploy on Hetzner now to cut costs immediately, with a documented escape hatch to k3s/kube-hetzner when the roadmap (Temporal, Keycloak, Rocket.Chat, Prometheus/Loki/Tempo/Grafana, Velero) demands real K8s. Postgres self-hosted on Hetzner (separate VPS + Storage Box backups).

Target monthly cost: ~€20/mo Hetzner (CPX32 app box €15.32 + CX23 Postgres €4.51 + Storage Box €3.20) plus ~$5–10/mo DO standby (DOKS node pool scaled to 0, managed Postgres kept hot for 30 days, registry kept indefinitely). vs $63/mo today. Net savings ~60% with full revertibility; ~75% if DO is later torn down.

Reversibility: the DO setup is not deleted. CI/CD's old DO deploy steps live in git history; node pool can be scaled back up with one CLI call; DNS flip restores traffic. Reversion is a ~30-minute operation, not a rebuild.


Phase 0 — Persist this plan in the repo

Done by virtue of this file existing. The plan now lives at docs/migration-to-hetzner.md alongside deployment.md and infrastructure-overview.md, in git history, and Claude-locally at ~/.claude/plans/digital-ocean-costs-are-lively-fern.md.

Phase 1 — Dokploy on Hetzner (immediate)

1.1 Provision Hetzner

  • App VPS: CPX32 (4 vCPU AMD, 8GB RAM, 160GB NVMe) — runs Dokploy + backend + Redpanda + Redis + frontends + mkdocs + coverage. Falkenstein (FSN1) for proximity to current FRA1.
  • DB VPS: CX23 (2 vCPU, 4GB RAM, 40GB) — Postgres 16 only.
  • Storage Box: BX11 (1TB) — Barman WAL archive mirror + nightly Redpanda data snapshots. (Replaces pgBackRest, archived 2026-04-27 — see §1.4.)
  • Private network: create one Hetzner private network, attach both VPSs. DB binds Postgres only on the private interface; Redpanda + Redis stay container-internal on the app VPS.
  • Firewall: public 22/80/443 on app VPS (SSH key only); DB VPS exposes 5432 only on private network. Redpanda 9092 and Redis 6379 are never exposed publicly.

Provisioning: Terraform. Symmetry with infrastructure/cloud_environment_setup/digitalocean/terraform/ and reproducibility outweigh the small upfront cost. See infrastructure/cloud_environment_setup/hetzner/terraform/README.md for the apply procedure. Provider: hetznercloud/hcloud. State: local (terraform.tfstate, gitignored) for Phase 1; revisit when Phase 2's k3s state lands.

Storage Box exception: Hetzner Storage Box is a Robot product (separate API), not in hetznercloud/hcloud. Order it manually in the Console; reference its hostname/SSH credentials in postgres/setup.md §6. Everything else (servers, network, subnet, firewalls, primary IPs, SSH key) is Terraform-managed.

1.1a Where each stateful service lives

Service Host Storage Backup Rationale
Postgres 16 DB VPS (CX23) Local NVMe /var/lib/postgresql/16 Barman → Storage Box (daily full + hourly incr + WAL streaming, rsync-over-SSH) Isolated for blast-radius reasons; Postgres is the one thing that can't be rebuilt from scratch
Redpanda App VPS (CPX32), Docker volume Named volume redpanda-data on local NVMe Nightly tar snapshot of volume → Storage Box via cron Single-broker dev workload, not HA today (matches current DOKS shape: 1 replica, 2Gi PVC). Co-locating with backend cuts network hops and matches existing Spring Kafka tuning
Redis App VPS (CPX32), Docker volume Named volume redis-data (or fully ephemeral) None — matches current behavior (DOKS Redis is ephemeral) Used as cache/session, no durability requirement today
Spring backend App VPS (CPX32) Stateless N/A
Frontends, mkdocs, coverage App VPS (CPX32) Stateless N/A

Why Redpanda + Redis on the same VPS as the backend: matches current K8s topology (single node anyway), reduces network round-trips, fits comfortably in 8GB RAM (backend ~1.5GB + Redpanda ~1GB + Redis ~128MB + frontends ~256MB + Dokploy/Traefik ~500MB ≈ 3.5GB used, 4.5GB headroom).

When this stops being OK: if Redpanda becomes a durability-critical component (real producers/consumers in prod with retention guarantees), move it to its own VPS or trigger Phase 2 (k3s) and run it under an operator. Today it's used as an event bus for a single-replica backend — co-location is correct.

1.2 Install Dokploy on app VPS

curl -sSL https://dokploy.com/install.sh | sh

Configure: - Traefik built-in for TLS (Let's Encrypt) and routing - Domains: point existing DO domains' DNS A records to Hetzner app IP - Add DO Container Registry pull credentials (see §1.5)

1.3 Convert K8s manifests → docker-compose

Source: infrastructure/cloud_environment_setup/digitalocean/k8s/

Create new directory: infrastructure/cloud_environment_setup/hetzner/dokploy/

One docker-compose.yml per Dokploy "project," or a single compose file with: - system (backend Spring Boot) — port 8080, Traefik label for api.<domain> - admin-portal — Traefik label for admin.<domain> - parent-portal — Traefik label for parent.<domain> - mkdocs — Traefik label for docs.<domain> - coverage — Traefik label for coverage.<domain> - redpanda — single-node, persistent volume /var/lib/redpanda/data, expose 9092 internally only - redis — alpine, internal only

Drop K8s probes; use Docker healthcheck: blocks. Keep Spring's application.yml Kafka tuning (rebalance delay, CooperativeStickyAssignor) — those still apply.

1.4 Self-host Postgres on DB VPS

  • Install Postgres 16 from PGDG repo (matches DO managed version exactly — required for pg_dump/pg_restore cross-version safety).
  • Configure pg_hba.conf to allow only the app VPS private IP.
  • Barman for WAL-based PITR backups to Storage Box (rsync over SSH).
  • Daily full + continuous WAL streaming + 4-hourly off-host rsync mirror to Storage Box.
  • Restore procedure: docs/operations/postgres-restore.md.
  • Tune postgresql.conf for 4GB RAM box (shared_buffers=1GB, effective_cache_size=3GB, work_mem=16MB).

Why Barman, not pgBackRest: pgBackRest was archived on 2026-04-27 after Crunchy Data was sold and the maintainer couldn't fund continued work (final release v2.58.0, 2026-01-19, still works but receives no future fixes). Barman is the conservative replacement: actively maintained by EnterpriseDB (last commit ~2 weeks ago), nearly identical conceptual model (server/stanza, full + incr + WAL), and uses native rsync-over-SSH which fits Hetzner Storage Box without changing storage backends. WAL-G was considered but its primary backend is S3, which would have forced a simultaneous swap to Hetzner Object Storage — two variables changing at once. WAL-G + Object Storage is a better Phase 2 target alongside Velero.

1.5 Container registry: keep DO Container Registry

Registries are just stateless storage — nothing forces them to live on the same cloud as compute. Keep pushing images to registry.digitalocean.com/smartsappregistry (already paid for, already authenticated in CI), and have Hetzner Dokploy pull from it.

  • CI: unchanged. Existing doctl registry login step using DIGITALOCEAN_ACCESS_TOKEN keeps working.
  • Hetzner Dokploy: add a Docker registry credential pointing at registry.digitalocean.com with a DO API token (read-only scope, separate from CI's token). Dokploy will use it to pull images on deploy.
  • Cost impact: zero — DO Registry basic tier is already in use and bundled.
  • Reinforces fallback: if Hetzner has issues, DOKS can pull the same images and resume serving with no registry migration needed.

If DO Registry is later torn down (deferred §1.9 decision), GHCR or Hetzner's registry become the swap-in. Not now.

1.6 CI/CD — bitbucket-pipelines.yml

Before any edits: copy the current file to bitbucket-pipelines.digitalocean.yml at the repo root. This is the literal frozen snapshot of the working DO pipeline. Bitbucket only reads bitbucket-pipelines.yml, so the .digitalocean.yml copy is inert — it sits on disk as a one-command revert (cp bitbucket-pipelines.digitalocean.yml bitbucket-pipelines.yml). Add a header comment at the top of the archived file: # ARCHIVED — frozen DO pipeline as of <date>. To revert from Hetzner, copy this over bitbucket-pipelines.yml and restore DO Bitbucket variables.

Then modify bitbucket-pipelines.yml (keep build/test stages unchanged):

Current New
doctl registry login → push to registry.digitalocean.com/smartsappregistry unchanged — keep pushing to DO Registry
doctl kubernetes cluster kubeconfig save (removed from active pipeline; preserved in bitbucket-pipelines.digitalocean.yml)
kubectl apply -f infrastructure/.../k8s/ (removed from active pipeline)
kubectl set image deployment/... Dokploy API call: curl -X POST https://dokploy.<domain>/api/deploy -H "Authorization: Bearer $DOKPLOY_TOKEN" -d '{"projectId":"..."}'
kubectl rollout status Poll Dokploy deploy status endpoint until success

Bitbucket variables to add: DOKPLOY_TOKEN, DOKPLOY_HOST. Bitbucket variables to keep: DIGITALOCEAN_ACCESS_TOKEN (still used to push to DO Registry), REGISTRY_NAME. Bitbucket variables to leave alone but stop using: K8S_CLUSTER_NAME (no harm in leaving; pipeline no longer references it).

Nothing introduces GitHub. No GHCR_TOKEN. The DO API token already in CI continues to do exactly the job it does today: registry push.

Keep stale-commit optimization, parallel image builds, smoke tests — all still valid.

1.7 Cutover (non-destructive — DO stays standby, no temp subdomain)

No staging/temporary subdomain. The Hetzner box is validated via local /etc/hosts overrides, then DNS for all production subdomains is flipped together in a single batch.

  1. Deploy app to Hetzner Dokploy with DO managed Postgres still active (cross-cloud DB temporarily). The new pods come up but no DNS resolves to the Hetzner IP yet, so no traffic reaches them.
  2. Validate via /etc/hosts override (laptop only): map api-v3, app, parent-portal, docs-v3 .smartsapp-staging.com to the Hetzner App VPS IP. Run smoke tests with curl -k (Let's Encrypt cert is not yet issued — public DNS still points at DO). Remove the override when done.
  3. Schedule short maintenance window. pg_dump from DO → pg_restore to self-hosted Hetzner Postgres. Switch app's JDBC_URL env var via Dokploy. DO managed Postgres stays running — read-only fallback. Re-validate via /etc/hosts override.
  4. Lower DNS TTL to 60s 24h before flip day.
  5. Flip DNS for all subdomains in one batch — DO LB IP → Hetzner App VPS IP. Within ~3 min Traefik issues fresh Let's Encrypt certs and HTTPS works cleanly.
  6. No DO resources are deleted. DOKS keeps running, DO Postgres keeps running, DO registry keeps its images. CI/CD now targets Hetzner; DO simply stops receiving traffic and stops being deployed to.
  7. Monitor 30 days. If green and stable, see §1.9 for optional standby/teardown.

1.8 Standby DO infra (default state after cutover)

  • DOKS: leave the cluster as-is, or scale DOKS node pool to 0 nodes (cluster control plane is free on DO; node pool is the cost). One CLI flip restores full service. Keeps Terraform state valid.
  • DO Managed Postgres: keep running for 30 days as a hot fallback. The pg_dump snapshot taken at cutover is the recovery point. After 30 days, downsize tier or take a final snapshot and pause.
  • DO Container Registry: keep active — Hetzner CI pushes here, Hetzner Dokploy pulls from here. Not standby; this is the live registry for both clouds.
  • DO Load Balancer: keep until DNS TTL fully propagates and you're confident no clients are hitting old IPs (~7 days). Then optionally remove (it's the most expensive idle resource).

1.9 Optional teardown (deferred, explicit user decision only)

After 30+ days of stable Hetzner operation, the user can choose to: - Run terraform destroy in infrastructure/cloud_environment_setup/digitalocean/terraform/ (preserves the code, removes the cloud resources). - Or keep paying the standby cost (~$5–10/mo with node pool at 0) for indefinite revertibility.

Default: do not tear down. This step never runs without an explicit "yes, decommission DO."

1.10 Files to create/modify

File Action
infrastructure/cloud_environment_setup/hetzner/dokploy/docker-compose.yml Create
infrastructure/cloud_environment_setup/hetzner/dokploy/.env.example Create — document required env vars
infrastructure/cloud_environment_setup/hetzner/terraform/main.tf Create — provider, network/subnet, firewalls, servers, SSH key data source
infrastructure/cloud_environment_setup/hetzner/terraform/variables.tf Create — region, sizes, image, ssh-key name, private-network CIDR
infrastructure/cloud_environment_setup/hetzner/terraform/outputs.tf Create — public/private IPs, ready-to-paste ssh commands
infrastructure/cloud_environment_setup/hetzner/terraform/cloud-init/app-vps.yaml Create — admin user, Docker, Dokploy install one-liner
infrastructure/cloud_environment_setup/hetzner/terraform/cloud-init/db-vps.yaml Create — admin user, Postgres 16 install, Barman install
infrastructure/cloud_environment_setup/hetzner/terraform/.gitignore Create — gitignore terraform.tfstate*, .terraform/, *.tfvars
infrastructure/cloud_environment_setup/hetzner/postgres/setup.md Create — Postgres install + Barman config
infrastructure/cloud_environment_setup/hetzner/README.md Create — overview, cutover runbook
bitbucket-pipelines.digitalocean.yml Create — verbatim copy of current bitbucket-pipelines.yml with archived header comment. Frozen, never auto-edited.
bitbucket-pipelines.yml Modify deploy steps (1.6) — only after the archive copy exists
docs/deployment.md Replace DOKS section with Hetzner Dokploy section; keep DOKS as legacy/archived note
docs/operations/postgres-restore.md Create — Barman restore runbook (drill, PITR, full DR, revert-from-DO)
infrastructure/cloud_environment_setup/digitalocean/ Keep, do not rename, do not delete. Add a STATUS.md at the root noting "Standby — Hetzner is primary as of . DO resources remain provisioned (or scaled to 0) for fallback. To revert: re-point DNS, scale DOKS node pool back up, switch CI back to DO deploy steps (see git history of bitbucket-pipelines.yml)."

App code changes: none expected. All connection strings come from env vars (app-configmap.yml shows JDBC_URL, Kafka bootstrap, Redis host externalized). The Spring Boot Kafka rebalance tuning, spring.threads.virtual.enabled: false, and Hikari pool settings stay as-is.


Phase 2 — k3s on Hetzner (deferred trigger)

Trigger criteria (any one): - Adding Temporal, Keycloak, or the observability stack (Prometheus/Loki/Tempo/Grafana). - Need for HA / rolling deploys / >1 replica of backend. - Dokploy node hits >70% CPU or memory steady-state.

Approach when triggered: - Use kube-hetzner Terraform module — 3-node HA k3s cluster (~€20–30/mo). - Re-apply existing manifests in infrastructure/cloud_environment_setup/digitalocean/k8s/ (preserved exactly for this purpose) into the new cluster, swapping ingress class and registry refs. - Migrate Postgres in-cluster via CloudNativePG operator (or keep external Postgres VPS — both viable). - Use Hetzner Cloud Controller Manager + CSI for LBs and volumes. - Bitbucket pipeline reverts to kubectl apply + kubectl set image — same pattern as DOKS today.

This phase is documented but not implemented now. Don't pre-build it; the Dokploy compose files are throwaway when this triggers, and that's fine.


Verification

After Phase 1 cutover, before decommissioning DO:

  1. Smoke tests: Postman collection (already in pipeline) — all endpoints 200.
  2. Kafka: confirm backend's consumer group rebalance completes (kubectl logs equivalent: docker logs system | grep "partitions assigned").
  3. Postgres: Hikari pool reports healthy; run an explicit query through admin portal to confirm DB writes persist.
  4. Backups: barman backup main, then barman recover --target-time to a scratch directory on the DB VPS — verify the restored DB starts and barman check main reports OK. Full procedure: docs/operations/postgres-restore.md §1.
  5. Frontend reachability: load admin.<domain> and parent.<domain> over HTTPS, confirm Let's Encrypt cert valid.
  6. CI/CD end-to-end: push a no-op commit to a branch, watch Bitbucket pipeline → DO Registry push → Dokploy deploy → smoke test green.
  7. Cost check: after 7 days, confirm Hetzner bill aligns with €20/mo estimate; DO billing trending toward standby-only spend.

Critical files for execution