Scrydon
DeploymentOperations

Backup & restore

What to back up, where the canonical state lives, and how to restore it.

Scrydon's state lives in three stores: PostgreSQL for relational state, object storage (SeaweedFS or S3) for blobs, and an OLAP warehouse (StarRocks) for analytics rows. This runbook covers backing up and restoring all three.

What's where

DataStoreBackup approach
Auth state, identities, organisations, workspacesPostgreSQLLogical dump or PITR snapshot
Workflows, automations, knowledge-base indexPostgreSQLLogical dump or PITR snapshot
Ontology schema, bindings, branchesPostgreSQLLogical dump or PITR snapshot
Audit logPostgreSQLLogical dump or PITR snapshot
Secrets (encrypted)PostgreSQLLogical dump or PITR snapshot
Knowledge-base documents (blobs)Object storageObject replication or rsync
Managed-table rowsStarRocksBackup snapshots via the StarRocks operator
Marimo notebooksObject storageObject replication

Backup cadence

The recommended baseline:

StoreCadenceRetention
PostgreSQL logical dumpDaily30 days
PostgreSQL PITR snapshotContinuous WAL archive7 days
Object storageContinuous replication to a secondary bucket / regionForever (with delete protection)
StarRocks snapshotsDaily14 days

For regulated environments, extend retention to whatever your compliance framework requires.

PostgreSQL backup

If you're using a managed Postgres (Azure Database for PostgreSQL, Amazon RDS, …), use the provider's PITR + snapshot tools.

If you're using the platform-bundled Postgres (not recommended for production), the recommended pattern is pg_basebackup for full backups + WAL archive for PITR. The Helm chart documents the volume mounts.

Object storage backup

If you're using SeaweedFS (the bundled blob store), set up volume replication to a secondary location. SeaweedFS supports cross-region replication via its filer's replication feature.

If you're using S3 / Azure Blob / GCS, the provider's cross-region replication is straightforward.

StarRocks backup

The platform-bundled StarRocks is a single-pod image with a snapshot mechanism for individual tables. For production HA, use the StarRocks Kubernetes operator with native FE/BE backup.

Restore — full disaster recovery

Order of operations for a full DR scenario:

  1. Provision the target cluster with the same Kubernetes version and Scrydon version.
  2. Restore PostgreSQL from your most recent backup (or PITR target). All five enabled databases (auth, agentic, analytics, cortex, ontology) must restore together — the license, organizations, integrations, workflows, and ontology packs are joined across them.
  3. Restore object storage to the same paths.
  4. Restore StarRocks snapshots.
  5. Apply the Helm chart pointing at the restored stores.
  6. No license re-application step. The license lives in the restored auth database's platform_config row — once Postgres comes back, api-platform reads it on the next request. If you need to rotate the license post-restore, use Settings → License → Update license in the UI (see License rotation).
  7. Validate by signing in and confirming workflows / tables / knowledge bases are visible.

The Dapr crypto master key (master-key Secret, AES-256, in the Scrydon release namespace) must be restored alongside Postgres — without it, encrypted secret values stored via @scrydon/better-auth-secrets are unrecoverable. See the Dapr crypto master key section below for the backup procedure.

Dapr crypto master key

The Dapr crypto component encrypts every secret value the platform writes (OAuth client secrets, vendor API keys, encrypted env vars passed to integration providers) before it lands in Postgres. The encryption key lives in a Kubernetes Secret named master-key in the Scrydon release namespace.

The chart auto-generates this key on fresh install via the Helm lookup function — but it is never part of any subsequent Helm release, which means a routine helm uninstall + helm install cycle, or any restore that re-applies the chart against a fresh namespace, will generate a new key. Any encrypted-secret rows in the restored Postgres are then unrecoverable.

Back up the Secret out-of-band and restore it before applying the chart:

# Capture the key into a YAML file (run this on the source cluster).
kubectl get secret master-key -n scrydon-platform -o yaml \
  | grep -v '^\s*resourceVersion\|^\s*uid\|^\s*creationTimestamp' \
  > master-key.backup.yaml

# Store master-key.backup.yaml off-cluster — encrypted at rest. Treat it
# with the same protection class as your Postgres backups.

# Restoring on the target cluster (BEFORE running helm install):
kubectl apply -f master-key.backup.yaml -n scrydon-platform

Alternatives if your team already runs one of these:

  • Sealed Secrets — seal the master-key Secret with the cluster's controller key; the sealed file is safe to commit. Re-apply on disaster recovery before the chart.
  • External Secrets Operator — source master-key.key from your cloud key manager (AWS Secrets Manager, Azure Key Vault, HashiCorp Vault). The chart's lookup-based auto-generate path will not overwrite a Secret that already carries the Helm ownership annotations.
  • etcd snapshots — captures every Kubernetes Secret including master-key. Useful for full-cluster DR; less useful for cross-cluster restore.

Compliance mapping. ISO 27001:2022 A.5.16 / A.8.24 — cryptographic-key management requires a documented backup, rotation, and recovery procedure for any key that protects production data. The master-key rotation procedure is covered in the internal Dapr ADR; for customer purposes, treat this backup as a hard prerequisite to a working PostgreSQL restore.

Restore — single store

If you only need to restore one store (e.g. a corrupted knowledge-base blob), do the targeted restore against that store. Other stores stay live.

For PostgreSQL, restoring a single table requires a logical dump that targets it specifically (pg_dump -t).

Verifying a backup

A backup that hasn't been restored isn't a backup. Run a monthly drill:

  1. Restore yesterday's PostgreSQL dump into a scratch cluster.
  2. Sign in with a known admin account.
  3. Open a workflow, a knowledge-base document, and a managed table.
  4. Confirm the audit log shows yesterday's events.
  5. Tear the scratch cluster down.

If any step fails, escalate immediately — a silently failing backup is a known operational hazard.

On this page

On this page