Backup & restore
What to back up, where the canonical state lives, and how to restore it.
Scrydon's state lives in three stores: PostgreSQL for relational state, object storage (SeaweedFS or S3) for blobs, and an OLAP warehouse (StarRocks) for analytics rows. This runbook covers backing up and restoring all three.
What's where
| Data | Store | Backup approach |
|---|---|---|
| Auth state, identities, organisations, workspaces | PostgreSQL | Logical dump or PITR snapshot |
| Workflows, automations, knowledge-base index | PostgreSQL | Logical dump or PITR snapshot |
| Ontology schema, bindings, branches | PostgreSQL | Logical dump or PITR snapshot |
| Audit log | PostgreSQL | Logical dump or PITR snapshot |
| Secrets (encrypted) | PostgreSQL | Logical dump or PITR snapshot |
| Knowledge-base documents (blobs) | Object storage | Object replication or rsync |
| Managed-table rows | StarRocks | Backup snapshots via the StarRocks operator |
| Marimo notebooks | Object storage | Object replication |
Backup cadence
The recommended baseline:
| Store | Cadence | Retention |
|---|---|---|
| PostgreSQL logical dump | Daily | 30 days |
| PostgreSQL PITR snapshot | Continuous WAL archive | 7 days |
| Object storage | Continuous replication to a secondary bucket / region | Forever (with delete protection) |
| StarRocks snapshots | Daily | 14 days |
For regulated environments, extend retention to whatever your compliance framework requires.
PostgreSQL backup
If you're using a managed Postgres (Azure Database for PostgreSQL, Amazon RDS, …), use the provider's PITR + snapshot tools.
If you're using the platform-bundled Postgres (not recommended for production), the recommended pattern is pg_basebackup for full backups + WAL archive for PITR. The Helm chart documents the volume mounts.
Object storage backup
If you're using SeaweedFS (the bundled blob store), set up volume replication to a secondary location. SeaweedFS supports cross-region replication via its filer's replication feature.
If you're using S3 / Azure Blob / GCS, the provider's cross-region replication is straightforward.
StarRocks backup
The platform-bundled StarRocks is a single-pod image with a snapshot mechanism for individual tables. For production HA, use the StarRocks Kubernetes operator with native FE/BE backup.
Restore — full disaster recovery
Order of operations for a full DR scenario:
- Provision the target cluster with the same Kubernetes version and Scrydon version.
- Restore PostgreSQL from your most recent backup (or PITR target). All five enabled databases (
auth,agentic,analytics,cortex,ontology) must restore together — the license, organizations, integrations, workflows, and ontology packs are joined across them. - Restore object storage to the same paths.
- Restore StarRocks snapshots.
- Apply the Helm chart pointing at the restored stores.
- No license re-application step. The license lives in the restored
authdatabase'splatform_configrow — once Postgres comes back,api-platformreads it on the next request. If you need to rotate the license post-restore, use Settings → License → Update license in the UI (see License rotation). - Validate by signing in and confirming workflows / tables / knowledge bases are visible.
The Dapr crypto master key (master-key Secret, AES-256, in the Scrydon release namespace) must be restored alongside Postgres — without it, encrypted secret values stored via @scrydon/better-auth-secrets are unrecoverable. See the Dapr crypto master key section below for the backup procedure.
Dapr crypto master key
The Dapr crypto component encrypts every secret value the platform writes (OAuth client secrets, vendor API keys, encrypted env vars passed to integration providers) before it lands in Postgres. The encryption key lives in a Kubernetes Secret named master-key in the Scrydon release namespace.
The chart auto-generates this key on fresh install via the Helm lookup function — but it is never part of any subsequent Helm release, which means a routine helm uninstall + helm install cycle, or any restore that re-applies the chart against a fresh namespace, will generate a new key. Any encrypted-secret rows in the restored Postgres are then unrecoverable.
Back up the Secret out-of-band and restore it before applying the chart:
# Capture the key into a YAML file (run this on the source cluster).
kubectl get secret master-key -n scrydon-platform -o yaml \
| grep -v '^\s*resourceVersion\|^\s*uid\|^\s*creationTimestamp' \
> master-key.backup.yaml
# Store master-key.backup.yaml off-cluster — encrypted at rest. Treat it
# with the same protection class as your Postgres backups.
# Restoring on the target cluster (BEFORE running helm install):
kubectl apply -f master-key.backup.yaml -n scrydon-platformAlternatives if your team already runs one of these:
- Sealed Secrets — seal the
master-keySecret with the cluster's controller key; the sealed file is safe to commit. Re-apply on disaster recovery before the chart. - External Secrets Operator — source
master-key.keyfrom your cloud key manager (AWS Secrets Manager, Azure Key Vault, HashiCorp Vault). The chart'slookup-based auto-generate path will not overwrite a Secret that already carries the Helm ownership annotations. - etcd snapshots — captures every Kubernetes Secret including
master-key. Useful for full-cluster DR; less useful for cross-cluster restore.
Compliance mapping. ISO 27001:2022 A.5.16 / A.8.24 — cryptographic-key management requires a documented backup, rotation, and recovery procedure for any key that protects production data. The
master-keyrotation procedure is covered in the internal Dapr ADR; for customer purposes, treat this backup as a hard prerequisite to a working PostgreSQL restore.
Restore — single store
If you only need to restore one store (e.g. a corrupted knowledge-base blob), do the targeted restore against that store. Other stores stay live.
For PostgreSQL, restoring a single table requires a logical dump that targets it specifically (pg_dump -t).
Verifying a backup
A backup that hasn't been restored isn't a backup. Run a monthly drill:
- Restore yesterday's PostgreSQL dump into a scratch cluster.
- Sign in with a known admin account.
- Open a workflow, a knowledge-base document, and a managed table.
- Confirm the audit log shows yesterday's events.
- Tear the scratch cluster down.
If any step fails, escalate immediately — a silently failing backup is a known operational hazard.
Related
- Upgrade runbook — restore is one of the rollback paths.
- Database migrations — schema-aware aspects of restore.
- Licensing — reapplying the license post-restore.