Database migrations
How schema migrations are applied on Scrydon upgrades — what runs, how to verify, and how to handle failures.
Every Scrydon upgrade may include database migrations. This runbook covers what runs, how to verify, and how to handle a failed migration.
How migrations run
On upgrade, each Scrydon service that owns a schema runs its own migrations on startup. The first pod in each rolling update applies the migration; subsequent pods see the migration as applied and skip.
| Service | Schema owner |
|---|---|
| Platform | Authentication, organisations, workspaces, audit, secrets metadata |
| Agentic | Workflows, automations, knowledge bases, chats |
| Analytics | Managed-table catalogue, profiles, policy bundles |
| Ontology | Ontology schema, bindings, branches |
Migrations are forward-only by design. Down migrations are not provided; if you need to roll back, restore from backup.
What a migration does
Migrations are typically:
- Adding a new column (
ADD COLUMN ... NULL). - Backfilling derived data.
- Adding an index in
CONCURRENTLYmode. - Soft-deprecating a column (renamed, then read from new + old, then read-from-new only, then drop).
Migrations that would block writes for more than a few seconds are split across multiple releases — the platform never ships a single-step lock-the-table migration on a large table.
Verifying a migration ran
# Find the migration version
kubectl exec -it deploy/api-platform -n scrydon-platform -- \
psql "$DATABASE_URL" -c "SELECT version, applied_at FROM platform_migrations ORDER BY applied_at DESC LIMIT 5;"Each service has its own migrations table.
Handling a migration failure
If a pod fails to apply a migration, the pod stays in CrashLoopBackOff. The migration error appears in the pod logs.
Do not kubectl rollout restart blindly. A failing migration that's retried can leave the schema in a partially-applied state on some databases. Investigate the error first.
Recovery procedure:
- Read the failing pod's logs.
- Identify the migration version and the SQL statement that failed.
- Decide between three paths:
- Fix forward: apply a hot-patch to the failing migration. Restart the pod.
- Skip the migration (if you've decided it's safe): mark the migration applied in the migrations table manually and restart.
- Restore from backup: roll back to the pre-upgrade state. Restore PostgreSQL from the snapshot you took before the upgrade.
Always take a PostgreSQL snapshot before starting an upgrade. The upgrade runbook calls this out explicitly.
Migration tracking desync
The migration job verifies its tracking table against the actual database schema on every run. If the tracking table claims migrations are applied whose objects do not exist, the job fails with:
[migrate-bootstrap] FATAL: the migration tracking table is out of sync with
the actual database schema — it claims migrations are applied whose objects
do not exist. This is a tracking desync, not a migration bug.When the message offers MIGRATE_REPAIR_TRACKING=1 ("Re-run with MIGRATE_REPAIR_TRACKING=1 to rewind tracking from ..."), the state is provably auto-repairable: the affected migrations left no objects behind, so re-applying them cannot conflict with existing data. Enable the repair via the matching service's values flag:
# values override for the affected service only — auth, analytics,
# cortex, apiOntology, or agentic
analytics:
migration:
repairTracking: trueThen re-run helm upgrade. The migration job rewinds the falsely-tracked rows and re-applies the migrations in order.
repairTracking is a one-shot, break-glass flag. Set it for the repair upgrade only, confirm the migration job succeeds, then set it back to false. Leaving it enabled permanently silences a safety check that exists to catch database/schema drift loudly.
If the failure message instead reports a partial or non-contiguous state, the auto-repair refuses to run by design — contact support before touching the tracking table manually.
Long-running backfills
Some upgrades introduce a backfill (populating a new column from existing data). For large datasets, the backfill runs as a separate background job, not blocking the rolling update.
The progress is visible in the audit log as MIGRATION_BACKFILL_* events. The job can be paused, resumed, or restarted from where it left off.
What helm upgrade doesn't touch
helm upgrade orchestrates the rollout but doesn't itself touch the database. The services apply their own migrations on startup. This separation means:
- A Helm rollback doesn't un-apply migrations.
- A failed Helm upgrade where pods refused to start typically left the schema unchanged.
- The migration history is owned by the service, not by Helm.
Migrations never ran on a fresh install
On a first install, each service's migrations run as a Helm hook job (post-install,pre-upgrade). Post-install hooks only execute after helm install --wait reports every resource Ready. If any Deployment never becomes ready, the install is marked failed and the migration hooks are skipped — leaving the databases empty. The visible symptom is usually "Auth server unreachable" in the platform UI and relation "…" does not exist errors in the service logs.
Confirm the schema is missing:
kubectl -n <ns> exec deploy/db -c postgres -- psql -U postgres -d auth -c '\dt' | head
# "Did not find any tables." => migrations never ranRecover. First fix whatever kept a pod from becoming ready (see Install troubleshooting). Then re-run the migrations one of two ways:
-
Re-run via Helm (preferred). The migration hook is
pre-upgrade, which runs before the readiness wait — so it completes even if a slow service is still settling:helm -n <ns> upgrade <release> <chart> --reuse-values -
Run the migration jobs manually (if you can't run
helm upgrade). The chart's migration jobs are registered as hooks; render and apply them as ordinary Jobs:helm -n <ns> get hooks <release> > /tmp/hooks.yaml # extract the *-migration-* Job manifests, then: kubectl -n <ns> apply -f /tmp/migrations.yaml kubectl -n <ns> wait --for=condition=complete job -l app.kubernetes.io/name=scrydon --timeout=300s
The jobs are idempotent — they detect a fresh database and apply every migration from scratch, and become a no-op once the schema is present.
Related
- Install troubleshooting — first-install failures and recovery.
- Upgrade runbook — full upgrade procedure.
- Backup & restore — the rollback path.