Operations
Day-2 runbooks for operating Scrydon — backup, restore, migrations, license rotation, observability, SIEM forwarding, supply-chain verification, and upgrades.
Day-1 covers getting Scrydon installed. Day-2 covers operating it — backup, restore, migrations, license rotation, observability, and the runbooks your on-call team needs.
Account recovery & re-running setup
Regain admin access after a lost password, and how to re-open the first-run setup wizard.
Backup & restore
What to back up, where the canonical state lives, and how to restore it.
Database migrations
How schema migrations are applied on upgrade.
License rotation
Apply a renewed license without downtime.
Observability
Platform metrics, dashboards, and the SLOs to track.
SIEM forwarding
Wire the audit log into Splunk, Datadog, Elastic, Sumo, or Sentinel.
Supply-chain verification
Verify signed Helm charts, signed images, and the SBOM.
Upgrade runbook
The order of operations for an in-place upgrade.
Day-2 expectations
For a steady-state Scrydon deployment, plan on:
| Activity | Cadence |
|---|---|
| License heartbeat health check | Daily (automated) |
| Audit log review (focus events) | Weekly |
| Backup verification (restore-from-yesterday drill) | Monthly |
| Disaster recovery exercise | Quarterly |
| Minor version upgrade | Quarterly |
| Major version upgrade | Annually |
| Vulnerability scan + patch cycle | Continuous |
What requires planned downtime
Most operations are non-disruptive. The exceptions:
- Major version upgrades that touch the OLAP warehouse (StarRocks) typically require a brief read-only window while indexes rebuild.
- Encryption-strategy changes (LOCAL → BYOK → HYOK) require re-encryption of secrets in place — typically a few minutes for normal-sized vaults.
- Replacing PostgreSQL with a managed instance requires a one-time migration window.
Each is covered in the relevant runbook with the expected duration.
Where to look if something's wrong
- Check the audit log first — most failures show up there with a structured
*_FAILEDevent. - Check the platform metrics dashboard for the affected subsystem.
- Check the relevant subsystem's logs (workflow runtime, analytics, ontology, copilot).
- If the issue isn't visible in any of the above, contact Scrydon support with the audit log and dashboard screenshots.