Observability
Metrics, logs, and traces Scrydon emits — and the SLOs to track.
Scrydon emits metrics, logs, and OpenTelemetry traces. This page documents what's available and the SLOs you should care about.
Metrics
Each subsystem exposes Prometheus-compatible metrics. The recommended dashboards group them by concern:
Platform dashboard
auth.signin.success/auth.signin.failure— per minute, per provider.audit_log.events_per_minute— by action namespace.secrets.access.count— by strategy (LOCAL / BYOK / HYOK).permission_check.duration_p95— latency of the policy decision point.
Agentic dashboard
workflow.runs_started/workflow.runs_completed/workflow.runs_failed— per minute, per workflow.workflow.run.duration_p95— by workflow.block.executed— count per block type.tool.call.duration_p95— by vendor.cortex.llm_call.tokens_in/cortex.llm_call.tokens_out— by model.cortex.llm_call.cost— by model.cortex.llm_call.latency_p95— by model.
Analytics dashboard
managed_table.read.count— per minute, per table.managed_table.write.count— per minute, per table.managed_table.query.duration_p95— by table.policy.evaluation.duration_p95— Rego policy decision latency.
Voice dashboard
voice.session.active— current count.voice.session.duration_p95— session length.voice.stt.latency_p95/voice.tts.latency_p95— pipeline latency.
Suggested SLOs
| SLO | Target | Why |
|---|---|---|
| Sign-in success rate | ≥ 99.5% | Anything lower indicates an IdP / network issue. |
| Workflow run success rate | ≥ 99% | Per-workflow; some failure is expected for evaluator-gated runs. |
| Audit log forwarding lag | ≤ 60 s | Audit downstream tooling needs near-real-time. |
| Permission check p95 | ≤ 50 ms | Higher latency cascades into every API call. |
| Managed table query p95 | ≤ 2 s | For dashboard-style queries; analytical queries can be longer by intent. |
These are guideline numbers — adjust to your workload.
Tracing
OpenTelemetry traces are emitted for:
- Every API request (entry to exit).
- Every workflow run (parent span) with child spans per block execution.
- Every LLM call through Cortex (with model, provider, latency, cost attributes).
- Every managed-table read with the policy decision as an attribute.
Configuring Dapr distributed traces
Scrydon uses Dapr for service-to-service communication. By default,
Dapr spans are written to the sidecar logs (exporter: stdout) — they are not shipped to
a collector.
To send Dapr distributed traces to an OTLP-compatible collector (SigNoz, Grafana Tempo,
Jaeger, an OpenTelemetry Collector, Honeycomb, Datadog OTLP endpoint, etc.), set the
dapr.tracing values in your scrydon Helm chart:
dapr:
tracing:
samplingRate: "1" # "1" = 100%; "0" = disabled; fractional values supported
exporter: otel # stdout (default, spans to sidecar logs) | otel (ship to collector)
otel:
endpointAddress: "my-otel-collector.observability.svc.cluster.local:4317"
protocol: grpc # grpc | http
isSecure: false # true if the collector endpoint requires TLS| Value | Default | Description |
|---|---|---|
dapr.tracing.samplingRate | "1" | Dapr trace sampling rate. "1" = 100%, "0" = disabled. |
dapr.tracing.exporter | stdout | stdout writes spans to sidecar logs (no collector needed). otel ships to an OTLP endpoint. |
dapr.tracing.otel.endpointAddress | "" | Required when exporter: otel. The host:port of your OTLP-compatible collector. |
dapr.tracing.otel.protocol | grpc | grpc (recommended) or http. |
dapr.tracing.otel.isSecure | false | Set true when the collector endpoint requires TLS. |
Important: exporter: otel with an empty endpointAddress is an error at chart render
time — the chart fails closed to prevent misconfigured tracing silently dropping spans.
Example — send to an OpenTelemetry Collector sidecar/DaemonSet:
dapr:
tracing:
exporter: otel
otel:
endpointAddress: "otel-collector.monitoring.svc.cluster.local:4317"
protocol: grpc
isSecure: falseExample — send to Grafana Tempo:
dapr:
tracing:
exporter: otel
otel:
endpointAddress: "tempo.monitoring.svc.cluster.local:4317"
protocol: grpc
isSecure: falseThese examples use in-cluster addresses — adjust endpointAddress to match wherever your
collector is hosted. The exporter setting applies to all Dapr-enabled services in the
deployment.
Configure the OTLP endpoint for non-Dapr traces under observability.otlp.endpoint to
send application-layer spans to your collector (Jaeger, Tempo, Datadog, Honeycomb,
Lightstep, …).
Log access
Logs are structured JSON, emitted to stdout. Configure your log pipeline (Loki, Cloud Logging, Datadog logs, …) to ingest pod logs in the Scrydon namespaces.
Important fields on every log line:
service— which subsystem emitted it.level—info/warn/error/fatal.requestId— correlation ID across services.actorId(when in a user context) — the user.organizationId(when in a tenant context) — the tenant.
Logs never contain secret values or PII — those are redacted at emission. See Redaction.
Related
- SIEM forwarding — for the audit-event side.
- Audit logging — the queryable event log.