Scrydon
DeploymentOperations

Observability

Metrics, logs, and traces Scrydon emits — and the SLOs to track.

Scrydon emits metrics, logs, and OpenTelemetry traces. This page documents what's available and the SLOs you should care about.

Metrics

Each subsystem exposes Prometheus-compatible metrics. The recommended dashboards group them by concern:

Platform dashboard

  • auth.signin.success / auth.signin.failure — per minute, per provider.
  • audit_log.events_per_minute — by action namespace.
  • secrets.access.count — by strategy (LOCAL / BYOK / HYOK).
  • permission_check.duration_p95 — latency of the policy decision point.

Agentic dashboard

  • workflow.runs_started / workflow.runs_completed / workflow.runs_failed — per minute, per workflow.
  • workflow.run.duration_p95 — by workflow.
  • block.executed — count per block type.
  • tool.call.duration_p95 — by vendor.
  • cortex.llm_call.tokens_in / cortex.llm_call.tokens_out — by model.
  • cortex.llm_call.cost — by model.
  • cortex.llm_call.latency_p95 — by model.

Analytics dashboard

  • managed_table.read.count — per minute, per table.
  • managed_table.write.count — per minute, per table.
  • managed_table.query.duration_p95 — by table.
  • policy.evaluation.duration_p95 — Rego policy decision latency.

Voice dashboard

  • voice.session.active — current count.
  • voice.session.duration_p95 — session length.
  • voice.stt.latency_p95 / voice.tts.latency_p95 — pipeline latency.

Suggested SLOs

SLOTargetWhy
Sign-in success rate≥ 99.5%Anything lower indicates an IdP / network issue.
Workflow run success rate≥ 99%Per-workflow; some failure is expected for evaluator-gated runs.
Audit log forwarding lag≤ 60 sAudit downstream tooling needs near-real-time.
Permission check p95≤ 50 msHigher latency cascades into every API call.
Managed table query p95≤ 2 sFor dashboard-style queries; analytical queries can be longer by intent.

These are guideline numbers — adjust to your workload.

Tracing

OpenTelemetry traces are emitted for:

  • Every API request (entry to exit).
  • Every workflow run (parent span) with child spans per block execution.
  • Every LLM call through Cortex (with model, provider, latency, cost attributes).
  • Every managed-table read with the policy decision as an attribute.

Configuring Dapr distributed traces

Scrydon uses Dapr for service-to-service communication. By default, Dapr spans are written to the sidecar logs (exporter: stdout) — they are not shipped to a collector.

To send Dapr distributed traces to an OTLP-compatible collector (SigNoz, Grafana Tempo, Jaeger, an OpenTelemetry Collector, Honeycomb, Datadog OTLP endpoint, etc.), set the dapr.tracing values in your scrydon Helm chart:

dapr:
  tracing:
    samplingRate: "1"   # "1" = 100%; "0" = disabled; fractional values supported
    exporter: otel      # stdout (default, spans to sidecar logs) | otel (ship to collector)
    otel:
      endpointAddress: "my-otel-collector.observability.svc.cluster.local:4317"
      protocol: grpc    # grpc | http
      isSecure: false   # true if the collector endpoint requires TLS
ValueDefaultDescription
dapr.tracing.samplingRate"1"Dapr trace sampling rate. "1" = 100%, "0" = disabled.
dapr.tracing.exporterstdoutstdout writes spans to sidecar logs (no collector needed). otel ships to an OTLP endpoint.
dapr.tracing.otel.endpointAddress""Required when exporter: otel. The host:port of your OTLP-compatible collector.
dapr.tracing.otel.protocolgrpcgrpc (recommended) or http.
dapr.tracing.otel.isSecurefalseSet true when the collector endpoint requires TLS.

Important: exporter: otel with an empty endpointAddress is an error at chart render time — the chart fails closed to prevent misconfigured tracing silently dropping spans.

Example — send to an OpenTelemetry Collector sidecar/DaemonSet:

dapr:
  tracing:
    exporter: otel
    otel:
      endpointAddress: "otel-collector.monitoring.svc.cluster.local:4317"
      protocol: grpc
      isSecure: false

Example — send to Grafana Tempo:

dapr:
  tracing:
    exporter: otel
    otel:
      endpointAddress: "tempo.monitoring.svc.cluster.local:4317"
      protocol: grpc
      isSecure: false

These examples use in-cluster addresses — adjust endpointAddress to match wherever your collector is hosted. The exporter setting applies to all Dapr-enabled services in the deployment.

Configure the OTLP endpoint for non-Dapr traces under observability.otlp.endpoint to send application-layer spans to your collector (Jaeger, Tempo, Datadog, Honeycomb, Lightstep, …).

Log access

Logs are structured JSON, emitted to stdout. Configure your log pipeline (Loki, Cloud Logging, Datadog logs, …) to ingest pod logs in the Scrydon namespaces.

Important fields on every log line:

  • service — which subsystem emitted it.
  • levelinfo / warn / error / fatal.
  • requestId — correlation ID across services.
  • actorId (when in a user context) — the user.
  • organizationId (when in a tenant context) — the tenant.

Logs never contain secret values or PII — those are redacted at emission. See Redaction.

On this page

On this page