Scrydon
Architecture

Realtime voice

The low-latency WebRTC pipeline for voice agents — STT, LLM, TTS streaming through one path.

The realtime voice subsystem runs voice agents end-to-end: microphone in, transcript through the LLM, audio out, with sub-second latency. It runs in the same namespace as the workflow engine and shares the same authorisation, audit, and integration registry.

Realtime voice pipeline — microphone audio streams to STT, transcript drives an LLM agent that may call tools mid-stream, generated text streams to TTS, audio returns to the speaker over WebRTC. Barge-in truncates output when the user starts speaking again.

What it does

CapabilityDetail
WebRTC ingressLow-latency bidirectional audio between the browser and your cluster.
Streaming STTSpeech-to-text via any installed STT-capable integration (Azure Speech, Whisper, Voxtral, …).
Agent loopThe transcript drives an Agent block run; tool calls are issued mid-stream.
Streaming TTSText-to-speech via any installed TTS-capable integration.
Barge-inThe user can interrupt the agent mid-sentence; audio output truncates cleanly.

How it routes capabilities

Voice agents follow the same capability-resolution model as everything else. Each call selects a provider through the integration registry:

  • STT — picked from installed STT-capable integrations. Default model is the org's STT default.
  • LLM — picked from installed LLM-capable integrations. Routed through Cortex.
  • TTS — picked from installed TTS-capable integrations. Default voice is the org's TTS default.

Each can be self-hosted or cloud. An org can run a fully on-cluster voice stack (e.g. Whisper + vLLM + a local TTS) or mix self-hosted STT with a cloud LLM and a cloud voice.

Where it sits

Voice runs in the scrydon-agentic namespace as a sibling to the standard workflow engine. It depends on:

  • Platform — for user authentication and session validation.
  • Cortex — for every LLM call.
  • Integration registry — for STT and TTS provider selection.
  • Coturn / Traefik — for WebRTC signalling and TURN if you need NAT traversal.

Customer concerns

  • Latency. Self-hosted STT + LLM on co-located GPU yields the lowest end-to-end latency. Cloud providers add network RTT.
  • Egress. If your network policy denies outbound traffic to cloud voice / LLM endpoints, choose self-hosted variants for both STT and TTS.
  • Audio retention. The platform does not retain raw audio by default. Transcripts follow your workflow's audit-log policy.
On this page

On this page