Realtime voice

The low-latency WebRTC pipeline for voice agents — STT, LLM, TTS streaming through one path.

The realtime voice subsystem runs voice agents end-to-end: microphone in, transcript through the LLM, audio out, with sub-second latency. It runs in the same namespace as the workflow engine and shares the same authorisation, audit, and integration registry.

Realtime voice pipeline — microphone audio streams to STT, transcript drives an LLM agent that may call tools mid-stream, generated text streams to TTS, audio returns to the speaker over WebRTC. Barge-in truncates output when the user starts speaking again.

What it does

Capability	Detail
WebRTC ingress	Low-latency bidirectional audio between the browser and your cluster.
Streaming STT	Speech-to-text via any installed STT-capable integration (Azure Speech, Whisper, Voxtral, …).
Agent loop	The transcript drives an Agent block run; tool calls are issued mid-stream.
Streaming TTS	Text-to-speech via any installed TTS-capable integration.
Barge-in	The user can interrupt the agent mid-sentence; audio output truncates cleanly.

How it routes capabilities

Voice agents follow the same capability-resolution model as everything else. Each call selects a provider through the integration registry:

STT — picked from installed STT-capable integrations. Default model is the org's STT default.
LLM — picked from installed LLM-capable integrations. Routed through Cortex.
TTS — picked from installed TTS-capable integrations. Default voice is the org's TTS default.

Each can be self-hosted or cloud. An org can run a fully on-cluster voice stack (e.g. Whisper + vLLM + a local TTS) or mix self-hosted STT with a cloud LLM and a cloud voice.

Where it sits

Voice runs in the scrydon-agentic namespace as a sibling to the standard workflow engine. It depends on:

Platform — for user authentication and session validation.
Cortex — for every LLM call.
Integration registry — for STT and TTS provider selection.
Coturn / Traefik — for WebRTC signalling and TURN if you need NAT traversal.

Customer concerns

Latency. Self-hosted STT + LLM on co-located GPU yields the lowest end-to-end latency. Cloud providers add network RTT.
Egress. If your network policy denies outbound traffic to cloud voice / LLM endpoints, choose self-hosted variants for both STT and TTS.
Audio retention. The platform does not retain raw audio by default. Transcripts follow your workflow's audit-log policy.

Integrations → Capabilities — STT / TTS capability definitions.
Architecture → Cortex — LLM gateway behind the voice loop.
Security → Audit logging — voice sessions emit the same audit events as text-based runs.

Realtime voice

What it does

How it routes capabilities

Where it sits

Customer concerns

Related

On this page