Realtime voice
The low-latency WebRTC pipeline for voice agents — STT, LLM, TTS streaming through one path.
The realtime voice subsystem runs voice agents end-to-end: microphone in, transcript through the LLM, audio out, with sub-second latency. It runs in the same namespace as the workflow engine and shares the same authorisation, audit, and integration registry.
What it does
| Capability | Detail |
|---|---|
| WebRTC ingress | Low-latency bidirectional audio between the browser and your cluster. |
| Streaming STT | Speech-to-text via any installed STT-capable integration (Azure Speech, Whisper, Voxtral, …). |
| Agent loop | The transcript drives an Agent block run; tool calls are issued mid-stream. |
| Streaming TTS | Text-to-speech via any installed TTS-capable integration. |
| Barge-in | The user can interrupt the agent mid-sentence; audio output truncates cleanly. |
How it routes capabilities
Voice agents follow the same capability-resolution model as everything else. Each call selects a provider through the integration registry:
- STT — picked from installed STT-capable integrations. Default model is the org's STT default.
- LLM — picked from installed LLM-capable integrations. Routed through Cortex.
- TTS — picked from installed TTS-capable integrations. Default voice is the org's TTS default.
Each can be self-hosted or cloud. An org can run a fully on-cluster voice stack (e.g. Whisper + vLLM + a local TTS) or mix self-hosted STT with a cloud LLM and a cloud voice.
Where it sits
Voice runs in the scrydon-agentic namespace as a sibling to the standard workflow engine. It depends on:
- Platform — for user authentication and session validation.
- Cortex — for every LLM call.
- Integration registry — for STT and TTS provider selection.
- Coturn / Traefik — for WebRTC signalling and TURN if you need NAT traversal.
Customer concerns
- Latency. Self-hosted STT + LLM on co-located GPU yields the lowest end-to-end latency. Cloud providers add network RTT.
- Egress. If your network policy denies outbound traffic to cloud voice / LLM endpoints, choose self-hosted variants for both STT and TTS.
- Audio retention. The platform does not retain raw audio by default. Transcripts follow your workflow's audit-log policy.
Related
- Integrations → Capabilities — STT / TTS capability definitions.
- Architecture → Cortex — LLM gateway behind the voice loop.
- Security → Audit logging — voice sessions emit the same audit events as text-based runs.