IntegrationsCapabilities
STT
Implement Speech-to-Text — batch upload + optional realtime WebSocket streaming.
Batch mode is required; realtime is optional. Realtime sessions stream chunks over WebSocket or SSE.
Define the capability
import { defineCapabilitySTT } from "@scrydon/sdk-authoring/integrations/define";
const sttCapability = defineCapabilitySTT({
models: [
{ id: "whisper-v3", name: "Whisper V3", benchmarks: [{ name: "WER", score: 4.2, source: "Artificial Analysis" }] },
{ id: "fast-transcribe", name: "Fast Transcribe", benchmarks: [{ name: "WER", score: 5.1, source: "Internal" }] },
],
defaultModel: "whisper-v3",
// Batch mode — upload audio, get text back
async transcribe(request) {
const response = await fetch("https://api.example.com/v1/transcribe", {
method: "POST",
headers: { Authorization: `Bearer ${request.apiKey}` },
body: request.audioData,
});
const data = await response.json();
return { transcript: data.text, language: data.language, confidence: data.confidence };
},
// Realtime streaming (optional)
realtime: {
protocol: "websocket",
async createSession(config) {
const ws = new WebSocket("wss://api.example.com/v1/realtime-stt");
return {
sessionId: crypto.randomUUID(),
send: (chunk) => ws.send(chunk),
onMessage: (handler) => ws.addEventListener("message", (e) => handler(e.data)),
close: async () => ws.close(),
};
},
features: { interimResults: true, endpointDetection: true, multiLanguage: false },
},
});Model metadata
| Field | Type | Required | Description |
|---|---|---|---|
id | string | yes | Unique model identifier (e.g. "whisper-v3") |
name | string | yes | Display name in the UI |
benchmarks | BenchmarkScore[] | no | WER scores — lower is better |
Standard STT benchmark is WER (Word Error Rate) — a percentage where lower is better. Include source to indicate where measured.
Realtime feature flags
| Flag | Meaning |
|---|---|
interimResults | Vendor streams partial transcripts as audio arrives |
endpointDetection | Vendor signals when the speaker stops |
multiLanguage | A single session can detect / mix languages |