STT

Implement Speech-to-Text — batch upload + optional realtime WebSocket streaming.

Batch mode is required; realtime is optional. Realtime sessions stream chunks over WebSocket or SSE.

Define the capability

import { defineCapabilitySTT } from "@scrydon/sdk-authoring/integrations/define";

const sttCapability = defineCapabilitySTT({
  models: [
    { id: "whisper-v3", name: "Whisper V3", benchmarks: [{ name: "WER", score: 4.2, source: "Artificial Analysis" }] },
    { id: "fast-transcribe", name: "Fast Transcribe", benchmarks: [{ name: "WER", score: 5.1, source: "Internal" }] },
  ],
  defaultModel: "whisper-v3",

  // Batch mode — upload audio, get text back
  async transcribe(request) {
    const response = await fetch("https://api.example.com/v1/transcribe", {
      method: "POST",
      headers: { Authorization: `Bearer ${request.apiKey}` },
      body: request.audioData,
    });
    const data = await response.json();
    return { transcript: data.text, language: data.language, confidence: data.confidence };
  },

  // Realtime streaming (optional)
  realtime: {
    protocol: "websocket",
    async createSession(config) {
      const ws = new WebSocket("wss://api.example.com/v1/realtime-stt");
      return {
        sessionId: crypto.randomUUID(),
        send: (chunk) => ws.send(chunk),
        onMessage: (handler) => ws.addEventListener("message", (e) => handler(e.data)),
        close: async () => ws.close(),
      };
    },
    features: { interimResults: true, endpointDetection: true, multiLanguage: false },
  },
});

Model metadata

Field	Type	Required	Description
`id`	string	yes	Unique model identifier (e.g. `"whisper-v3"`)
`name`	string	yes	Display name in the UI
`benchmarks`	`BenchmarkScore[]`	no	WER scores — lower is better

Standard STT benchmark is WER (Word Error Rate) — a percentage where lower is better. Include source to indicate where measured.

Realtime feature flags

Flag	Meaning
`interimResults`	Vendor streams partial transcripts as audio arrives
`endpointDetection`	Vendor signals when the speaker stops
`multiLanguage`	A single session can detect / mix languages

STT

Define the capability

Model metadata

Realtime feature flags

On this page