Scrydon
IntegrationsCapabilities

STT

Implement Speech-to-Text — batch upload + optional realtime WebSocket streaming.

Batch mode is required; realtime is optional. Realtime sessions stream chunks over WebSocket or SSE.

Define the capability

import { defineCapabilitySTT } from "@scrydon/sdk-authoring/integrations/define";

const sttCapability = defineCapabilitySTT({
  models: [
    { id: "whisper-v3", name: "Whisper V3", benchmarks: [{ name: "WER", score: 4.2, source: "Artificial Analysis" }] },
    { id: "fast-transcribe", name: "Fast Transcribe", benchmarks: [{ name: "WER", score: 5.1, source: "Internal" }] },
  ],
  defaultModel: "whisper-v3",

  // Batch mode — upload audio, get text back
  async transcribe(request) {
    const response = await fetch("https://api.example.com/v1/transcribe", {
      method: "POST",
      headers: { Authorization: `Bearer ${request.apiKey}` },
      body: request.audioData,
    });
    const data = await response.json();
    return { transcript: data.text, language: data.language, confidence: data.confidence };
  },

  // Realtime streaming (optional)
  realtime: {
    protocol: "websocket",
    async createSession(config) {
      const ws = new WebSocket("wss://api.example.com/v1/realtime-stt");
      return {
        sessionId: crypto.randomUUID(),
        send: (chunk) => ws.send(chunk),
        onMessage: (handler) => ws.addEventListener("message", (e) => handler(e.data)),
        close: async () => ws.close(),
      };
    },
    features: { interimResults: true, endpointDetection: true, multiLanguage: false },
  },
});

Model metadata

FieldTypeRequiredDescription
idstringyesUnique model identifier (e.g. "whisper-v3")
namestringyesDisplay name in the UI
benchmarksBenchmarkScore[]noWER scores — lower is better

Standard STT benchmark is WER (Word Error Rate) — a percentage where lower is better. Include source to indicate where measured.

Realtime feature flags

FlagMeaning
interimResultsVendor streams partial transcripts as audio arrives
endpointDetectionVendor signals when the speaker stops
multiLanguageA single session can detect / mix languages
On this page

On this page