Scrydon
IntegrationsCapabilities

TTS

Implement Text-to-Speech — batch synthesis + optional realtime streaming.

Batch mode synthesises an audio buffer in one shot; realtime streams chunks bidirectionally.

Define the capability

import { defineCapabilityTTS } from "@scrydon/sdk-authoring/integrations/define";

const ttsCapability = defineCapabilityTTS({
  models: [
    { id: "tts-standard", name: "Standard TTS", voices: 6, benchmarks: [{ name: "MOS", score: 3.5, source: "Internal" }] },
    { id: "tts-hd",       name: "HD TTS",       voices: 6, benchmarks: [{ name: "MOS", score: 4.2, source: "Internal" }] },
  ],
  defaultModel: "tts-standard",

  // Batch mode — send text, get audio back
  async synthesize(request) {
    const response = await fetch("https://api.example.com/v1/synthesize", {
      method: "POST",
      headers: {
        Authorization: `Bearer ${request.apiKey}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        text: request.text,
        voice: request.voice,
        model: request.model,
        speed: request.speed,
      }),
    });
    return {
      audioBuffer: await response.arrayBuffer(),
      format: "mp3",
      mimeType: "audio/mpeg",
    };
  },

  // Realtime streaming (optional)
  realtime: {
    protocol: "websocket",
    async createSession(config) {
      // Mirror the STT realtime pattern
    },
    features: { streamingInput: true, streamingOutput: true },
  },
});

Model metadata

FieldTypeRequiredDescription
idstringyesUnique model identifier (e.g. "tts-hd")
namestringyesDisplay name in the UI
voicesnumbernoNumber of available voice presets
benchmarksBenchmarkScore[]noMOS scores — higher is better

Standard TTS benchmark is MOS (Mean Opinion Score, 1–5) — higher is better. voices tells the UI how many voice presets to surface.

On this page

On this page