Live TTS (WebSocket)

WSS

apis

live-tts

Bidirectional WebSocket endpoint for real-time text-to-speech. Push text as you have it; receive audio as it’s synthesized, in strict segment order. Designed for live captions, narration over streaming LLM output, interactive voice apps — anywhere you want playback to start before the writer is finished writing.

wss://client.camb.ai/apis/live-tts/ws

Authenticate with your CambAI API key via the x-api-key header (or ?api_key=... query parameter for clients that can’t set headers).

Quickstart

A complete, copy-pasteable client. Connect → configure → stream text → write the audio to a file.

import asyncio
import json
import websockets


async def synthesize(api_key: str, text: str, out_path: str = "out.mp3") -> None:
    url = "wss://client.camb.ai/apis/live-tts/ws"
    async with websockets.connect(
        url,
        additional_headers=[("x-api-key", api_key)],
    ) as ws:
        # 1. First frame: session config.
        await ws.send(json.dumps({
            "type": "session.start",
            "voice_id": 6460,
            "language": "en-us",
            "output_format": "mp3",
            "word_timestamps": True,
        }))

        # 2. Server confirms with session.ready.
        ready = json.loads(await ws.recv())
        assert ready["type"] == "session.ready", ready
        print(f"session {ready['session_id']} run_id={ready['run_id']}")

        # 3. Stream text. You can call text.chunk many times; the server
        #    segments based on content (and on a 1s idle flush).
        await ws.send(json.dumps({"type": "text.chunk", "text": text}))
        await ws.send(json.dumps({"type": "text.done"}))

        # 4. Receive ordered audio + json frames until the session ends.
        with open(out_path, "wb") as f:
            async for msg in ws:
                if isinstance(msg, bytes):
                    f.write(msg)
                    continue

                frame = json.loads(msg)
                kind = frame["type"]
                if kind == "segment.start":
                    print(f"  segment {frame['segment_id']}: {frame['text']!r}")
                    for w in frame.get("word_timestamps", []):
                        print(f"    {w['start']:6.2f}s → {w['end']:6.2f}s  {w['word']}")
                elif kind == "segment.skipped":
                    print(f"  ! skipped segment {frame['segment_id']}: {frame['text']!r}")
                elif kind == "session.done":
                    print("done")
                    break
                elif kind == "session.error":
                    raise RuntimeError(frame["error"])


asyncio.run(synthesize(
    api_key="your-camb-api-key",
    text="Hello, world. This is a streaming text-to-speech demo.",
))

That’s the whole integration surface. Everything below is reference for the four message types you’ll exchange.

Integration in 4 steps

Open the socket with your API key

async with websockets.connect(
    "wss://client.camb.ai/apis/live-tts/ws",
    additional_headers=[("x-api-key", "your-camb-api-key")],
) as ws:
    ...

Missing or invalid key → server closes with code 4401.

Send `session.start` as the first frame

{
  "type": "session.start",
  "voice_id": 6460,
  "language": "en-us",
  "output_format": "mp3",
  "word_timestamps": true,
  "idle_timeout": 1.0,

  "enhance_named_entities_pronunciation": false,
  "apply_enhancement": null,
  "enhance_reference_audio_quality": false,
  "maintain_source_accent": false,
  "speaking_rate": null,
  "inference_steps": null
}

voice_id is the only required field — everything else has a sensible default. The tuning knobs mirror the regular POST /tts-stream API one-for-one (enhance_named_entities_pronunciation, apply_enhancement, enhance_reference_audio_quality, maintain_source_accent, speaking_rate), so you can port a working /tts-stream payload directly. See the full reference at the top of the page for types and defaults.Wait for the session.ready reply (carries session_id and run_id). A malformed first frame, forbidden voice, or unsupported language → session.error then close 4400.

Stream text in

{"type": "text.chunk", "text": "Hello, "}
{"type": "text.chunk", "text": "world."}

Push as fast or as slowly as you like. The server segments by content (sentence boundaries), and idle-flushes after idle_timeout seconds of silence (default 1.0) — so for live use cases (LLM token stream, transcribed mic input) you don’t need to send text.done until the session is truly over.idle_timeout is only a fallback flush for trailing fragments without a boundary. A complete sentence (terminal punctuation, paragraph break, etc.) is flushed immediately — it never waits on idle_timeout. Bump the value on session.start (e.g. 2.5) if your producer routinely stalls mid-sentence — slower LLMs, token-level jitter — to avoid splitting one sentence across two segments.

Slow producers fragment sentences. If your LLM (or other source) is not producing text fast enough to land consecutive chunks within idle_timeout (default 1s), each chunk will be flushed as its own segment — even if together they would have formed a single sentence. The result is choppier audio and prosody that resets at each fragment boundary. Raise idle_timeout to cover the worst-case gap between your producer’s tokens.

Read ordered audio + lifecycle frames

For each segment N, the server emits, in order:

segment.start N → <binary audio chunks> → segment.done N

Segment N’s frames are completely emitted before any of segment N+1’s, even though synthesis runs concurrently behind the scenes. Concatenate the binary frames per segment_id and you have playable audio.When everything is done you’ll receive session.done, followed by a clean close.

Common patterns

Stream from an LLM

Push tokens straight from the model. Don’t call text.done — let the idle flush handle in-flight buffering, then close when the LLM is done.

async for token in llm_stream():
    await ws.send(json.dumps({"type": "text.chunk", "text": token}))

# LLM finished; flush any tail and end cleanly.
await ws.send(json.dumps({"type": "text.done"}))

Play audio while it’s still synthesizing

Hand each segment to your player as soon as segment.done arrives:

buffers: dict[int, bytearray] = {}
current_segment: int | None = None

async for msg in ws:
    if isinstance(msg, bytes):
        if current_segment is not None:
            buffers.setdefault(current_segment, bytearray()).extend(msg)
        continue

    frame = json.loads(msg)
    if frame["type"] == "segment.start":
        current_segment = frame["segment_id"]
    elif frame["type"] == "segment.done":
        sid = frame["segment_id"]
        player.enqueue(bytes(buffers.pop(sid)))   # play this segment
        current_segment = None
    elif frame["type"] == "session.done":
        break

Recover from a skipped segment

segment.skipped means TTS retries (3 by default, exponential backoff) were exhausted for that segment. The session keeps running — re-send the text in a new text.chunk if you need the audio:

if frame["type"] == "segment.skipped":
    await ws.send(json.dumps({"type": "text.chunk", "text": frame["text"]}))

Word-level timestamps

Set "word_timestamps": true in session.start. When resolution succeeds, segment.start carries a word_timestamps array:

{
  "type": "segment.start",
  "segment_id": 0,
  "text": "Hello, world.",
  "word_timestamps": [
    {"word": "Hello", "start": 0.04, "end": 0.32},
    {"word": "world", "start": 0.38, "end": 0.71}
  ]
}

Word-timestamp failures (timeout, 5xx, network) are silently swallowed; the segment is still delivered without the word_timestamps field. Treat it as best-effort — don’t block playback on it.

Reference

The AsyncAPI spec above documents every message type and field. Quick lookup:

Close codes

Code	Reason
`4400`	Bad first frame, forbidden voice, or unsupported language.
`4401`	Missing or invalid API key.
`4402`	Insufficient credits.

Auth & billing

API key auth is identical to the rest of /apis/*.
A TTS_API Run is created on session.start; its run_id is in session.ready and can be queried later via the standard run endpoints.
Credits are deducted per segment, immediately before that segment is synthesized. If you run out mid-session, the server emits a single session.error and closes with 4402.

Voice & language

Voice access uses the same rules as /tts-stream. The session is pinned to the mars-8.1-flash-beta speech model — see the streaming TTS docs for the supported BCP-47 locales. For best results, supply a reference voice in the same language/accent as language.

Server-side TTS retries

ConnectionError / TimeoutError / OSError / aiohttp.ClientError against the underlying TTS engine trigger up to 3 retries per segment with exponential backoff. On exhaustion the segment becomes segment.skipped (see Recover from a skipped segment above) and the rest of the session continues normally.

Messages

{
  "type": "<string>",
  "voice_id": 123,
  "language": "<string>",
  "output_format": "<string>",
  "word_timestamps": true,
  "idle_timeout": 123,
  "enhance_named_entities_pronunciation": true,
  "apply_enhancement": true,
  "enhance_reference_audio_quality": true,
  "maintain_source_accent": true,
  "speaking_rate": 123,
  "sample_rate": 123,
  "inference_steps": 123
}

Session Accepted

type:object

Sent immediately after session.start is accepted.

type

type:string

required

session.ready

session_id

type:string

required

run_id

type:integer

required

ID of the TTS_API Run created for this session.

config

type:object

required

Echo of the resolved session configuration (without reference_audio).

Segment Start

type:object

Marks the beginning of a synthesized segment. Followed by one or more binary audio frames and then segment.done.

type

type:string

required

segment.start

segment_id

type:integer

required

text

type:string

required

The exact text that produced this segment's audio.

word_timestamps

type:array

Per-word timing data. Present only when word_timestamps=true was set on session.start and resolution succeeded.

word

type:string

required

start

type:number

required

Start time in seconds, relative to the segment.

end

type:number

required

End time in seconds, relative to the segment.

Binary Audio Frame

type:string

Raw audio bytes for the current segment. Up to LIVE_TTS_AUDIO_FRAME_MAX_BYTES (default 65536) per frame.

Segment Done

type:object

All audio for the current segment has been emitted.

type

type:string

required

segment.done

segment_id

type:integer

required

Segment Skipped

type:object

TTS retries were exhausted for this segment. The session continues; resend the text via text.chunk if needed.

type

type:string

required

segment.skipped

segment_id

type:integer

required

text

type:string

required

Session Done

type:object

Pipeline drained, all segments emitted. Followed by a normal close.

type

type:string

required

session.done

Session Error

type:object

Fatal session-level error. Followed by a close with code 4400 / 4401 / 4402.

type

type:string

required

session.error

error

type:string

required

Start Session (first frame)

type:object

Must be the very first message sent on the WebSocket. Configures the synthesis run.

type

type:string

required

session.start

voice_id

type:integer

required

CambAI voice ID. Validated using the same rules as /tts-stream.

language

type:string

BCP-47 locale (e.g. en-us, hi-in, zh-cn). Must be supported by mars-8.1-flash-beta.

output_format

type:enum

Available options: mp3, wav, flac, aac

word_timestamps

type:boolean

When true, the server includes per-word timing data (word_timestamps) on each segment.start.

idle_timeout

type:number

Fallback flush, in seconds, for trailing text fragments that don't end in a sentence boundary. Complete sentences (terminal punctuation, paragraph break, etc.) are flushed immediately and never wait on this timer. Bump up (e.g. 2.5) when the producer stalls mid-sentence — slower LLMs, token-level jitter — to avoid splitting a sentence across two segments. Lower it for tighter tail-latency on live captioning / mic input.

enhance_named_entities_pronunciation

type:boolean

If true, improves pronunciation of names, brands, and other named entities. Mirrors /tts-stream.

apply_enhancement

type:boolean

If true, applies output audio enhancement (loudness, denoising, polish). Defaults to the speech-model's per-engine default when omitted (off for the speed-oriented mars-flash and mars-8.1-flash-beta models, on otherwise). Mirrors /tts-stream output_configuration.apply_enhancement.

enhance_reference_audio_quality

type:boolean

If true, removes noise/compression from the reference audio before cloning. Mirrors /tts-stream voice_settings.enhance_reference_audio_quality.

maintain_source_accent

type:boolean

If true, preserves the accent of the reference voice. Mirrors /tts-stream voice_settings.maintain_source_accent.

speaking_rate

type:number

Speech pace multiplier (e.g. 1.5). Mirrors /tts-stream voice_settings.speaking_rate. Pass-through to the TTS engine.

sample_rate

type:integer

Output sample rate in Hz. Mirrors /tts-stream output_configuration.sample_rate.

inference_steps

type:integer

TTS quality/latency knob.

Append Text

type:object

Push more text into the synthesis buffer. The server segments based on content, not chunk boundaries.

type

type:string

required

text.chunk

text

type:string

required

index

type:integer

Optional informational ordering hint.

End of Input

type:object

Flush whatever is buffered and finish. Optional — the server also flushes after LIVE_TTS_IDLE_FLUSH_SECONDS (default 1s) of silence.

type

type:string

required

text.done

Last modified on May 21, 2026

Stream Text-to-Speech AudioConvert text to speech in real-time with customizable voice characteristics, delivering audio content as it's generated for immediate playback in your applications.

Messages

{
  "type": "<string>",
  "voice_id": 123,
  "language": "<string>",
  "output_format": "<string>",
  "word_timestamps": true,
  "idle_timeout": 123,
  "enhance_named_entities_pronunciation": true,
  "apply_enhancement": true,
  "enhance_reference_audio_quality": true,
  "maintain_source_accent": true,
  "speaking_rate": 123,
  "sample_rate": 123,
  "inference_steps": 123
}

Getting Started

Models

Tutorials

SDK Guides

Hosting Platforms

Integrations

API Reference

Other Products

Release Logs

Live TTS (WebSocket)

Quickstart

Integration in 4 steps

Common patterns

Stream from an LLM

Play audio while it’s still synthesizing

Recover from a skipped segment

Word-level timestamps

Reference

Close codes

Auth & billing

Voice & language

Server-side TTS retries

​Quickstart

​Integration in 4 steps

​Common patterns

​Stream from an LLM

​Play audio while it’s still synthesizing

​Recover from a skipped segment

​Word-level timestamps

​Reference

​Close codes

​Auth & billing

​Voice & language

​Server-side TTS retries

Quickstart

Integration in 4 steps

Common patterns

Stream from an LLM

Play audio while it’s still synthesizing

Recover from a skipped segment

Word-level timestamps

Reference

Close codes

Auth & billing

Voice & language

Server-side TTS retries