Live TTS (WebSocket)
Stream text in, receive synthesized speech audio + optional word-level timestamps in real time over a single WebSocket connection.
WSS
Bidirectional WebSocket endpoint for real-time text-to-speech. Push text as you have it; receive audio as it’s synthesized, in strict segment order. Designed for live captions, narration over streaming LLM output, interactive voice apps — anywhere you want playback to start before the writer is finished writing.
Authenticate with your CambAI API key via the
That’s the whole integration surface. Everything below is reference for the four message types you’ll exchange.
Push as fast or as slowly as you like. The server segments by content (sentence boundaries), and idle-flushes after
Word-timestamp failures (timeout, 5xx, network) are silently swallowed; the segment is still delivered without the
x-api-key header (or ?api_key=... query parameter for clients that can’t set headers).
Quickstart
A complete, copy-pasteable client. Connect → configure → stream text → write the audio to a file.Integration in 4 steps
Send `session.start` as the first frame
voice_id is the only required field — everything else has a sensible default. The tuning knobs mirror the regular POST /tts-stream API one-for-one (enhance_named_entities_pronunciation, apply_enhancement, enhance_reference_audio_quality, maintain_source_accent, speaking_rate), so you can port a working /tts-stream payload directly. See the full reference at the top of the page for types and defaults.Wait for the session.ready reply (carries session_id and run_id). A malformed first frame, forbidden voice, or unsupported language → session.error then close 4400.Stream text in
idle_timeout seconds of silence (default 1.0) — so for live use cases (LLM token stream, transcribed mic input) you don’t need to send text.done until the session is truly over.idle_timeout is only a fallback flush for trailing fragments without a boundary. A complete sentence (terminal punctuation, paragraph break, etc.) is flushed immediately — it never waits on idle_timeout. Bump the value on session.start (e.g. 2.5) if your producer routinely stalls mid-sentence — slower LLMs, token-level jitter — to avoid splitting one sentence across two segments.Read ordered audio + lifecycle frames
For each segment N, the server emits, in order:Segment N’s frames are completely emitted before any of segment N+1’s, even though synthesis runs concurrently behind the scenes. Concatenate the binary frames per
segment_id and you have playable audio.When everything is done you’ll receive session.done, followed by a clean close.Common patterns
Stream from an LLM
Push tokens straight from the model. Don’t calltext.done — let the idle flush handle in-flight buffering, then close when the LLM is done.
Play audio while it’s still synthesizing
Hand each segment to your player as soon assegment.done arrives:
Recover from a skipped segment
segment.skipped means TTS retries (3 by default, exponential backoff) were exhausted for that segment. The session keeps running — re-send the text in a new text.chunk if you need the audio:
Word-level timestamps
Set"word_timestamps": true in session.start. When resolution succeeds, segment.start carries a word_timestamps array:
word_timestamps field. Treat it as best-effort — don’t block playback on it.
Reference
The AsyncAPI spec above documents every message type and field. Quick lookup:Close codes
| Code | Reason |
|---|---|
4400 | Bad first frame, forbidden voice, or unsupported language. |
4401 | Missing or invalid API key. |
4402 | Insufficient credits. |
Auth & billing
- API key auth is identical to the rest of
/apis/*. - A
TTS_APIRun is created onsession.start; itsrun_idis insession.readyand can be queried later via the standard run endpoints. - Credits are deducted per segment, immediately before that segment is synthesized. If you run out mid-session, the server emits a single
session.errorand closes with4402.
Voice & language
Voice access uses the same rules as/tts-stream. The session is pinned to the mars-8.1-flash-beta speech model — see the streaming TTS docs for the supported BCP-47 locales. For best results, supply a reference voice in the same language/accent as language.
Server-side TTS retries
ConnectionError / TimeoutError / OSError / aiohttp.ClientError against the underlying TTS engine trigger up to 3 retries per segment with exponential backoff. On exhaustion the segment becomes segment.skipped (see Recover from a skipped segment above) and the rest of the session continues normally.Messages
Previous
Stream Text-to-Speech AudioConvert text to speech in real-time with customizable voice characteristics, delivering audio content as it's generated for immediate playback in your applications.
Next
Messages