Realtime Speech To Speech
Stream microphone audio into a realtime translation session and receive transcripts, translated text, and translated audio.
WSS
Beta. The Speech to Speech WebSocket is generally available for testing but session events, configuration, and audio formats may change in backwards-incompatible ways before GA.
realtime-api-server at wss://realtime.camb.ai/v1/realtime, separate from the /apis/live-tts/ws and /apis/transcription/listen WebSocket endpoints.
iris model.
Authenticate with the x-api-key WebSocket request header. If your client cannot set WebSocket headers, send credentials in the first session.update event instead.
Quickstart
Use the SDK (Python or TypeScript) β it handles the session lifecycle (including thesession.starting cold-boot wait), normalizes binary and base64 audio frames, and exposes typed events. Input and output audio are PCM16, mono, 24 kHz. The example below streams a WAV file and writes the translated speech to another WAV.
Integration in 4 steps
Open the realtime socket
Connect to
wss://realtime.camb.ai/v1/realtime. This endpoint is not under the client.camb.ai/apis namespace used by the other WebSocket API references.Send `session.update` as the first message
The first WebSocket message must be a JSON The server responds with
session.update event. The server waits up to 10 seconds for it.session.created, then session.updated.Stream input audio
Send microphone audio as base64-encoded bytes in Each decoded audio payload can be up to 256 KiB.
input_audio_buffer.append. Only text WebSocket messages are parsed as realtime events.Authentication
Prefer the WebSocket request header:session.update event can also carry credentials:
auth object are present, the request header credential is used.
Reference
The AsyncAPI spec above documents every client and server event. Quick lookup:Session configuration
| Field | Type | Required | Notes |
|---|---|---|---|
source_language | string | Yes | Source language tag, for example en-us. Must be a supported language. |
target_language | string | Yes | Target language tag, for example de-de. Must be a supported language. |
output_modalities | string[] | No | Defaults to ["text", "audio"]. |
voice | object | No | Output voice selection. Defaults to { "type": "default" }. See Voice selection. |
Voice selection
By default, translated speech is synthesized with a built-in voice for the target language. To synthesize the translation with one of your own cloned voices, include avoice object in the session configuration:
| Field | Type | Required | Notes |
|---|---|---|---|
type | string | Yes | "default" to use the built-in voice, or "cloned" to use one of your voices. |
voice_id | integer | When type is "cloned" | ID of a voice you own. Get it from List Voices or Create Custom Voice. |
error event. Omitting voice (or sending { "type": "default" }) uses the built-in voice. The resolved selection is echoed back in session.created.
If you use the SDK, pass voice_id (Python) or voiceId (TypeScript) to realtime.connect() and it builds this voice object for you:
Supported languages
source_language and target_language accept the BCP-47 tags below (case-insensitive). Pick any supported language as the source and any supported language as the target.
Supported realtime languages (14)
Supported realtime languages (14)
| Code | Language |
|---|---|
ar-ae | Arabic (United Arab Emirates) |
ar-eg | Arabic (Egypt) |
ar-sa | Arabic (Saudi Arabia) |
de-de | German (Germany) |
en-gb | English (United Kingdom) |
en-us | English (United States) |
es-es | Spanish (Spain) |
fr-ca | French (Canada) |
fr-fr | French (France) |
hi-in | Hindi (India) |
ja-jp | Japanese (Japan) |
ko-kr | Korean (Korea) |
pt-br | Portuguese (Brazil) |
zh-cn | Chinese (Mandarin, Simplified) |
Client events
| Event | Support |
|---|---|
session.update | Required first message. |
input_audio_buffer.append | Supported after activation. |
input_audio_buffer.clear | Recognized but not supported; returns error. |
input_audio_buffer.commit | Recognized but not supported; returns error. |
response.cancel | Recognized but not supported; returns error. |
Server events
| Event | Description |
|---|---|
session.created | Session has been authorized, started, and activated. |
session.updated | Active session configuration. |
conversation.item.input_audio_transcription.completed | Completed user transcript. |
response.text.delta | Additive translated text delta. |
response.text.done | Final translated text. |
response.audio.delta | Base64-encoded translated audio bytes. |
response.audio.done | Current translated audio response is complete. |
error | Unsupported recognized event or billing stop decision. |
Limits
| Limit | Value |
|---|---|
Initial session.update timeout | 10 seconds |
| Maximum client event text size | 1 MiB |
Maximum decoded audio payload per input_audio_buffer.append | 256 KiB |
Audio encoding in input_audio_buffer.append.audio | Base64-encoded audio bytes |
Billing
Active sessions are charged in billing windows and finalized on close, failure, or billing stop. If billing stops a session, the server sends anerror event whose error.message is the billing close reason, then ends the realtime loop.Messages
Previous
Live Transcription (Websocket)Stream raw audio to CAMB over a single WebSocket and receive cumulative interim transcripts, word-level timing, and typed events.
Next
Messages