Realtime Speech To Speech

WSS

realtime

Beta. The Speech to Speech WebSocket is generally available for testing but session events, configuration, and audio formats may change in backwards-incompatible ways before GA.

Bidirectional WebSocket endpoint for real-time speech translation. This endpoint is served by realtime-api-server at wss://realtime.camb.ai/v1/realtime, separate from the /apis/live-tts/ws and /streaming-transcription/listen WebSocket endpoints.

GET /v1/realtime
Host: realtime.camb.ai
x-api-key: <YOUR_API_KEY>

The realtime endpoint uses the low-latency iris model. Authenticate with the x-api-key WebSocket request header. If your client cannot set WebSocket headers, send credentials in the first session.update event instead.

Quickstart

Use the SDK (Python or TypeScript) — it handles the session lifecycle (including the session.starting cold-boot wait), normalizes binary and base64 audio frames, and exposes typed events. Input and output audio are PCM16, mono, 24 kHz. The example below streams a WAV file and writes the translated speech to another WAV.

import asyncio
import os
import wave

from camb.client import CambAI
from camb.realtime import ServerEventType
from camb.live_transcription import FileAudioSource


async def main():
    client = CambAI(api_key=os.environ["CAMB_API_KEY"])
    session = await client.realtime.connect(
        source_language="en-us",
        target_language="de-de",
    )

    out_audio = bytearray()

    @session.on(ServerEventType.TEXT_DONE)
    def _(event):
        print("translation:", event.text)

    @session.on(ServerEventType.AUDIO_DELTA)
    def _(event):
        out_audio.extend(event.data)  # raw PCM16 mono 24 kHz

    async with session:
        await session.wait_until_ready()
        # Input WAV must be 16-bit PCM, mono, 24 kHz.
        await session.stream_audio(FileAudioSource("input_24k_mono.wav", real_time=True))

    with wave.open("translated.wav", "wb") as out:
        out.setnchannels(1)
        out.setsampwidth(2)
        out.setframerate(24000)
        out.writeframes(bytes(out_audio))


asyncio.run(main())

import fs from "node:fs";

import { CambClient, RealtimeServerEventType } from "@camb-ai/sdk";

const client = new CambClient({ apiKey: process.env.CAMB_API_KEY });
const session = await client.realtime.connect({
  sourceLanguage: "en-us",
  targetLanguage: "de-de",
});

const outChunks = [];
let resolveDone;
const audioDone = new Promise((r) => (resolveDone = r));

session.on(RealtimeServerEventType.TextDone, (event) =>
  console.log("translation:", event.text),
);
session.on(RealtimeServerEventType.AudioDelta, (event) =>
  outChunks.push(Buffer.from(event.data)), // raw PCM16 mono 24 kHz
);
session.on(RealtimeServerEventType.AudioDone, () => resolveDone());

await session.waitUntilReady();

// Input WAV must be 16-bit PCM, mono, 24 kHz. This skips the standard 44-byte
// header and streams the PCM data at real-time pace in 100 ms slices.
const data = fs.readFileSync("input_24k_mono.wav").subarray(44);
const chunkSize = 24000 * 2 * 0.1;
for (let i = 0; i < data.length; i += chunkSize) {
  await session.sendAudio(data.subarray(i, i + chunkSize));
  await new Promise((r) => setTimeout(r, 100));
}

await Promise.race([audioDone, new Promise((r) => setTimeout(r, 30_000))]);
await session.close();

// Raw PCM16 mono 24 kHz; wrap in a WAV header to play it (see the tutorial).
fs.writeFileSync("translated.pcm", Buffer.concat(outChunks));

See the Realtime Speech Translation tutorial for the microphone quickstart, the full event list, and configuration. The sections below document the underlying wire protocol for reference (for example, if you are building a client in a language without an SDK).

Integration in 4 steps

Open the realtime socket

Connect to wss://realtime.camb.ai/v1/realtime. This endpoint is not under the client.camb.ai/apis namespace used by the other WebSocket API references.

async with websockets.connect(
    "wss://realtime.camb.ai/v1/realtime",
    additional_headers=[("x-api-key", "YOUR_API_KEY")],
) as ws:
    ...

Send `session.update` as the first message

The first WebSocket message must be a JSON session.update event. The server waits up to 10 seconds for it.

{
  "type": "session.update",
  "session": {
    "source_language": "en-us",
    "target_language": "de-de",
    "output_modalities": ["text", "audio"]
  }
}

The server responds with session.created, then session.updated.

Stream input audio

Send microphone audio as base64-encoded bytes in input_audio_buffer.append. Only text WebSocket messages are parsed as realtime events.

{
  "type": "input_audio_buffer.append",
  "audio": "<base64_audio_bytes>"
}

Each decoded audio payload can be up to 256 KiB.

Read translated output

Listen for transcript, translated text, and translated audio events. response.text.delta values are additive for the current response, and response.audio.delta contains base64-encoded synthesized audio bytes.

Authentication

Prefer the WebSocket request header:

x-api-key: <YOUR_API_KEY>

The initial session.update event can also carry credentials:

{
  "type": "session.update",
  "session": {
    "source_language": "en-us",
    "target_language": "de-de",
    "output_modalities": ["text", "audio"]
  },
  "auth": {
    "api_key": "<YOUR_API_KEY>"
  }
}

If both the request header and auth object are present, the request header credential is used.

Reference

The AsyncAPI spec above documents every client and server event. Quick lookup:

Session configuration

Field	Type	Required	Notes
`source_language`	string	Yes	Source language tag, for example `en-us`. Must be a supported language.
`target_language`	string	Yes	Target language tag, for example `de-de`. Must be a supported language.
`output_modalities`	string[]	No	Defaults to `["text", "audio"]`.
`voice`	object	No	Output voice selection. Defaults to `{ "type": "default" }`. See Voice selection.

Voice selection

By default, translated speech is synthesized with a built-in voice for the target language. To synthesize the translation with one of your own cloned voices, include a voice object in the session configuration:

{
  "type": "session.update",
  "session": {
    "source_language": "en-us",
    "target_language": "de-de",
    "output_modalities": ["text", "audio"],
    "voice": { "type": "cloned", "voice_id": 147320 }
  }
}

Field	Type	Required	Notes
`type`	string	Yes	`"default"` to use the built-in voice, or `"cloned"` to use one of your voices.
`voice_id`	integer	When `type` is `"cloned"`	ID of a voice you own. Get it from List Voices or Create Custom Voice.

The voice must belong to your account — stock/built-in voice IDs are rejected with an error event. Omitting voice (or sending { "type": "default" }) uses the built-in voice. The resolved selection is echoed back in session.created. If you use the SDK, pass voice_id (Python) or voiceId (TypeScript) to realtime.connect() and it builds this voice object for you:

session = await client.realtime.connect(
    source_language="en-us",
    target_language="de-de",
    voice_id=147320,
)

const session = await client.realtime.connect({
  sourceLanguage: "en-us",
  targetLanguage: "de-de",
  voiceId: 147320,
});

For the most natural-sounding results, choose a voice whose reference language matches your target_language. A large mismatch between the voice’s native language and the translation language can reduce clarity and accent accuracy.

Supported languages

source_language and target_language accept the BCP-47 tags below (case-insensitive). Pick any supported language as the source and any supported language as the target.

Supported realtime languages (14)

Code	Language
`ar-ae`	Arabic (United Arab Emirates)
`ar-eg`	Arabic (Egypt)
`ar-sa`	Arabic (Saudi Arabia)
`de-de`	German (Germany)
`en-gb`	English (United Kingdom)
`en-us`	English (United States)
`es-es`	Spanish (Spain)
`fr-ca`	French (Canada)
`fr-fr`	French (France)
`hi-in`	Hindi (India)
`ja-jp`	Japanese (Japan)
`ko-kr`	Korean (Korea)
`pt-br`	Portuguese (Brazil)
`zh-cn`	Chinese (Mandarin, Simplified)

Client events

Event	Support
`session.update`	Required first message.
`input_audio_buffer.append`	Supported after activation.
`input_audio_buffer.clear`	Recognized but not supported; returns `error`.
`input_audio_buffer.commit`	Recognized but not supported; returns `error`.
`response.cancel`	Recognized but not supported; returns `error`.

Server events

Event	Description
`session.created`	Session has been authorized, started, and activated.
`session.updated`	Active session configuration.
`conversation.item.input_audio_transcription.completed`	Completed user transcript.
`response.text.delta`	Additive translated text delta.
`response.text.done`	Final translated text.
`response.audio.delta`	Base64-encoded translated audio bytes.
`response.audio.done`	Current translated audio response is complete.
`error`	Unsupported recognized event or billing stop decision.

Limits

Limit	Value
Initial `session.update` timeout	10 seconds
Maximum client event text size	1 MiB
Maximum decoded audio payload per `input_audio_buffer.append`	256 KiB
Audio encoding in `input_audio_buffer.append.audio`	Base64-encoded audio bytes

Billing

Active sessions are charged in billing windows and finalized on close, failure, or billing stop. If billing stops a session, the server sends an error event whose error.message is the billing close reason, then ends the realtime loop.

Messages

Session Created

type:object

Sent after authorization, startup, and activation complete.

type

type:string

required

session.created

session

type:object

required

source_language

type:string

required

Source language tag, for example en-US.

target_language

type:string

required

Target language tag, for example de-DE.

output_modalities

type:array

item

type:enum

Available options: text, audio

voice

type:object

Output voice selection. Use the built-in voice or one of your cloned voices.

type:string

required

Durable realtime session ID.

Session Updated

type:object

Sent immediately after session.created with the active session configuration.

type

type:string

required

session.updated

session

type:object

required

source_language

type:string

required

Source language tag, for example en-US.

target_language

type:string

required

Target language tag, for example de-DE.

output_modalities

type:array

item

type:enum

Available options: text, audio

voice

type:object

Output voice selection. Use the built-in voice or one of your cloned voices.

Input Audio Transcription Completed

type:object

Completed user transcript produced by the realtime pipeline.

type

type:string

required

conversation.item.input_audio_transcription.completed

transcript

type:string

required

Response Text Delta

type:object

Incremental translated text. The delta is additive for the current response.

type

type:string

required

response.text.delta

delta

type:string

required

Additive translated text delta for the current response.

Response Text Done

type:object

Final translated text for the current response.

type

type:string

required

response.text.done

text

type:string

required

Final translated text.

Response Audio Delta

type:object

Base64-encoded synthesized output audio bytes.

type

type:string

required

response.audio.delta

delta

type:string

required

Base64-encoded synthesized output audio bytes.

Response Audio Done

type:object

Current assistant audio response is complete.

type

type:string

required

response.audio.done

Error

type:object

Structured error for unsupported recognized events and billing stop decisions.

type

type:string

required

error

type:object

required

message

type:string

required

Update Session

type:object

First client event. Authorizes and activates the realtime session.

type

type:string

required

session.update

session

type:object

required

source_language

type:string

required

Source language tag, for example en-US.

target_language

type:string

required

Target language tag, for example de-DE.

output_modalities

type:array

item

type:enum

Available options: text, audio

voice

type:object

Output voice selection. Use the built-in voice or one of your cloned voices.

auth

type:object

api_key

type:string

required

Append Input Audio

type:object

Append base64-encoded microphone audio bytes to the realtime input stream.

type

type:string

required

input_audio_buffer.append

audio

type:string

required

Base64-encoded audio bytes. The decoded payload can be up to 256 KiB.

Clear Input Audio Buffer

type:object

Recognized but not supported in this version. The server responds with an error event.

type

type:string

required

input_audio_buffer.clear

Commit Input Audio Buffer

type:object

Recognized but not supported in this version. The server responds with an error event.

type

type:string

required

input_audio_buffer.commit

Cancel Response

type:object

Recognized but not supported in this version. The server responds with an error event.

type

type:string

required

response.cancel

Last modified on June 30, 2026

Live Transcription (Websocket)Stream raw audio to CAMB over a single WebSocket and receive cumulative interim transcripts, word-level timing, and typed events.

Messages

Getting Started

Models

Tutorials

SDK Guides

Hosting Platforms

Integrations

API Reference

Other Products

Release Logs

Realtime Speech To Speech

Quickstart

Integration in 4 steps

Authentication

Reference

Session configuration

Voice selection

Supported languages

Client events

Server events

Limits

Billing

​Quickstart

​Integration in 4 steps

​Authentication

​Reference

​Session configuration

​Voice selection

​Supported languages

​Client events

​Server events

​Limits

​Billing

Quickstart

Integration in 4 steps

Authentication

Reference

Session configuration

Voice selection

Supported languages

Client events

Server events

Limits

Billing