Skip to main content
WSS
/
v1
/
realtime
Beta. The Speech to Speech WebSocket is generally available for testing but session events, configuration, and audio formats may change in backwards-incompatible ways before GA.
Bidirectional WebSocket endpoint for real-time speech translation. This endpoint is served by realtime-api-server at wss://realtime.camb.ai/v1/realtime, separate from the /apis/live-tts/ws and /apis/transcription/listen WebSocket endpoints.
GET /v1/realtime
Host: realtime.camb.ai
x-api-key: <YOUR_API_KEY>
The realtime endpoint uses the low-latency iris model. Authenticate with the x-api-key WebSocket request header. If your client cannot set WebSocket headers, send credentials in the first session.update event instead.

Quickstart

Use the SDK (Python or TypeScript) β€” it handles the session lifecycle (including the session.starting cold-boot wait), normalizes binary and base64 audio frames, and exposes typed events. Input and output audio are PCM16, mono, 24 kHz. The example below streams a WAV file and writes the translated speech to another WAV.
import asyncio
import os
import wave

from camb.client import CambAI
from camb.realtime import ServerEventType
from camb.live_transcription import FileAudioSource


async def main():
    client = CambAI(api_key=os.environ["CAMB_API_KEY"])
    session = await client.realtime.connect(
        source_language="en-us",
        target_language="de-de",
    )

    out_audio = bytearray()

    @session.on(ServerEventType.TEXT_DONE)
    def _(event):
        print("translation:", event.text)

    @session.on(ServerEventType.AUDIO_DELTA)
    def _(event):
        out_audio.extend(event.data)  # raw PCM16 mono 24 kHz

    async with session:
        await session.wait_until_ready()
        # Input WAV must be 16-bit PCM, mono, 24 kHz.
        await session.stream_audio(FileAudioSource("input_24k_mono.wav", real_time=True))

    with wave.open("translated.wav", "wb") as out:
        out.setnchannels(1)
        out.setsampwidth(2)
        out.setframerate(24000)
        out.writeframes(bytes(out_audio))


asyncio.run(main())
See the Realtime Speech Translation tutorial for the microphone quickstart, the full event list, and configuration. The sections below document the underlying wire protocol for reference (for example, if you are building a client in a language without an SDK).

Integration in 4 steps

1

Open the realtime socket

Connect to wss://realtime.camb.ai/v1/realtime. This endpoint is not under the client.camb.ai/apis namespace used by the other WebSocket API references.
async with websockets.connect(
    "wss://realtime.camb.ai/v1/realtime",
    additional_headers=[("x-api-key", "YOUR_API_KEY")],
) as ws:
    ...
2

Send `session.update` as the first message

The first WebSocket message must be a JSON session.update event. The server waits up to 10 seconds for it.
{
  "type": "session.update",
  "session": {
    "source_language": "en-us",
    "target_language": "de-de",
    "output_modalities": ["text", "audio"]
  }
}
The server responds with session.created, then session.updated.
3

Stream input audio

Send microphone audio as base64-encoded bytes in input_audio_buffer.append. Only text WebSocket messages are parsed as realtime events.
{
  "type": "input_audio_buffer.append",
  "audio": "<base64_audio_bytes>"
}
Each decoded audio payload can be up to 256 KiB.
4

Read translated output

Listen for transcript, translated text, and translated audio events. response.text.delta values are additive for the current response, and response.audio.delta contains base64-encoded synthesized audio bytes.

Authentication

Prefer the WebSocket request header:
x-api-key: <YOUR_API_KEY>
The initial session.update event can also carry credentials:
{
  "type": "session.update",
  "session": {
    "source_language": "en-us",
    "target_language": "de-de",
    "output_modalities": ["text", "audio"]
  },
  "auth": {
    "api_key": "<YOUR_API_KEY>"
  }
}
If both the request header and auth object are present, the request header credential is used.

Reference

The AsyncAPI spec above documents every client and server event. Quick lookup:

Session configuration

FieldTypeRequiredNotes
source_languagestringYesSource language tag, for example en-us. Must be a supported language.
target_languagestringYesTarget language tag, for example de-de. Must be a supported language.
output_modalitiesstring[]NoDefaults to ["text", "audio"].
voiceobjectNoOutput voice selection. Defaults to { "type": "default" }. See Voice selection.

Voice selection

By default, translated speech is synthesized with a built-in voice for the target language. To synthesize the translation with one of your own cloned voices, include a voice object in the session configuration:
{
  "type": "session.update",
  "session": {
    "source_language": "en-us",
    "target_language": "de-de",
    "output_modalities": ["text", "audio"],
    "voice": { "type": "cloned", "voice_id": 147320 }
  }
}
FieldTypeRequiredNotes
typestringYes"default" to use the built-in voice, or "cloned" to use one of your voices.
voice_idintegerWhen type is "cloned"ID of a voice you own. Get it from List Voices or Create Custom Voice.
The voice must belong to your account β€” stock/built-in voice IDs are rejected with an error event. Omitting voice (or sending { "type": "default" }) uses the built-in voice. The resolved selection is echoed back in session.created. If you use the SDK, pass voice_id (Python) or voiceId (TypeScript) to realtime.connect() and it builds this voice object for you:
session = await client.realtime.connect(
    source_language="en-us",
    target_language="de-de",
    voice_id=147320,
)
For the most natural-sounding results, choose a voice whose reference language matches your target_language. A large mismatch between the voice’s native language and the translation language can reduce clarity and accent accuracy.

Supported languages

source_language and target_language accept the BCP-47 tags below (case-insensitive). Pick any supported language as the source and any supported language as the target.
CodeLanguage
ar-aeArabic (United Arab Emirates)
ar-egArabic (Egypt)
ar-saArabic (Saudi Arabia)
de-deGerman (Germany)
en-gbEnglish (United Kingdom)
en-usEnglish (United States)
es-esSpanish (Spain)
fr-caFrench (Canada)
fr-frFrench (France)
hi-inHindi (India)
ja-jpJapanese (Japan)
ko-krKorean (Korea)
pt-brPortuguese (Brazil)
zh-cnChinese (Mandarin, Simplified)

Client events

EventSupport
session.updateRequired first message.
input_audio_buffer.appendSupported after activation.
input_audio_buffer.clearRecognized but not supported; returns error.
input_audio_buffer.commitRecognized but not supported; returns error.
response.cancelRecognized but not supported; returns error.

Server events

EventDescription
session.createdSession has been authorized, started, and activated.
session.updatedActive session configuration.
conversation.item.input_audio_transcription.completedCompleted user transcript.
response.text.deltaAdditive translated text delta.
response.text.doneFinal translated text.
response.audio.deltaBase64-encoded translated audio bytes.
response.audio.doneCurrent translated audio response is complete.
errorUnsupported recognized event or billing stop decision.

Limits

LimitValue
Initial session.update timeout10 seconds
Maximum client event text size1 MiB
Maximum decoded audio payload per input_audio_buffer.append256 KiB
Audio encoding in input_audio_buffer.append.audioBase64-encoded audio bytes

Billing

Active sessions are charged in billing windows and finalized on close, failure, or billing stop. If billing stops a session, the server sends an error event whose error.message is the billing close reason, then ends the realtime loop.
Messages
Session Created
type:object

Sent after authorization, startup, and activation complete.

Session Updated
type:object

Sent immediately after session.created with the active session configuration.

Input Audio Transcription Completed
type:object

Completed user transcript produced by the realtime pipeline.

Response Text Delta
type:object

Incremental translated text. The delta is additive for the current response.

Response Text Done
type:object

Final translated text for the current response.

Response Audio Delta
type:object

Base64-encoded synthesized output audio bytes.

Response Audio Done
type:object

Current assistant audio response is complete.

Error
type:object

Structured error for unsupported recognized events and billing stop decisions.

Update Session
type:object

First client event. Authorizes and activates the realtime session.

Append Input Audio
type:object

Append base64-encoded microphone audio bytes to the realtime input stream.

Clear Input Audio Buffer
type:object

Recognized but not supported in this version. The server responds with an error event.

Commit Input Audio Buffer
type:object

Recognized but not supported in this version. The server responds with an error event.

Cancel Response
type:object

Recognized but not supported in this version. The server responds with an error event.