Model Capabilities

Speech to Text

Transcribe audio files into text with a single API call, or stream audio in real time over WebSocket. The API supports 12 audio formats, word-level timestamps, multichannel transcription, and text formatting.

Quick Start

Transcribe an audio file with a single API call:

curl -X POST https://api.x.ai/v1/stt \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -F format=true \
  -F language=en \
  -F "keyterm=Understand The Universe" \
  -F [email protected]

import os
import requests

response = requests.post(
    "https://api.x.ai/v1/stt",
    headers={"Authorization": f"Bearer {os.environ['XAI_API_KEY']}"},
    files={"file": ("audio.mp3", open("audio.mp3", "rb"), "audio/mpeg")},
    data=[
        ("format", "true"),
        ("language", "en"),
        ("keyterm", "Understand The Universe"),
    ],
)
response.raise_for_status()

result = response.json()
print(result["text"])
print(f"Duration: {result['duration']}s")
for word in result.get("words", []):
    print(f"  {word['start']:.2f}s - {word['end']:.2f}s: {word['text']}")

import fs from "fs";

const formData = new FormData();
formData.append("format", "true");
formData.append("language", "en");
formData.append("keyterm", "Understand The Universe");
formData.append("file", new Blob([fs.readFileSync("audio.mp3")]), "audio.mp3");

const response = await fetch("https://api.x.ai/v1/stt", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.XAI_API_KEY}`,
  },
  body: formData,
});

if (!response.ok) throw new Error(`STT error ${response.status}`);

const result = await response.json();
console.log(result.text);
console.log(`Duration: ${result.duration}s`);
for (const word of result.words ?? []) {
  console.log(`  ${word.start.toFixed(2)}s - ${word.end.toFixed(2)}s: ${word.text}`);
}

Note: The file parameter must be provided after all other parameters in the multipart form.

Get API Key →

Live Voice Demos

Supported Languages

The language parameter enables formatting for the following languages. The model transcribes speech in any of these languages regardless of the language parameter — setting it enables formatting of numbers, currencies, and units into their written form.

Language	Code	Language	Code
Arabic	`ar`	Macedonian	`mk`
Czech	`cs`	Malay	`ms`
Danish	`da`	Persian	`fa`
Dutch	`nl`	Polish	`pl`
English	`en`	Portuguese	`pt`
Filipino	`fil`	Romanian	`ro`
French	`fr`	Russian	`ru`
German	`de`	Spanish	`es`
Hindi	`hi`	Swedish	`sv`
Indonesian	`id`	Thai	`th`
Italian	`it`	Turkish	`tr`
Japanese	`ja`	Vietnamese	`vi`
Korean	`ko`

Request Body

The request uses multipart/form-data. Either file or url must be provided.

Parameter	Type	Default	Required	Description
`audio_format`	string			Format hint for raw/headerless audio: `pcm`, `mulaw`, `alaw`. Container formats are auto-detected — do not set this field for MP3, WAV, etc.
`channels`	integer			Number of audio channels (2–8). Only required for multichannel raw audio. Auto-detected for container formats.
`diarize`	boolean	`false`		When `true`, enables speaker diarization. Each word in the response includes a `speaker` field (integer) identifying the detected speaker.
`file`	file		✓†	Audio file to transcribe. Max 500 MB. See Supported Formats. Must be the last field in the multipart form.
`filler_words`	boolean	`false`		When `true`, filler words (e.g. "uh", "um", "er") are included in the transcript. When `false` (default), filler words are automatically removed from the transcript text and the `words` array.
`format`	boolean	`false`		When `true`, enables Inverse Text Normalization — converts spoken numbers/currency to written form (e.g. "one hundred dollars" → "$100"). Requires `language`.
`keyterm`	string			A key term to bias transcription toward (e.g. product names, proper nouns). Repeat the field for multiple terms (e.g. `keyterm=Understand+The+Universe`). Max 100 terms, each up to 50 characters.
`language`	string			Language code (e.g. `en`, `fr`, `de`). Used with `format=true` to enable text formatting. See Supported Languages.
`multichannel`	boolean	`false`		When `true`, transcribes each audio channel independently. Results returned in the `channels` array.
`sample_rate`	integer			Sample rate in Hz. Only required for raw audio (`pcm`, `mulaw`, `alaw`). Supported: `8000`, `16000`, `22050`, `24000`, `44100`, `48000`.
`url`	string		✓†	URL of an audio file to download and transcribe (server-side).

† Either file or url must be provided.

Example with text formatting

curl -X POST https://api.x.ai/v1/stt \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -F format=true \
  -F language=en \
  -F "keyterm=Understand The Universe" \
  -F [email protected]

The file parameter must be provided after all other parameters in the multipart form.

Response

The response includes the full transcript, audio duration, and word-level timestamps.

{
  "text": "The balance is $167,983.15.",
  "language": "English",
  "duration": 3.45,
  "words": [
    { "text": "The", "start": 0.24, "end": 0.48 },
    { "text": "balance", "start": 0.48, "end": 0.96 },
    { "text": "is", "start": 0.96, "end": 1.12 },
    { "text": "$167,983.15.", "start": 1.12, "end": 3.20 }
  ]
}

Field	Type	Description
`channels`	array	Per-channel transcripts (only when `multichannel=true`). Each entry has `index`, `text`, and `words`.
`duration`	number	Audio duration in seconds (2 d.p.).
`language`	string	Detected language name (e.g. `"English"`, `"French"`).
`text`	string	Full transcript text.
`words`	array	Word-level segments with `text`, `start`, `end`, and `speaker` (integer, only when `diarize=true`).

Supported Audio Formats

Container formats (auto-detected)

Format	Extension	Description
AAC	`.aac`	Advanced Audio Coding
FLAC	`.flac`	Free Lossless Audio Codec — lossless compression
M4A	`.m4a`	MPEG-4 Audio — Apple ecosystem standard
MKV	`.mkv`	Matroska container — supports MP3, AAC, and FLAC audio codecs
MP3	`.mp3`	MPEG Audio Layer 3 — widely supported
MP4	`.mp4`	MPEG-4 container
OGG	`.ogg`	Ogg container — open format
Opus	`.opus`	Opus codec — low-latency, high quality
WAV	`.wav`	Waveform Audio — lossless, best quality input

Raw formats (require `audio_format` and `sample_rate`)

Format	`audio_format` value	Description
A-law	`alaw`	G.711 A-law (1 byte/sample)
PCM	`pcm`	Signed 16-bit little-endian (2 bytes/sample)
µ-law	`mulaw`	G.711 µ-law (1 byte/sample)

Limits

Max file size: 500 MB
Channels: Mono, stereo, or up to 8 channels (with multichannel=true)
Sample rates: 8000, 16000, 22050, 24000, 44100, 48000 Hz

Streaming Speech-to-Text (WebSocket)

For real-time transcription, use the WebSocket API at wss://api.x.ai/v1/stt. The client streams raw audio as binary WebSocket frames and receives JSON transcript events as the audio is processed.

Endpoint: wss://api.x.ai/v1/stt

Configuration is done via URL query parameters — no setup message required. Audio is sent as raw binary frames (no base64 encoding).

[!NOTE]

Never expose your API key in client-side code. Always proxy WebSocket connections through your backend.

Query Parameters

Parameter	Type	Default	Description
`channels`	integer	`1`	Number of interleaved audio channels (max 8).
`diarize`	boolean		When `true`, enables speaker diarization. Words include a `speaker` field identifying the detected speaker.
`encoding`	string	`pcm`	Audio encoding: `pcm`, `mulaw`, or `alaw`.
`endpointing`	integer	`10`	Silence duration (ms) before utterance-final event. Range: 0–5000. `0` = fire on any VAD silence boundary.
`filler_words`	boolean	`false`	When `true`, filler words (e.g. `uh`, `um`, `er`) are included in the transcript. When `false` (default), filler words are automatically removed.
`interim_results`	boolean	`false`	When `true`, emit partial transcripts `is_final=false` every ~500 ms.
`keyterm`	string		A key term to bias transcription toward (e.g. product names, proper nouns). Repeat the parameter for multiple terms (e.g. `keyterm=Understand+The+Universe`). Max 100 terms, each up to 50 characters.
`language`	string		Language code for text formatting. See Supported Languages.
`multichannel`	boolean	`false`	Per-channel transcription. Requires `channels` ≥ 2.
`sample_rate`	integer	`16000`	Audio sample rate in Hz.
`smart_turn_timeout`	integer		Maximum silence duration (ms) before forcing `speech_final`, even when the Smart Turn model predicts the speaker hasn't finished. Range: 1–5000. Only applies when `smart_turn` is enabled. See Smart Turn.
`smart_turn`	number		End-of-turn detection threshold (0.0–1.0). When set, enables Smart Turn — an ML model predicts whether the speaker has finished their thought at each silence boundary. See Smart Turn.

Server Events

Event	Description
`error`	Error with `message` field. Connection stays open.
`transcript.created`	Server ready — wait for this before sending audio.
`transcript.done`	Final transcript after `audio.done`. `duration` always present. Includes `channel_index` when `multichannel=true` — one event sent per channel. Connection closes after this.
`transcript.partial`	Transcript result with `text`, `words`, `is_final`, `speech_final`, `start`, `duration`. Includes `channel_in
…

Model Capabilities

Speech to Text

Quick Start

Transcribe an audio file with a single API call:

curl -X POST https://api.x.ai/v1/stt \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -F format=true \
  -F language=en \
  -F "keyterm=Understand The Universe" \
  -F [email protected]

import os
import requests

response = requests.post(
    "https://api.x.ai/v1/stt",
    headers={"Authorization": f"Bearer {os.environ['XAI_API_KEY']}"},
    files={"file": ("audio.mp3", open("audio.mp3", "rb"), "audio/mpeg")},
    data=[
        ("format", "true"),
        ("language", "en"),
        ("keyterm", "Understand The Universe"),
    ],
)
response.raise_for_status()

result = response.json()
print(result["text"])
print(f"Duration: {result['duration']}s")
for word in result.get("words", []):
    print(f"  {word['start']:.2f}s - {word['end']:.2f}s: {word['text']}")

import fs from "fs";

const formData = new FormData();
formData.append("format", "true");
formData.append("language", "en");
formData.append("keyterm", "Understand The Universe");
formData.append("file", new Blob([fs.readFileSync("audio.mp3")]), "audio.mp3");

const response = await fetch("https://api.x.ai/v1/stt", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.XAI_API_KEY}`,
  },
  body: formData,
});

if (!response.ok) throw new Error(`STT error ${response.status}`);

const result = await response.json();
console.log(result.text);
console.log(`Duration: ${result.duration}s`);
for (const word of result.words ?? []) {
  console.log(`  ${word.start.toFixed(2)}s - ${word.end.toFixed(2)}s: ${word.text}`);
}

Note: The file parameter must be provided after all other parameters in the multipart form.

Get API Key →

Live Voice Demos

Supported Languages

Language	Code	Language	Code
Arabic	`ar`	Macedonian	`mk`
Czech	`cs`	Malay	`ms`
Danish	`da`	Persian	`fa`
Dutch	`nl`	Polish	`pl`
English	`en`	Portuguese	`pt`
Filipino	`fil`	Romanian	`ro`
French	`fr`	Russian	`ru`
German	`de`	Spanish	`es`
Hindi	`hi`	Swedish	`sv`
Indonesian	`id`	Thai	`th`
Italian	`it`	Turkish	`tr`
Japanese	`ja`	Vietnamese	`vi`
Korean	`ko`

Request Body

The request uses multipart/form-data. Either file or url must be provided.

Parameter	Type	Default	Required	Description
`audio_format`	string			Format hint for raw/headerless audio: `pcm`, `mulaw`, `alaw`. Container formats are auto-detected — do not set this field for MP3, WAV, etc.
`channels`	integer			Number of audio channels (2–8). Only required for multichannel raw audio. Auto-detected for container formats.
`diarize`	boolean	`false`		When `true`, enables speaker diarization. Each word in the response includes a `speaker` field (integer) identifying the detected speaker.
`file`	file		✓†	Audio file to transcribe. Max 500 MB. See Supported Formats. Must be the last field in the multipart form.
`filler_words`	boolean	`false`		When `true`, filler words (e.g. "uh", "um", "er") are included in the transcript. When `false` (default), filler words are automatically removed from the transcript text and the `words` array.
`format`	boolean	`false`		When `true`, enables Inverse Text Normalization — converts spoken numbers/currency to written form (e.g. "one hundred dollars" → "$100"). Requires `language`.
`keyterm`	string			A key term to bias transcription toward (e.g. product names, proper nouns). Repeat the field for multiple terms (e.g. `keyterm=Understand+The+Universe`). Max 100 terms, each up to 50 characters.
`language`	string			Language code (e.g. `en`, `fr`, `de`). Used with `format=true` to enable text formatting. See Supported Languages.
`multichannel`	boolean	`false`		When `true`, transcribes each audio channel independently. Results returned in the `channels` array.
`sample_rate`	integer			Sample rate in Hz. Only required for raw audio (`pcm`, `mulaw`, `alaw`). Supported: `8000`, `16000`, `22050`, `24000`, `44100`, `48000`.
`url`	string		✓†	URL of an audio file to download and transcribe (server-side).

† Either file or url must be provided.

Example with text formatting

curl -X POST https://api.x.ai/v1/stt \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -F format=true \
  -F language=en \
  -F "keyterm=Understand The Universe" \
  -F [email protected]

The file parameter must be provided after all other parameters in the multipart form.

Response

The response includes the full transcript, audio duration, and word-level timestamps.

{
  "text": "The balance is $167,983.15.",
  "language": "English",
  "duration": 3.45,
  "words": [
    { "text": "The", "start": 0.24, "end": 0.48 },
    { "text": "balance", "start": 0.48, "end": 0.96 },
    { "text": "is", "start": 0.96, "end": 1.12 },
    { "text": "$167,983.15.", "start": 1.12, "end": 3.20 }
  ]
}

Field	Type	Description
`channels`	array	Per-channel transcripts (only when `multichannel=true`). Each entry has `index`, `text`, and `words`.
`duration`	number	Audio duration in seconds (2 d.p.).
`language`	string	Detected language name (e.g. `"English"`, `"French"`).
`text`	string	Full transcript text.
`words`	array	Word-level segments with `text`, `start`, `end`, and `speaker` (integer, only when `diarize=true`).

Supported Audio Formats

Container formats (auto-detected)

Format	Extension	Description
AAC	`.aac`	Advanced Audio Coding
FLAC	`.flac`	Free Lossless Audio Codec — lossless compression
M4A	`.m4a`	MPEG-4 Audio — Apple ecosystem standard
MKV	`.mkv`	Matroska container — supports MP3, AAC, and FLAC audio codecs
MP3	`.mp3`	MPEG Audio Layer 3 — widely supported
MP4	`.mp4`	MPEG-4 container
OGG	`.ogg`	Ogg container — open format
Opus	`.opus`	Opus codec — low-latency, high quality
WAV	`.wav`	Waveform Audio — lossless, best quality input

Raw formats (require `audio_format` and `sample_rate`)

Format	`audio_format` value	Description
A-law	`alaw`	G.711 A-law (1 byte/sample)
PCM	`pcm`	Signed 16-bit little-endian (2 bytes/sample)
µ-law	`mulaw`	G.711 µ-law (1 byte/sample)

Limits

Max file size: 500 MB
Channels: Mono, stereo, or up to 8 channels (with multichannel=true)
Sample rates: 8000, 16000, 22050, 24000, 44100, 48000 Hz

Streaming Speech-to-Text (WebSocket)

For real-time transcription, use the WebSocket API at wss://api.x.ai/v1/stt. The client streams raw audio as binary WebSocket frames and receives JSON transcript events as the audio is processed.

Endpoint: wss://api.x.ai/v1/stt

Configuration is done via URL query parameters — no setup message required. Audio is sent as raw binary frames (no base64 encoding).

[!NOTE]

Never expose your API key in client-side code. Always proxy WebSocket connections through your backend.

Query Parameters

Parameter	Type	Default	Description
`channels`	integer	`1`	Number of interleaved audio channels (max 8).
`diarize`	boolean		When `true`, enables speaker diarization. Words include a `speaker` field identifying the detected speaker.
`encoding`	string	`pcm`	Audio encoding: `pcm`, `mulaw`, or `alaw`.
`endpointing`	integer	`10`	Silence duration (ms) before utterance-final event. Range: 0–5000. `0` = fire on any VAD silence boundary.
`filler_words`	boolean	`false`	When `true`, filler words (e.g. `uh`, `um`, `er`) are included in the transcript. When `false` (default), filler words are automatically removed.
`interim_results`	boolean	`false`	When `true`, emit partial transcripts `is_final=false` every ~500 ms.
`keyterm`	string		A key term to bias transcription toward (e.g. product names, proper nouns). Repeat the parameter for multiple terms (e.g. `keyterm=Understand+The+Universe`). Max 100 terms, each up to 50 characters.
`language`	string		Language code for text formatting. See Supported Languages.
`multichannel`	boolean	`false`	Per-channel transcription. Requires `channels` ≥ 2.
`sample_rate`	integer	`16000`	Audio sample rate in Hz.
`smart_turn_timeout`	integer		Maximum silence duration (ms) before forcing `speech_final`, even when the Smart Turn model predicts the speaker hasn't finished. Range: 1–5000. Only applies when `smart_turn` is enabled. See Smart Turn.
`smart_turn`	number		End-of-turn detection threshold (0.0–1.0). When set, enables Smart Turn — an ML model predicts whether the speaker has finished their thought at each silence boundary. See Smart Turn.

Server Events

Event	Description
`error`	Error with `message` field. Connection stays open.
`transcript.created`	Server ready — wait for this before sending audio.
`transcript.done`	Final transcript after `audio.done`. `duration` always present. Includes `channel_index` when `multichannel=true` — one event sent per channel. Connection closes after this.
`transcript.partial`	Transcript result with `text`, `words`, `is_final`, `speech_final`, `start`, `duration`. Includes `channel_in
…

pages/developers/model-capabilities/audio/speech-to-text.md

Model Capabilities

Speech to Text

Quick Start

Supported Languages

Request Body

Example with text formatting

Response

Supported Audio Formats

Container formats (auto-detected)

Raw formats (require `audio_format` and `sample_rate`)

Limits

Streaming Speech-to-Text (WebSocket)

Query Parameters

Server Events

pages/developers/model-capabilities/audio/speech-to-text.md

Model Capabilities

Speech to Text

Quick Start

Supported Languages

Request Body

Example with text formatting

Response

Supported Audio Formats

Container formats (auto-detected)

Raw formats (require `audio_format` and `sample_rate`)

Limits

Streaming Speech-to-Text (WebSocket)

Query Parameters

Server Events

Model Capabilities

Speech to Text

Quick Start

Supported Languages

Request Body

Example with text formatting

Response

Supported Audio Formats

Container formats (auto-detected)

Raw formats (require audio_format and sample_rate)

Limits

Streaming Speech-to-Text (WebSocket)

Query Parameters

Server Events

Model Capabilities

Speech to Text

Quick Start

Supported Languages

Request Body

Example with text formatting

Response

Supported Audio Formats

Container formats (auto-detected)

Raw formats (require audio_format and sample_rate)

Limits

Streaming Speech-to-Text (WebSocket)

Query Parameters

Server Events

Raw formats (require `audio_format` and `sample_rate`)

Raw formats (require `audio_format` and `sample_rate`)