Model Capabilities
Speech to Text
Transcribe audio files into text with a single API call, or stream audio in real time over WebSocket. The API supports 12 audio formats, word-level timestamps, multichannel transcription, and text formatting.
Quick Start
Transcribe an audio file with a single API call:
curl -X POST https://api.x.ai/v1/stt \
-H "Authorization: Bearer $XAI_API_KEY" \
-F format=true \
-F language=en \
-F "keyterm=Understand The Universe" \
-F [email protected]
import os
import requests
response = requests.post(
"https://api.x.ai/v1/stt",
headers={"Authorization": f"Bearer {os.environ['XAI_API_KEY']}"},
files={"file": ("audio.mp3", open("audio.mp3", "rb"), "audio/mpeg")},
data=[
("format", "true"),
("language", "en"),
("keyterm", "Understand The Universe"),
],
)
response.raise_for_status()
result = response.json()
print(result["text"])
print(f"Duration: {result['duration']}s")
for word in result.get("words", []):
print(f" {word['start']:.2f}s - {word['end']:.2f}s: {word['text']}")
import fs from "fs";
const formData = new FormData();
formData.append("format", "true");
formData.append("language", "en");
formData.append("keyterm", "Understand The Universe");
formData.append("file", new Blob([fs.readFileSync("audio.mp3")]), "audio.mp3");
const response = await fetch("https://api.x.ai/v1/stt", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.XAI_API_KEY}`,
},
body: formData,
});
if (!response.ok) throw new Error(`STT error ${response.status}`);
const result = await response.json();
console.log(result.text);
console.log(`Duration: ${result.duration}s`);
for (const word of result.words ?? []) {
console.log(` ${word.start.toFixed(2)}s - ${word.end.toFixed(2)}s: ${word.text}`);
}
Note: The file parameter must be provided after all other parameters in the multipart form.
Supported Languages
The language parameter enables formatting for the following languages. The model transcribes speech in any of these languages regardless of the language parameter — setting it enables formatting of numbers, currencies, and units into their written form.
| Language | Code | Language | Code | |
|---|---|---|---|---|
| Arabic | ar |
Macedonian | mk |
|
| Czech | cs |
Malay | ms |
|
| Danish | da |
Persian | fa |
|
| Dutch | nl |
Polish | pl |
|
| English | en |
Portuguese | pt |
|
| Filipino | fil |
Romanian | ro |
|
| French | fr |
Russian | ru |
|
| German | de |
Spanish | es |
|
| Hindi | hi |
Swedish | sv |
|
| Indonesian | id |
Thai | th |
|
| Italian | it |
Turkish | tr |
|
| Japanese | ja |
Vietnamese | vi |
|
| Korean | ko |
Request Body
The request uses multipart/form-data. Either file or url must be provided.
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
audio_format |
string | Format hint for raw/headerless audio: pcm, mulaw, alaw. Container formats are auto-detected — do not set this field for MP3, WAV, etc. |
||
channels |
integer | Number of audio channels (2–8). Only required for multichannel raw audio. Auto-detected for container formats. | ||
diarize |
boolean | false |
When true, enables speaker diarization. Each word in the response includes a speaker field (integer) identifying the detected speaker. |
|
file |
file | ✓† | Audio file to transcribe. Max 500 MB. See Supported Formats. Must be the last field in the multipart form. | |
filler_words |
boolean | false |
When true, filler words (e.g. "uh", "um", "er") are included in the transcript. When false (default), filler words are automatically removed from the transcript text and the words array. |
|
format |
boolean | false |
When true, enables Inverse Text Normalization — converts spoken numbers/currency to written form (e.g. "one hundred dollars" → "$100"). Requires language. |
|
keyterm |
string | A key term to bias transcription toward (e.g. product names, proper nouns). Repeat the field for multiple terms (e.g. keyterm=Understand+The+Universe). Max 100 terms, each up to 50 characters. |
||
language |
string | Language code (e.g. en, fr, de). Used with format=true to enable text formatting. See Supported Languages. |
||
multichannel |
boolean | false |
When true, transcribes each audio channel independently. Results returned in the channels array. |
|
sample_rate |
integer | Sample rate in Hz. Only required for raw audio (pcm, mulaw, alaw). Supported: 8000, 16000, 22050, 24000, 44100, 48000. |
||
url |
string | ✓† | URL of an audio file to download and transcribe (server-side). |
† Either file or url must be provided.
Example with text formatting
curl -X POST https://api.x.ai/v1/stt \
-H "Authorization: Bearer $XAI_API_KEY" \
-F format=true \
-F language=en \
-F "keyterm=Understand The Universe" \
-F [email protected]
The file parameter must be provided after all other parameters in the multipart form.
Response
The response includes the full transcript, audio duration, and word-level timestamps.
{
"text": "The balance is $167,983.15.",
"language": "English",
"duration": 3.45,
"words": [
{ "text": "The", "start": 0.24, "end": 0.48 },
{ "text": "balance", "start": 0.48, "end": 0.96 },
{ "text": "is", "start": 0.96, "end": 1.12 },
{ "text": "$167,983.15.", "start": 1.12, "end": 3.20 }
]
}
| Field | Type | Description |
|---|---|---|
channels |
array | Per-channel transcripts (only when multichannel=true). Each entry has index, text, and words. |
duration |
number | Audio duration in seconds (2 d.p.). |
language |
string | Detected language name (e.g. "English", "French"). |
text |
string | Full transcript text. |
words |
array | Word-level segments with text, start, end, and speaker (integer, only when diarize=true). |
Supported Audio Formats
Container formats (auto-detected)
| Format | Extension | Description |
|---|---|---|
| AAC | .aac |
Advanced Audio Coding |
| FLAC | .flac |
Free Lossless Audio Codec — lossless compression |
| M4A | .m4a |
MPEG-4 Audio — Apple ecosystem standard |
| MKV | .mkv |
Matroska container — supports MP3, AAC, and FLAC audio codecs |
| MP3 | .mp3 |
MPEG Audio Layer 3 — widely supported |
| MP4 | .mp4 |
MPEG-4 container |
| OGG | .ogg |
Ogg container — open format |
| Opus | .opus |
Opus codec — low-latency, high quality |
| WAV | .wav |
Waveform Audio — lossless, best quality input |
Raw formats (require audio_format and sample_rate)
| Format | audio_format value |
Description |
|---|---|---|
| A-law | alaw |
G.711 A-law (1 byte/sample) |
| PCM | pcm |
Signed 16-bit little-endian (2 bytes/sample) |
| µ-law | mulaw |
G.711 µ-law (1 byte/sample) |
Limits
- Max file size: 500 MB
- Channels: Mono, stereo, or up to 8 channels (with
multichannel=true) - Sample rates: 8000, 16000, 22050, 24000, 44100, 48000 Hz
Streaming Speech-to-Text (WebSocket)
For real-time transcription, use the WebSocket API at wss://api.x.ai/v1/stt. The client streams raw audio as binary WebSocket frames and receives JSON transcript events as the audio is processed.
Endpoint: wss://api.x.ai/v1/stt
Configuration is done via URL query parameters — no setup message required. Audio is sent as raw binary frames (no base64 encoding).
[!NOTE]
Never expose your API key in client-side code. Always proxy WebSocket connections through your backend.
Query Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
channels |
integer | 1 |
Number of interleaved audio channels (max 8). |
diarize |
boolean | When true, enables speaker diarization. Words include a speaker field identifying the detected speaker. |
|
encoding |
string | pcm |
Audio encoding: pcm, mulaw, or alaw. |
endpointing |
integer | 10 |
Silence duration (ms) before utterance-final event. Range: 0–5000. 0 = fire on any VAD silence boundary. |
filler_words |
boolean | false |
When true, filler words (e.g. uh, um, er) are included in the transcript. When false (default), filler words are automatically removed. |
interim_results |
boolean | false |
When true, emit partial transcripts is_final=false every ~500 ms. |
keyterm |
string | A key term to bias transcription toward (e.g. product names, proper nouns). Repeat the parameter for multiple terms (e.g. keyterm=Understand+The+Universe). Max 100 terms, each up to 50 characters. |
|
language |
string | Language code for text formatting. See Supported Languages. | |
multichannel |
boolean | false |
Per-channel transcription. Requires channels ≥ 2. |
sample_rate |
integer | 16000 |
Audio sample rate in Hz. |
smart_turn_timeout |
integer | Maximum silence duration (ms) before forcing speech_final, even when the Smart Turn model predicts the speaker hasn't finished. Range: 1–5000. Only applies when smart_turn is enabled. See Smart Turn. |
|
smart_turn |
number | End-of-turn detection threshold (0.0–1.0). When set, enables Smart Turn — an ML model predicts whether the speaker has finished their thought at each silence boundary. See Smart Turn. |
Server Events
| Event | Description |
|---|---|
error |
Error with message field. Connection stays open. |
transcript.created |
Server ready — wait for this before sending audio. |
transcript.done |
Final transcript after audio.done. duration always present. Includes channel_index when multichannel=true — one event sent per channel. Connection closes after this. |
transcript.partial |
Transcript result with text, words, is_final, speech_final, start, duration. Includes `channel_in |
| … |