• Tech Dev NotesTech Dev Notes
Apps
  • App lookup
  • App compare
Market movement
  • App charts
  • App rankings
Visual proof
  • App screens
  • App listing screenshots
  • App icons
Build intelligence
  • App tech stacks
  • Tool releases
  • Developers
More
  • X feature flags
  • Grokipedia
  • Blog
  • Follow on X
Skip to content
All content/ filesChangelog

xai-docs/latest/content · Jun 27, 00:17 UTC

pages/developers/model-capabilities/audio/speech-to-text.md

MD·20 KB·222 lines

content/

  • .

    • llms.txt
  • pages

    • overview.md
  • pages/build

    • enterprise.md
    • modes-and-commands.md
    • overview.md
    • settings.md
  • pages/build/cli

    • headless-scripting.md
  • pages/build/features

    • skills-plugins-marketplaces.md
  • pages/console

    • billing.md
    • collections.md
    • usage.md
  • pages/console/faq

    • accounts.md
    • billing.md
    • security.md
  • pages/developers

    • community.md
    • cost-tracking.md
    • debugging.md
    • docs-mcp.md
    • files.md
    • grpc-api-reference.md
    • management-api-guide.md
    • models.md
    • pricing.md
    • quickstart.md
    • rate-limits.md
    • release-notes.md
  • pages/developers/advanced-api-usage

    • async.md
    • batch-api.md
    • context-compaction.md
    • deferred-chat-completions.md
    • mtls.md
    • priority-processing.md
    • prompt-caching.md
    • websocket-mode.md
  • pages/developers/advanced-api-usage/prompt-caching

    • best-practices.md
    • how-it-works.md
    • maximizing-cache-hits.md
    • multi-turn.md
    • usage-and-pricing.md
  • pages/developers/faq

    • accounts.md
    • billing.md
    • general.md
    • security.md
    • team-management.md
  • pages/developers/files

    • collections.md
    • managing-files.md
    • public-urls.md
  • pages/developers/files/collections

    • api.md
    • metadata.md
  • pages/developers/migration

    • may-15-retirement.md
  • pages/developers/model-capabilities

    • imagine.md
  • pages/developers/model-capabilities/audio

    • custom-voices.md
    • ephemeral-tokens.md
    • speech-to-text.md
    • text-to-speech.md
    • voice-agent.md
    • voice.md
  • pages/developers/model-capabilities/audio/voice-agent

    • sip.md
  • pages/developers/model-capabilities/files

    • chat-with-files.md
  • pages/developers/model-capabilities/images

    • editing.md
    • generation.md
    • multi-image-editing.md
    • understanding.md
  • pages/developers/model-capabilities/imagine

    • files.md
  • pages/developers/model-capabilities/imagine/files

    • inputs.md
    • outputs.md
  • pages/developers/model-capabilities/legacy

    • chat-completions.md
  • pages/developers/model-capabilities/text

    • comparison.md
    • generate-text.md
    • multi-agent.md
    • reasoning.md
    • streaming.md
    • structured-outputs.md
  • pages/developers/model-capabilities/video

    • editing.md
    • extension.md
    • generation.md
    • image-to-video.md
    • reference-to-video.md
  • pages/developers/models

    • speech-to-text.md
    • text-to-speech.md
    • voice-agent-api.md
  • pages/developers/rest-api-reference

    • collections.md
    • files.md
    • inference.md
    • management.md
  • pages/developers/rest-api-reference/collections

    • collection.md
    • search.md
  • pages/developers/rest-api-reference/files

    • download.md
    • manage.md
    • upload.md
  • pages/developers/rest-api-reference/inference

    • batches.md
    • chat.md
    • images.md
    • legacy.md
    • models.md
    • other.md
    • speech-to-text.md
    • videos.md
    • voice.md
  • pages/developers/rest-api-reference/management

    • audit.md
    • auth.md
    • billing.md
  • pages/developers/tools

    • advanced-usage.md
    • citations.md
    • code-execution.md
    • collections-search.md
    • function-calling.md
    • overview.md
    • remote-mcp.md
    • streaming-and-sync.md
    • tool-usage-details.md
    • web-search.md
    • x-search.md
  • pages/grok

    • connector-management.md
    • connectors.md
    • faq.md
    • management.md
    • organization.md
    • user-guide.md
  • pages/grok/connectors

    • custom-mcp-tunneling.md
    • gmail-google-calendar.md
    • google-drive.md
    • microsoft-teams.md
    • onedrive.md
    • outlook.md
    • salesforce.md
    • sharepoint.md
  • pages/grok/faq

    • team-management.md
  • pages/integrations

    • hubspot-mcp-setup.md

Model Capabilities

Speech to Text

Transcribe audio files into text with a single API call, or stream audio in real time over WebSocket. The API supports 12 audio formats, word-level timestamps, multichannel transcription, and text formatting.

Quick Start

Transcribe an audio file with a single API call:

curl -X POST https://api.x.ai/v1/stt \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -F format=true \
  -F language=en \
  -F "keyterm=Understand The Universe" \
  -F [email protected]
import os
import requests

response = requests.post(
    "https://api.x.ai/v1/stt",
    headers={"Authorization": f"Bearer {os.environ['XAI_API_KEY']}"},
    files={"file": ("audio.mp3", open("audio.mp3", "rb"), "audio/mpeg")},
    data=[
        ("format", "true"),
        ("language", "en"),
        ("keyterm", "Understand The Universe"),
    ],
)
response.raise_for_status()

result = response.json()
print(result["text"])
print(f"Duration: {result['duration']}s")
for word in result.get("words", []):
    print(f"  {word['start']:.2f}s - {word['end']:.2f}s: {word['text']}")
import fs from "fs";

const formData = new FormData();
formData.append("format", "true");
formData.append("language", "en");
formData.append("keyterm", "Understand The Universe");
formData.append("file", new Blob([fs.readFileSync("audio.mp3")]), "audio.mp3");

const response = await fetch("https://api.x.ai/v1/stt", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.XAI_API_KEY}`,
  },
  body: formData,
});

if (!response.ok) throw new Error(`STT error ${response.status}`);

const result = await response.json();
console.log(result.text);
console.log(`Duration: ${result.duration}s`);
for (const word of result.words ?? []) {
  console.log(`  ${word.start.toFixed(2)}s - ${word.end.toFixed(2)}s: ${word.text}`);
}

Note: The file parameter must be provided after all other parameters in the multipart form.

Get API Key →

Live Voice Demos

Supported Languages

The language parameter enables formatting for the following languages. The model transcribes speech in any of these languages regardless of the language parameter — setting it enables formatting of numbers, currencies, and units into their written form.

Language Code Language Code
Arabic ar Macedonian mk
Czech cs Malay ms
Danish da Persian fa
Dutch nl Polish pl
English en Portuguese pt
Filipino fil Romanian ro
French fr Russian ru
German de Spanish es
Hindi hi Swedish sv
Indonesian id Thai th
Italian it Turkish tr
Japanese ja Vietnamese vi
Korean ko

Request Body

The request uses multipart/form-data. Either file or url must be provided.

Parameter Type Default Required Description
audio_format string Format hint for raw/headerless audio: pcm, mulaw, alaw. Container formats are auto-detected — do not set this field for MP3, WAV, etc.
channels integer Number of audio channels (2–8). Only required for multichannel raw audio. Auto-detected for container formats.
diarize boolean false When true, enables speaker diarization. Each word in the response includes a speaker field (integer) identifying the detected speaker.
file file ✓† Audio file to transcribe. Max 500 MB. See Supported Formats. Must be the last field in the multipart form.
filler_words boolean false When true, filler words (e.g. "uh", "um", "er") are included in the transcript. When false (default), filler words are automatically removed from the transcript text and the words array.
format boolean false When true, enables Inverse Text Normalization — converts spoken numbers/currency to written form (e.g. "one hundred dollars" → "$100"). Requires language.
keyterm string A key term to bias transcription toward (e.g. product names, proper nouns). Repeat the field for multiple terms (e.g. keyterm=Understand+The+Universe). Max 100 terms, each up to 50 characters.
language string Language code (e.g. en, fr, de). Used with format=true to enable text formatting. See Supported Languages.
multichannel boolean false When true, transcribes each audio channel independently. Results returned in the channels array.
sample_rate integer Sample rate in Hz. Only required for raw audio (pcm, mulaw, alaw). Supported: 8000, 16000, 22050, 24000, 44100, 48000.
url string ✓† URL of an audio file to download and transcribe (server-side).

† Either file or url must be provided.

Example with text formatting

curl -X POST https://api.x.ai/v1/stt \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -F format=true \
  -F language=en \
  -F "keyterm=Understand The Universe" \
  -F [email protected]

The file parameter must be provided after all other parameters in the multipart form.

Response

The response includes the full transcript, audio duration, and word-level timestamps.

{
  "text": "The balance is $167,983.15.",
  "language": "English",
  "duration": 3.45,
  "words": [
    { "text": "The", "start": 0.24, "end": 0.48 },
    { "text": "balance", "start": 0.48, "end": 0.96 },
    { "text": "is", "start": 0.96, "end": 1.12 },
    { "text": "$167,983.15.", "start": 1.12, "end": 3.20 }
  ]
}
Field Type Description
channels array Per-channel transcripts (only when multichannel=true). Each entry has index, text, and words.
duration number Audio duration in seconds (2 d.p.).
language string Detected language name (e.g. "English", "French").
text string Full transcript text.
words array Word-level segments with text, start, end, and speaker (integer, only when diarize=true).

Supported Audio Formats

Container formats (auto-detected)

Format Extension Description
AAC .aac Advanced Audio Coding
FLAC .flac Free Lossless Audio Codec — lossless compression
M4A .m4a MPEG-4 Audio — Apple ecosystem standard
MKV .mkv Matroska container — supports MP3, AAC, and FLAC audio codecs
MP3 .mp3 MPEG Audio Layer 3 — widely supported
MP4 .mp4 MPEG-4 container
OGG .ogg Ogg container — open format
Opus .opus Opus codec — low-latency, high quality
WAV .wav Waveform Audio — lossless, best quality input

Raw formats (require audio_format and sample_rate)

Format audio_format value Description
A-law alaw G.711 A-law (1 byte/sample)
PCM pcm Signed 16-bit little-endian (2 bytes/sample)
µ-law mulaw G.711 µ-law (1 byte/sample)

Limits

  • Max file size: 500 MB
  • Channels: Mono, stereo, or up to 8 channels (with multichannel=true)
  • Sample rates: 8000, 16000, 22050, 24000, 44100, 48000 Hz

Streaming Speech-to-Text (WebSocket)

For real-time transcription, use the WebSocket API at wss://api.x.ai/v1/stt. The client streams raw audio as binary WebSocket frames and receives JSON transcript events as the audio is processed.

Endpoint: wss://api.x.ai/v1/stt

Configuration is done via URL query parameters — no setup message required. Audio is sent as raw binary frames (no base64 encoding).

[!NOTE]

Never expose your API key in client-side code. Always proxy WebSocket connections through your backend.

Query Parameters

Parameter Type Default Description
channels integer 1 Number of interleaved audio channels (max 8).
diarize boolean When true, enables speaker diarization. Words include a speaker field identifying the detected speaker.
encoding string pcm Audio encoding: pcm, mulaw, or alaw.
endpointing integer 10 Silence duration (ms) before utterance-final event. Range: 0–5000. 0 = fire on any VAD silence boundary.
filler_words boolean false When true, filler words (e.g. uh, um, er) are included in the transcript. When false (default), filler words are automatically removed.
interim_results boolean false When true, emit partial transcripts is_final=false every ~500 ms.
keyterm string A key term to bias transcription toward (e.g. product names, proper nouns). Repeat the parameter for multiple terms (e.g. keyterm=Understand+The+Universe). Max 100 terms, each up to 50 characters.
language string Language code for text formatting. See Supported Languages.
multichannel boolean false Per-channel transcription. Requires channels ≥ 2.
sample_rate integer 16000 Audio sample rate in Hz.
smart_turn_timeout integer Maximum silence duration (ms) before forcing speech_final, even when the Smart Turn model predicts the speaker hasn't finished. Range: 1–5000. Only applies when smart_turn is enabled. See Smart Turn.
smart_turn number End-of-turn detection threshold (0.0–1.0). When set, enables Smart Turn — an ML model predicts whether the speaker has finished their thought at each silence boundary. See Smart Turn.

Server Events

Event Description
error Error with message field. Connection stays open.
transcript.created Server ready — wait for this before sending audio.
transcript.done Final transcript after audio.done. duration always present. Includes channel_index when multichannel=true — one event sent per channel. Connection closes after this.
transcript.partial Transcript result with text, words, is_final, speech_final, start, duration. Includes `channel_in
…
Previouspages/developers/model-capabilities/audio/ephemeral-tokens.mdNextpages/developers/model-capabilities/audio/text-to-speech.md

© 2026 Tech Dev Notes

RSSAboutAPIPrivacyTermsSitemap@techdevnotes