• Tech Dev NotesTech Dev Notes
Apps
  • App lookup
  • App compare
Market movement
  • App charts
  • App rankings
Visual proof
  • App screens
  • App listing screenshots
  • App icons
Build intelligence
  • App tech stacks
  • Tool releases
  • Developers
More
  • X feature flags
  • Grokipedia
  • Blog
  • Follow on X
Skip to content
All content/ filesChangelog

gemini-docs/latest/content · Jun 26, 14:03 UTC

pages/audio.txt

TXT·10.4 KB·298 lines

content/

  • pages

    • agent-environment.txt
    • agents.txt
    • ai-studio-quickstart.txt
    • aistudio-agents.txt
    • aistudio-android.txt
    • aistudio-build-mode.txt
    • aistudio-deploying.txt
    • aistudio-fullstack.txt
    • antigravity-agent.txt
    • api-key.txt
    • api-versions.txt
    • audio.txt
    • available-regions.txt
    • background-execution.txt
    • batch-api.txt
    • billing.txt
    • caching.txt
    • changelog.txt
    • code-execution.txt
    • coding-agents.txt
    • computer-use.txt
    • crewai-example.txt
    • custom-agents.txt
    • deep-research.txt
    • deprecations.txt
    • document-processing.txt
    • embeddings.txt
    • feedback-policies.txt
    • file-input-methods.txt
    • file-search.txt
    • files.txt
    • flex-inference.txt
    • function-calling.txt
    • gemini-3.txt
    • gemini-for-research.txt
    • get-started.txt
    • google-search.txt
    • image-generation.txt
    • image-understanding.txt
    • imagen.txt
    • index.txt
    • interactions-breaking-changes-may-2026.txt
    • interactions-overview.txt
    • langgraph-example.txt
    • learnlm.txt
    • libraries.txt
    • live-api.txt
    • llama-index.txt
    • logs-datasets.txt
    • logs-policy.txt
    • long-context.txt
    • managed-agents-quickstart.txt
    • maps-grounding.txt
    • media-resolution.txt
    • migrate-to-cloud.txt
    • migrate-to-interactions.txt
    • migrate.txt
    • model-tuning.txt
    • models.txt
    • music-generation.txt
    • oauth.txt
    • openai.txt
    • optimization.txt
    • partner-integration.txt
    • pricing.txt
    • priority-inference.txt
    • prompting-strategies.txt
    • rate-limits.txt
    • realtime-music-generation.txt
    • robotics-overview.txt
    • safety-guidance.txt
    • safety-settings.txt
    • speech-generation.txt
    • streaming.txt
    • structured-output.txt
    • temporal-example.txt
    • text-generation.txt
    • thinking.txt
    • thought-signatures.txt
    • tokens.txt
    • tool-combination.txt
    • tools.txt
  • pages/generate-content

    • api-key.txt
    • audio.txt
    • caching.txt
    • code-execution.txt
    • computer-use.txt
    • document-processing.txt
    • file-input-methods.txt
    • file-search.txt
    • files.txt
    • flex-inference.txt
    • function-calling.txt
    • gemini-3.txt
    • get-started.txt
    • google-search.txt
    • image-generation.txt
    • image-understanding.txt
    • maps-grounding.txt
    • media-resolution.txt
    • music-generation.txt
    • priority-inference.txt
    • speech-generation.txt
    • structured-output.txt
    • text-generation.txt
    • thinking.txt
    • thought-signatures.txt
    • tokens.txt
    • tool-combination.txt
    • url-context.txt
    • video-understanding.txt
    • webhooks.txt
    • whats-new-gemini-3.5.txt
  • pages/live-api

    • best-practices.txt
    • capabilities.txt
    • ephemeral-tokens.txt
    • get-started-sdk.txt
    • get-started-websocket.txt
    • live-translate.txt
    • session-management.txt
    • tools.txt
  • pages/models

    • antigravity-preview-05-2026.txt
    • deep-research-max-preview-04-2026.txt
    • deep-research-preview-04-2026.txt
    • deep-research-pro-preview-12-2025.txt
    • gemini-2.0-flash-lite.txt
    • gemini-2.0-flash.txt
    • gemini-2.5-computer-use-preview-10-2025.txt
    • gemini-2.5-flash-image.txt
    • gemini-2.5-flash-lite-preview-09-2025.txt
    • gemini-2.5-flash-lite.txt
    • gemini-2.5-flash-native-audio-preview-12-2025.txt
    • gemini-2.5-flash-preview-09-2025.txt
    • gemini-2.5-flash-preview-tts.txt
    • gemini-2.5-flash.txt
    • gemini-2.5-pro-preview-tts.txt
    • gemini-2.5-pro.txt
    • gemini-3-flash-preview.txt
    • gemini-3-pro-image.txt
    • gemini-3-pro-preview.txt
    • gemini-3.1-flash-image.txt
    • gemini-3.1-flash-lite-preview.txt
    • gemini-3.1-flash-lite.txt
    • gemini-3.1-flash-live-preview.txt
    • gemini-3.1-flash-tts-preview.txt
    • gemini-3.1-pro-preview.txt
    • gemini-3.5-flash.txt
    • gemini-3.5-live-translate-preview.txt
    • gemini-embedding-001.txt
    • gemini-embedding-2-preview.txt
    • gemini-embedding-2.txt
    • gemini-robotics-er-1.5-preview.txt
    • gemini-robotics-er-1.6-preview.txt
    • imagen.txt
    • lyria-3-clip-preview.txt
    • lyria-3-pro-preview.txt
    • lyria-realtime-exp.txt
    • veo-2.0-generate-001.txt
    • veo-3.1-generate-preview.txt
    • veo-3.1-lite-generate-preview.txt
route: /gemini-api/docs/audio
title: Audio understanding
description: 

Note: This version of the page covers the Interactions API. You can use the toggle on this page to switch to the generateContent API version of this page.
Gemini can analyze audio input and generate text responses.
Python
from google import genai
import base64
client = genai.Client()
uploaded_file = client.files.upload(file="path/to/sample.mp3")
interaction = client.interactions.create(
model="gemini-3.5-flash",
input=[
{"type": "text", "text": "Describe this audio clip"},
{
"type": "audio",
"uri": uploaded_file.uri,
"mime_type": uploaded_file.mime_type
}
]
)
print(interaction.output_text)
JavaScript
import { GoogleGenAI } from "@google/genai";
const client = new GoogleGenAI({});
const uploadedFile = await client.files.upload({
file: "path/to/sample.mp3",
config: { mime_type: "audio/mp3" }
});
const interaction = await client.interactions.create({
model: "gemini-3.5-flash",
input: [
{type: "text", text: "Describe this audio clip"},
{
type: "audio",
uri: uploadedFile.uri,
mime_type: uploadedFile.mimeType
}
]
});
console.log(interaction.output_text);
REST
# First upload the file, then use the URI:
curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "gemini-3.5-flash",
"input": [
{"type": "text", "text": "Describe this audio clip"},
{
"type": "audio",
"uri": "YOUR_FILE_URI",
"mime_type": "audio/mp3"
}
]
}'
Overview
Gemini can analyze and understand audio input and generate text responses,
enabling use cases like:
Describe, summarize, or answer questions about audio content
Transcription and translation (speech to text)
Speaker diarization (identifying different speakers)
Emotion detection in speech and music
Analyzing specific segments with timestamps
For real-time voice and video interactions, see the
Live API.
For dedicated speech to text models with support for real-time transcription,
use the Google Cloud Speech-to-Text API.
Transcribe speech to text
This example shows how to transcribe, translate, and summarize speech with
timestamps, speaker diarization, and emotion detection using
structured outputs.
Python
from google import genai
client = genai.Client()
YOUTUBE_URL = "https://www.youtube.com/watch?v=ku-N-eS1lgM"
prompt = """
Process the audio file and generate a detailed transcription.
Requirements:
1. Identify distinct speakers (e.g., Speaker 1, Speaker 2).
2. Provide accurate timestamps for each segment (Format: MM:SS).
3. Detect the primary language of each segment.
4. If not English, provide the English translation.
5. Identify the primary emotion: Happy, Sad, Angry, or Neutral.
6. Provide a brief summary at the beginning.
"""
response_schema = {
"type": "object",
"properties": {
"summary": {"type": "string"},
"segments": {
"type": "array",
"items": {
"type": "object",
"properties": {
"speaker": {"type": "string"},
"timestamp": {"type": "string"},
"content": {"type": "string"},
"language": {"type": "string"},
"emotion": {
"type": "string",
"enum": ["happy", "sad", "angry", "neutral"]
}
},
"required": ["speaker", "timestamp", "content", "emotion"]
}
},
"required": ["summary", "segments"]
}
interaction = client.interactions.create(
model="gemini-3.5-flash",
input=[
{"type": "video", "uri": YOUTUBE_URL, "mime_type": "video/mp4"},
{"type": "text", "text": prompt}
],
response_format=response_schema,
)
print(interaction.output_text)
JavaScript
import { GoogleGenAI } from "@google/genai";
const client = new GoogleGenAI({});
const YOUTUBE_URL = "https://www.youtube.com/watch?v=ku-N-eS1lgM";
const prompt = `
Process the audio file and generate a detailed transcription.
Requirements:
1. Identify distinct speakers (e.g., Speaker 1, Speaker 2).
2. Provide accurate timestamps for each segment (Format: MM:SS).
3. Detect the primary language of each segment.
4. If not English, provide the English translation.
5. Identify the primary emotion: Happy, Sad, Angry, or Neutral.
6. Provide a brief summary at the beginning.
`;
const responseSchema = {
type: "object",
properties: {
summary: { type: "string" },
segments: {
type: "array",
items: {
type: "object",
properties: {
speaker: { type: "string" },
timestamp: { type: "string" },
content: { type: "string" },
language: { type: "string" },
emotion: {
type: "string",
enum: ["happy", "sad", "angry", "neutral"]
}
},
required: ["speaker", "timestamp", "content", "emotion"]
}
},
required: ["summary", "segments"]
};
const interaction = await client.interactions.create({
model: "gemini-3.5-flash",
input: [
{ type: "video", uri: YOUTUBE_URL, mime_type: "video/mp4" },
{ type: "text", text: prompt }
],
response_format: responseSchema,
});
console.log(JSON.parse(interaction.output_text));
REST
curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "gemini-3.5-flash",
"input": [
{
"type": "video",
"uri": "https://www.youtube.com/watch?v=ku-N-eS1lgM",
"mime_type": "video/mp4"
},
{
"type": "text",
"text": "Transcribe with speaker diarization and emotion detection."
}
],
"response_format": {
"type": "object",
"properties": {
"summary": {"type": "string"},
"segments": {
"type": "array",
"items": {
"type": "object",
"properties": {
"speaker": {"type": "string"},
"timestamp": {"type": "string"},
"content": {"type": "string"},
"emotion": {"type": "string", "enum": ["happy", "sad", "angry", "neutral"]}
}
}'
Input audio
You can provide audio data in the following ways:
Upload an audio file before making a request.
Pass inline audio data with the request.
Upload an audio file
Use the Files API for files larger than 20 MB.
Python
from google import genai
client = genai.Client()
uploaded_file = client.files.upload(file="path/to/sample.mp3")
interaction = client.interactions.create(
model="gemini-3.5-flash",
input=[
{"type": "text", "text": "Describe this audio clip"},
{
"type": "audio",
"uri": uploaded_file.uri,
"mime_type": uploaded_file.mime_type
}
]
)
print(interaction.output_text)
JavaScript
import { GoogleGenAI } from "@google/genai";
const client = new GoogleGenAI({});
const uploadedFile = await client.files.upload({
file: "path/to/sample.mp3",
config: { mimeType: "audio/mp3" }
});
const interaction = await client.interactions.create({
model: "gemini-3.5-flash",
input: [
{type: "text", text: "Describe this audio clip"},
{
type: "audio",
uri: uploadedFile.uri,
mime_type: uploadedFile.mimeType
}
]
});
console.log(interaction.output_text);
REST
# First upload the file using the Files API, then use the URI:
curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "gemini-3.5-flash",
"input": [
{"type": "text", "text": "Describe this audio clip"},
{
"type": "audio",
"uri": "YOUR_FILE_URI",
"mime_type": "audio/mp3"
}
]
}'
Pass audio data inline
For small audio files under 20MB total request size:
Python
from google import genai
import base64
client = genai.Client()
with open('path/to/small-sample.mp3', 'rb') as f:
audio_bytes = f.read()
interaction = client.interactions.create(
model="gemini-3.5-flash",
input=[
{"type": "text", "text": "Describe this audio clip"},
{
"type": "audio",
"data": base64.b64encode(audio_bytes).decode('utf-8'),
"mime_type": "audio/mp3"
}
]
)
print(interaction.output_text)
JavaScript
import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const client = new GoogleGenAI({});
const audioData = fs.readFileSync("path/to/small-sample.mp3", {
encoding: "base64"
});
const interaction = await client.interactions.create({
model: "gemini-3.5-flash",
input: [
{type: "text", text: "Describe this audio clip"},
{
type: "audio",
data: audioData,
mime_type: "audio/mp3"
}
]
});
console.log(interaction.output_text);
REST
AUDIO_PATH="path/to/sample.mp3"
if [[ "$(base64 --version 2>&1)" = *"FreeB
…
Previouspages/api-versions.txtNextpages/available-regions.txt

© 2026 Tech Dev Notes

RSSAboutAPIPrivacyTermsSitemap@techdevnotes