• Tech Dev NotesTech Dev Notes
Apps
  • App lookup
  • App compare
Market movement
  • App charts
  • App rankings
Visual proof
  • App screens
  • App listing screenshots
  • App icons
Build intelligence
  • App tech stacks
  • Tool releases
  • Developers
More
  • X feature flags
  • Grokipedia
  • Blog
  • Follow on X
Skip to content
All content/ filesChangelog

gemini-docs/latest/content · Jun 26, 14:03 UTC

pages/flex-inference.txt

TXT·6.9 KB·213 lines

content/

  • pages

    • agent-environment.txt
    • agents.txt
    • ai-studio-quickstart.txt
    • aistudio-agents.txt
    • aistudio-android.txt
    • aistudio-build-mode.txt
    • aistudio-deploying.txt
    • aistudio-fullstack.txt
    • antigravity-agent.txt
    • api-key.txt
    • api-versions.txt
    • audio.txt
    • available-regions.txt
    • background-execution.txt
    • batch-api.txt
    • billing.txt
    • caching.txt
    • changelog.txt
    • code-execution.txt
    • coding-agents.txt
    • computer-use.txt
    • crewai-example.txt
    • custom-agents.txt
    • deep-research.txt
    • deprecations.txt
    • document-processing.txt
    • embeddings.txt
    • feedback-policies.txt
    • file-input-methods.txt
    • file-search.txt
    • files.txt
    • flex-inference.txt
    • function-calling.txt
    • gemini-3.txt
    • gemini-for-research.txt
    • get-started.txt
    • google-search.txt
    • image-generation.txt
    • image-understanding.txt
    • imagen.txt
    • index.txt
    • interactions-breaking-changes-may-2026.txt
    • interactions-overview.txt
    • langgraph-example.txt
    • learnlm.txt
    • libraries.txt
    • live-api.txt
    • llama-index.txt
    • logs-datasets.txt
    • logs-policy.txt
    • long-context.txt
    • managed-agents-quickstart.txt
    • maps-grounding.txt
    • media-resolution.txt
    • migrate-to-cloud.txt
    • migrate-to-interactions.txt
    • migrate.txt
    • model-tuning.txt
    • models.txt
    • music-generation.txt
    • oauth.txt
    • openai.txt
    • optimization.txt
    • partner-integration.txt
    • pricing.txt
    • priority-inference.txt
    • prompting-strategies.txt
    • rate-limits.txt
    • realtime-music-generation.txt
    • robotics-overview.txt
    • safety-guidance.txt
    • safety-settings.txt
    • speech-generation.txt
    • streaming.txt
    • structured-output.txt
    • temporal-example.txt
    • text-generation.txt
    • thinking.txt
    • thought-signatures.txt
    • tokens.txt
    • tool-combination.txt
    • tools.txt
  • pages/generate-content

    • api-key.txt
    • audio.txt
    • caching.txt
    • code-execution.txt
    • computer-use.txt
    • document-processing.txt
    • file-input-methods.txt
    • file-search.txt
    • files.txt
    • flex-inference.txt
    • function-calling.txt
    • gemini-3.txt
    • get-started.txt
    • google-search.txt
    • image-generation.txt
    • image-understanding.txt
    • maps-grounding.txt
    • media-resolution.txt
    • music-generation.txt
    • priority-inference.txt
    • speech-generation.txt
    • structured-output.txt
    • text-generation.txt
    • thinking.txt
    • thought-signatures.txt
    • tokens.txt
    • tool-combination.txt
    • url-context.txt
    • video-understanding.txt
    • webhooks.txt
    • whats-new-gemini-3.5.txt
  • pages/live-api

    • best-practices.txt
    • capabilities.txt
    • ephemeral-tokens.txt
    • get-started-sdk.txt
    • get-started-websocket.txt
    • live-translate.txt
    • session-management.txt
    • tools.txt
  • pages/models

    • antigravity-preview-05-2026.txt
    • deep-research-max-preview-04-2026.txt
    • deep-research-preview-04-2026.txt
    • deep-research-pro-preview-12-2025.txt
    • gemini-2.0-flash-lite.txt
    • gemini-2.0-flash.txt
    • gemini-2.5-computer-use-preview-10-2025.txt
    • gemini-2.5-flash-image.txt
    • gemini-2.5-flash-lite-preview-09-2025.txt
    • gemini-2.5-flash-lite.txt
    • gemini-2.5-flash-native-audio-preview-12-2025.txt
    • gemini-2.5-flash-preview-09-2025.txt
    • gemini-2.5-flash-preview-tts.txt
    • gemini-2.5-flash.txt
    • gemini-2.5-pro-preview-tts.txt
    • gemini-2.5-pro.txt
    • gemini-3-flash-preview.txt
    • gemini-3-pro-image.txt
    • gemini-3-pro-preview.txt
    • gemini-3.1-flash-image.txt
    • gemini-3.1-flash-lite-preview.txt
    • gemini-3.1-flash-lite.txt
    • gemini-3.1-flash-live-preview.txt
    • gemini-3.1-flash-tts-preview.txt
    • gemini-3.1-pro-preview.txt
    • gemini-3.5-flash.txt
    • gemini-3.5-live-translate-preview.txt
    • gemini-embedding-001.txt
    • gemini-embedding-2-preview.txt
    • gemini-embedding-2.txt
    • gemini-robotics-er-1.5-preview.txt
    • gemini-robotics-er-1.6-preview.txt
    • imagen.txt
    • lyria-3-clip-preview.txt
    • lyria-3-pro-preview.txt
    • lyria-realtime-exp.txt
    • veo-2.0-generate-001.txt
    • veo-3.1-generate-preview.txt
    • veo-3.1-lite-generate-preview.txt
route: /gemini-api/docs/flex-inference
title: Flex inference
description: Learn how to optimize costs with the Flex inference tier in the Interactions API

Note: This version of the page covers the Interactions API. You can use the toggle on this page to switch to the generateContent API version of this page.
Preview: The Gemini Flex API is in
Preview.
The Gemini Flex API is an inference tier that offers a 50% cost reduction
compared to standard rates, in exchange for variable latency and best-effort
availability. It's designed for latency-tolerant workloads that require
synchronous processing but don't need the real-time performance of the standard
API.
How to use Flex
To use the Flex tier, specify the service_tier as flex in your request. By default, requests use the standard tier if this field is omitted.
Python
from google import genai
client = genai.Client()
interaction = client.interactions.create(
model="gemini-3.5-flash",
input="Analyze this dataset for trends...",
service_tier='flex'
)
print(interaction.output_text)
JavaScript
import { GoogleGenAI } from '@google/genai';
const client = new GoogleGenAI({});
async function main() {
const interaction = await client.interactions.create({
model: 'gemini-3.5-flash',
input: 'Analyze this dataset for trends...',
service_tier: 'flex'
});
console.log(interaction.output_text);
}
await main();
REST
curl -X POST "https://generativelanguage.googleapis.com/v1beta/interactions" \
-H "Content-Type: application/json" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-d '{
"model": "gemini-3.5-flash",
"input": "Analyze this dataset for trends...",
"service_tier": "flex"
}'
How Flex inference works
Gemini Flex inference bridges the gap between the standard API and the 24-hour
turnaround of the Batch API. It utilizes off-peak,
"sheddable" compute capacity to provide a cost-effective solution for background
tasks and sequential workflows.
Feature
Flex
Priority
Standard
Batch
Pricing
50% discount
75-100% more than Standard
Full price
50% discount
Latency
Minutes (1–15 min target)
Low (Seconds)
Seconds to minutes
Up to 24 hours
Reliability
Best-effort (Sheddable)
High (Non-sheddable)
High / Medium-high
High (for throughput)
Interface
Synchronous
Asynchronous
Key benefits
Cost efficiency: Substantial savings for non-production evals, background agents, and data enrichment.
Low friction: Simply add a single parameter to your existing requests.
Synchronous workflows: Ideal for sequential API chains where the next request depends on the output of the previous one, making it more flexible than Batch for agentic workflows.
Use cases
Offline evaluations: Running "LLM-as-a-judge" regression tests or leaderboards.
Background agents: Sequential tasks like CRM updates, profile building, or content moderation where minutes of delay are acceptable.
Budget-constrained research: Academic experiments that require high token volume on a limited budget.
Rate limits
Flex inference traffic counts towards your general rate limits; it doesn't
offer extended rate limits like the Batch API.
Sheddable capacity
Flex traffic is treated with lower priority. If there is a spike in
standard traffic, Flex requests may be preempted or evicted to ensure capacity
for high-priority users. If you're looking for high-priority inference, check
Priority inference
Error codes
When Flex capacity is unavailable or the system is congested, the API will
return standard error codes:
503 Service Unavailable: The system is currently at capacity.
429 Too Many Requests: Rate limits or resource exhaustion.
Client responsibility
No server-side fallback: To prevent unexpected charges, the system won't
automatically upgrade a Flex request to the Standard tier if Flex capacity is
full.
Retries: You must implement your own client-side retry logic with
exponential backoff.
Timeouts: Because Flex requests may sit in a queue, we recommend
increasing client-side timeouts to 10 minutes or more to avoid premature
connection closure.
Adjust timeout windows
You can configure per-request timeouts for the REST API and client libraries.
Always ensure your client-side timeout covers the intended server patience
window (e.g., 600s+ for Flex wait queues). The SDKs expect timeout values in
milliseconds.
Per-request timeouts
Python
from google import genai
client = genai.Client(http_options={"timeout": 900000})
interaction = client.interactions.create(
model="gemini-3.5-flash",
input="why is the sky blue?",
service_tier="flex",
)
JavaScript
import { GoogleGenAI } from '@google/genai';
const client = new GoogleGenAI({});
async function main() {
const interaction = await client.interactions.create({
model: "gemini-3.5-flash",
input: "why is the sky blue?",
service_tier: "flex",
}, {timeout: 900000});
}
await main();
Implement retries
Because Flex is sheddable and fails with 503 errors, here is an example of
optionally implementing retry logic to continue with failed requests:
Python
import time
from google import genai
client = genai.Client()
def call_with_retry(max_retries=3, base_delay=5):
for attempt in range(max_retries):
try:
return client.interactions.create(
model="gemini-3.5-flash",
input="Analyze this batch statement.",
service_tier="flex",
)
except Exception as e:
if attempt < max_retries - 1:
delay = base_delay * (2 ** attempt) # Exponential Backoff
print(f"Flex busy, retrying in {delay}s...")
time.sleep(delay)
else:
print("Flex exhausted, falling back to Standard...")
return client.interactions.create(
model="gemini-3.5-flash",
input="Analyze this batch statement."
)
interaction = call_with_retry()
print(interaction.output_text)
JavaScript
import { GoogleGenAI } from '@google/genai';
const ai = new GoogleGenAI({});
async function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
async function callWithRetry(maxRetries = 3, baseDelay = 5) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
console.log(`Attempt ${attempt + 1}: Calling Flex tier...`);
const interaction = await ai.interactions.create({
model: "gemini-3.5-flash",
input: "Analyze this batch statement.",
service_tier: 'flex',
});
return interaction;
} catch (e) {
if (attempt < maxRetries - 1) {
const delay = baseDelay * (2 ** attempt);
console.log(`Flex busy, retrying in ${delay}s...`);
await sleep(delay * 1000);
} else {
console.log("Flex exhausted, falling back to Standard...");
return await ai.interactions.create({
model: "gemini-3.5-flash",
input: "Analyze this batch statement.",
});
}
async function main() {
const interaction = await callWithRetry();
console.log(interaction.output_text);
}
await main();
Pricing
Flex inference is priced at 50% of the standard API
and billed per token.
Supported models
The following models support Flex inference:
Model
Flex inference
Gemini 3.5 Flash
✔️
Gemini 3.1 Flash-Lite
✔️
Gemini 3.1 Pro Preview
✔️
Gemini 3 Flash Preview
✔️
Gemini 2.5 Pro
✔️
Gemini 2.5 Flash
✔️
Gemini 2.5 Flash-Lite
✔️
What's next
Priority inference for ultra-low latency.
Tokens: Understand tokens.
Previouspages/files.txtNextpages/function-calling.txt

© 2026 Tech Dev Notes

RSSAboutAPIPrivacyTermsSitemap@techdevnotes