gemini-docs/latest/content · Jun 26, 14:03 UTC

pages/computer-use.txt

TXT45 KB187 lines

route: /gemini-api/docs/computer-use
title: Computer Use
description: Learn how to use the Gemini API computer use feature.

Note: This version of the page covers the Interactions API. You can use the
toggle on this page to switch to the generateContent API version of this
page.
The Computer Use tool lets you build browser, mobile, and desktop control agents
that interact with and automate tasks. Using screenshots, the model can "see" a
computer screen, and "act" by generating specific UI actions like mouse clicks
and keyboard inputs. Similar to function calling, you will need to implement the
client-side execution environment to receive and execute the Computer Use
actions.
Gemini 3.5 Flash is the recommended model for Computer Use, and introduces
several new capabilities:
Multi-environment support: build agents for browser, mobile, and desktop environments.
Streamlined actions with intents: actions include an intent field that explains the model's reasoning behind each step.
Configurable safety policies: fine-tune safety behavior with built-in policy categories and overrides.
Prompt injection detection: opt-in screenshot scanning to detect hidden adversarial instructions.
With Computer Use, you can build agents that:
Automate repetitive data entry or form filling on websites.
Perform automated testing of web applications and user flows
Conduct research across various websites (e.g., gathering product
information, prices, and reviews from ecommerce sites to inform a purchase)
Here's a minimal example of initializing the client and sending a prompt to the model with the computer_use tool enabled for a browser environment:
Python
from google import genai
client = genai.Client()
interaction = client.interactions.create(
model="gemini-3.5-flash",
input="Search for 'Gemini API' on Google.",
tools=[{"type": "computer_use", "environment": "browser"}]
)
print(interaction)
JavaScript
import { GoogleGenAI } from '@google/genai';
const ai = new GoogleGenAI();
const interaction = await ai.interactions.create({
model: 'gemini-3.5-flash',
input: "Search for 'Gemini API' on Google.",
tools: [{ type: "computer_use", environment: "browser" }]
});
console.log(interaction);
Note: As a Preview capability, Computer Use may contain errors and security
vulnerabilities. We recommend supervising closely for important tasks, and that
you avoid using the Computer Use capability for tasks involving critical
decisions, sensitive data, or actions where serious errors cannot be corrected.
We encourage you to review the Safety best practices,
the Prohibited Use
Policy and Gemini
API Additional Terms of Service.
How Computer Use works
To build an agent with the Computer Use model, you need to set up a
continuous loop between your application and the API. Here is what your code
will do at each step:
Send a request to the model
Your application sends an API request containing the Computer Use tool,
your configuration settings (like the target environment), the user's
prompt, and a screenshot of the current screen.
Receive the model response
The model analyzes the screen and the prompt, returning a response
which includes a suggested function_call representing a UI action (such
as a click, scroll, or keystroke).
For Gemini 3.5 Flash, the response also includes a reasoning intent
explaining why the model chose that action.
The response may also include a safety_decision from an internal safety
system that classifies the action as regular/allowed,
require_confirmation (requiring user approval), or blocked.
Execute the received action
If the action is allowed (or the user confirms it), your client-side
code parses the function_call, scales the normalized coordinates to match
your viewport, and executes the action in your target environment using
automation tools (such as Playwright). If the action is blocked, your
client should halt the execution or handle the interruption.
Capture the new environment state
After the action finishes executing, your application captures a new
screenshot and sends it back to the model in a function_result to
request the next step.
This process then repeats from step 2, continually soliciting the next action
from the model until the task is completed or terminated.
How to implement Computer Use
Before building with the Computer Use tool you will need to set up:
Secure execution environment: Run your agent in a sandboxed VM or
container to isolate it from your host system and limit its potential impact.
The reference implementation
includes a ready-to-use Docker-based sandbox you can use as a starting point.
Client-side action handler: Implement client-side logic to execute coordinates, type text, and take screenshots.
The examples below use a web browser as the execution environment and
Playwright as the client-side handler.
0. Set up Playwright
First, install the required packages:
pip install google-genai playwright
playwright install chromium
Then, initialize a Playwright browser instance to use for execution:
from playwright.sync_api import sync_playwright
# 1. Configure screen dimensions for the target environment
SCREEN_WIDTH = 1440
SCREEN_HEIGHT = 900
# 2. Start the Playwright browser
# In production, utilize a sandboxed environment.
playwright = sync_playwright().start()
# Set headless=False to see the actions performed on your screen
browser = playwright.chromium.launch(headless=False)
# 3. Create a context and page with the specified dimensions
context = browser.new_context(
viewport={"width": SCREEN_WIDTH, "height": SCREEN_HEIGHT}
)
page = context.new_page()
# 4. Navigate to an initial page to start the task
page.goto("https://www.google.com")
# The 'page', 'SCREEN_WIDTH', and 'SCREEN_HEIGHT' variables
# will be used in the steps below.
1. Send a request to the model
Initialize the client library and configure the Computer Use tool. Note that there is no need to specify the display size when issuing a request; the model predicts pixel coordinates scaled to the height and width of the screen.
Gemini 3.5 Flash (Recommended)
Python
Use the google-genai Python SDK (version 2.7.0 or higher) to configure a request targeting the browser environment:
from google import genai
client = genai.Client()
interaction = client.interactions.create(
model='gemini-3.5-flash',
input="Find a flight from SF to Hawaii on Jun 30th, coming back on Jul 6th",
tools=[
{
"type": "computer_use",
"environment": "browser",
"enable_prompt_injection_detection": True
}
]
)
print(interaction)
JavaScript
Use the @google/genai Node.js SDK to configure a request targeting the browser environment:
import { GoogleGenAI } from '@google/genai';
const ai = new GoogleGenAI();
const interaction = await ai.interactions.create({
model: 'gemini-3.5-flash',
input: "Find a flight from SF to Hawaii on Jun 30th, coming back on Jul 6th",
tools: [
{
type: "computer_use",
environment: "browser",
enable_prompt_injection_detection: true
}
]
});
console.log(interaction);
REST
Use curl to send a request:
curl -X POST \
"https://generativelanguage.googleapis.com/v1beta/interactions" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gemini-3.5-flash",
"input": "Find me a flight from SF to Hawaii on Jun 30th, coming back on Jul 6th. Start by navigating directly to flights.google.com",
"tools": [
{
"type": "computer_use",
"environment": "browser",
"enable_prompt_injection_detection": true
}
]
}'
Gemini 2.5 (Legacy)
Python
from google import genai
client = genai.Client()
# Specify predefined functions to exclude (optional)
excluded_functions = ["drag_and_drop"]
interaction = client.interactions.create(
model='gemini-2.5-computer-use-preview-10-2025',
input="Search for highly rated smart fridges on Google Shopping.",
tools=[
{
"type": "computer_use",
"environment": "browser",
"excluded_predefined_functions": excluded_functions
}
]
)
print(interaction)
JavaScript
import { GoogleGenAI } from '@google/genai';
const ai = new GoogleGenAI();
// Specify predefined functio
…

All content/ files Changelog

gemini-docs/latest/content · Jun 26, 14:03 UTC

pages/computer-use.txt

TXT45 KB187 lines

route: /gemini-api/docs/computer-use
title: Computer Use
description: Learn how to use the Gemini API computer use feature.

Note: This version of the page covers the Interactions API. You can use the
toggle on this page to switch to the generateContent API version of this
page.
The Computer Use tool lets you build browser, mobile, and desktop control agents
that interact with and automate tasks. Using screenshots, the model can "see" a
computer screen, and "act" by generating specific UI actions like mouse clicks
and keyboard inputs. Similar to function calling, you will need to implement the
client-side execution environment to receive and execute the Computer Use
actions.
Gemini 3.5 Flash is the recommended model for Computer Use, and introduces
several new capabilities:
Multi-environment support: build agents for browser, mobile, and desktop environments.
Streamlined actions with intents: actions include an intent field that explains the model's reasoning behind each step.
Configurable safety policies: fine-tune safety behavior with built-in policy categories and overrides.
Prompt injection detection: opt-in screenshot scanning to detect hidden adversarial instructions.
With Computer Use, you can build agents that:
Automate repetitive data entry or form filling on websites.
Perform automated testing of web applications and user flows
Conduct research across various websites (e.g., gathering product
information, prices, and reviews from ecommerce sites to inform a purchase)
Here's a minimal example of initializing the client and sending a prompt to the model with the computer_use tool enabled for a browser environment:
Python
from google import genai
client = genai.Client()
interaction = client.interactions.create(
model="gemini-3.5-flash",
input="Search for 'Gemini API' on Google.",
tools=[{"type": "computer_use", "environment": "browser"}]
)
print(interaction)
JavaScript
import { GoogleGenAI } from '@google/genai';
const ai = new GoogleGenAI();
const interaction = await ai.interactions.create({
model: 'gemini-3.5-flash',
input: "Search for 'Gemini API' on Google.",
tools: [{ type: "computer_use", environment: "browser" }]
});
console.log(interaction);
Note: As a Preview capability, Computer Use may contain errors and security
vulnerabilities. We recommend supervising closely for important tasks, and that
you avoid using the Computer Use capability for tasks involving critical
decisions, sensitive data, or actions where serious errors cannot be corrected.
We encourage you to review the Safety best practices,
the Prohibited Use
Policy and Gemini
API Additional Terms of Service.
How Computer Use works
To build an agent with the Computer Use model, you need to set up a
continuous loop between your application and the API. Here is what your code
will do at each step:
Send a request to the model
Your application sends an API request containing the Computer Use tool,
your configuration settings (like the target environment), the user's
prompt, and a screenshot of the current screen.
Receive the model response
The model analyzes the screen and the prompt, returning a response
which includes a suggested function_call representing a UI action (such
as a click, scroll, or keystroke).
For Gemini 3.5 Flash, the response also includes a reasoning intent
explaining why the model chose that action.
The response may also include a safety_decision from an internal safety
system that classifies the action as regular/allowed,
require_confirmation (requiring user approval), or blocked.
Execute the received action
If the action is allowed (or the user confirms it), your client-side
code parses the function_call, scales the normalized coordinates to match
your viewport, and executes the action in your target environment using
automation tools (such as Playwright). If the action is blocked, your
client should halt the execution or handle the interruption.
Capture the new environment state
After the action finishes executing, your application captures a new
screenshot and sends it back to the model in a function_result to
request the next step.
This process then repeats from step 2, continually soliciting the next action
from the model until the task is completed or terminated.
How to implement Computer Use
Before building with the Computer Use tool you will need to set up:
Secure execution environment: Run your agent in a sandboxed VM or
container to isolate it from your host system and limit its potential impact.
The reference implementation
includes a ready-to-use Docker-based sandbox you can use as a starting point.
Client-side action handler: Implement client-side logic to execute coordinates, type text, and take screenshots.
The examples below use a web browser as the execution environment and
Playwright as the client-side handler.
0. Set up Playwright
First, install the required packages:
pip install google-genai playwright
playwright install chromium
Then, initialize a Playwright browser instance to use for execution:
from playwright.sync_api import sync_playwright
# 1. Configure screen dimensions for the target environment
SCREEN_WIDTH = 1440
SCREEN_HEIGHT = 900
# 2. Start the Playwright browser
# In production, utilize a sandboxed environment.
playwright = sync_playwright().start()
# Set headless=False to see the actions performed on your screen
browser = playwright.chromium.launch(headless=False)
# 3. Create a context and page with the specified dimensions
context = browser.new_context(
viewport={"width": SCREEN_WIDTH, "height": SCREEN_HEIGHT}
)
page = context.new_page()
# 4. Navigate to an initial page to start the task
page.goto("https://www.google.com")
# The 'page', 'SCREEN_WIDTH', and 'SCREEN_HEIGHT' variables
# will be used in the steps below.
1. Send a request to the model
Initialize the client library and configure the Computer Use tool. Note that there is no need to specify the display size when issuing a request; the model predicts pixel coordinates scaled to the height and width of the screen.
Gemini 3.5 Flash (Recommended)
Python
Use the google-genai Python SDK (version 2.7.0 or higher) to configure a request targeting the browser environment:
from google import genai
client = genai.Client()
interaction = client.interactions.create(
model='gemini-3.5-flash',
input="Find a flight from SF to Hawaii on Jun 30th, coming back on Jul 6th",
tools=[
{
"type": "computer_use",
"environment": "browser",
"enable_prompt_injection_detection": True
}
]
)
print(interaction)
JavaScript
Use the @google/genai Node.js SDK to configure a request targeting the browser environment:
import { GoogleGenAI } from '@google/genai';
const ai = new GoogleGenAI();
const interaction = await ai.interactions.create({
model: 'gemini-3.5-flash',
input: "Find a flight from SF to Hawaii on Jun 30th, coming back on Jul 6th",
tools: [
{
type: "computer_use",
environment: "browser",
enable_prompt_injection_detection: true
}
]
});
console.log(interaction);
REST
Use curl to send a request:
curl -X POST \
"https://generativelanguage.googleapis.com/v1beta/interactions" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gemini-3.5-flash",
"input": "Find me a flight from SF to Hawaii on Jun 30th, coming back on Jul 6th. Start by navigating directly to flights.google.com",
"tools": [
{
"type": "computer_use",
"environment": "browser",
"enable_prompt_injection_detection": true
}
]
}'
Gemini 2.5 (Legacy)
Python
from google import genai
client = genai.Client()
# Specify predefined functions to exclude (optional)
excluded_functions = ["drag_and_drop"]
interaction = client.interactions.create(
model='gemini-2.5-computer-use-preview-10-2025',
input="Search for highly rated smart fridges on Google Shopping.",
tools=[
{
"type": "computer_use",
"environment": "browser",
"excluded_predefined_functions": excluded_functions
}
]
)
print(interaction)
JavaScript
import { GoogleGenAI } from '@google/genai';
const ai = new GoogleGenAI();
// Specify predefined functio
…