AI Browser and Computer Use Agents Cheat Sheet

Updated 2026-05-21

Next Topic: AI Coding Agents Cheat Sheet

AI browser and computer use agents are autonomous AI systems that perceive graphical user interfaces through screenshots or accessibility trees and take actions via simulated mouse, keyboard, and browser controls — enabling them to complete multi-step tasks on real computers without custom API integrations. The field is driven by production deployments from Anthropic (Claude Computer Use), OpenAI (Operator/CUA), and Google (Project Mariner, now folded into Gemini), along with open-source frameworks like browser-use and Stagehand. A critical mental model to internalize early: these agents blur the line between data and instruction at every rendered pixel, making prompt injection the dominant security risk and sandboxed execution environments a non-negotiable prerequisite for safe deployment.

What This Cheat Sheet Covers

This topic spans 14 focused tables and 97 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Agent ArchitecturesTable 2: Screenshot Grounding and Coordinate MappingTable 3: Anthropic Computer Use APITable 4: OpenAI Operator and CUA ModelTable 5: Google Project Mariner and Gemini Browser AgentsTable 6: Evaluation BenchmarksTable 7: Sandboxed Execution EnvironmentsTable 8: Browser Automation FrameworksTable 9: Security Risks and AttacksTable 10: Prompt Injection DefensesTable 11: Human-in-the-Loop and Oversight PatternsTable 12: Deployment Patterns and Best PracticesTable 13: Reinforcement Learning for GUI AgentsTable 14: Microsoft UFO Windows Agent

Table 1: Core Agent Architectures

Agents are built around a perception-reasoning-action cycle that repeats until task completion or a stopping condition is met. Understanding the fundamental loop patterns helps you choose the right design for latency, safety, and task complexity.

Architecture	Example	Description
Agent loop (sampling loop)	`for _ in range(max_iter):` `resp = client.beta.messages.create(...)` `if not tool_results: break`	Repeating cycle where the model requests a tool action, the host executes it, returns results, and the model decides the next step; terminates when no tool is requested or max iterations is reached.
ReAct pattern	Thought: "I need to find the submit button" Action: screenshot → left_click [740, 520] Observation: page changed	Interleaves Reason and Act at every step; the model thinks, acts, observes, then thinks again; best for exploratory tasks where the next step depends on intermediate observations.
Plan-and-execute	1. Open browser 2. Navigate to URL 3. Fill form 4. Submit	Separates planning (full task decomposition upfront) from sequential execution; superior for well-defined multi-step tasks; replanning triggered only on failure; consumes more tokens than ReAct per task.

Table 1: Core Agent Architectures

Architecture	Example	Description
Agent loop (sampling loop)	`for _ in range(max_iter):` `resp = client.beta.messages.create(...)` `if not tool_results: break`	Repeating cycle where the model requests a tool action, the host executes it, returns results, and the model decides the next step; terminates when no tool is requested or max iterations is reached.
ReAct pattern	Thought: "I need to find the submit button" Action: screenshot → left_click [740, 520] Observation: page changed	Interleaves Reason and Act at every step; the model thinks, acts, observes, then thinks again; best for exploratory tasks where the next step depends on intermediate observations.
Plan-and-execute	1. Open browser 2. Navigate to URL 3. Fill form 4. Submit	Separates planning (full task decomposition upfront) from sequential execution; superior for well-defined multi-step tasks; replanning triggered only on failure; consumes more tokens than ReAct per task.