AI browser and computer use agents are autonomous AI systems that perceive graphical user interfaces through screenshots or accessibility trees and take actions via simulated mouse, keyboard, and browser controls β enabling them to complete multi-step tasks on real computers without custom API integrations. The field is driven by production deployments from Anthropic (Claude Computer Use), OpenAI (Operator/CUA), and Google (Project Mariner, now folded into Gemini), along with open-source frameworks like browser-use and Stagehand. A critical mental model to internalize early: these agents blur the line between data and instruction at every rendered pixel, making prompt injection the dominant security risk and sandboxed execution environments a non-negotiable prerequisite for safe deployment.
What This Cheat Sheet Covers
This topic spans 14 focused tables and 97 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Agent Architectures
Agents are built around a perception-reasoning-action cycle that repeats until task completion or a stopping condition is met. Understanding the fundamental loop patterns helps you choose the right design for latency, safety, and task complexity.
| Architecture | Example | Description |
|---|---|---|
for _ in range(max_iter): resp = client.beta.messages.create(...) if not tool_results: break | Repeating cycle where the model requests a tool action, the host executes it, returns results, and the model decides the next step; terminates when no tool is requested or max iterations is reached. | |
Thought: "I need to find the submit button" Action: screenshot β left_click [740, 520] Observation: page changed | Interleaves Reason and Act at every step; the model thinks, acts, observes, then thinks again; best for exploratory tasks where the next step depends on intermediate observations. | |
1. Open browser 2. Navigate to URL 3. Fill form 4. Submit | Separates planning (full task decomposition upfront) from sequential execution; superior for well-defined multi-step tasks; replanning triggered only on failure; consumes more tokens than ReAct per task. |