Browser agents combine a large language model with a browser automation library, giving the LLM the ability to navigate websites, click elements, fill forms, and extract content. Unlike traditional web scraping (which requires brittle CSS selectors or XPath) or RPA (which requires recording specific interaction sequences), browser agents understand the intent behind a task and can adapt to UI variations in real time.
How Browser Agents Work
The core loop is simple: take a screenshot (or the page's accessibility tree), pass it to the LLM with the current task, receive an action (click, type, navigate, extract), execute the action in the browser, and repeat until the task is complete.
browser-use is the most widely adopted open-source Python library for this pattern. It integrates with Playwright and exposes a high-level agent interface:
import asyncio
from browser_use import Agent
from langchain_anthropic import ChatAnthropic
async def main():
agent = Agent(
task="Go to Hacker News and find the top 5 posts about AI agents today. Return the titles and URLs.",
llm=ChatAnthropic(model="claude-opus-4-5"),
)
result = await agent.run()
print(result)
asyncio.run(main())
browser-use handles the screenshot, action execution, and loop management. The LLM handles task understanding and action selection. Under the hood, Playwright controls a real Chromium browser.
For more control, you can use Playwright directly with an LLM:
from playwright.async_api import async_playwright
import anthropic
import base64
async def browser_agent(task: str):
client = anthropic.Anthropic()
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
for _ in range(20): # max 20 steps
screenshot = await page.screenshot()
screenshot_b64 = base64.b64encode(screenshot).decode()
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": screenshot_b64}},
{"type": "text", "text": f"Task: {task}
What action should I take next? Reply with: CLICK x,y or TYPE text or NAVIGATE url or DONE result"}
]
}]
)
action = response.content[0].text
if action.startswith("DONE"):
return action.split("DONE ", 1)[1]
elif action.startswith("NAVIGATE"):
await page.goto(action.split("NAVIGATE ", 1)[1])
elif action.startswith("CLICK"):
coords = action.split("CLICK ", 1)[1].split(",")
await page.mouse.click(int(coords[0]), int(coords[1]))
elif action.startswith("TYPE"):
await page.keyboard.type(action.split("TYPE ", 1)[1])
await browser.close()
This is a simplified version, but it illustrates the core loop without framework abstraction.
Practical Use Cases
Competitive intelligence is one of the highest-value use cases. Monitoring competitor pricing, tracking product changes, collecting market data from sources with no public API. A browser agent can log in, navigate to the right page, extract the data, and format it for analysis.
Form filling at scale: submitting the same information to multiple portals (insurance quotes, vendor applications, government registrations). A browser agent handles the variation in form design that breaks scripted automation.
Web scraping when APIs do not exist: extracting structured data from websites that block programmatic requests or do not expose an API. Browser agents using real browsers are harder to detect than headless HTTP clients.
Testing web applications with natural language test cases: describe user flows in plain English and have the agent execute them, reporting failures. This is slower than Playwright scripts but requires no script maintenance when the UI changes.