# Directory Structure
```
├── .editorconfig
├── .gitattributes
├── .gitignore
├── Dockerfile
├── documentation
│ ├── CONFIGURATION.md
│ └── SECURITY.md
├── pyproject.toml
├── README.md
├── renovate.json
├── sample.env.env
├── smithery.yaml
├── src
│ └── mcp_browser_use
│ ├── __init__.py
│ ├── agent
│ │ ├── __init__.py
│ │ ├── custom_agent.py
│ │ ├── custom_massage_manager.py
│ │ ├── custom_prompts.py
│ │ └── custom_views.py
│ ├── browser
│ │ ├── __init__.py
│ │ └── browser_manager.py
│ ├── client.py
│ ├── controller
│ │ ├── __init__.py
│ │ └── custom_controller.py
│ ├── mcp_browser_use.py
│ ├── server.py
│ └── utils
│ ├── __init__.py
│ ├── agent_state.py
│ ├── logging.py
│ └── utils.py
└── tests
├── conftest.py
├── stubs
│ ├── browser_use
│ │ ├── __init__.py
│ │ ├── agent
│ │ │ ├── message_manager
│ │ │ │ ├── service.py
│ │ │ │ └── views.py
│ │ │ ├── prompts.py
│ │ │ ├── service.py
│ │ │ └── views.py
│ │ ├── browser
│ │ │ ├── __init__.py
│ │ │ ├── browser.py
│ │ │ ├── context.py
│ │ │ ├── events.py
│ │ │ ├── profile.py
│ │ │ └── views.py
│ │ ├── controller
│ │ │ ├── registry
│ │ │ │ └── views.py
│ │ │ └── service.py
│ │ ├── telemetry
│ │ │ └── views.py
│ │ └── utils.py
│ ├── langchain_core
│ │ ├── language_models
│ │ │ ├── __init__.py
│ │ │ └── chat_models.py
│ │ ├── messages
│ │ │ └── __init__.py
│ │ └── prompts
│ │ └── __init__.py
│ ├── langchain_openai
│ │ ├── __init__.py
│ │ └── chat_models
│ │ ├── __init__.py
│ │ └── base.py
│ └── PIL
│ └── __init__.py
├── test_agent_state.py
├── test_browser_manager.py
├── test_client_session.py
├── test_custom_agent_controller.py
├── test_gif_creation.py
├── test_logging_configuration.py
├── test_summarize_messages.py
└── test_utils.py
```
# Files
--------------------------------------------------------------------------------
/.gitattributes:
--------------------------------------------------------------------------------
```
# Auto detect text files and perform LF normalization
* text=auto
```
--------------------------------------------------------------------------------
/.editorconfig:
--------------------------------------------------------------------------------
```
# Check http://editorconfig.org for more information
# This is the main config file for this project:
root = true
[*]
charset = utf-8
end_of_line = lf
insert_final_newline = true
indent_style = space
indent_size = 2
trim_trailing_whitespace = true
[*.{py, pyi}]
indent_style = space
indent_size = 4
[Makefile]
indent_style = tab
[*.md]
trim_trailing_whitespace = false
```
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
```
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# ignore the database
*.db
# ignore vscode settings
.vscode/
# Project Files
/*.json
target/
dbt_packages/
dbt_packages/*
logs/
/secrets/*
#mac pc specific - system configuration files
.DS_Store
```
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
```markdown
# MCP Browser Use Server
[](https://smithery.ai/server/@JovaniPink/mcp-browser-use)
> Model Context Protocol (MCP) server that wires [browser-use](https://github.com/browser-use/browser-use) into Claude Desktop and other MCP compatible clients.
<a href="https://glama.ai/mcp/servers/tjea5rgnbv"><img width="380" height="200" src="https://glama.ai/mcp/servers/tjea5rgnbv/badge" alt="Browser Use Server MCP server" /></a>
## Overview
This repository provides a production-ready wrapper around the `browser-use` automation engine. It exposes a single MCP tool (`run_browser_agent`) that orchestrates a browser session, executes the `browser-use` agent, and returns the final result back to the client. The refactored layout focuses on keeping configuration in one place, improving testability, and keeping `browser-use` upgrades isolated from MCP specific code.
### Key Capabilities
- **Automated browsing** – Navigate, interact with forms, control tabs, capture screenshots, and read page content through natural-language instructions executed by `browser-use`.
- **Agent lifecycle management** – `CustomAgent` wraps `browser-use`'s base agent to add history export, richer prompts, and consistent error handling across runs.
- **Centralised browser configuration** – `create_browser_session` translates environment variables into a ready-to-use `BrowserSession`, enabling persistent profiles, proxies, and custom Chromium flags without touching the agent logic.
- **FastMCP integration** – `server.py` registers the MCP tool, normalises configuration, and ensures the browser session is always cleaned up.
- **Client helpers** – `client.py` includes async helpers for tests or other Python processes that wish to exercise the MCP server in-process.
### Project Structure
```
.
├── documentation/
│ ├── CONFIGURATION.md # Detailed configuration reference
│ └── SECURITY.md # Security considerations for running the server
├── .env.example # Example environment variables for local development
├── src/mcp_browser_use/
│ ├── agent/ # Custom agent, prompts, message history, and views
│ ├── browser/ # Browser session factory and persistence helpers
│ ├── controller/ # Custom controller extensions for clipboard actions
│ ├── utils/ # LLM factory, agent state helpers, encoding utilities
│ ├── client.py # Async helper for connecting to the FastMCP app
│ └── server.py # FastMCP app and the `run_browser_agent` tool
└── tests/ # Unit tests covering server helpers and agent features
```
## Getting Started
### Requirements
- Python 3.11+
- Google Chrome or Chromium (for local automation)
- [`uv`](https://github.com/astral-sh/uv) for dependency management (recommended)
- Optional: Claude Desktop or another MCP-compatible client for integration testing
### Installation
```bash
git clone https://github.com/JovaniPink/mcp-browser-use.git
cd mcp-browser-use
uv sync
```
Copy `sample.env` to `.env` (or export the variables in another way) and update the values for the providers you plan to use.
### Launching the server
```bash
uv run mcp-browser-use
```
The command invokes the console script defined in `pyproject.toml`, starts the FastMCP application, and registers the `run_browser_agent` tool.
#### Using with Claude Desktop
Once the server is running you can register it inside Claude Desktop, for example:
```json
"mcpServers": {
"mcp_server_browser_use": {
"command": "uvx",
"args": ["mcp-browser-use"],
"env": {
"MCP_MODEL_PROVIDER": "anthropic",
"MCP_MODEL_NAME": "claude-3-5-sonnet-20241022"
}
}
}
```
### Debugging
For interactive debugging, use the [MCP Inspector](https://github.com/modelcontextprotocol/inspector):
```bash
npx @modelcontextprotocol/inspector uv --directory /path/to/project run mcp-browser-use
```
The inspector prints a URL that can be opened in the browser to watch tool calls and responses in real time.
## Configuration
A full list of environment variables and their defaults is available in [documentation/CONFIGURATION.md](documentation/CONFIGURATION.md). Highlights include:
- `MCP_MODEL_PROVIDER`, `MCP_MODEL_NAME`, `MCP_TEMPERATURE`, `MCP_MAX_STEPS`, `MCP_MAX_ACTIONS_PER_STEP`, and `MCP_USE_VISION` control the LLM and agent run.
- Provider-specific API keys and endpoints (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `DEEPSEEK_API_KEY`, `GOOGLE_API_KEY`, `AZURE_OPENAI_API_KEY`, etc.).
- Browser runtime flags (`BROWSER_USE_HEADLESS`, `BROWSER_USE_EXTRA_CHROMIUM_ARGS`, `CHROME_PERSISTENT_SESSION`, `BROWSER_USE_PROXY_URL`, ...).
Use `.env` + [`python-dotenv`](https://pypi.org/project/python-dotenv/) or your preferred secrets manager to keep credentials out of source control.
## Running Tests
```bash
uv run pytest
```
The tests cover the custom agent behaviour, browser session factory, and other utility helpers.
## Security
Controlling a full browser instance remotely can grant broad access to the host machine. Review [documentation/SECURITY.md](documentation/SECURITY.md) before exposing the server to untrusted environments.
## Contributing
1. Fork the repository
2. Create your feature branch: `git checkout -b my-new-feature`
3. Commit your changes: `git commit -m 'Add some feature'`
4. Push to the branch: `git push origin my-new-feature`
5. Open a pull request
Bug reports and feature suggestions are welcome—please include logs and reproduction steps when applicable.
```
--------------------------------------------------------------------------------
/documentation/SECURITY.md:
--------------------------------------------------------------------------------
```markdown
# Security
> Below is a comprehensive security audit of your Browser-Use + MCP project using all the prior conversations and standard best practices for security. This is not an exhaustive penetration test but a systematic review of the major scripts and common pitfalls. We also provide suggestions for how to mitigate identified risks.
1. Project Structure & High-Level Summary
The code layout is:
1. Main server code server.py that runs an async event loop (loop = asyncio.new_event_loop()) within __main__):
- Runs a FastMCP (Model Context Protocol) server.
- Exposes a tool endpoint to run a single “browser agent.”
2. Custom Agent under the agent directory and Related Classes:
- custom_agent.py: Inherits from a base Agent and implements logic to parse LLM output, execute browser actions, handle vision, and create history GIFs.
- custom_massage_manager.py: Handles LLM output parsing and conversion to browser actions.
- custom_prompts.py: Contains system-level instructions for the LLM to produce a structured JSON output.
- custom_views.py: Data classes (CustomAgentStepInfo, CustomAgentBrain) are used to store the agent’s state and output schema.
3. Custom Browser Components under the browser directory:
- config.py: Holds dataclasses for configuring Chrome (persistent sessions, debugging port).
- custom_browser.py: Subclass of Browser that handles launching or connecting to Chrome over a debugging port. It may disable some security flags or run headless.
- custom_context.py: Subclass of BrowserContext that can reuse an existing context or create new ones, load cookies, start traces, etc.
4. Controllers & Actions:
- custom_controller.py: Registers custom actions (copy/paste from clipboard).
5. Utilities:
- agent_state.py: Tracks a stop_requested event (via asyncio.Event) and optional “last valid state.” Implemented as a singleton (only one agent at a time).
- utils.py: offers a get_llm_model function to create different LLM clients (OpenAI, Anthropic, Azure, etc.), as well as image encoding and file-tracking utilities.
The project runs a single agent simultaneously, hooking an LLM to actual browser actions. Let’s go through significant security aspects.
2. Identified Security Risks & Recommendations
Below are the main areas of concern based on the code we’ve seen and typical usage patterns.
2.1 Disabling Browser Security & Remote Debug Port
Where
- custom_browser.py:
- Allows launching Chrome with flags like --disable-web-security.
- Launches Chrome with --remote-debugging-port=9222.
Risks
1. Cross-Origin Attacks: Disabling web security (--disable-web-security, --disable-features=IsolateOrigins) allows malicious pages to read cross-origin data in the same browser instance. If the agent visits untrusted websites, it could inadvertently exfiltrate data from other open tabs or sessions.
2. Debug Port Exposure: A remote debugging port on 9222 (if bound to 0.0.0.0 or otherwise accessible externally) gives anyone who can connect full control of the browser. If not behind a firewall, an attacker can hijack the session.
Recommendations
1. Limit the usage of disable-web-security and related flags. Restrict this to internal/test scenarios or run it inside a hardened container or ephemeral environment.
2. Restrict Access to Port 9222:
- Bind to 127.0.0.1 only (--remote-debugging-address=127.0.0.1) so external hosts cannot connect.
- Use a firewall or security group to block external access.
- If remote access is required, use SSH tunneling rather than publicly exposing the port.
3. If you must open untrusted pages, create separate browser instances. This means not reusing the same “user data dir” or disabling security for critical tasks.
2.2 Global Singleton AgentState
Where
- agent_state.py implements a singleton that shares_stop_requested and last_valid_state across all agent references.
Risks
1. Concurrent Agents: If you (in the future) attempt to run multiple agents, the single AgentState object might cause cross-talk or unpredictable behavior (e.g., one agent’s stop request stops another).
2. Potential Race Conditions: If the code evolves to multi-thread or multi-process, the concurrency might not behave as expected.
Recommendations
1. Ensure Only One Agent: If that’s your design (a single agent at a time), the singleton is acceptable. Document it.
2. Remove Singleton for multi-agent scenarios. Each agent can have its state object.
2.3 Clipboard Actions
Where
- custom_controller.py registers actions like “Copy text to clipboard” and “Paste from clipboard.”
Risks
1. System Clipboard: Copy/paste using the OS-level clipboard (pyperclip). This can leak sensitive data if other apps or remote sessions see the same clipboard.
2. Overwrite: The agent can overwrite a user’s clipboard or read from it unexpectedly.
Recommendations
1. Run in a Controlled Environment: It may be okay if you only do local development or a dedicated environment.
2. Use an In-Memory Clipboard: Instead of the actual system clipboard, implement a local memory store for copying and pasting within the agent’s session. This prevents overwriting the user’s system clipboard.
3. Disable or Restrict these actions if you run in multi-user or production mode.
2.4 Logging Sensitive Data
Where
- Various scripts log LLM responses or user tasks.
- utils.py and other files read environment variables for API keys.
Risks
1. API Keys in Logs: If you ever log environment variables, they might contain secrets (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY).
2. Conversation Logs: LLM or browser actions might contain personal info or private data from pages the agent visits.
Recommendations
1. Scrub Sensitive Info: Use partial redaction to log environment variables or user data.
2. Control Log Levels: Keep debug logs for local dev; avoid them in production or store them in a secure location.
3. Never commit or print raw API keys or user credentials.
2.5 Environment Variables for API Keys
Where
- utils.py reads OPENAI_API_KEY, ANTHROPIC_API_KEY, AZURE_OPENAI_API_KEY, etc.
Risks
1. Credentials Leak: Others might read if environment variables are insecurely stored or the machine is multi-tenant.
2. Rotation & Auditing: It is harder to rotate if you embed them in environment variables in multiple places.
Recommendations
1. Use a Secret Manager: For production, store keys in Vault, AWS Secrets Manager, or a similar service, injecting them at runtime with minimal exposure.
2. Lock Down or Mask your environment variables in logs.
2.6 Handling of Cookies & Persisted Sessions
Where
- custom_context.py loads cookies from a file and reuses them if cookies_file is set.
Risks
1. Cookie Theft: Cookies containing session tokens can be used to impersonate or access accounts.
2. Insecure Storage: If cookies_file is not locked down or is in a publicly accessible directory, attackers could read it.
Recommendations
1. Encrypt or Secure the cookie file if it’s sensitive.
2. Use ephemeral sessions if you don’t need persistence (this mitigates the risk of session hijacking).
3. Handle JSON Errors gracefully. The code might crash if the cookie file is corrupted or maliciously edited. Currently, you catch some exceptions, but be sure they are robust.
2.7 LLM Output Execution
Where
- custom_agent.py uses the LLM output to determine subsequent actions in the browser. This is effectively arbitrary remote code controlling the browser if the LLM’s output is invalid.
Risks
1. Prompt Injection or Malicious LLM Output: If an attacker can manipulate the prompt or the LLM’s instructions, they might cause harmful browsing actions (e.g., navigating to malicious pages, downloading malicious content, or exfiltrating data).
2. Excessive Trust: The agent automatically performs actions the LLM says. If the LLM is compromised or intentionally producing malicious JSON, your system might become an attack vector.
Recommendations
1. Policy Layer: Before executing each action, you can add checks to ensure it’s within a set of “allowed” domains or “allowed action types.”
2. Safe Browsing: You could block navigation to known malicious or undesired domains.
3. Sandboxes: Run the browser in a locked-down Docker container or VM so the environment is contained even if the LLM instructs to visit a malicious link.
2.8 Untrusted Web Content & Vision
Where
- The agent uses optional “vision-based element detection” or page screenshots.
Risks
1. Malicious Images: If the agent processes images from untrusted sources, ensure it’s safe from typical image library exploits (PIL is relatively safe, but keep it updated).
2. Screenshot capturing: If you store or send screenshots, you risk inadvertently capturing personal data or content.
Recommendations
1. Use the Latest Libraries: Keep PIL (pillow) updated to avoid known vulnerabilities in image parsing.
2. Handle Storage: If you store screenshots, do so in secure, short-lived storage with restricted access.
3. Summary of Key Security Practices
Based on the potential issues above, here’s a short checklist to ensure your system remains secure:
1. Networking & Ports:
- Bind remote debugging to 127.0.0.1 only.
- Use firewalls or SSH tunnels if remote access is necessary.
2. Sandboxing:
- Use Docker or a VM for your automation environment.
- Avoid --disable-web-security in production, or keep it in an isolated environment if you must use it.
3. Logging & Secrets:
- Never log API keys or raw environment variables.
- Redact sensitive info in logs.
- Use a secret manager to store credentials.
4. Clipboard & Persistence:
- Limit usage of system clipboard actions or implement an in-memory approach.
- If session data/cookies are reused, ensure the file and directory permissions are locked down.
5. LLM Output Validation:
- Consider a “policy layer” that checks which actions are allowed before executing them.
- Consider domain safelisting or an interactive approval step in critical scenarios.
6. Error Handling:
• - Gracefully handle invalid JSON, cookies, or environment variables.
- Decide if you want to continue or fail fast with an error message.
7. Document your single-agent approach:
- The singleton approach is fine if you never plan multiple concurrent agents.
- Otherwise, remove it or ensure concurrency safety.
4. Verifying Project Structure
From a structural standpoint:
1. Modular & Readable: Your project is decently modular: custom_agent, custom_browser, custom_context, custom_controller, custom_prompts, etc.
2. Dependencies: You rely on Playwright. async_api, pyperclip, requests, and custom browser_use and langchain _* modules. Ensure they are pinned to known-safe versions (e.g., in a requirements.txt) and kept updated.
3. Single vs. Multi Agent: In your README or main docs, clarify that you run only one agent at a time or concurrency is in scope.
4. Deployment: If you distribute or deploy this server, outline the usage of environment variables, the required ports, and the recommended containerization approach.
5. Conclusion
Your codebase is well-organized and functionally robust. The main security concerns revolve around:
- Remote Debugging & Disabling Security** in Chrome.
- Clipboard & Cookie usage.
- LLM output leading to potentially dangerous actions if not validated.
- Logging & environment variables containing sensitive data.
You can mitigate most of these risks by containerizing or VM-isolating your environment, restricting your debugging port to localhost, carefully handling credentials and logs, and implementing a minimal policy layer for LLM-driven actions.
The project is in good shape, but you should document these security measures and carefully configure them, especially in environments other than internal development.
Next Steps:
- Implement or strengthen the recommended mitigation steps above.
- Periodically review dependencies for security patches.
- If this is a production-grade service, consider formal penetration testing or a threat model exercise to identify additional risks.
- Keep documentation clear about the single-agent design and environment variables, and recommend using a container or ephemeral environment to prevent lateral movement or data exfiltration.
```
--------------------------------------------------------------------------------
/src/mcp_browser_use/agent/__init__.py:
--------------------------------------------------------------------------------
```python
# -*- coding: utf-8 -*-
```
--------------------------------------------------------------------------------
/src/mcp_browser_use/browser/__init__.py:
--------------------------------------------------------------------------------
```python
# -*- coding: utf-8 -*-
```
--------------------------------------------------------------------------------
/src/mcp_browser_use/controller/__init__.py:
--------------------------------------------------------------------------------
```python
# -*- coding: utf-8 -*-
```
--------------------------------------------------------------------------------
/src/mcp_browser_use/utils/__init__.py:
--------------------------------------------------------------------------------
```python
# -*- coding: utf-8 -*-
```
--------------------------------------------------------------------------------
/tests/stubs/browser_use/browser/browser.py:
--------------------------------------------------------------------------------
```python
class Browser:
pass
```
--------------------------------------------------------------------------------
/tests/stubs/browser_use/controller/registry/views.py:
--------------------------------------------------------------------------------
```python
class ActionModel:
pass
```
--------------------------------------------------------------------------------
/tests/stubs/browser_use/agent/prompts.py:
--------------------------------------------------------------------------------
```python
class SystemPrompt:
pass
```
--------------------------------------------------------------------------------
/tests/stubs/langchain_openai/chat_models/base.py:
--------------------------------------------------------------------------------
```python
_convert_message_to_dict = lambda x: {}
```
--------------------------------------------------------------------------------
/tests/stubs/browser_use/agent/message_manager/service.py:
--------------------------------------------------------------------------------
```python
class MessageManager:
def __init__(self, *args, **kwargs):
pass
```
--------------------------------------------------------------------------------
/tests/stubs/langchain_core/messages/__init__.py:
--------------------------------------------------------------------------------
```python
class BaseMessage: pass
class HumanMessage: pass
class AIMessage: pass
class SystemMessage: pass
```
--------------------------------------------------------------------------------
/tests/stubs/langchain_openai/__init__.py:
--------------------------------------------------------------------------------
```python
from .chat_models import AzureChatOpenAI, ChatOpenAI
__all__ = ["ChatOpenAI", "AzureChatOpenAI"]
```
--------------------------------------------------------------------------------
/renovate.json:
--------------------------------------------------------------------------------
```json
{
"$schema": "https://docs.renovatebot.com/renovate-schema.json",
"extends": ["config:recommended"]
}
```
--------------------------------------------------------------------------------
/tests/stubs/browser_use/agent/service.py:
--------------------------------------------------------------------------------
```python
class Agent:
def __init__(self, *args, **kwargs):
self.history = kwargs.get('history', None)
self.generate_gif = False
```
--------------------------------------------------------------------------------
/tests/stubs/langchain_core/language_models/__init__.py:
--------------------------------------------------------------------------------
```python
class BaseChatModel:
def with_structured_output(self, *args, **kwargs):
return self
async def ainvoke(self, *args, **kwargs):
return {}
```
--------------------------------------------------------------------------------
/tests/stubs/langchain_core/language_models/chat_models.py:
--------------------------------------------------------------------------------
```python
class BaseChatModel:
async def ainvoke(self, *args, **kwargs):
return {}
def with_structured_output(self, *args, **kwargs):
return self
```
--------------------------------------------------------------------------------
/tests/stubs/browser_use/telemetry/views.py:
--------------------------------------------------------------------------------
```python
class AgentEndTelemetryEvent:
def __init__(self, *args, **kwargs):
pass
class AgentRunTelemetryEvent:
def __init__(self, *args, **kwargs):
pass
```
--------------------------------------------------------------------------------
/tests/stubs/browser_use/browser/events.py:
--------------------------------------------------------------------------------
```python
class SendKeysEvent:
def __init__(self, keys: str):
self.keys = keys
class ScreenshotEvent:
def __init__(self, full_page: bool = False):
self.full_page = full_page
```
--------------------------------------------------------------------------------
/tests/stubs/browser_use/utils.py:
--------------------------------------------------------------------------------
```python
def time_execution_async(name):
def decorator(func):
async def wrapper(*args, **kwargs):
return await func(*args, **kwargs)
return wrapper
return decorator
```
--------------------------------------------------------------------------------
/src/mcp_browser_use/mcp_browser_use.py:
--------------------------------------------------------------------------------
```python
"""Public entry-points for backwards compatible imports."""
from __future__ import annotations
from .client import AgentNotRegisteredError, create_client_session
__all__ = ["AgentNotRegisteredError", "create_client_session"]
```
--------------------------------------------------------------------------------
/tests/stubs/browser_use/agent/message_manager/views.py:
--------------------------------------------------------------------------------
```python
from dataclasses import dataclass, field
from typing import Any, List
@dataclass
class MessageHistory:
messages: List[Any] = field(default_factory=list)
total_tokens: int = 0
@dataclass
class ManagedMessage:
message: Any
```
--------------------------------------------------------------------------------
/tests/stubs/browser_use/browser/__init__.py:
--------------------------------------------------------------------------------
```python
from .. import BrowserSession as Browser # noqa: F401
from .events import SendKeysEvent # noqa: F401
from .profile import BrowserProfile, ProxySettings # noqa: F401
__all__ = ["Browser", "BrowserProfile", "ProxySettings", "SendKeysEvent"]
```
--------------------------------------------------------------------------------
/tests/stubs/browser_use/browser/context.py:
--------------------------------------------------------------------------------
```python
class BrowserContextConfig:
def __init__(self, **kwargs):
for key, value in kwargs.items():
setattr(self, key, value)
class BrowserContext:
async def get_state(self, *args, **kwargs):
pass
async def close(self):
pass
```
--------------------------------------------------------------------------------
/tests/stubs/browser_use/browser/views.py:
--------------------------------------------------------------------------------
```python
from dataclasses import dataclass
@dataclass
class BrowserStateHistory:
url: str = ""
title: str = ""
tabs: list = None
interacted_element: list = None
screenshot: str | None = None
@dataclass
class BrowserState:
screenshot: str | None = None
```
--------------------------------------------------------------------------------
/tests/stubs/browser_use/browser/profile.py:
--------------------------------------------------------------------------------
```python
class ProxySettings:
def __init__(self, **kwargs):
for key, value in kwargs.items():
setattr(self, key, value)
class BrowserProfile:
def __init__(self, **kwargs):
for key, value in kwargs.items():
setattr(self, key, value)
```
--------------------------------------------------------------------------------
/tests/stubs/langchain_core/prompts/__init__.py:
--------------------------------------------------------------------------------
```python
class ChatPromptTemplate:
@staticmethod
def from_messages(msgs):
return ChatPromptTemplate()
def __or__(self, other):
return self
def invoke(self, data):
return ''
class MessagesPlaceholder:
def __init__(self, variable_name=''):
pass
```
--------------------------------------------------------------------------------
/src/mcp_browser_use/__init__.py:
--------------------------------------------------------------------------------
```python
# -*- coding: utf-8 -*-
"""MCP server for browser-use."""
from mcp_browser_use.mcp_browser_use import ( # noqa: F401
AgentNotRegisteredError,
create_client_session,
)
from mcp_browser_use.server import app, launch_mcp_browser_use_server
__all__ = [
"app",
"launch_mcp_browser_use_server",
"create_client_session",
"AgentNotRegisteredError",
]
```
--------------------------------------------------------------------------------
/tests/stubs/browser_use/controller/service.py:
--------------------------------------------------------------------------------
```python
class _Registry:
def get_prompt_description(self):
return ""
def create_action_model(self):
return type("ActionModel", (), {})
def action(self, *_args, **_kwargs):
def decorator(func):
return func
return decorator
class Controller:
def __init__(self):
self.registry = _Registry()
async def multi_act(self, actions, context): # pragma: no cover - stub
return []
```
--------------------------------------------------------------------------------
/tests/stubs/langchain_openai/chat_models/__init__.py:
--------------------------------------------------------------------------------
```python
class Base:
pass
class ChatOpenAI:
def __init__(self, *args, **kwargs):
pass
root_async_client = None
model_name = 'mock'
def with_structured_output(self, *args, **kwargs):
return self
async def ainvoke(self, *args, **kwargs):
return {}
class AzureChatOpenAI(ChatOpenAI):
"""Minimal stub mirroring the OpenAI chat client API."""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
```
--------------------------------------------------------------------------------
/tests/test_agent_state.py:
--------------------------------------------------------------------------------
```python
from mcp_browser_use.utils.agent_state import AgentState
def test_agent_state_stop_flow():
state = AgentState()
assert state.is_stop_requested() is False
state.request_stop()
assert state.is_stop_requested() is True
state.clear_stop()
assert state.is_stop_requested() is False
def test_agent_state_last_valid_state_reset():
state = AgentState()
marker = {"url": "https://example.com"}
state.set_last_valid_state(marker)
assert state.get_last_valid_state() == marker
state.clear_stop()
assert state.get_last_valid_state() is None
```
--------------------------------------------------------------------------------
/tests/stubs/browser_use/agent/views.py:
--------------------------------------------------------------------------------
```python
from dataclasses import dataclass, field
from typing import Any, List, Optional
@dataclass
class ActionResult:
extracted_content: Optional[str] = None
error: Optional[str] = None
is_done: bool = False
include_in_memory: bool = False
@dataclass
class AgentHistory:
model_output: Any
state: Any
result: List[ActionResult]
@dataclass
class AgentHistoryList:
history: List[AgentHistory] = field(default_factory=list)
def is_done(self) -> bool:
for h in self.history:
for r in h.result:
if r.is_done:
return True
return False
@dataclass
class AgentStepInfo:
step_number: int = 0
class AgentOutput:
pass
```
--------------------------------------------------------------------------------
/src/mcp_browser_use/utils/logging.py:
--------------------------------------------------------------------------------
```python
"""Centralised logging configuration utilities for the MCP browser agent."""
from __future__ import annotations
import logging
import os
from typing import Optional
_DEFAULT_FORMAT = "%(asctime)s | %(levelname)s | %(name)s | %(message)s"
def _resolve_level(level_name: Optional[str]) -> int:
"""Translate a string level name into a numeric logging level."""
if not level_name:
return logging.INFO
try:
return int(level_name)
except ValueError:
resolved = logging.getLevelName(level_name.upper())
if isinstance(resolved, int):
return resolved
return logging.INFO
def configure_logging() -> None:
"""Configure the root logger once for the application."""
level = _resolve_level(os.getenv("LOG_LEVEL"))
root_logger = logging.getLogger()
if not root_logger.handlers:
logging.basicConfig(level=level, format=_DEFAULT_FORMAT)
else:
root_logger.setLevel(level)
```
--------------------------------------------------------------------------------
/src/mcp_browser_use/utils/agent_state.py:
--------------------------------------------------------------------------------
```python
# -*- coding: utf-8 -*-
"""
If we plan to scale or have multiple agents, we might remove the singleton pattern or differentiate them by agent ID.
"""
import asyncio
from typing import Any, Optional
class AgentState:
"""
Tracks an asynchronous stop signal and stores the last valid browser state.
request_stop() sets an asyncio.Event, is_stop_requested() checks if it's set,
clear_stop() resets the event and last_valid_state.
"""
def __init__(self) -> None:
self._stop_requested = asyncio.Event()
self._last_valid_state: Optional[Any] = None
def request_stop(self) -> None:
self._stop_requested.set()
def clear_stop(self) -> None:
self._stop_requested.clear()
self._last_valid_state = None
def is_stop_requested(self) -> bool:
return self._stop_requested.is_set()
def set_last_valid_state(self, state: Any) -> None:
self._last_valid_state = state
def get_last_valid_state(self) -> Optional[Any]:
return self._last_valid_state
```
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
```toml
[project]
name = "mcp_browser_use"
version = "0.1.0"
description = "This Python project is a FastAPI server implementing MCP Server protocol Browser automation via browser-use library."
readme = "README.md"
requires-python = ">=3.11"
license = { text = "MIT" }
classifiers = [
"Development Status :: 4 - Beta",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.11",
"Operating System :: OS Independent",
]
dependencies = [
"pydantic>=2.11.9",
"uvicorn>=0.37.0",
"browser-use>=0.7.9",
"fastapi>=0.117.1",
"fastmcp>=2.12.4",
"instructor>=1.11.3",
"langchain>=0.3.27",
"langchain-google-genai>=2.1.1",
"langchain-openai>=0.2.14",
"langchain-anthropic>=0.3.20",
"langchain-ollama>=0.2.2",
"openai>=1.109.1",
"pillow>=11.3.0",
"python-dotenv>=1.1.1",
"pyperclip>=1.11.0",
]
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.hatch.build.targets.wheel]
packages = ["src/mcp_browser_use"]
[project.scripts]
mcp-browser-use = "mcp_browser_use.server:launch_mcp_browser_use_server"
```
--------------------------------------------------------------------------------
/tests/test_logging_configuration.py:
--------------------------------------------------------------------------------
```python
"""Smoke tests around module imports and logging configuration."""
from __future__ import annotations
import importlib
import logging
import sys
from typing import Iterable
import pytest
MODULES_TO_TEST: Iterable[str] = (
"mcp_browser_use.controller.custom_controller",
"mcp_browser_use.utils.utils",
"mcp_browser_use.agent.custom_agent",
"mcp_browser_use.agent.custom_message_manager",
)
@pytest.mark.parametrize("module_name", MODULES_TO_TEST)
def test_module_import_does_not_call_basic_config(module_name: str, monkeypatch) -> None:
"""Ensure importing project modules does not invoke ``logging.basicConfig``."""
# Import once so that shared third-party dependencies are cached.
importlib.import_module(module_name)
sys.modules.pop(module_name, None)
calls: list[tuple[tuple[object, ...], dict[str, object]]] = []
def record_basic_config(*args: object, **kwargs: object) -> None:
calls.append((args, kwargs))
monkeypatch.setattr(logging, "basicConfig", record_basic_config)
importlib.import_module(module_name)
assert calls == [], f"Module {module_name} should not call logging.basicConfig during import"
```
--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
```dockerfile
# Generated by https://smithery.ai. See: https://smithery.ai/docs/config#dockerfile
# Use a Python image with uv pre-installed
FROM ghcr.io/astral-sh/uv:python3.12-bookworm-slim AS uv
# Install the project into /app
WORKDIR /app
# Enable bytecode compilation
ENV UV_COMPILE_BYTECODE=1
# Copy from the cache instead of linking since it's a mounted volume
ENV UV_LINK_MODE=copy
# Install the project's dependencies using the lockfile and settings
RUN --mount=type=cache,target=/root/.cache/uv \
--mount=type=bind,source=uv.lock,target=uv.lock \
--mount=type=bind,source=pyproject.toml,target=pyproject.toml \
uv sync --frozen --no-install-project --no-dev --no-editable
# Then, add the rest of the project source code and install it
# Installing separately from its dependencies allows optimal layer caching
ADD . /app
RUN --mount=type=cache,target=/root/.cache/uv \
uv sync --frozen --no-dev --no-editable
FROM python:3.13-slim-bookworm
WORKDIR /app
COPY --from=uv /root/.local /root/.local
COPY --from=uv --chown=app:app /app/.venv /app/.venv
# Place executables in the environment at the front of the path
ENV PATH="/app/.venv/bin:$PATH"
# when running the container, add --db-path and a bind mount to the host's db file
ENTRYPOINT ["mcp-browser-use"]
```
--------------------------------------------------------------------------------
/tests/conftest.py:
--------------------------------------------------------------------------------
```python
"""Test fixtures and environment setup for the test suite."""
import importlib
import os
import sys
import types
BASE_DIR = os.path.dirname(__file__)
STUBS_DIR = os.path.join(BASE_DIR, "stubs")
SRC_DIR = os.path.join(os.path.dirname(BASE_DIR), "src")
for path in (STUBS_DIR, SRC_DIR):
if path not in sys.path:
sys.path.insert(0, path)
if "langchain_openai" not in sys.modules:
importlib.import_module("langchain_openai")
if "langchain_anthropic" not in sys.modules:
module = types.ModuleType("langchain_anthropic")
class ChatAnthropic: # type: ignore[too-many-ancestors]
def __init__(self, *args, **kwargs):
pass
module.ChatAnthropic = ChatAnthropic
sys.modules["langchain_anthropic"] = module
if "langchain_google_genai" not in sys.modules:
module = types.ModuleType("langchain_google_genai")
class ChatGoogleGenerativeAI: # type: ignore[too-many-ancestors]
def __init__(self, *args, **kwargs):
pass
module.ChatGoogleGenerativeAI = ChatGoogleGenerativeAI
sys.modules["langchain_google_genai"] = module
if "langchain_ollama" not in sys.modules:
module = types.ModuleType("langchain_ollama")
class ChatOllama: # type: ignore[too-many-ancestors]
def __init__(self, *args, **kwargs):
pass
module.ChatOllama = ChatOllama
sys.modules["langchain_ollama"] = module
```
--------------------------------------------------------------------------------
/tests/test_gif_creation.py:
--------------------------------------------------------------------------------
```python
import sys
import os
import base64
import io
# Add stub package path before importing CustomAgent
BASE_DIR = os.path.dirname(__file__)
sys.path.insert(0, os.path.join(BASE_DIR, "stubs"))
sys.path.insert(0, os.path.join(os.path.dirname(BASE_DIR), "src"))
from PIL import Image
from mcp_browser_use.agent.custom_agent import CustomAgent
from browser_use.agent.views import AgentHistoryList, AgentHistory, ActionResult
from browser_use.browser.views import BrowserStateHistory
class DummyState:
def __init__(self, thought: str):
self.current_state = type("Brain", (), {"thought": thought})()
def create_screenshot() -> str:
img = Image.new("RGB", (100, 100), color="white")
buf = io.BytesIO()
img.save(buf, format="PNG")
return base64.b64encode(buf.getvalue()).decode("utf-8")
def test_create_history_gif(tmp_path):
screenshot = create_screenshot()
hist = AgentHistoryList(
history=[
AgentHistory(
model_output=DummyState("step one"),
state=BrowserStateHistory(screenshot=screenshot),
result=[ActionResult(is_done=False)],
),
AgentHistory(
model_output=DummyState("step two"),
state=BrowserStateHistory(screenshot=screenshot),
result=[ActionResult(is_done=True)],
),
]
)
agent = CustomAgent.__new__(CustomAgent)
agent.history = hist
agent.task = "My Task"
output_gif = tmp_path / "out.gif"
agent.create_history_gif(output_path=str(output_gif))
assert output_gif.exists()
```
--------------------------------------------------------------------------------
/tests/test_browser_manager.py:
--------------------------------------------------------------------------------
```python
"""Tests for browser manager environment configuration helpers."""
from __future__ import annotations
import importlib
import pytest
browser_manager = importlib.import_module(
"mcp_browser_use.browser.browser_manager"
)
@pytest.fixture(autouse=True)
def clear_browser_env(monkeypatch):
"""Ensure browser-related environment variables do not leak between tests."""
for key in (
"BROWSER_USE_CDP_URL",
"CHROME_DEBUGGING_HOST",
"CHROME_DEBUGGING_PORT",
):
monkeypatch.delenv(key, raising=False)
def test_from_env_derives_cdp_url_from_debugging(monkeypatch):
"""When only debugging env vars are set, derive a CDP URL automatically."""
monkeypatch.setenv("CHROME_DEBUGGING_HOST", "debug.example")
monkeypatch.setenv("CHROME_DEBUGGING_PORT", "1337")
config = browser_manager.BrowserEnvironmentConfig.from_env()
assert config.cdp_url == "http://debug.example:1337"
def test_create_browser_session_preserves_computed_cdp_url(monkeypatch):
"""Computed CDP URL is passed to BrowserSession when overrides omit it."""
monkeypatch.setenv("CHROME_DEBUGGING_HOST", "localhost")
monkeypatch.setenv("CHROME_DEBUGGING_PORT", "9000")
captured_kwargs: dict[str, object] = {}
class DummyBrowserSession:
def __init__(self, **kwargs):
captured_kwargs.update(kwargs)
monkeypatch.setattr(browser_manager, "BrowserSession", DummyBrowserSession)
session = browser_manager.create_browser_session()
assert isinstance(session, DummyBrowserSession)
assert captured_kwargs["cdp_url"] == "http://localhost:9000"
```
--------------------------------------------------------------------------------
/tests/stubs/PIL/__init__.py:
--------------------------------------------------------------------------------
```python
class DummyImage:
def __init__(self, width=100, height=100):
self.width = width
self.height = height
self.mode = "RGBA"
@property
def size(self):
return (self.width, self.height)
def convert(self, mode):
self.mode = mode
return self
def resize(self, size, resample=None):
self.width, self.height = size
return self
def save(self, fp, *args, **kwargs):
if hasattr(fp, "write"):
fp.write(b"dummy")
else:
with open(fp, "wb") as f:
f.write(b"dummy")
def alpha_composite(self, other):
pass
def paste(self, img, pos, mask=None):
pass
class Image:
@staticmethod
def open(fp):
return DummyImage()
@staticmethod
def new(mode, size, color=(0, 0, 0, 0)):
return DummyImage(*size)
Resampling = type("Resampling", (), {"LANCZOS": 0})
Image = DummyImage
class ImageDraw:
class Draw:
def __init__(self, img):
pass
def text(self, *args, **kwargs):
pass
def rectangle(self, *args, **kwargs):
pass
def textbbox(self, xy, text, font=None):
# return left, top, right, bottom
return (0, 0, len(text) * 10, 10)
def textlength(self, text, font=None):
return len(text) * 10
ImageDraw = Draw
class ImageFont:
class FreeTypeFont:
pass
@staticmethod
def truetype(font, size):
return ImageFont.FreeTypeFont()
@staticmethod
def load_default():
return ImageFont.FreeTypeFont()
```
--------------------------------------------------------------------------------
/tests/test_custom_agent_controller.py:
--------------------------------------------------------------------------------
```python
import os
import sys
BASE_DIR = os.path.dirname(__file__)
sys.path.insert(0, os.path.join(BASE_DIR, "stubs"))
sys.path.insert(0, os.path.join(os.path.dirname(BASE_DIR), "src"))
import pytest
from langchain_core.language_models.chat_models import BaseChatModel
from unittest.mock import Mock
import mcp_browser_use.agent.custom_agent as custom_agent_module
@pytest.fixture
def custom_agent(monkeypatch):
class DummyMessageManager:
def __init__(self, *args, **kwargs):
pass
monkeypatch.setattr(
custom_agent_module,
"CustomMassageManager",
DummyMessageManager,
)
def fake_agent_init(self, *args, **kwargs):
for key, value in kwargs.items():
setattr(self, key, value)
# Set attributes not passed in kwargs that are needed
self.n_steps = 0
self._last_result = None
self.message_manager = None
self.history = None
self.generate_gif = False
monkeypatch.setattr(custom_agent_module.Agent, "__init__", fake_agent_init)
return custom_agent_module
def test_custom_agent_creates_independent_default_controllers(
custom_agent, monkeypatch
):
controllers = []
class TrackingController(custom_agent.Controller):
def __init__(self):
super().__init__()
controllers.append(self)
monkeypatch.setattr(custom_agent, "Controller", TrackingController)
llm = Mock(spec=BaseChatModel)
agent_one = custom_agent.CustomAgent(task="Task one", llm=llm)
agent_two = custom_agent.CustomAgent(task="Task two", llm=llm)
assert agent_one.controller is not agent_two.controller
assert controllers == [agent_one.controller, agent_two.controller]
def test_custom_agent_uses_supplied_controller(custom_agent):
llm = Mock(spec=BaseChatModel)
provided_controller = custom_agent.Controller()
agent = custom_agent.CustomAgent(
task="Task with supplied controller",
llm=llm,
controller=provided_controller,
)
assert agent.controller is provided_controller
```
--------------------------------------------------------------------------------
/sample.env.env:
--------------------------------------------------------------------------------
```
# ---------------------------
# API Keys (Replace as needed)
# ---------------------------
OPENAI_API_KEY=your_openai_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here
GOOGLE_API_KEY=your_google_api_key_here
AZURE_OPENAI_API_KEY=your_azure_api_key_here
DEEPSEEK_API_KEY=your_deepseek_api_key_here
# ----------------------------------
# Model Provider & Endpoint Settings
# ----------------------------------
# Typical endpoints; change to match your usage.
OPENAI_ENDPOINT=https://api.openai.com/v1
ANTHROPIC_API_ENDPOINT=https://api.anthropic.com
AZURE_OPENAI_ENDPOINT=https://your-azure-openai-endpoint
DEEPSEEK_ENDPOINT=https://api.deepseek.com
# ---------------------------
# Model & Agent Configuration
# ---------------------------
# Choose one provider: "openai", "anthropic", "azure_openai", "deepseek", "gemini", "ollama".
MCP_MODEL_PROVIDER=anthropic
MCP_MODEL_NAME=claude-3-5-sonnet-20241022
MCP_TEMPERATURE=0.3
MCP_MAX_STEPS=30
MCP_MAX_ACTIONS_PER_STEP=5
MCP_USE_VISION=true
MCP_TOOL_CALL_IN_CONTENT=true
# ---------------------------------
# Chrome / Playwright Configuration
# ---------------------------------
# If CHROME_PATH is set, the code will attempt to launch a locally installed Chrome
# with remote debugging on port 9222.
# If left empty, it will launch a standard Chromium instance via Playwright.
CHROME_PATH=/path/to/your/chrome/binary
CHROME_USER_DATA=/path/to/your/chrome-profile
CHROME_DEBUGGING_PORT=9222
CHROME_DEBUGGING_HOST=localhost
CHROME_PERSISTENT_SESSION=false
# You can add extra flags in your code if needed:
# Example: export CHROME_EXTRA_ARGS="--some-chrome-flag"
# --------------
# Other Settings
# --------------
# Adjust HEADLESS or DISABLE_SECURITY if your code checks them.
# By default, you might keep them out or set them in the code itself.
# HEADLESS=false
# DISABLE_SECURITY=false
# -------------
# Example Usage
# -------------
# Load this file with:
# source .env
# or use a library like python-dotenv or uv to manage environment variables.
# Note: In production or multi-user environments, never commit real API keys
# or share them publicly. Instead use a secrets manager or encrypted storage.
```
--------------------------------------------------------------------------------
/src/mcp_browser_use/client.py:
--------------------------------------------------------------------------------
```python
"""Client helpers for interacting with the in-process FastMCP server."""
from __future__ import annotations
from contextlib import asynccontextmanager
from typing import Any, AsyncIterator, Callable, Optional
from fastmcp.client import Client
from .server import app
class AgentNotRegisteredError(RuntimeError):
"""Error raised when attempting to control an agent that is not running."""
@asynccontextmanager
async def create_client_session(
client: Optional[Client] = None,
*,
client_factory: Optional[Callable[[], Client]] = None,
**client_kwargs: Any,
) -> AsyncIterator[Client]:
"""Create an asynchronous context manager for interacting with the server.
Parameters
----------
client:
An existing :class:`fastmcp.client.Client` instance. If provided, the
caller is responsible for its configuration. ``client_kwargs`` must not
be supplied in this case.
client_factory:
Optional callable used to lazily construct a client. This is useful in
testing where a lightweight stub client might be injected. If provided,
the callable is invoked with no arguments and ``client_kwargs`` must not
be supplied.
**client_kwargs:
Additional keyword arguments forwarded to :class:`fastmcp.client.Client`
when neither ``client`` nor ``client_factory`` is provided.
Yields
------
Client
A connected FastMCP client ready for use within the context manager.
"""
if client is not None and client_factory is not None:
raise ValueError("Provide either 'client' or 'client_factory', not both.")
if client is not None and client_kwargs:
raise ValueError(
"'client_kwargs' cannot be used when an explicit client instance is provided."
)
if client_factory is not None and client_kwargs:
raise ValueError("'client_kwargs' cannot be combined with 'client_factory'.")
if client is not None:
session_client = client
elif client_factory is not None:
session_client = client_factory()
else:
session_client = Client(app, **client_kwargs)
async with session_client as connected_client:
yield connected_client
```
--------------------------------------------------------------------------------
/tests/stubs/browser_use/__init__.py:
--------------------------------------------------------------------------------
```python
class _DummyEvent:
def __await__(self):
async def _noop():
return None
return _noop().__await__()
async def event_result(self, *args, **kwargs): # pragma: no cover - stub method
return None
class _DummyEventBus:
def dispatch(self, event): # noqa: D401 - simple stub
return _DummyEvent()
class BrowserPage:
def __init__(self, **kwargs):
for key, value in kwargs.items():
setattr(self, key, value)
self.event_bus = _DummyEventBus()
async def close(self) -> None: # pragma: no cover - stub method
return None
class Browser:
"""Lightweight stub mirroring the public Browser API used in tests."""
def __init__(self, **kwargs):
for key, value in kwargs.items():
setattr(self, key, value)
self._pages: list[BrowserPage] = []
self._started = False
async def start(self): # pragma: no cover - stub method
self._started = True
return self
async def stop(self): # pragma: no cover - stub method
self._started = False
return None
async def new_page(self, **kwargs):
page = BrowserPage(**kwargs)
self._pages.append(page)
return page
async def close(self): # pragma: no cover - compatibility alias
return await self.stop()
class BrowserSession(Browser): # pragma: no cover - stub class
async def kill(self): # pragma: no cover - stub method
return await self.stop()
class BrowserProfile: # pragma: no cover - stub class
def __init__(self, **kwargs):
for key, value in kwargs.items():
setattr(self, key, value)
self.event_bus = _DummyEventBus()
async def kill(self) -> None: # pragma: no cover - stub method
return None
class BrowserProfile: # pragma: no cover - stub class
def __init__(self, **kwargs):
for key, value in kwargs.items():
setattr(self, key, value)
class ProxySettings: # pragma: no cover - stub class
def __init__(self, **kwargs):
for key, value in kwargs.items():
setattr(self, key, value)
# Alias maintained for compatibility with production package
Browser = BrowserSession
```
--------------------------------------------------------------------------------
/src/mcp_browser_use/controller/custom_controller.py:
--------------------------------------------------------------------------------
```python
# -*- coding: utf-8 -*-
import logging
import sys
import pyperclip
from browser_use import BrowserSession
from browser_use.agent.views import ActionResult
from browser_use.controller.service import Controller
logger = logging.getLogger(__name__)
class CustomController(Controller):
"""
A custom controller registering two clipboard actions: copy and paste.
"""
def __init__(self):
super().__init__()
self._register_custom_actions()
def _register_custom_actions(self) -> None:
"""Register all custom browser actions for this controller."""
@self.registry.action("Copy text to clipboard")
def copy_to_clipboard(text: str) -> ActionResult:
"""
Copy the given text to the system's clipboard.
Returns an ActionResult with the same text as extracted_content.
"""
try:
pyperclip.copy(text)
# Be cautious about logging the actual text, if sensitive
logger.debug("Copied text to clipboard.")
return ActionResult(extracted_content=text)
except Exception as e:
logger.error(f"Error copying text to clipboard: {e}")
return ActionResult(error=str(e), extracted_content=None)
@self.registry.action("Paste text from clipboard", requires_browser=True)
async def paste_from_clipboard(browser_session: BrowserSession) -> ActionResult:
"""
Paste whatever is currently in the system's clipboard
into the active browser page by using the send_keys tool.
"""
try:
text = pyperclip.paste()
except Exception as e:
logger.error(f"Error reading text from clipboard: {e}")
return ActionResult(error=str(e), extracted_content=None)
try:
modifier = "meta" if sys.platform == "darwin" else "ctrl"
# Use the documented tool via the registry
await self.registry.execute_action(
"send_keys",
{"keys": f"{modifier}+v"},
browser_session=browser_session,
)
logger.debug("Triggered paste shortcut inside the browser session.")
return ActionResult(extracted_content=text)
except Exception as e:
logger.error(f"Error pasting text into the browser session: {e}")
return ActionResult(error=str(e), extracted_content=None)
```
--------------------------------------------------------------------------------
/src/mcp_browser_use/agent/custom_views.py:
--------------------------------------------------------------------------------
```python
# -*- coding: utf-8 -*-
from dataclasses import dataclass
from typing import List, Type
from browser_use.agent.views import AgentOutput
from browser_use.controller.registry.views import ActionModel
from pydantic import BaseModel, ConfigDict, Field, create_model
@dataclass
class CustomAgentStepInfo:
"""
Holds metadata about a single step of the agent's execution.
:param step_number: Which step number we're currently on.
:param max_steps: Total maximum steps before we stop.
:param task: The primary task assigned to the agent.
:param add_infos: Additional contextual info or instructions.
:param memory: Cumulative memory or context from previous steps.
:param task_progress: Text describing progress toward the task goal.
"""
step_number: int
max_steps: int
task: str
add_infos: str
memory: str
task_progress: str
class CustomAgentBrain(BaseModel):
"""
Represents the agent's 'thinking' or ephemeral state during processing.
:param prev_action_evaluation: String evaluation of the last action performed (success/failure).
:param important_contents: Key points or memory extracted from the environment.
:param completed_contents: Completed portion of the task so far.
:param thought: Agent's internal reasoning or thought process text.
:param summary: Short summary of the agent's current state or progress.
"""
prev_action_evaluation: str
important_contents: str
completed_contents: str
thought: str
summary: str
class CustomAgentOutput(AgentOutput):
"""
Output model for the agent. Extended at runtime with custom actions
by 'type_with_custom_actions'.
"""
model_config = ConfigDict(arbitrary_types_allowed=True)
current_state: CustomAgentBrain
action: List[ActionModel]
@staticmethod
def type_with_custom_actions(
custom_actions: Type[ActionModel],
) -> Type["CustomAgentOutput"]:
"""
Create a new Pydantic model that inherits from CustomAgentOutput
but redefines the 'action' field to be a list of the given
custom action model.
:param custom_actions: The action model type from the controller registry.
:return: A new Pydantic model class based on CustomAgentOutput.
"""
return create_model(
# Could rename to something more specific if needed
"AgentOutput",
__base__=CustomAgentOutput,
action=(List[custom_actions], Field(...)),
__module__=CustomAgentOutput.__module__,
)
```
--------------------------------------------------------------------------------
/smithery.yaml:
--------------------------------------------------------------------------------
```yaml
# Smithery configuration file: https://smithery.ai/docs/config#smitheryyaml
startCommand:
type: stdio
configSchema:
# JSON Schema defining the configuration options for the MCP.
type: object
required:
- openaiApiKey
- anthropicApiKey
- mcpModelProvider
- mcpModelName
properties:
openaiApiKey:
type: string
description: API key for OpenAI services.
anthropicApiKey:
type: string
description: API key for Anthropic services.
googleApiKey:
type: string
description: API key for Google services (optional).
azureOpenaiEndpoint:
type: string
description: Azure OpenAI endpoint (optional).
azureOpenaiApiKey:
type: string
description: Azure OpenAI API key (optional).
chromePath:
type: string
description: Path to Chrome executable (optional).
chromeUserData:
type: string
description: Path to Chrome user data directory (optional).
chromeDebuggingPort:
type: string
default: "9222"
description: Chrome debugging port. Default is 9222.
chromeDebuggingHost:
type: string
default: localhost
description: Chrome debugging host. Default is localhost.
chromePersistentSession:
type: boolean
default: false
description: Keep browser open between tasks.
mcpModelProvider:
type: string
description: Model provider (e.g., anthropic, openai).
mcpModelName:
type: string
description: Model name.
mcpTemperature:
type: number
default: 0.3
description: Model temperature.
mcpMaxSteps:
type: number
default: 30
description: Max steps for model.
mcpUseVision:
type: boolean
default: true
description: Use vision capabilities.
mcpMaxActionsPerStep:
type: number
default: 5
description: Max actions per step.
commandFunction:
# A function that produces the CLI command to start the MCP on stdio.
|-
(config) => ({ command: 'uv', args: ['run', 'mcp-browser-use'], env: { OPENAI_API_KEY: config.openaiApiKey, ANTHROPIC_API_KEY: config.anthropicApiKey, GOOGLE_API_KEY: config.googleApiKey, AZURE_OPENAI_ENDPOINT: config.azureOpenaiEndpoint, AZURE_OPENAI_API_KEY: config.azureOpenaiApiKey, CHROME_PATH: config.chromePath, CHROME_USER_DATA: config.chromeUserData, CHROME_DEBUGGING_PORT: config.chromeDebuggingPort || '9222', CHROME_DEBUGGING_HOST: config.chromeDebuggingHost || 'localhost', CHROME_PERSISTENT_SESSION: config.chromePersistentSession, MCP_MODEL_PROVIDER: config.mcpModelProvider, MCP_MODEL_NAME: config.mcpModelName, MCP_TEMPERATURE: config.mcpTemperature || 0.3, MCP_MAX_STEPS: config.mcpMaxSteps || 30, MCP_USE_VISION: config.mcpUseVision, MCP_MAX_ACTIONS_PER_STEP: config.mcpMaxActionsPerStep || 5 } })
```
--------------------------------------------------------------------------------
/tests/test_summarize_messages.py:
--------------------------------------------------------------------------------
```python
from langchain_core.messages import AIMessage, HumanMessage, SystemMessage
import mcp_browser_use.agent.custom_agent as custom_agent_module
from mcp_browser_use.agent.custom_agent import CustomAgent
from browser_use.agent.message_manager.views import MessageHistory, ManagedMessage
class FakeLLM:
def __init__(self, content: str = "Conversation summary"):
self.calls = []
self._content = content
def invoke(self, input, **kwargs):
self.calls.append(input)
message = AIMessage(content=self._content)
return message
def __call__(self, input, **kwargs):
return self.invoke(input, **kwargs)
class DummyMessageManager:
def __init__(self, extra_messages: int = 6):
self.system_prompt = SystemMessage(content="System instructions")
self.example_tool_call = AIMessage(content="[]")
self.example_tool_call.tool_calls = []
self.reset_calls = 0
self.history = MessageHistory()
self.reset_history()
for idx in range(extra_messages):
human = HumanMessage(content=f"User message {idx}")
self._add_message_with_tokens(human)
def get_messages(self):
return [managed.message for managed in self.history.messages]
def reset_history(self) -> None:
self.reset_calls += 1
self.history = MessageHistory()
self.history.messages = []
if hasattr(self.history, "total_tokens"):
self.history.total_tokens = 0
self._add_message_with_tokens(self.system_prompt)
self._add_message_with_tokens(self.example_tool_call)
def _add_message_with_tokens(self, message):
self.history.messages.append(ManagedMessage(message=message))
if hasattr(self.history, "total_tokens"):
self.history.total_tokens += 1
def test_summarize_messages_preserves_system_prompt(monkeypatch):
class StubChain:
def __init__(self, llm):
self.llm = llm
def invoke(self, data):
return self.llm.invoke(data)
class StubPrompt:
def __or__(self, llm):
return StubChain(llm)
class StubChatPromptTemplate:
@staticmethod
def from_messages(messages):
return StubPrompt()
monkeypatch.setattr(
custom_agent_module,
"ChatPromptTemplate",
StubChatPromptTemplate,
)
agent = CustomAgent.__new__(CustomAgent)
agent.llm = FakeLLM()
agent.message_manager = DummyMessageManager()
assert len(agent.message_manager.get_messages()) > 5
# Ensure the initial reset was performed
assert agent.message_manager.reset_calls == 1
result = agent.summarize_messages()
assert result is True
assert agent.message_manager.reset_calls == 2
history_messages = agent.message_manager.history.messages
assert len(history_messages) == 3
assert [entry.message for entry in history_messages[:2]] == [
agent.message_manager.system_prompt,
agent.message_manager.example_tool_call,
]
assert history_messages[2].message.content == "Conversation summary"
if hasattr(agent.message_manager.history, "total_tokens"):
assert agent.message_manager.history.total_tokens == len(history_messages)
# Ensure the LLM was called with the conversation
assert len(agent.llm.calls) == 1
prompt_value = agent.llm.calls[0]
assert isinstance(prompt_value, dict)
assert "chat_history" in prompt_value
```
--------------------------------------------------------------------------------
/tests/test_client_session.py:
--------------------------------------------------------------------------------
```python
import importlib
import pytest
from mcp_browser_use import client as client_module
from mcp_browser_use.client import AgentNotRegisteredError, create_client_session
@pytest.fixture
def anyio_backend():
return "asyncio"
@pytest.mark.anyio("asyncio")
async def test_create_client_session_uses_supplied_client():
events = []
class DummyClient:
def __init__(self):
self.connected = False
async def __aenter__(self):
events.append("enter")
self.connected = True
return self
async def __aexit__(self, exc_type, exc, tb):
events.append("exit")
self.connected = False
dummy = DummyClient()
async with create_client_session(client=dummy) as session:
assert session is dummy
assert dummy.connected
assert events == ["enter", "exit"]
assert dummy.connected is False
@pytest.mark.anyio("asyncio")
async def test_create_client_session_accepts_factory():
events = []
class DummyClient:
async def __aenter__(self):
events.append("enter")
return self
async def __aexit__(self, exc_type, exc, tb):
events.append("exit")
async with create_client_session(client_factory=DummyClient) as session:
assert isinstance(session, DummyClient)
assert events == ["enter", "exit"]
@pytest.mark.anyio("asyncio")
async def test_create_client_session_rejects_mixed_arguments():
class DummyClient:
async def __aenter__(self):
return self
async def __aexit__(self, exc_type, exc, tb):
pass
dummy = DummyClient()
with pytest.raises(ValueError):
async with create_client_session(client=dummy, timeout=5):
pass
with pytest.raises(ValueError):
async with create_client_session(client_factory=DummyClient, timeout=5):
pass
with pytest.raises(ValueError):
async with create_client_session(client=dummy, client_factory=DummyClient):
pass
@pytest.mark.anyio("asyncio")
async def test_create_client_session_constructs_default_client(monkeypatch):
created = {}
class DummyClient:
def __init__(self, app, **kwargs):
created["app"] = app
created["kwargs"] = kwargs
async def __aenter__(self):
created["entered"] = True
return self
async def __aexit__(self, exc_type, exc, tb):
created["exited"] = True
monkeypatch.setattr("mcp_browser_use.client.Client", DummyClient)
async with create_client_session(timeout=5) as session:
assert isinstance(session, DummyClient)
assert created["app"] is client_module.app
assert created["kwargs"] == {"timeout": 5}
assert created["entered"] is True
assert created["exited"] is True
@pytest.mark.anyio("asyncio")
async def test_create_client_session_kwargs_with_factory_raise():
class DummyClient:
async def __aenter__(self):
return self
async def __aexit__(self, exc_type, exc, tb):
pass
kwargs = {"client_factory": DummyClient, "timeout": 10}
with pytest.raises(ValueError):
async with create_client_session(**kwargs):
pass
@pytest.mark.parametrize(
"legacy_module",
[
"mcp_browser",
"mcp_browser.use",
"mcp_browser.use.mcp_browser_use",
],
)
def test_legacy_namespace_is_removed(legacy_module):
with pytest.raises(ModuleNotFoundError):
importlib.import_module(legacy_module)
def test_exception_type():
assert issubclass(AgentNotRegisteredError, RuntimeError)
```
--------------------------------------------------------------------------------
/src/mcp_browser_use/agent/custom_massage_manager.py:
--------------------------------------------------------------------------------
```python
# -*- coding: utf-8 -*-
from __future__ import annotations
import copy
import logging
from typing import List, Optional, Type
from browser_use.agent.message_manager.service import MessageManager
from browser_use.agent.message_manager.views import MessageHistory
from browser_use.agent.prompts import SystemPrompt
from browser_use.agent.views import ActionResult, AgentStepInfo
from browser_use.browser.views import BrowserState
from langchain_core.language_models import BaseChatModel
from langchain_core.messages import HumanMessage, AIMessage
from mcp_browser_use.agent.custom_prompts import CustomAgentMessagePrompt
logger = logging.getLogger(__name__)
class CustomMassageManager(MessageManager):
def __init__(
self,
llm: BaseChatModel,
task: str,
action_descriptions: str,
system_prompt_class: Type[SystemPrompt],
max_input_tokens: int = 128000,
estimated_tokens_per_character: int = 3,
image_tokens: int = 800,
include_attributes: list[str] = [],
max_error_length: int = 400,
max_actions_per_step: int = 10,
tool_call_in_content: bool = False,
):
super().__init__(
llm=llm,
task=task,
action_descriptions=action_descriptions,
system_prompt_class=system_prompt_class,
max_input_tokens=max_input_tokens,
estimated_tokens_per_character=estimated_tokens_per_character,
image_tokens=image_tokens,
include_attributes=include_attributes,
max_error_length=max_error_length,
max_actions_per_step=max_actions_per_step,
tool_call_in_content=tool_call_in_content,
)
# Store template for example tool call so we can rebuild the history when needed
self.tool_call_in_content = tool_call_in_content
self._example_tool_call_template = [
{
"name": "CustomAgentOutput",
"args": {
"current_state": {
"prev_action_evaluation": "Unknown - No previous actions to evaluate.",
"important_contents": "",
"completed_contents": "",
"thought": "Now Google is open. Need to type OpenAI to search.",
"summary": "Type OpenAI to search.",
},
"action": [],
},
"id": "",
"type": "tool_call",
}
]
self.reset_history()
def _create_example_tool_call_message(self) -> AIMessage:
tool_calls = copy.deepcopy(self._example_tool_call_template)
if self.tool_call_in_content:
# openai throws error if tool_calls are not responded -> move to content
return AIMessage(
content=f"{tool_calls}",
tool_calls=[],
)
return AIMessage(
content="",
tool_calls=tool_calls,
)
def reset_history(self) -> None:
"""Reset the message history to the initial seeded state."""
self.history = MessageHistory()
if hasattr(self.history, "total_tokens"):
self.history.total_tokens = 0
self._add_message_with_tokens(self.system_prompt)
self._add_message_with_tokens(self._create_example_tool_call_message())
def add_state_message(
self,
state: BrowserState,
result: Optional[List[ActionResult]] = None,
step_info: Optional[AgentStepInfo] = None,
) -> None:
"""Add browser state as human message"""
# if keep in memory, add to directly to history and add state without result
if result:
for r in result:
if r.include_in_memory:
if r.extracted_content:
msg = HumanMessage(content=str(r.extracted_content))
self._add_message_with_tokens(msg)
if r.error:
msg = HumanMessage(
content=str(r.error)[-self.max_error_length :]
)
self._add_message_with_tokens(msg)
result = None # if result in history, we dont want to add it again
# otherwise add state message and result to next message (which will not stay in memory)
state_message = CustomAgentMessagePrompt(
state,
result,
include_attributes=self.include_attributes,
max_error_length=self.max_error_length,
step_info=step_info,
).get_user_message()
self._add_message_with_tokens(state_message)
```
--------------------------------------------------------------------------------
/src/mcp_browser_use/server.py:
--------------------------------------------------------------------------------
```python
# -*- coding: utf-8 -*-
from mcp_browser_use.utils.logging import configure_logging
# It is critical to configure logging before any other modules are imported,
# as they might initialize logging themselves.
configure_logging()
import asyncio
import logging
import os
import sys
import traceback
from typing import Any, Optional
from browser_use import Browser
from fastmcp import FastMCP
from mcp_browser_use.agent.custom_agent import CustomAgent
from mcp_browser_use.controller.custom_controller import CustomController
from mcp_browser_use.browser.browser_manager import create_browser_session
from mcp_browser_use.utils import utils
from mcp_browser_use.utils.agent_state import AgentState
logger = logging.getLogger(__name__)
app = FastMCP("mcp_browser_use")
@app.tool()
async def run_browser_agent(task: str, add_infos: str = "") -> str:
"""
This is the entrypoint for running a browser-based agent.
:param task: The main instruction or goal for the agent.
:param add_infos: Additional information or context for the agent.
:return: The final result string from the agent run.
"""
browser_session: Optional[Browser] = None
agent_state = AgentState()
try:
# Clear any previous agent stop signals
agent_state.clear_stop()
# Read environment variables with defaults and parse carefully
# Fallback to defaults if parsing fails.
model_provider = os.getenv("MCP_MODEL_PROVIDER", "anthropic")
model_name = os.getenv("MCP_MODEL_NAME", "claude-3-5-sonnet-20241022")
def safe_float(env_var: str, default: float) -> float:
"""Safely parse a float from an environment variable."""
try:
return float(os.getenv(env_var, str(default)))
except ValueError:
logger.warning(f"Invalid float for {env_var}, using default={default}")
return default
def safe_int(env_var: str, default: int) -> int:
"""Safely parse an int from an environment variable."""
try:
return int(os.getenv(env_var, str(default)))
except ValueError:
logger.warning(f"Invalid int for {env_var}, using default={default}")
return default
# Get environment variables with defaults
temperature = safe_float("MCP_TEMPERATURE", 0.3)
max_steps = safe_int("MCP_MAX_STEPS", 30)
use_vision = os.getenv("MCP_USE_VISION", "true").lower() == "true"
max_actions_per_step = safe_int("MCP_MAX_ACTIONS_PER_STEP", 5)
tool_call_in_content = (
os.getenv("MCP_TOOL_CALL_IN_CONTENT", "true").lower() == "true"
)
# Prepare LLM
llm = utils.get_llm_model(
provider=model_provider, model_name=model_name, temperature=temperature
)
# Create a fresh browser session for this run
browser_session = create_browser_session()
await browser_session.start()
# Create controller and agent
controller = CustomController()
agent = CustomAgent(
task=task,
add_infos=add_infos,
use_vision=use_vision,
llm=llm,
browser_session=browser_session,
controller=controller,
max_actions_per_step=max_actions_per_step,
tool_call_in_content=tool_call_in_content,
agent_state=agent_state,
)
# Execute the agent task lifecycle
history = await agent.execute_agent_task(max_steps=max_steps)
# Extract final result from the agent's history
final_result = history.final_result()
if not final_result:
final_result = f"No final result. Possibly incomplete. {history}"
return final_result
except Exception as e:
logger.error("run-browser-agent error: %s", str(e))
raise ValueError(f"run-browser-agent error: {e}\n{traceback.format_exc()}")
finally:
# Always ensure cleanup, even if no error.
try:
agent_state.request_stop()
except Exception as stop_error:
logger.warning("Error stopping agent state: %s", stop_error)
if browser_session:
try:
await browser_session.stop()
except Exception as browser_error:
logger.warning(
"Failed to stop browser session gracefully, killing it: %s",
browser_error,
)
if hasattr(browser_session, "kill"):
await browser_session.kill()
def launch_mcp_browser_use_server() -> None:
"""
Entry point for running the FastMCP application.
Handles server start and final resource cleanup.
"""
try:
app.run()
except Exception as e:
logger.error("Error running MCP server: %s\n%s", e, traceback.format_exc())
if __name__ == "__main__":
launch_mcp_browser_use_server()
```
--------------------------------------------------------------------------------
/tests/test_utils.py:
--------------------------------------------------------------------------------
```python
import base64
import importlib
import importlib.util
import os
import sys
import time
import types
import pytest
# Path to utils module
ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
UTILS_PATH = os.path.join(ROOT, "src", "mcp_browser_use", "utils", "utils.py")
# Provide dummy langchain modules if they are not installed
if "langchain_openai" not in sys.modules:
module = types.ModuleType("langchain_openai")
class ChatOpenAI:
def __init__(self, *args, **kwargs):
pass
class AzureChatOpenAI:
def __init__(self, *args, **kwargs):
pass
module.ChatOpenAI = ChatOpenAI
module.AzureChatOpenAI = AzureChatOpenAI
sys.modules["langchain_openai"] = module
if "langchain_anthropic" not in sys.modules:
module = types.ModuleType("langchain_anthropic")
class ChatAnthropic:
def __init__(self, *args, **kwargs):
pass
module.ChatAnthropic = ChatAnthropic
sys.modules["langchain_anthropic"] = module
if "langchain_google_genai" not in sys.modules:
module = types.ModuleType("langchain_google_genai")
class ChatGoogleGenerativeAI:
def __init__(self, *args, **kwargs):
pass
module.ChatGoogleGenerativeAI = ChatGoogleGenerativeAI
sys.modules["langchain_google_genai"] = module
if "langchain_ollama" not in sys.modules:
module = types.ModuleType("langchain_ollama")
class ChatOllama:
def __init__(self, *args, **kwargs):
pass
module.ChatOllama = ChatOllama
sys.modules["langchain_ollama"] = module
if "browser_use" not in sys.modules:
browser_use_module = types.ModuleType("browser_use")
browser_module = types.ModuleType("browser_use.browser")
events_module = types.ModuleType("browser_use.browser.events")
class ScreenshotEvent:
def __init__(self, full_page: bool = False):
self.full_page = full_page
events_module.ScreenshotEvent = ScreenshotEvent
browser_module.events = events_module
browser_use_module.browser = browser_module
sys.modules["browser_use"] = browser_use_module
sys.modules["browser_use.browser"] = browser_module
sys.modules["browser_use.browser.events"] = events_module
# Import utils module directly from file after stubbing dependencies
spec = importlib.util.spec_from_file_location("mcp_browser_use.utils.utils", UTILS_PATH)
utils = importlib.util.module_from_spec(spec)
spec.loader.exec_module(utils)
@pytest.fixture
def anyio_backend():
return "asyncio"
def test_get_llm_model_returns_chatopenai():
model = utils.get_llm_model("openai")
assert isinstance(model, utils.ChatOpenAI)
def test_get_llm_model_unknown_provider_raises():
with pytest.raises(ValueError):
utils.get_llm_model("unknown")
def test_encode_image_handles_empty_path():
assert utils.encode_image(None) is None
assert utils.encode_image("") is None
def test_encode_image_roundtrip(tmp_path):
image_path = tmp_path / "image.bin"
payload = b"test-bytes"
image_path.write_bytes(payload)
encoded = utils.encode_image(str(image_path))
assert encoded == base64.b64encode(payload).decode("utf-8")
def test_encode_image_missing_file(tmp_path):
with pytest.raises(FileNotFoundError):
utils.encode_image(str(tmp_path / "missing.bin"))
def test_get_latest_files_creates_directory(tmp_path):
target = tmp_path / "captures"
result = utils.get_latest_files(str(target), file_types=[".webm", ".zip"])
assert target.exists()
assert result == {".webm": None, ".zip": None}
def test_get_latest_files_skips_recent_files(tmp_path, monkeypatch):
directory = tmp_path / "captures"
directory.mkdir()
recent_path = directory / "recent.webm"
recent_path.write_text("recent")
now = time.time()
os.utime(recent_path, (now, now))
monkeypatch.setattr(utils.time, "time", lambda: now)
result = utils.get_latest_files(str(directory), file_types=[".webm"])
assert result == {".webm": None}
@pytest.mark.anyio("asyncio")
async def test_capture_screenshot_uses_event_bus():
screenshot_payload = base64.b64encode(b"payload").decode("utf-8")
class DummyEvent:
def __init__(self, result):
self._result = result
self.awaited = False
def __await__(self):
async def _wait():
self.awaited = True
return self
return _wait().__await__()
async def event_result(self, raise_if_any=True, raise_if_none=True):
return self._result
class DummyEventBus:
def __init__(self, dispatched_event):
self._event = dispatched_event
self.dispatched = []
def dispatch(self, event):
self.dispatched.append(event)
return self._event
class DummyBrowserSession:
def __init__(self, event_bus):
self.event_bus = event_bus
dummy_event = DummyEvent(screenshot_payload)
event_bus = DummyEventBus(dummy_event)
session = DummyBrowserSession(event_bus)
encoded = await utils.capture_screenshot(session)
assert encoded == screenshot_payload
assert dummy_event.awaited is True
assert len(event_bus.dispatched) == 1
assert isinstance(event_bus.dispatched[0], utils.ScreenshotEvent)
@pytest.mark.anyio("asyncio")
async def test_capture_screenshot_returns_none_on_error():
class DummyErrorEvent:
def __await__(self):
async def _wait():
return self
return _wait().__await__()
async def event_result(self, raise_if_any=True, raise_if_none=True):
raise RuntimeError("boom")
class DummyEventBus:
def dispatch(self, event):
return DummyErrorEvent()
class DummyBrowserSession:
def __init__(self):
self.event_bus = DummyEventBus()
session = DummyBrowserSession()
result = await utils.capture_screenshot(session)
assert result is None
```
--------------------------------------------------------------------------------
/documentation/CONFIGURATION.md:
--------------------------------------------------------------------------------
```markdown
# Configuration Guide
This guide describes every configuration option recognised by the MCP Browser Use server. All settings can be supplied as environment variables (e.g. via a `.env` file loaded with [`python-dotenv`](https://pypi.org/project/python-dotenv/)) or injected by your MCP client.
The sample file at [`sample.env.example`](../sample.env.example) contains a ready-to-copy template with placeholders for secrets.
## How configuration is loaded
1. **Model & Agent settings** are read in [`server.py`](../src/mcp_browser_use/server.py). They control the language model as well as the agent run loop.
2. **Browser runtime settings** are parsed in [`browser/browser_manager.py`](../src/mcp_browser_use/browser/browser_manager.py) which returns a configured `BrowserSession` instance.
3. **Provider specific credentials** are consumed by the LLM factory in [`utils/utils.py`](../src/mcp_browser_use/utils/utils.py).
Unless otherwise noted, boolean flags treat any of `1`, `true`, `yes`, `on` (case insensitive) as **true**. Any other value is considered **false**.
## Core Agent Options
| Variable | Default | Description |
| --- | --- | --- |
| `MCP_MODEL_PROVIDER` | `anthropic` | LLM provider name passed to the LangChain factory. Supported values: `anthropic`, `openai`, `deepseek`, `gemini`, `ollama`, `azure_openai`. |
| `MCP_MODEL_NAME` | `claude-3-5-sonnet-20241022` | Model identifier sent to the provider. Each provider supports its own model list. |
| `MCP_TEMPERATURE` | `0.3` | Sampling temperature for the model. Parsed as float. |
| `MCP_MAX_STEPS` | `30` | Maximum number of reasoning/action steps before aborting the run. Parsed as integer. |
| `MCP_MAX_ACTIONS_PER_STEP` | `5` | Limits how many tool invocations the agent may issue in a single step. Parsed as integer. |
| `MCP_USE_VISION` | `true` | Enables vision features within the agent (element snapshots). |
| `MCP_TOOL_CALL_IN_CONTENT` | `true` | Whether tool call payloads are expected inside the model response content. |
## Provider Credentials & Endpoints
The LLM factory reads the following variables when initialising clients. Only set the values for the provider(s) you actively use.
| Variable | Purpose |
| --- | --- |
| `ANTHROPIC_API_KEY` | API key for Anthropic Claude models. |
| `OPENAI_API_KEY` | API key for OpenAI models. |
| `DEEPSEEK_API_KEY` | API key for DeepSeek hosted models. |
| `GOOGLE_API_KEY` | API key for Google Gemini via LangChain Google Generative AI. |
| `AZURE_OPENAI_API_KEY` | API key for Azure OpenAI deployments. |
| `AZURE_OPENAI_ENDPOINT` | Endpoint URL for the Azure OpenAI deployment. |
| `OPENAI_ENDPOINT` | Override the OpenAI base URL (useful for proxies). |
| `DEEPSEEK_ENDPOINT` | Base URL for the DeepSeek-compatible endpoint. |
| `ANTHROPIC_API_ENDPOINT` | Alternative base URL for Anthropic (rarely needed). |
When pointing to self-hosted or compatible services you may also override the defaults using `base_url` specific variables in your own code. See [`utils/utils.py`](../src/mcp_browser_use/utils/utils.py) for the full mapping.
## Browser Runtime Options
These options are parsed by [`BrowserEnvironmentConfig.from_env`](../src/mcp_browser_use/browser/browser_manager.py) and control Chromium launch behaviour.
| Variable | Default | Description |
| --- | --- | --- |
| `CHROME_PATH` | _unset_ | Absolute path to a Chrome/Chromium executable. Leave unset to let `browser-use` manage Chromium via Playwright. |
| `CHROME_USER_DATA` | _unset_ | Directory to store user data (profiles, cookies). Required when `CHROME_PERSISTENT_SESSION` is true. |
| `CHROME_PERSISTENT_SESSION` | `false` | Keeps the browser profile between runs by mounting `CHROME_USER_DATA`. |
| `CHROME_DEBUGGING_PORT` | _unset_ | Remote debugging port for attaching to an existing Chrome instance. Must be an integer. |
| `CHROME_DEBUGGING_HOST` | _unset_ | Hostname/IP for remote debugging (e.g. `localhost`). |
| `BROWSER_USE_HEADLESS` | `false` | Launch Chromium in headless mode. |
| `BROWSER_USE_DISABLE_SECURITY` | `false` | Disables web security features (CORS, sandbox). Use with caution. |
| `BROWSER_USE_EXTRA_CHROMIUM_ARGS` | _unset_ | Comma-separated list of additional Chromium command-line flags. |
| `BROWSER_USE_ALLOWED_DOMAINS` | _unset_ | Comma-separated allowlist limiting which domains the agent may open. |
| `BROWSER_USE_PROXY_URL` | _unset_ | HTTP/HTTPS proxy URL. |
| `BROWSER_USE_NO_PROXY` | _unset_ | Hosts to bypass in proxy mode. |
| `BROWSER_USE_PROXY_USERNAME` | _unset_ | Username for proxy authentication. |
| `BROWSER_USE_PROXY_PASSWORD` | _unset_ | Password for proxy authentication. |
| `BROWSER_USE_CDP_URL` | _unset_ | Connect to an existing Chrome DevTools Protocol endpoint instead of launching a new browser. |
### Persistence hints
- When `CHROME_PERSISTENT_SESSION` is true and `CHROME_USER_DATA` is not provided, the server logs a warning and the session falls back to ephemeral storage.
- Remote debugging settings (`CHROME_DEBUGGING_HOST` / `CHROME_DEBUGGING_PORT`) are optional and ignored if invalid values are supplied. The server logs a warning and continues with defaults.
## Additional Environment Variables
Some ancillary features inspect the following variables:
| Variable | Purpose |
| --- | --- |
| `WIN_FONT_DIR` | Custom Windows font directory used when generating GIF summaries of browsing sessions. |
## Tips for managing configuration
- Store secrets outside of version control. When sharing an `.env` file, redact or rotate keys immediately.
- Keep provider-specific settings grouped so you can switch model providers quickly when testing.
- Start with the defaults, confirm the agent behaves as expected, then tighten security by restricting `BROWSER_USE_ALLOWED_DOMAINS` and enabling headless mode.
- When experimenting locally, keep `CHROME_PERSISTENT_SESSION=false` to avoid stale cookies interfering with automation runs.
For any options not covered here, consult the upstream [`browser-use` documentation](https://github.com/browser-use/browser-use) which explains additional environment variables recognised by the underlying library.
```
--------------------------------------------------------------------------------
/src/mcp_browser_use/browser/browser_manager.py:
--------------------------------------------------------------------------------
```python
# -*- coding: utf-8 -*-
"""Utility helpers for configuring and creating :class:`BrowserSession` instances.
This module consolidates the thin wrappers that previously lived in
``custom_browser.py``, ``custom_context.py``, and ``config.py``. The new structure
centralises environment parsing so ``server.py`` can simply request a configured
browser session without re-implementing the translation from environment
variables to ``BrowserSession`` keyword arguments.
"""
from __future__ import annotations
import logging
import os
from dataclasses import dataclass
from typing import Any, Dict, Optional
from browser_use import BrowserSession
from browser_use.browser.profile import ProxySettings
logger = logging.getLogger(__name__)
_BOOL_TRUE = {"1", "true", "yes", "on"}
@dataclass(slots=True)
class BrowserPersistenceConfig:
"""Configuration for browser persistence and remote debugging settings."""
persistent_session: bool = False
user_data_dir: Optional[str] = None
debugging_port: Optional[int] = None
debugging_host: Optional[str] = None
@classmethod
def from_env(cls) -> "BrowserPersistenceConfig":
persistent_session = (
os.getenv("CHROME_PERSISTENT_SESSION", "").lower() in _BOOL_TRUE
)
user_data_dir = os.getenv("CHROME_USER_DATA") or None
debugging_port: Optional[int]
port_value = os.getenv("CHROME_DEBUGGING_PORT")
if port_value:
try:
debugging_port = int(port_value)
except ValueError:
logger.warning(
"Invalid CHROME_DEBUGGING_PORT=%r, ignoring debug port setting.",
port_value,
)
debugging_port = None
else:
debugging_port = None
debugging_host = os.getenv("CHROME_DEBUGGING_HOST") or None
return cls(
persistent_session=persistent_session,
user_data_dir=user_data_dir,
debugging_port=debugging_port,
debugging_host=debugging_host,
)
@dataclass(slots=True)
class BrowserEnvironmentConfig:
"""All runtime settings required for instantiating ``BrowserSession``."""
headless: bool = False
disable_security: bool = False
executable_path: Optional[str] = None
args: Optional[list[str]] = None
allowed_domains: Optional[list[str]] = None
proxy: Optional[ProxySettings] = None
cdp_url: Optional[str] = None
user_data_dir: Optional[str] = None
def to_kwargs(self) -> Dict[str, Any]:
"""Convert to keyword arguments understood by :class:`BrowserSession`."""
kwargs: Dict[str, Any] = {
"headless": self.headless,
"disable_security": self.disable_security,
"executable_path": self.executable_path,
"args": self.args,
"allowed_domains": self.allowed_domains,
"proxy": self.proxy,
"cdp_url": self.cdp_url,
"user_data_dir": self.user_data_dir,
}
# Remove ``None`` values so BrowserSession can rely on its defaults.
return {key: value for key, value in kwargs.items() if value is not None}
@classmethod
def from_env(cls) -> "BrowserEnvironmentConfig":
persistence = BrowserPersistenceConfig.from_env()
headless = os.getenv("BROWSER_USE_HEADLESS", "false").lower() in _BOOL_TRUE
disable_security = (
os.getenv("BROWSER_USE_DISABLE_SECURITY", "false").lower() in _BOOL_TRUE
)
executable_path = os.getenv("CHROME_PATH") or None
extra_args_env = os.getenv("BROWSER_USE_EXTRA_CHROMIUM_ARGS")
args = None
if extra_args_env:
args = [arg.strip() for arg in extra_args_env.split(",") if arg.strip()]
allowed_domains_env = os.getenv("BROWSER_USE_ALLOWED_DOMAINS")
allowed_domains = None
if allowed_domains_env:
allowed_domains = [
domain.strip()
for domain in allowed_domains_env.split(",")
if domain.strip()
]
proxy_url = os.getenv("BROWSER_USE_PROXY_URL")
proxy: Optional[ProxySettings] = None
if proxy_url:
proxy = ProxySettings(
server=proxy_url,
bypass=os.getenv("BROWSER_USE_NO_PROXY"),
username=os.getenv("BROWSER_USE_PROXY_USERNAME"),
password=os.getenv("BROWSER_USE_PROXY_PASSWORD"),
)
cdp_url = os.getenv("BROWSER_USE_CDP_URL") or None
if not cdp_url and (persistence.debugging_host or persistence.debugging_port):
host = persistence.debugging_host or "127.0.0.1"
port = persistence.debugging_port or 9222
cdp_url = f"http://{host}:{port}"
user_data_dir = None
if persistence.persistent_session:
if persistence.user_data_dir:
user_data_dir = persistence.user_data_dir
else:
logger.warning(
"CHROME_PERSISTENT_SESSION requested but CHROME_USER_DATA was not provided."
)
return cls(
headless=headless,
disable_security=disable_security,
executable_path=executable_path,
args=args,
allowed_domains=allowed_domains,
proxy=proxy,
cdp_url=cdp_url,
user_data_dir=user_data_dir,
)
def create_browser_session(
overrides: Optional[Dict[str, Any]] = None,
) -> BrowserSession:
"""Instantiate a :class:`BrowserSession` using environment defaults.
``overrides`` can be supplied to fine-tune the resulting session. Any keys
set to ``None`` are ignored so callers can override only a subset of values.
"""
config = BrowserEnvironmentConfig.from_env()
kwargs = config.to_kwargs()
if overrides:
for key, value in overrides.items():
if value is not None:
kwargs[key] = value
elif key in kwargs:
# Explicit ``None`` removes the override letting BrowserSession
# fall back to its internal default.
kwargs.pop(key)
logger.debug(
"Creating BrowserSession with kwargs: %s",
{k: v for k, v in kwargs.items() if k != "proxy"},
)
return BrowserSession(**kwargs)
```
--------------------------------------------------------------------------------
/src/mcp_browser_use/utils/utils.py:
--------------------------------------------------------------------------------
```python
# -*- coding: utf-8 -*-
import base64
import logging
import os
import time
from pathlib import Path
from typing import Any, Callable, Dict, List, Optional, Tuple, Type
from browser_use.browser.events import ScreenshotEvent
from langchain_anthropic import ChatAnthropic
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_ollama import ChatOllama
from langchain_openai import AzureChatOpenAI, ChatOpenAI
logger = logging.getLogger(__name__)
def _anthropic_params(kwargs: Dict[str, Any]) -> Dict[str, Any]:
return {
"model_name": kwargs.get("model_name", "claude-3-5-sonnet-20240620"),
"temperature": kwargs.get("temperature", 0.0),
"base_url": kwargs.get("base_url") or "https://api.anthropic.com",
"api_key": kwargs.get("api_key") or os.getenv("ANTHROPIC_API_KEY", ""),
}
def _openai_params(kwargs: Dict[str, Any]) -> Dict[str, Any]:
return {
"model": kwargs.get("model_name", "gpt-4o"),
"temperature": kwargs.get("temperature", 0.0),
"base_url": kwargs.get("base_url")
or os.getenv("OPENAI_ENDPOINT", "https://api.openai.com/v1"),
"api_key": kwargs.get("api_key") or os.getenv("OPENAI_API_KEY", ""),
}
def _deepseek_params(kwargs: Dict[str, Any]) -> Dict[str, Any]:
return {
"model": kwargs.get("model_name", "deepseek-chat"),
"temperature": kwargs.get("temperature", 0.0),
"base_url": kwargs.get("base_url") or os.getenv("DEEPSEEK_ENDPOINT", ""),
"api_key": kwargs.get("api_key") or os.getenv("DEEPSEEK_API_KEY", ""),
}
def _gemini_params(kwargs: Dict[str, Any]) -> Dict[str, Any]:
return {
"model": kwargs.get("model_name", "gemini-2.0-flash-exp"),
"temperature": kwargs.get("temperature", 0.0),
"google_api_key": kwargs.get("api_key") or os.getenv("GOOGLE_API_KEY", ""),
}
def _ollama_params(kwargs: Dict[str, Any]) -> Dict[str, Any]:
return {
"model": kwargs.get("model_name", "phi4"),
"temperature": kwargs.get("temperature", 0.0),
"num_ctx": kwargs.get("num_ctx", 128000),
"base_url": kwargs.get("base_url", "http://localhost:11434"),
}
def _azure_openai_params(kwargs: Dict[str, Any]) -> Dict[str, Any]:
return {
"model": kwargs.get("model_name", "gpt-4o"),
"temperature": kwargs.get("temperature", 0.0),
"api_version": kwargs.get("api_version", "2024-05-01-preview"),
"azure_endpoint": kwargs.get("base_url")
or os.getenv("AZURE_OPENAI_ENDPOINT", ""),
"api_key": kwargs.get("api_key") or os.getenv("AZURE_OPENAI_API_KEY", ""),
}
LLM_PROVIDERS: Dict[str, Tuple[Type, Callable[[Dict[str, Any]], Dict[str, Any]]]] = {
"anthropic": (ChatAnthropic, _anthropic_params),
"openai": (ChatOpenAI, _openai_params),
"deepseek": (ChatOpenAI, _deepseek_params),
"gemini": (ChatGoogleGenerativeAI, _gemini_params),
"ollama": (ChatOllama, _ollama_params),
"azure_openai": (AzureChatOpenAI, _azure_openai_params),
}
def get_llm_model(provider: str, **kwargs) -> Any:
"""
Return an initialized language model client based on the given provider name.
:param provider: The name of the LLM provider (e.g., "anthropic", "openai", "azure_openai").
:param kwargs: Additional parameters (model_name, temperature, base_url, api_key, etc.).
:return: An instance of a ChatLLM from the relevant langchain_* library.
:raises ValueError: If the provider is unsupported.
"""
try:
llm_class, params_builder = LLM_PROVIDERS[provider]
except KeyError as error:
raise ValueError(f"Unsupported provider: {provider}") from error
provider_kwargs = params_builder(kwargs)
return llm_class(**provider_kwargs)
# Commonly used model names for quick reference
model_names = {
"anthropic": ["claude-3-5-sonnet-20240620", "claude-3-opus-20240229"],
"openai": ["gpt-4o", "gpt-4", "gpt-3.5-turbo"],
"deepseek": ["deepseek-chat"],
"gemini": [
"gemini-2.0-flash-exp",
"gemini-2.0-flash-thinking-exp",
"gemini-1.5-flash-latest",
"gemini-1.5-flash-8b-latest",
"gemini-2.0-flash-thinking-exp-1219",
],
"ollama": ["deepseek-r1:671b", "qwen2.5:7b", "llama3.3", "phi4"],
"azure_openai": ["gpt-4o", "gpt-4", "gpt-3.5-turbo"],
}
def encode_image(img_path: Optional[str]) -> Optional[str]:
"""
Convert an image at `img_path` into a base64-encoded string.
Returns None if `img_path` is None or empty.
Raises FileNotFoundError if the file doesn't exist.
"""
if not img_path:
return None
try:
with open(img_path, "rb") as image_file:
image_data = base64.b64encode(image_file.read()).decode("utf-8")
return image_data
except FileNotFoundError as error:
logger.error(f"Image not found at path {img_path}: {error}")
raise
except Exception as error:
logger.error(f"Error encoding image at {img_path}: {error}")
raise
def get_latest_files(
directory: str, file_types: List[str] = [".webm", ".zip"]
) -> Dict[str, Optional[str]]:
"""
Find the latest file for each extension in `file_types` under `directory`.
Returns a dict {file_extension: latest_file_path or None}.
:param directory: The directory to search.
:param file_types: List of file extensions (e.g., [".webm", ".zip"]).
:return: dict mapping each extension to the path of the newest file or None if not found.
"""
latest_files: Dict[str, Optional[str]] = {ext: None for ext in file_types}
if not os.path.exists(directory):
logger.debug(f"Directory '{directory}' does not exist. Creating it.")
os.makedirs(directory, exist_ok=True)
return latest_files
for file_type in file_types:
try:
matching_files = list(Path(directory).rglob(f"*{file_type}"))
if matching_files:
# Sort or use max() by modified time
most_recent_file = max(
matching_files, key=lambda path: path.stat().st_mtime
)
# Check if file is not actively being written
if time.time() - most_recent_file.stat().st_mtime > 1.0:
latest_files[file_type] = str(most_recent_file)
else:
logger.debug(
f"Skipping file {most_recent_file} - possibly still being written."
)
except Exception as error:
logger.error(
f"Error getting latest {file_type} file in '{directory}': {error}"
)
return latest_files
async def capture_screenshot(browser_session) -> Optional[str]:
"""Capture a screenshot of the current page using the browser-use event bus."""
if not hasattr(browser_session, "event_bus"):
logger.error("Browser session does not have an event_bus.")
return None
try:
event = browser_session.event_bus.dispatch(ScreenshotEvent(full_page=False))
await event
result = await event.event_result(raise_if_any=True, raise_if_none=True)
return result
except Exception as error:
logger.error(f"Failed to capture screenshot via event bus: {error}")
return None
```
--------------------------------------------------------------------------------
/src/mcp_browser_use/agent/custom_prompts.py:
--------------------------------------------------------------------------------
```python
# -*- coding: utf-8 -*-
from typing import List, Optional
from browser_use.agent.prompts import SystemPrompt
from browser_use.agent.views import ActionResult
from browser_use.browser.views import BrowserState
from langchain_core.messages import HumanMessage, SystemMessage
from mcp_browser_use.agent.custom_views import CustomAgentStepInfo
class CustomSystemPrompt(SystemPrompt):
"""
Custom system prompt that extends SystemPrompt to inject additional
formatting rules and instructions for the AI agent.
"""
def important_rules(self) -> str:
"""
Return a detailed multiline string describing how the agent
must format its JSON response, handle multiple actions, forms,
navigation, and the maximum actions per step.
The text includes guidelines for:
- JSON response format
- Action sequences
- Element interaction
- Navigation & error handling
- Task completion
- Visual context usage
- Handling form filling and suggestions
"""
text = r"""
1. RESPONSE FORMAT: You must ALWAYS respond with valid JSON in this exact format:
{
"current_state": {
"prev_action_evaluation": "Success|Failed|Unknown - Analyze the current elements and the image to check if the previous goals/actions are successful like intended by the task. Ignore the action result. The website is the ground truth. Also mention if something unexpected happened like new suggestions in an input field. Shortly state why/why not. Note that the result you output must be consistent with the reasoning you output afterwards. If you consider it to be 'Failed,' you should reflect on this during your thought.",
"important_contents": "Output important contents closely related to user's instruction or task on the current page. If there is, please output the contents. If not, please output empty string ''.",
"completed_contents": "Update the input Task Progress. Completed contents is a general summary of the current contents that have been completed. Just summarize the contents that have been actually completed based on the current page and the history operations. Please list each completed item individually, such as: 1. Input username. 2. Input Password. 3. Click confirm button",
"thought": "Think about the requirements that have been completed in previous operations and the requirements that need to be completed in the next one operation. If the output of prev_action_evaluation is 'Failed', please reflect and output your reflection here. If you think you have entered the wrong page, consider to go back to the previous page in next action.",
"summary": "Please generate a brief natural language description for the operation in next actions based on your Thought."
},
"action": [
{
"action_name": {
// action-specific parameters
}
},
// ... more actions in sequence
]
}
2. ACTIONS: You can specify multiple actions to be executed in sequence.
Common action sequences:
- Form filling: [
{"input_text": {"index": 1, "text": "username"}},
{"input_text": {"index": 2, "text": "password"}},
{"click_element": {"index": 3}}
]
- Navigation and extraction: [
{"open_new_tab": {}},
{"go_to_url": {"url": "https://example.com"}},
{"extract_page_content": {}}
]
3. ELEMENT INTERACTION:
- Only use indexes that exist in the provided element list
- Each element has a unique index number (e.g., "33[:]<button>")
- Elements marked with "_[:]" are non-interactive (for context only)
4. NAVIGATION & ERROR HANDLING:
- If no suitable elements exist, use other functions to complete the task
- If stuck, try alternative approaches
- Handle popups/cookies by accepting or closing them
- Use scroll to find elements you are looking for
5. TASK COMPLETION:
- If you think all the requirements of user's instruction have been completed and no further operation is required, output the done action to terminate the operation process.
- Don't hallucinate actions.
- If the task requires specific information - make sure to include everything in the done function. This is what the user will see.
- If you are running out of steps (current step), think about speeding it up, and ALWAYS use the done action as the last action.
6. VISUAL CONTEXT:
- When an image is provided, use it to understand the page layout
- Bounding boxes with labels correspond to element indexes
- Each bounding box and its label have the same color
- Most often the label is inside the bounding box, on the top right
- Visual context helps verify element locations and relationships
- Sometimes labels overlap, so use the context to verify the correct element
7. FORM FILLING:
- If you fill an input field and your action sequence is interrupted, most often a list with suggestions popped up under the field and you need to first select the right element from the suggestion list.
8. ACTION SEQUENCING:
- Actions are executed in the order they appear in the list
- Each action should logically follow from the previous one
- If the page changes after an action, the sequence is interrupted and you get the new state.
- If content only disappears the sequence continues.
- Only provide the action sequence until you think the page will change.
- Try to be efficient, e.g. fill forms at once, or chain actions where nothing changes on the page like saving, extracting, checkboxes...
- Only use multiple actions if it makes sense.
"""
text += f" - use maximum {self.max_actions_per_step} actions per sequence"
return text
def input_format(self) -> str:
"""
Return a string describing the input structure that the agent can rely on
when constructing its output (Task, Hints, Memory, Task Progress, etc.).
"""
return r"""
INPUT STRUCTURE:
1. Task: The user's instructions you need to complete.
2. Hints(Optional): Some hints to help you complete the user's instructions.
3. Memory: Important contents are recorded during historical operations for use in subsequent operations.
4. Task Progress: Up to the current page, the content you have completed can be understood as the progress of the task.
5. Current URL: The webpage you're currently on
6. Available Tabs: List of open browser tabs
7. Interactive Elements: List in the format:
index[:]<element_type>element_text</element_type>
- index: Numeric identifier for interaction
- element_type: HTML element type (button, input, etc.)
- element_text: Visible text or element description
Example:
33[:]<button>Submit Form</button>
_[:] Non-interactive text
Notes:
- Only elements with numeric indexes are interactive
- _[:] elements provide context but cannot be interacted with
"""
def get_system_message(self) -> SystemMessage:
"""
Build and return a SystemMessage containing all system-level instructions,
rules, and function references for the agent.
"""
time_str = self.current_date.strftime("%Y-%m-%d %H:%M")
AGENT_PROMPT = f"""You are a precise browser automation agent that interacts with websites through structured commands. Your role is to:
1. Analyze the provided webpage elements and structure
2. Plan a sequence of actions to accomplish the given task
3. Respond with valid JSON containing your action sequence and state assessment
Current date and time: {time_str}
{self.input_format()}
{self.important_rules()}
Functions:
{self.default_action_description}
Remember: Your responses must be valid JSON matching the specified format. Each action in the sequence must be valid."""
return SystemMessage(content=AGENT_PROMPT)
class CustomAgentMessagePrompt:
"""
Builds a user-facing prompt (HumanMessage) from the current browser state,
task step info, and any results or errors from previous actions.
"""
def __init__(
self,
state: BrowserState,
result: Optional[List[ActionResult]] = None,
include_attributes: Optional[List[str]] = None,
max_error_length: int = 400,
step_info: Optional[CustomAgentStepInfo] = None,
):
"""
:param state: The current BrowserState, including URL, tabs, elements, etc.
:param result: A list of ActionResults from the previous step(s).
:param include_attributes: A list of HTML attributes to show in element strings.
:param max_error_length: Maximum characters of error output to include.
:param step_info: Holds metadata like the current step number, memory, task details, etc.
"""
self.state = state
self.result = result or []
self.include_attributes = include_attributes or []
self.max_error_length = max_error_length
self.step_info = step_info
def get_user_message(self) -> HumanMessage:
"""
Construct and return a HumanMessage containing:
1. Task and hints from step_info
2. Memory and task progress
3. Current URL and available tabs
4. A string representation of interactive elements
5. Any results or errors from previous actions
6. An inline base64 screenshot if available
:return: A HumanMessage object for the agent to process.
"""
step_info = self.step_info
if not step_info:
# Fallback if no step_info is provided
step_info_text = ""
task = ""
add_infos = ""
memory = ""
task_progress = ""
else:
step_info_text = f"Step {step_info.step_number}/{step_info.max_steps}"
task = step_info.task
add_infos = step_info.add_infos
memory = step_info.memory
task_progress = step_info.task_progress
state_description = f"""
{step_info_text}
1. Task: {task}
2. Hints(Optional):
{add_infos}
3. Memory:
{memory}
4. Task Progress:
{task_progress}
5. Current url: {self.state.url}
6. Available tabs:
{self.state.tabs}
7. Interactive elements:
{self.state.element_tree.clickable_elements_to_string(
include_attributes=self.include_attributes
)}
"""
# Append action results or errors
for i, r in enumerate(self.result):
if r.extracted_content:
state_description += f"\nResult of action {i + 1}/{len(self.result)}: {r.extracted_content}"
if r.error:
truncated_error = r.error[-self.max_error_length :]
state_description += f"\nError of action {i + 1}/{len(self.result)}: ...{truncated_error}"
# If a screenshot is available, embed it as an image URL
if self.state.screenshot:
# Format message for vision model or multi-part message
return HumanMessage(
content=[
{"type": "text", "text": state_description},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{self.state.screenshot}"
},
},
]
)
else:
# Otherwise, just return text
return HumanMessage(content=state_description)
```
--------------------------------------------------------------------------------
/src/mcp_browser_use/agent/custom_agent.py:
--------------------------------------------------------------------------------
```python
# -*- coding: utf-8 -*-
import json
import logging
import traceback
from typing import Any, List, Optional, Type
import base64
import io
import os
from PIL import Image, ImageDraw, ImageFont
from browser_use.agent.prompts import SystemPrompt
from browser_use.agent.service import Agent
from browser_use.agent.views import (
ActionResult,
AgentHistoryList,
AgentOutput,
AgentHistory,
)
from browser_use import BrowserSession
from browser_use.browser.views import BrowserStateHistory
from browser_use.controller.service import Controller
from browser_use.telemetry.views import AgentEndTelemetryEvent, AgentRunTelemetryEvent
from browser_use.utils import time_execution_async
from langchain_core.language_models.chat_models import BaseChatModel
from langchain_core.messages import BaseMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI
from langchain_openai.chat_models.base import _convert_message_to_dict
from mcp_browser_use.utils.agent_state import AgentState
from mcp_browser_use.agent.custom_massage_manager import CustomMassageManager
from mcp_browser_use.agent.custom_views import CustomAgentOutput, CustomAgentStepInfo
logger = logging.getLogger(__name__)
class CustomAgent(Agent):
"""
An AI-driven Agent that uses a language model to determine browser actions,
interacts with a browser/page handle, and manages conversation history and
state.
"""
def __init__(
self,
task: str,
llm: BaseChatModel,
add_infos: str = "",
browser_session: Optional[BrowserSession] = None,
browser: Optional[BrowserSession] = None,
browser_context: Optional[Any] = None,
controller: Optional[Controller] = None,
use_vision: bool = True,
save_conversation_path: Optional[str] = None,
max_failures: int = 5,
retry_delay: int = 10,
system_prompt_class: Type[SystemPrompt] = SystemPrompt,
max_input_tokens: int = 13000,
validate_output: bool = False,
include_attributes: tuple[str, str, str, str, str, str, str, str, str, str] = (
"title",
"type",
"name",
"role",
"tabindex",
"aria-label",
"placeholder",
"value",
"alt",
"aria-expanded",
),
max_error_length: int = 400,
max_actions_per_step: int = 10,
tool_call_in_content: bool = True,
agent_state: Optional[AgentState] = None,
):
"""
:param task: Main instruction or goal for the agent.
:param llm: The large language model (BaseChatModel) used for reasoning.
:param add_infos: Additional information or context to pass to the agent.
:param browser_session: Optional browser/session instance (legacy name).
:param browser: Preferred browser object for ``browser-use`` >= 0.7.
:param browser_context: Optional active page/context to reuse.
:param controller: Optional controller for handling multi-step actions. A new
controller is created when not provided.
:param use_vision: Whether to use vision-based element detection.
:param save_conversation_path: File path to store conversation logs.
:param max_failures: Max consecutive failures allowed before aborting.
:param retry_delay: Delay between retries (not currently used).
:param system_prompt_class: System prompt class for the agent.
:param max_input_tokens: Token limit for model input.
:param validate_output: Whether to validate final output at each step.
:param include_attributes: HTML attributes to include in vision logic.
:param max_error_length: Max length for error messages.
:param max_actions_per_step: Limit the number of actions agent can perform per step.
:param tool_call_in_content: Whether tool calls are in the raw model content.
:param agent_state: Shared state to detect external stop signals, store last valid state, etc.
"""
controller = controller or Controller()
self.controller = controller
browser_handle = browser or browser_session
init_kwargs: dict[str, Any] = {
"task": task,
"llm": llm,
"controller": controller,
"use_vision": use_vision,
"save_conversation_path": save_conversation_path,
"max_failures": max_failures,
"retry_delay": retry_delay,
"system_prompt_class": system_prompt_class,
"max_input_tokens": max_input_tokens,
"validate_output": validate_output,
"include_attributes": include_attributes,
"max_error_length": max_error_length,
"max_actions_per_step": max_actions_per_step,
"tool_call_in_content": tool_call_in_content,
}
if browser_handle is not None:
init_kwargs["browser"] = browser_handle
if browser_context is not None:
init_kwargs["page"] = browser_context
for _ in range(4):
try:
super().__init__(**init_kwargs)
break
except TypeError as exc: # pragma: no cover - defensive compatibility
message = str(exc)
if (
"unexpected keyword argument 'browser'" in message
and "browser" in init_kwargs
):
browser_value = init_kwargs.pop("browser")
if browser_value is not None:
init_kwargs.setdefault("browser_session", browser_value)
continue
if (
"unexpected keyword argument 'browser_session'" in message
and "browser_session" in init_kwargs
):
browser_value = init_kwargs.pop("browser_session")
if browser_value is not None:
init_kwargs.setdefault("browser", browser_value)
continue
if (
"unexpected keyword argument 'page'" in message
and "page" in init_kwargs
):
init_kwargs.pop("page")
continue
if (
"unexpected keyword argument 'controller'" in message
and "controller" in init_kwargs
):
controller_value = init_kwargs.pop("controller")
init_kwargs.setdefault("tools", controller_value)
continue
if (
"unexpected keyword argument 'tools'" in message
and "tools" in init_kwargs
):
controller_value = init_kwargs.pop("tools")
init_kwargs.setdefault("controller", controller_value)
continue
raise
else: # pragma: no cover - should never happen
raise TypeError("Unable to initialise base Agent with provided arguments")
self.add_infos = add_infos
self.agent_state = agent_state
# Custom message manager
self.message_manager = CustomMassageManager(
llm=self.llm,
task=self.task,
action_descriptions=self.controller.registry.get_prompt_description(),
system_prompt_class=self.system_prompt_class,
max_input_tokens=self.max_input_tokens,
include_attributes=self.include_attributes,
max_error_length=self.max_error_length,
max_actions_per_step=self.max_actions_per_step,
tool_call_in_content=tool_call_in_content,
)
def _setup_action_models(self) -> None:
"""
Setup dynamic action models from the controller's registry.
This ensures the agent's output schema matches all possible actions.
"""
# Get the dynamic action model from controller's registry
self.ActionModel = self.controller.registry.create_action_model()
# Create output model with the dynamic actions
self.AgentOutput = CustomAgentOutput.type_with_custom_actions(self.ActionModel)
def _log_response(self, response: CustomAgentOutput) -> None:
"""
Log the model's response in a human-friendly way.
Shows success/fail state, memory, thought, summary, etc.
"""
evaluation = response.current_state.prev_action_evaluation or ""
if "Success" in evaluation:
emoji = "✅"
elif "Failed" in evaluation:
emoji = "❌"
else:
emoji = "🤷"
logger.info(f"{emoji} Eval: {evaluation}")
logger.info(f"🧠 New Memory: {response.current_state.important_contents}")
logger.info(f"⏳ Task Progress: {response.current_state.completed_contents}")
logger.info(f"🤔 Thought: {response.current_state.thought}")
logger.info(f"🎯 Summary: {response.current_state.summary}")
for i, action in enumerate(response.action):
logger.info(
f"🛠️ Action {i + 1}/{len(response.action)}: "
f"{action.model_dump_json(exclude_unset=True)}"
)
def update_step_info(
self,
model_output: CustomAgentOutput,
step_info: Optional[CustomAgentStepInfo] = None,
) -> None:
"""
Update the current step with new memory and completed contents.
:param model_output: Parsed output from the LLM.
:param step_info: Step information object, if any.
"""
if step_info is None:
return
step_info.step_number += 1
important_contents = model_output.current_state.important_contents
if (
important_contents
and "None" not in important_contents
and important_contents not in step_info.memory
):
step_info.memory += important_contents + "\n"
completed_contents = model_output.current_state.completed_contents
if completed_contents and "None" not in completed_contents:
step_info.task_progress = completed_contents
@time_execution_async("--get_next_action")
async def get_next_action(self, input_messages: List[BaseMessage]) -> AgentOutput:
"""
Get the next action from the LLM, attempting structured output parsing.
Falls back to manual JSON parsing if structured parse fails.
"""
logger.info("Getting next action from LLM")
logger.debug(f"Input messages: {input_messages}")
try:
if isinstance(self.llm, ChatOpenAI):
# For OpenAI, attempt structured parse with "instructor" first
parsed_output = await self._handle_openai_structured_output(
input_messages
)
else:
logger.info(f"Using non-OpenAI model: {type(self.llm).__name__}")
parsed_output = await self._handle_non_openai_structured_output(
input_messages
)
self._truncate_and_log_actions(parsed_output)
self.n_steps += 1
return parsed_output
except Exception as e:
logger.warning(f"Error getting structured output: {str(e)}")
logger.info("Attempting fallback to manual parsing")
return await self._fallback_parse(input_messages)
async def _handle_openai_structured_output(
self, input_messages: List[BaseMessage]
) -> AgentOutput:
"""
Attempt to get structured output from an OpenAI LLM
using the 'instructor' library. If that fails, fallback
to the default structured output approach.
"""
logger.info("Using OpenAI chat model")
# Usually safe to import here to avoid circular import issues
from instructor import from_openai
try:
client = from_openai(self.llm.root_async_client)
logger.debug(f"Using model: {self.llm.model_name}")
messages = [_convert_message_to_dict(msg) for msg in input_messages]
parsed_response = await client.chat.completions.create(
messages=messages,
model=self.llm.model_name,
response_model=self.AgentOutput,
)
logger.debug(f"Raw OpenAI response: {parsed_response}")
return parsed_response
except Exception as e:
# Attempt default structured output if instructor fails
logger.error(f"Error with 'instructor' approach: {str(e)}")
logger.info("Using default structured output approach.")
structured_llm = self.llm.with_structured_output(
self.AgentOutput, include_raw=True
)
response: dict[str, Any] = await structured_llm.ainvoke(input_messages)
logger.debug(f"Raw LLM response (default approach): {response}")
return response["parsed"] # type: ignore
async def _handle_non_openai_structured_output(
self, input_messages: List[BaseMessage]
) -> AgentOutput:
"""
For non-OpenAI models, we directly use the structured LLM approach.
"""
structured_llm = self.llm.with_structured_output(
self.AgentOutput, include_raw=True
)
response: dict[str, Any] = await structured_llm.ainvoke(input_messages)
logger.debug(f"Raw LLM response: {response}")
return response["parsed"] # type: ignore
async def _fallback_parse(self, input_messages: List[BaseMessage]) -> AgentOutput:
"""
Manual JSON parsing fallback if structured parse fails.
Tries to extract JSON from the raw text and parse into AgentOutput.
"""
try:
ret = await self.llm.ainvoke(input_messages)
logger.debug(f"Raw fallback response: {ret}")
content = ret.content
if isinstance(content, list):
# If content is a list, parse from the first element
parsed_json = json.loads(
content[0].replace("```json", "").replace("```", "")
)
else:
# Otherwise parse from the string
parsed_json = json.loads(
content.replace("```json", "").replace("```", "")
)
parsed_output: AgentOutput = self.AgentOutput(**parsed_json)
if parsed_output is None:
raise ValueError("Could not parse fallback response.")
self._truncate_and_log_actions(parsed_output)
self.n_steps += 1
logger.info(
f"Successfully got next action via fallback. Step count: {self.n_steps}"
)
return parsed_output
except Exception as parse_error:
logger.error(f"Fallback parsing failed: {str(parse_error)}")
raise
def _truncate_and_log_actions(self, parsed_output: AgentOutput) -> None:
"""
Enforce the max_actions_per_step limit and log the response.
"""
original_action_count = len(parsed_output.action)
parsed_output.action = parsed_output.action[: self.max_actions_per_step]
if original_action_count > self.max_actions_per_step:
logger.warning(
f"Truncated actions from {original_action_count} to {self.max_actions_per_step}"
)
self._log_response(parsed_output)
def summarize_messages(self) -> bool:
"""
Summarize message history if it exceeds 5 messages.
Returns True if summarization occurred, False otherwise.
"""
stored_messages = self.message_manager.get_messages()
message_count = len(stored_messages)
if message_count <= 5:
logger.debug("Message count <= 5, skipping summarization")
return False
logger.info(f"Summarizing {message_count} messages")
try:
summarization_prompt = ChatPromptTemplate.from_messages(
[
MessagesPlaceholder(variable_name="chat_history"),
(
"user",
"Distill the above chat messages into a single summary message. "
"Include as many specific details as you can.",
),
]
)
summarization_chain = summarization_prompt | self.llm
summary_message = summarization_chain.invoke(
{"chat_history": stored_messages}
)
logger.debug(f"Generated summary: {summary_message}")
self.message_manager.reset_history()
self.message_manager._add_message_with_tokens(
summary_message
) # Consider creating a public method for this
return True
except Exception as e:
logger.error(f"Error during message summarization: {str(e)}")
logger.debug(f"Full traceback: {traceback.format_exc()}")
return False
@time_execution_async("--execute-agent-step")
async def execute_agent_step(
self, step_info: Optional[CustomAgentStepInfo] = None
) -> None:
"""
Execute a single agent step of the task:
1) Capture browser state
2) Query LLM for next action
3) Execute that action(s)
4) Update logs/history
"""
logger.info(f"\n📍 Step {self.n_steps}")
logger.info(f"History token count: {self.message_manager.history.total_tokens}")
# Optionally summarize to reduce token usage
# self.summarize_messages()
state = None
model_output = None
result: List[ActionResult] = []
try:
try:
state = await self.browser_context.get_state(use_vision=self.use_vision)
except TypeError:
logger.warning(
"get_state does not support 'use_vision' argument, falling back."
)
state = await self.browser_context.get_state()
self.message_manager.add_state_message(state, self._last_result, step_info)
input_messages = self.message_manager.get_messages()
model_output = await self.get_next_action(input_messages)
self.update_step_info(model_output, step_info)
logger.info(f"🧠 All Memory: {getattr(step_info, 'memory', '')}")
self._save_conversation(input_messages, model_output)
# Remove the last state message from chat history to prevent bloat
self.message_manager._remove_last_state_message()
self.message_manager.add_model_output(model_output)
# Execute the requested actions
result = await self.controller.multi_act(
model_output.action, self.browser_context
)
self._last_result = result
# If the last action indicates "is_done", we can log the extracted content
if len(result) > 0 and result[-1].is_done:
logger.info(f"📄 Result: {result[-1].extracted_content}")
self.consecutive_failures = 0
except Exception as e:
result = self._handle_step_error(e)
self._last_result = result
finally:
if not result:
return
for r in result:
logger.warning(f"🔧 Action result: {r}")
if state:
self._make_history_item(model_output, state, result)
def create_history_gif(
self,
output_path: str = "agent_history.gif",
duration: int = 3000,
show_goals: bool = True,
show_task: bool = True,
show_logo: bool = False,
font_size: int = 40,
title_font_size: int = 56,
goal_font_size: int = 44,
margin: int = 40,
line_spacing: float = 1.5,
) -> None:
"""
Create a GIF from the agent's history using the captured screenshots.
Overlays text for tasks/goals. Optionally includes a logo.
"""
if not self.history.history:
logger.warning("No history to create GIF from")
return
if not self.history.history[0].state.screenshot:
logger.warning(
"No screenshots in the first history item; cannot create GIF"
)
return
images = []
try:
# Attempt to load some preferred fonts
font_options = ["Helvetica", "Arial", "DejaVuSans", "Verdana"]
regular_font, title_font, goal_font = None, None, None
font_loaded = False
for font_name in font_options:
try:
import platform
if platform.system() == "Windows":
# On Windows, we may need absolute font paths
font_name = os.path.join(
os.getenv("WIN_FONT_DIR", "C:\\Windows\\Fonts"),
font_name + ".ttf",
)
regular_font = ImageFont.truetype(font_name, font_size)
title_font = ImageFont.truetype(font_name, title_font_size)
goal_font = ImageFont.truetype(font_name, goal_font_size)
font_loaded = True
break
except OSError:
continue
if not font_loaded:
raise OSError("No preferred fonts found")
except OSError:
# Fallback to default
regular_font = ImageFont.load_default()
title_font = regular_font
goal_font = regular_font
logo = None
if show_logo:
try:
logo = Image.open("./static/browser-use.png")
# Resize logo
logo_height = 150
aspect_ratio = logo.width / logo.height
logo_width = int(logo_height * aspect_ratio)
logo = logo.resize((logo_width, logo_height), Image.Resampling.LANCZOS)
except Exception as e:
logger.warning(f"Could not load logo: {e}")
# If requested, create an initial frame with the entire task
if show_task and self.task:
task_frame = self._create_task_frame(
self.task,
self.history.history[0].state.screenshot,
title_font,
regular_font,
logo,
line_spacing,
)
images.append(task_frame)
# Convert each step’s screenshot
for i, item in enumerate(self.history.history, 1):
if not item.state.screenshot:
continue
img_data = base64.b64decode(item.state.screenshot)
image = Image.open(io.BytesIO(img_data))
if show_goals and item.model_output:
image = self._add_overlay_to_image(
image=image,
step_number=i,
goal_text=item.model_output.current_state.thought,
regular_font=regular_font,
title_font=title_font,
margin=margin,
logo=logo,
line_spacing=line_spacing,
)
images.append(image)
if images:
images[0].save(
output_path,
save_all=True,
append_images=images[1:],
duration=duration,
loop=0,
optimize=False,
)
logger.info(f"Created GIF at {output_path}")
else:
logger.warning("No images found in history to create GIF")
def _create_task_frame(
self,
task_text: str,
screenshot_b64: str,
title_font: ImageFont.FreeTypeFont,
regular_font: ImageFont.FreeTypeFont,
logo: Image.Image | None,
line_spacing: float,
) -> Image.Image:
"""Return an image with the task text overlaid on the screenshot."""
margin = 40
img = Image.open(io.BytesIO(base64.b64decode(screenshot_b64))).convert("RGBA")
overlay = Image.new("RGBA", img.size, (0, 0, 0, 0))
draw = ImageDraw.Draw(overlay)
max_width = img.width - margin * 2
text_lines: list[str] = self._wrap_text_to_lines(
draw, task_text, regular_font, max_width
)
y = margin
title_bbox = draw.textbbox((margin, y), "Task", font=title_font)
title_height = title_bbox[3] - title_bbox[1]
total_height = title_height + int(margin * 0.5)
for t in text_lines:
bbox = draw.textbbox((margin, 0), t, font=regular_font)
total_height += int((bbox[3] - bbox[1]) * line_spacing)
if logo:
total_height = max(total_height, logo.height + margin * 2)
draw.rectangle(
[(0, 0), (img.width, total_height)],
fill=(0, 0, 0, 180),
)
draw.text((margin, y), "Task", font=title_font, fill="white")
y += title_height + int(margin * 0.5)
for t in text_lines:
draw.text((margin, y), t, font=regular_font, fill="white")
bbox = draw.textbbox((margin, y), t, font=regular_font)
y += int((bbox[3] - bbox[1]) * line_spacing)
if logo:
overlay.paste(
logo,
(img.width - logo.width - margin, margin),
logo if logo.mode == "RGBA" else None,
)
img.alpha_composite(overlay)
return img.convert("RGB")
def _wrap_text_to_lines(
self,
draw: ImageDraw.ImageDraw,
text: str,
font: ImageFont.FreeTypeFont,
max_width: int,
) -> list[str]:
"""Split ``text`` into lines that fit within ``max_width`` pixels."""
if not text:
return []
if max_width <= 0:
return [text]
wrapped_lines: list[str] = []
lines = text.splitlines()
if not lines:
lines = [text]
for raw_line in lines:
words = raw_line.split()
if not words:
wrapped_lines.append("")
continue
current_line = words[0]
for word in words[1:]:
candidate = f"{current_line} {word}" if current_line else word
if draw.textlength(candidate, font=font) <= max_width:
current_line = candidate
else:
wrapped_lines.append(current_line)
current_line = word
wrapped_lines.append(current_line)
return wrapped_lines
def _add_overlay_to_image(
self,
image: Image.Image,
step_number: int,
goal_text: str,
regular_font: ImageFont.FreeTypeFont,
title_font: ImageFont.FreeTypeFont,
margin: int,
logo: Image.Image | None,
line_spacing: float, # Added line_spacing parameter
) -> Image.Image:
"""Overlay the step number and goal text onto a screenshot image."""
image = image.convert("RGBA")
overlay = Image.new("RGBA", image.size, (0, 0, 0, 0))
draw = ImageDraw.Draw(overlay)
step_text = f"Step {step_number}"
max_width = image.width - margin * 2
lines: list[str] = []
words = goal_text.split()
line = ""
for word in words:
test = f"{line} {word}".strip()
if draw.textlength(test, font=regular_font) <= max_width:
line = test
else:
lines.append(line)
line = word
if line:
lines.append(line)
y = margin
step_bbox = draw.textbbox((margin, y), step_text, font=title_font)
step_height = step_bbox[3] - step_bbox[1]
total_height = step_height + int(margin * 0.5)
for l in lines:
bbox = draw.textbbox((margin, 0), l, font=regular_font)
total_height += bbox[3] - bbox[1]
if logo:
total_height = max(total_height, logo.height + margin * 2)
draw.rectangle(
[(0, 0), (image.width, total_height)],
fill=(0, 0, 0, 180),
)
draw.text((margin, y), step_text, font=title_font, fill="white")
y += step_height + int(margin * 0.5)
for l in lines:
draw.text((margin, y), l, font=regular_font, fill="white")
bbox = draw.textbbox((margin, y), l, font=regular_font)
y += bbox[3] - bbox[1]
if logo:
overlay.paste(
logo,
(image.width - logo.width - margin, margin),
logo if logo.mode == "RGBA" else None,
)
image.alpha_composite(overlay)
return image.convert("RGB")
async def execute_agent_task(self, max_steps: int = 100) -> AgentHistoryList:
"""
Execute the entire agent task for up to max_steps or until 'done'.
Checks for external stop signals and logs each step in self.history.
"""
try:
logger.info(f"🚀 Starting task: {self.task}")
self.telemetry.capture(
AgentRunTelemetryEvent(
agent_id=self.agent_id,
task=self.task,
)
)
step_info = CustomAgentStepInfo(
task=self.task,
add_infos=self.add_infos,
step_number=1,
max_steps=max_steps,
memory="",
task_progress="",
)
for step in range(max_steps):
# 1) Check if stop requested externally
if self.agent_state and self.agent_state.is_stop_requested():
logger.info("🛑 Stop requested by user")
self._create_stop_history_item()
break
# 2) Store last valid state
if self.browser_context and self.agent_state:
state = await self.browser_context.get_state(
use_vision=self.use_vision
)
self.agent_state.set_last_valid_state(state)
# 3) Check for too many failures
if self._too_many_failures():
break
# 4) Execute one detailed agent step
await self.execute_agent_step(step_info)
if self.history.is_done():
if self.validate_output and step < max_steps - 1:
# Optionally validate final output
if not await self._validate_output():
continue
logger.info("✅ Task completed successfully")
break
else:
logger.info("❌ Failed to complete task within maximum steps")
return self.history
finally:
self.telemetry.capture(
AgentEndTelemetryEvent(
agent_id=self.agent_id,
task=self.task,
success=self.history.is_done(),
steps=len(self.history.history),
)
)
# Close the browser context if we created it here (not injected)
if not self.injected_browser_context and self.browser_context:
await self.browser_context.close()
# Close the browser instance if it wasn't injected
if not self.injected_browser and self.browser:
await self.browser.close()
# Generate a GIF of the agent's run if enabled
if self.generate_gif:
self.create_history_gif()
def _create_stop_history_item(self) -> None:
"""
Create a final 'stop' history item indicating the agent has halted by request.
"""
try:
state = None
if self.agent_state:
last_state = self.agent_state.get_last_valid_state()
if last_state:
state = self._convert_to_browser_state_history(last_state)
else:
state = self._create_empty_state()
else:
state = self._create_empty_state()
stop_history = AgentHistory(
model_output=None,
state=state,
result=[ActionResult(extracted_content=None, error=None, is_done=True)],
)
self.history.history.append(stop_history)
except Exception as e:
logger.error(f"Error creating stop history item: {e}")
state = self._create_empty_state()
stop_history = AgentHistory(
model_output=None,
state=state,
result=[ActionResult(extracted_content=None, error=None, is_done=True)],
)
self.history.history.append(stop_history)
def _convert_to_browser_state_history(
self, browser_state: Any
) -> BrowserStateHistory:
"""
Convert a raw browser_state object into a BrowserStateHistory dataclass.
"""
return BrowserStateHistory(
url=getattr(browser_state, "url", ""),
title=getattr(browser_state, "title", ""),
tabs=getattr(browser_state, "tabs", []),
interacted_element=[None],
screenshot=getattr(browser_state, "screenshot", None),
)
def _create_empty_state(self) -> BrowserStateHistory:
"""
Create a basic empty state for fallback or stop-history usage.
"""
return BrowserStateHistory(
url="", title="", tabs=[], interacted_element=[None], screenshot=None
)
```