scrapegraphai/scrapegraph-mcp # codebase.md

# Directory Structure

```
├── .agent
│   ├── README.md
│   └── system
│       ├── mcp_protocol.md
│       └── project_architecture.md
├── .github
│   └── workflows
│       └── python-package.yml
├── .gitignore
├── assets
│   ├── cursor_mcp.png
│   └── sgai_smithery.png
├── CLAUDE.md
├── Dockerfile
├── LICENSE
├── pyproject.toml
├── README.md
├── server.json
├── smithery.yaml
├── src
│   └── scrapegraph_mcp
│       ├── __init__.py
│       └── server.py
└── uv.lock
```

# Files

--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------

```
 1 | # Build artifacts
 2 | build/
 3 | dist/
 4 | wheels/
 5 | *.egg-info/
 6 | *.egg
 7 | 
 8 | # Python artifacts
 9 | __pycache__/
10 | *.py[cod]
11 | *$py.class
12 | *.so
13 | .Python
14 | .pytest_cache/
15 | .coverage
16 | htmlcov/
17 | .tox/
18 | .nox/
19 | .hypothesis/
20 | .mypy_cache/
21 | .ruff_cache/
22 | 
23 | # Virtual environments
24 | .venv/
25 | venv/
26 | ENV/
27 | env/
28 | 
29 | # IDE files
30 | .idea/
31 | .vscode/
32 | *.swp
33 | *.swo
34 | .DS_Store
35 | 
36 | # Environment variables
37 | .env
38 | .env.local
39 | 
40 | # Logs
41 | *.logdin.com/in/andrea-perozzi-46682830
42 | 
43 | .mcpregistry_github_token
44 | .mcpregistry_registry_token
45 | mcp-publisher
```

--------------------------------------------------------------------------------
/.agent/README.md:
--------------------------------------------------------------------------------

```markdown
  1 | # ScrapeGraph MCP Server Documentation
  2 | 
  3 | Welcome to the ScrapeGraph MCP Server documentation hub. This directory contains comprehensive documentation for understanding, developing, and maintaining the ScrapeGraph MCP Server.
  4 | 
  5 | ## 📚 Available Documentation
  6 | 
  7 | ### System Documentation (`system/`)
  8 | 
  9 | #### [Project Architecture](./system/project_architecture.md)
 10 | Complete system architecture documentation including:
 11 | - **System Overview** - MCP server purpose and capabilities
 12 | - **Technology Stack** - Python 3.10+, FastMCP, httpx dependencies
 13 | - **Project Structure** - File organization and key files
 14 | - **Core Architecture** - MCP design, server architecture, patterns
 15 | - **MCP Tools** - All 5 tools (markdownify, smartscraper, searchscraper, smartcrawler_initiate, smartcrawler_fetch_results)
 16 | - **API Integration** - ScrapeGraphAI API endpoints and credit system
 17 | - **Deployment** - Smithery, Claude Desktop, Cursor, Docker setup
 18 | - **Recent Updates** - SmartCrawler integration and latest features
 19 | 
 20 | #### [MCP Protocol](./system/mcp_protocol.md)
 21 | Complete Model Context Protocol integration documentation:
 22 | - **What is MCP?** - Protocol overview and key concepts
 23 | - **MCP in ScrapeGraph** - Architecture and FastMCP usage
 24 | - **Communication Protocol** - JSON-RPC over stdio transport
 25 | - **Tool Schema** - Schema generation from Python type hints
 26 | - **Error Handling** - Graceful error handling patterns
 27 | - **Client Integration** - Claude Desktop, Cursor, custom clients
 28 | - **Advanced Topics** - Versioning, streaming, authentication, rate limiting
 29 | - **Debugging** - MCP Inspector, logs, troubleshooting
 30 | 
 31 | ### Task Documentation (`tasks/`)
 32 | 
 33 | *Future: PRD and implementation plans for specific features*
 34 | 
 35 | ### SOP Documentation (`sop/`)
 36 | 
 37 | *Future: Standard operating procedures (e.g., adding new tools, testing)*
 38 | 
 39 | ---
 40 | 
 41 | ## 🚀 Quick Start
 42 | 
 43 | ### For New Engineers
 44 | 
 45 | 1. **Read First:**
 46 |    - [Project Architecture - System Overview](./system/project_architecture.md#system-overview)
 47 |    - [MCP Protocol - What is MCP?](./system/mcp_protocol.md#what-is-mcp)
 48 | 
 49 | 2. **Setup Development Environment:**
 50 |    - Install Python 3.10+
 51 |    - Clone repository: `git clone https://github.com/ScrapeGraphAI/scrapegraph-mcp`
 52 |    - Install dependencies: `pip install -e ".[dev]"`
 53 |    - Get API key from: [dashboard.scrapegraphai.com](https://dashboard.scrapegraphai.com)
 54 | 
 55 | 3. **Run the Server:**
 56 |    ```bash
 57 |    export SGAI_API_KEY=your-api-key
 58 |    scrapegraph-mcp
 59 |    ```
 60 | 
 61 | 4. **Test with MCP Inspector:**
 62 |    ```bash
 63 |    npx @modelcontextprotocol/inspector scrapegraph-mcp
 64 |    ```
 65 | 
 66 | 5. **Integrate with Claude Desktop:**
 67 |    - See: [Project Architecture - Deployment](./system/project_architecture.md#deployment)
 68 |    - Add config to `~/Library/Application Support/Claude/claude_desktop_config.json` (macOS)
 69 | 
 70 | ---
 71 | 
 72 | ## 🔍 Finding Information
 73 | 
 74 | ### I want to understand...
 75 | 
 76 | **...what MCP is:**
 77 | - Read: [MCP Protocol - What is MCP?](./system/mcp_protocol.md#what-is-mcp)
 78 | - Read: [Project Architecture - Core Architecture](./system/project_architecture.md#core-architecture)
 79 | 
 80 | **...how to add a new tool:**
 81 | - Read: [Project Architecture - Contributing - Adding New Tools](./system/project_architecture.md#adding-new-tools)
 82 | - Example: See existing tools in `src/scrapegraph_mcp/server.py`
 83 | 
 84 | **...how tools are defined:**
 85 | - Read: [MCP Protocol - Tool Schema](./system/mcp_protocol.md#tool-schema)
 86 | - Code: `src/scrapegraph_mcp/server.py` (lines 232-372)
 87 | 
 88 | **...how to debug MCP issues:**
 89 | - Read: [MCP Protocol - Debugging MCP](./system/mcp_protocol.md#debugging-mcp)
 90 | - Tools: MCP Inspector, Claude Desktop logs
 91 | 
 92 | **...how to deploy:**
 93 | - Read: [Project Architecture - Deployment](./system/project_architecture.md#deployment)
 94 | - Options: Smithery (automated), Docker, pip install
 95 | 
 96 | **...available tools and their parameters:**
 97 | - Read: [Project Architecture - MCP Tools](./system/project_architecture.md#mcp-tools)
 98 | - Quick reference: 5 tools (markdownify, smartscraper, searchscraper, smartcrawler_initiate, smartcrawler_fetch_results)
 99 | 
100 | **...error handling:**
101 | - Read: [MCP Protocol - Error Handling](./system/mcp_protocol.md#error-handling)
102 | - Pattern: Return `{"error": "message"}` instead of raising exceptions
103 | 
104 | **...how SmartCrawler works:**
105 | - Read: [Project Architecture - Tool #4 & #5](./system/project_architecture.md#4-smartcrawler_initiate)
106 | - Pattern: Initiate (async) → Poll fetch_results until complete
107 | 
108 | ---
109 | 
110 | ## 🛠️ Development Workflows
111 | 
112 | ### Running Locally
113 | 
114 | ```bash
115 | # Install dependencies
116 | pip install -e ".[dev]"
117 | 
118 | # Set API key
119 | export SGAI_API_KEY=your-api-key
120 | 
121 | # Run server
122 | scrapegraph-mcp
123 | # or
124 | python -m scrapegraph_mcp.server
125 | ```
126 | 
127 | ### Testing
128 | 
129 | **Manual Testing (MCP Inspector):**
130 | ```bash
131 | npx @modelcontextprotocol/inspector scrapegraph-mcp
132 | ```
133 | 
134 | **Manual Testing (stdio):**
135 | ```bash
136 | echo '{"jsonrpc":"2.0","method":"tools/call","params":{"name":"markdownify","arguments":{"website_url":"https://scrapegraphai.com"}},"id":1}' | scrapegraph-mcp
137 | ```
138 | 
139 | **Integration Testing (Claude Desktop):**
140 | 1. Configure MCP server in Claude Desktop
141 | 2. Restart Claude
142 | 3. Ask: "Convert https://scrapegraphai.com to markdown"
143 | 4. Verify tool invocation and results
144 | 
145 | ### Code Quality
146 | 
147 | ```bash
148 | # Linting
149 | ruff check src/
150 | 
151 | # Type checking
152 | mypy src/
153 | 
154 | # Format checking
155 | ruff format --check src/
156 | ```
157 | 
158 | ### Building Docker Image
159 | 
160 | ```bash
161 | # Build
162 | docker build -t scrapegraph-mcp .
163 | 
164 | # Run
165 | docker run -e SGAI_API_KEY=your-api-key scrapegraph-mcp
166 | 
167 | # Test
168 | echo '{"jsonrpc":"2.0","method":"tools/list","id":1}' | docker run -i -e SGAI_API_KEY=your-api-key scrapegraph-mcp
169 | ```
170 | 
171 | ---
172 | 
173 | ## 📊 MCP Tools Reference
174 | 
175 | Quick reference to all MCP tools:
176 | 
177 | | Tool | Parameters | Purpose | Credits | Async |
178 | |------|------------|---------|---------|-------|
179 | | `markdownify` | `website_url` | Convert webpage to markdown | 2 | No |
180 | | `smartscraper` | `user_prompt`, `website_url`, `number_of_scrolls?`, `markdown_only?` | AI-powered data extraction | 10+ | No |
181 | | `searchscraper` | `user_prompt`, `num_results?`, `number_of_scrolls?` | AI-powered web search | Variable | No |
182 | | `smartcrawler_initiate` | `url`, `prompt?`, `extraction_mode`, `depth?`, `max_pages?`, `same_domain_only?` | Start multi-page crawl | 100+ | Yes (returns request_id) |
183 | | `smartcrawler_fetch_results` | `request_id` | Get crawl results | N/A | No (polls status) |
184 | 
185 | For detailed tool documentation, see [Project Architecture - MCP Tools](./system/project_architecture.md#mcp-tools).
186 | 
187 | ---
188 | 
189 | ## 🔧 Key Files Reference
190 | 
191 | ### Core Files
192 | - `src/scrapegraph_mcp/server.py` - Main server implementation (all code)
193 | - `src/scrapegraph_mcp/__init__.py` - Package initialization
194 | 
195 | ### Configuration
196 | - `pyproject.toml` - Project metadata, dependencies, build config
197 | - `Dockerfile` - Docker container definition
198 | - `smithery.yaml` - Smithery deployment config
199 | 
200 | ### Documentation
201 | - `README.md` - User-facing documentation
202 | - `.agent/README.md` - This file (developer documentation index)
203 | - `.agent/system/project_architecture.md` - Architecture documentation
204 | - `.agent/system/mcp_protocol.md` - MCP protocol documentation
205 | 
206 | ---
207 | 
208 | ## 🚨 Troubleshooting
209 | 
210 | ### Common Issues
211 | 
212 | **Issue: "ScapeGraph client not initialized"**
213 | - **Cause:** Missing `SGAI_API_KEY` environment variable
214 | - **Solution:** Set `export SGAI_API_KEY=your-api-key` or pass via `--config`
215 | 
216 | **Issue: "Error 401: Unauthorized"**
217 | - **Cause:** Invalid API key
218 | - **Solution:** Verify API key at [dashboard.scrapegraphai.com](https://dashboard.scrapegraphai.com)
219 | 
220 | **Issue: "Error 402: Payment Required"**
221 | - **Cause:** Insufficient credits
222 | - **Solution:** Add credits to your ScrapeGraphAI account
223 | 
224 | **Issue: Tools not appearing in Claude Desktop**
225 | - **Cause:** Server not starting or config error
226 | - **Solution:** Check Claude logs at `~/Library/Logs/Claude/` (macOS)
227 | 
228 | **Issue: SmartCrawler not returning results**
229 | - **Cause:** Still processing (async operation)
230 | - **Solution:** Keep polling `smartcrawler_fetch_results()` until `status == "completed"`
231 | 
232 | **Issue: Python version error**
233 | - **Cause:** Python < 3.10
234 | - **Solution:** Upgrade Python to 3.10+
235 | 
236 | For more troubleshooting, see:
237 | - [Project Architecture - Troubleshooting](./system/project_architecture.md#troubleshooting)
238 | - [MCP Protocol - Debugging MCP](./system/mcp_protocol.md#debugging-mcp)
239 | 
240 | ---
241 | 
242 | ## 🤝 Contributing
243 | 
244 | ### Before Making Changes
245 | 
246 | 1. **Read relevant documentation** - Understand MCP and the server architecture
247 | 2. **Check existing issues** - Avoid duplicate work
248 | 3. **Test locally** - Use MCP Inspector to verify changes
249 | 4. **Test with clients** - Verify with Claude Desktop or Cursor
250 | 
251 | ### Adding a New Tool
252 | 
253 | **Step-by-step guide:**
254 | 
255 | 1. **Add method to `ScapeGraphClient` class:**
256 | ```python
257 | def new_tool(self, param: str) -> Dict[str, Any]:
258 |     """Tool description."""
259 |     url = f"{self.BASE_URL}/new-endpoint"
260 |     data = {"param": param}
261 |     response = self.client.post(url, headers=self.headers, json=data)
262 |     if response.status_code != 200:
263 |         raise Exception(f"Error {response.status_code}: {response.text}")
264 |     return response.json()
265 | ```
266 | 
267 | 2. **Add MCP tool decorator:**
268 | ```python
269 | @mcp.tool()
270 | def new_tool(param: str) -> Dict[str, Any]:
271 |     """
272 |     Tool description for AI assistants.
273 | 
274 |     Args:
275 |         param: Parameter description
276 | 
277 |     Returns:
278 |         Dictionary containing results
279 |     """
280 |     if scrapegraph_client is None:
281 |         return {"error": "ScapeGraph client not initialized. Please provide an API key."}
282 | 
283 |     try:
284 |         return scrapegraph_client.new_tool(param)
285 |     except Exception as e:
286 |         return {"error": str(e)}
287 | ```
288 | 
289 | 3. **Test with MCP Inspector:**
290 | ```bash
291 | npx @modelcontextprotocol/inspector scrapegraph-mcp
292 | ```
293 | 
294 | 4. **Update documentation:**
295 | - Add tool to [Project Architecture - MCP Tools](./system/project_architecture.md#mcp-tools)
296 | - Add schema to [MCP Protocol - Tool Schema](./system/mcp_protocol.md#tool-schema)
297 | - Update tool reference table in this README
298 | 
299 | 5. **Submit pull request**
300 | 
301 | ### Development Process
302 | 
303 | 1. **Make changes** - Edit `src/scrapegraph_mcp/server.py`
304 | 2. **Run linting** - `ruff check src/`
305 | 3. **Run type checking** - `mypy src/`
306 | 4. **Test locally** - MCP Inspector + Claude Desktop
307 | 5. **Update docs** - Keep `.agent/` docs in sync
308 | 6. **Commit** - Clear commit message
309 | 7. **Create PR** - Describe changes thoroughly
310 | 
311 | ### Code Style
312 | 
313 | - **Ruff:** Line length 100, target Python 3.12
314 | - **mypy:** Strict mode, disallow untyped defs
315 | - **Type hints:** Always use type hints for parameters and return values
316 | - **Docstrings:** Google-style docstrings for all public functions
317 | - **Error handling:** Return error dicts, don't raise exceptions in tools
318 | 
319 | ---
320 | 
321 | ## 📖 External Documentation
322 | 
323 | ### MCP Resources
324 | - [Model Context Protocol Specification](https://modelcontextprotocol.io/)
325 | - [MCP Python SDK](https://github.com/modelcontextprotocol/python-sdk)
326 | - [FastMCP Framework](https://github.com/jlowin/fastmcp)
327 | - [MCP Inspector](https://github.com/modelcontextprotocol/inspector)
328 | 
329 | ### ScrapeGraphAI Resources
330 | - [ScrapeGraphAI Homepage](https://scrapegraphai.com)
331 | - [ScrapeGraphAI Dashboard](https://dashboard.scrapegraphai.com)
332 | - [ScrapeGraphAI API Documentation](https://api.scrapegraphai.com/docs)
333 | 
334 | ### AI Assistant Integration
335 | - [Claude Desktop](https://claude.ai/desktop)
336 | - [Cursor](https://cursor.sh/)
337 | - [Smithery MCP Distribution](https://smithery.ai/)
338 | 
339 | ### Development Tools
340 | - [Python httpx](https://www.python-httpx.org/)
341 | - [Ruff Linter](https://docs.astral.sh/ruff/)
342 | - [mypy Type Checker](https://mypy-lang.org/)
343 | 
344 | ---
345 | 
346 | ## 📝 Documentation Maintenance
347 | 
348 | ### When to Update Documentation
349 | 
350 | **Update `.agent/system/project_architecture.md` when:**
351 | - Adding new MCP tools
352 | - Changing tool parameters or return types
353 | - Updating deployment methods
354 | - Modifying technology stack
355 | 
356 | **Update `.agent/system/mcp_protocol.md` when:**
357 | - Changing MCP protocol implementation
358 | - Adding new communication patterns
359 | - Modifying error handling strategy
360 | - Updating authentication method
361 | 
362 | **Update `.agent/README.md` when:**
363 | - Adding new documentation files
364 | - Changing development workflows
365 | - Updating quick start instructions
366 | 
367 | ### Documentation Best Practices
368 | 
369 | 1. **Keep it current** - Update docs with code changes in the same PR
370 | 2. **Be specific** - Include code snippets, file paths, line numbers
371 | 3. **Include examples** - Show real-world usage patterns
372 | 4. **Link related sections** - Cross-reference between documents
373 | 5. **Test examples** - Verify all code examples work
374 | 
375 | ---
376 | 
377 | ## 📅 Changelog
378 | 
379 | ### October 2025
380 | - ✅ Initial comprehensive documentation created
381 | - ✅ Project architecture fully documented
382 | - ✅ MCP protocol integration documented
383 | - ✅ All 5 MCP tools documented
384 | - ✅ SmartCrawler integration (initiate + fetch_results)
385 | - ✅ Deployment guides (Smithery, Docker, Claude Desktop, Cursor)
386 | - ✅ Recent updates: Enhanced error handling, extraction mode validation
387 | 
388 | ---
389 | 
390 | ## 🔗 Quick Links
391 | 
392 | - [Main README](../README.md) - User-facing documentation
393 | - [Server Implementation](../src/scrapegraph_mcp/server.py) - All code (single file)
394 | - [pyproject.toml](../pyproject.toml) - Project metadata
395 | - [Dockerfile](../Dockerfile) - Docker configuration
396 | - [smithery.yaml](../smithery.yaml) - Smithery config
397 | - [GitHub Repository](https://github.com/ScrapeGraphAI/scrapegraph-mcp)
398 | 
399 | ---
400 | 
401 | ## 📧 Support
402 | 
403 | For questions or issues:
404 | 1. Check this documentation first
405 | 2. Review [Project Architecture](./system/project_architecture.md) and [MCP Protocol](./system/mcp_protocol.md)
406 | 3. Test with [MCP Inspector](https://github.com/modelcontextprotocol/inspector)
407 | 4. Search [GitHub issues](https://github.com/ScrapeGraphAI/scrapegraph-mcp/issues)
408 | 5. Create a new issue with detailed information
409 | 
410 | ---
411 | 
412 | **Made with ❤️ by [ScrapeGraphAI](https://scrapegraphai.com) Team**
413 | 
414 | **Happy Coding! 🚀**
415 | 
```

--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------

```markdown
  1 | # ScrapeGraph MCP Server
  2 | 
  3 | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
  4 | [![Python 3.13+](https://img.shields.io/badge/python-3.13+-blue.svg)](https://www.python.org/downloads/)
  5 | [![smithery badge](https://smithery.ai/badge/@ScrapeGraphAI/scrapegraph-mcp)](https://smithery.ai/server/@ScrapeGraphAI/scrapegraph-mcp)
  6 | 
  7 | 
  8 | A production-ready [Model Context Protocol](https://modelcontextprotocol.io/introduction) (MCP) server that provides seamless integration with the [ScrapeGraph AI](https://scrapegraphai.com) API. This server enables language models to leverage advanced AI-powered web scraping capabilities with enterprise-grade reliability.
  9 | 
 10 | ## Table of Contents
 11 | 
 12 | - [Key Features](#key-features)
 13 | - [Quick Start](#quick-start)
 14 | - [Available Tools](#available-tools)
 15 | - [Setup Instructions](#setup-instructions)
 16 | - [Local Usage](#local-usage)
 17 | - [Google ADK Integration](#google-adk-integration)
 18 | - [Example Use Cases](#example-use-cases)
 19 | - [Error Handling](#error-handling)
 20 | - [Common Issues](#common-issues)
 21 | - [Development](#development)
 22 | - [Contributing](#contributing)
 23 | - [Documentation](#documentation)
 24 | - [Technology Stack](#technology-stack)
 25 | - [License](#license)
 26 | 
 27 | ## Key Features
 28 | 
 29 | - **8 Powerful Tools**: From simple markdown conversion to complex multi-page crawling and agentic workflows
 30 | - **AI-Powered Extraction**: Intelligently extract structured data using natural language prompts
 31 | - **Multi-Page Crawling**: SmartCrawler supports asynchronous crawling with configurable depth and page limits
 32 | - **Infinite Scroll Support**: Handle dynamic content loading with configurable scroll counts
 33 | - **JavaScript Rendering**: Full support for JavaScript-heavy websites
 34 | - **Flexible Output Formats**: Get results as markdown, structured JSON, or custom schemas
 35 | - **Easy Integration**: Works seamlessly with Claude Desktop, Cursor, and any MCP-compatible client
 36 | - **Enterprise-Ready**: Robust error handling, timeout management, and production-tested reliability
 37 | - **Simple Deployment**: One-command installation via Smithery or manual setup
 38 | - **Comprehensive Documentation**: Detailed developer docs in `.agent/` folder
 39 | 
 40 | ## Quick Start
 41 | 
 42 | ### 1. Get Your API Key
 43 | 
 44 | Sign up and get your API key from the [ScrapeGraph Dashboard](https://dashboard.scrapegraphai.com)
 45 | 
 46 | ### 2. Install with Smithery (Recommended)
 47 | 
 48 | ```bash
 49 | npx -y @smithery/cli install @ScrapeGraphAI/scrapegraph-mcp --client claude
 50 | ```
 51 | 
 52 | ### 3. Start Using
 53 | 
 54 | Ask Claude or Cursor:
 55 | - "Convert https://scrapegraphai.com to markdown"
 56 | - "Extract all product prices from this e-commerce page"
 57 | - "Research the latest AI developments and summarize findings"
 58 | 
 59 | That's it! The server is now available to your AI assistant.
 60 | 
 61 | ## Available Tools
 62 | 
 63 | The server provides **8 enterprise-ready tools** for AI-powered web scraping:
 64 | 
 65 | ### Core Scraping Tools
 66 | 
 67 | #### 1. `markdownify`
 68 | Transform any webpage into clean, structured markdown format.
 69 | 
 70 | ```python
 71 | markdownify(website_url: str)
 72 | ```
 73 | - **Credits**: 2 per request
 74 | - **Use case**: Quick webpage content extraction in markdown
 75 | 
 76 | #### 2. `smartscraper`
 77 | Leverage AI to extract structured data from any webpage with support for infinite scrolling.
 78 | 
 79 | ```python
 80 | smartscraper(
 81 |     user_prompt: str,
 82 |     website_url: str,
 83 |     number_of_scrolls: int = None,
 84 |     markdown_only: bool = None
 85 | )
 86 | ```
 87 | - **Credits**: 10+ (base) + variable based on scrolling
 88 | - **Use case**: AI-powered data extraction with custom prompts
 89 | 
 90 | #### 3. `searchscraper`
 91 | Execute AI-powered web searches with structured, actionable results.
 92 | 
 93 | ```python
 94 | searchscraper(
 95 |     user_prompt: str,
 96 |     num_results: int = None,
 97 |     number_of_scrolls: int = None
 98 | )
 99 | ```
100 | - **Credits**: Variable (3-20 websites × 10 credits)
101 | - **Use case**: Multi-source research and data aggregation
102 | 
103 | ### Advanced Scraping Tools
104 | 
105 | #### 4. `scrape`
106 | Basic scraping endpoint to fetch page content with optional heavy JavaScript rendering.
107 | 
108 | ```python
109 | scrape(website_url: str, render_heavy_js: bool = None)
110 | ```
111 | - **Use case**: Simple page content fetching with JS rendering support
112 | 
113 | #### 5. `sitemap`
114 | Extract sitemap URLs and structure for any website.
115 | 
116 | ```python
117 | sitemap(website_url: str)
118 | ```
119 | - **Use case**: Website structure analysis and URL discovery
120 | 
121 | ### Multi-Page Crawling
122 | 
123 | #### 6. `smartcrawler_initiate`
124 | Initiate intelligent multi-page web crawling (asynchronous operation).
125 | 
126 | ```python
127 | smartcrawler_initiate(
128 |     url: str,
129 |     prompt: str = None,
130 |     extraction_mode: str = "ai",
131 |     depth: int = None,
132 |     max_pages: int = None,
133 |     same_domain_only: bool = None
134 | )
135 | ```
136 | - **AI Extraction Mode**: 10 credits per page - extracts structured data
137 | - **Markdown Mode**: 2 credits per page - converts to markdown
138 | - **Returns**: `request_id` for polling
139 | - **Use case**: Large-scale website crawling and data extraction
140 | 
141 | #### 7. `smartcrawler_fetch_results`
142 | Retrieve results from asynchronous crawling operations.
143 | 
144 | ```python
145 | smartcrawler_fetch_results(request_id: str)
146 | ```
147 | - **Returns**: Status and results when crawling is complete
148 | - **Use case**: Poll for crawl completion and retrieve results
149 | 
150 | ### Intelligent Agent-Based Scraping
151 | 
152 | #### 8. `agentic_scrapper`
153 | Run advanced agentic scraping workflows with customizable steps and structured output schemas.
154 | 
155 | ```python
156 | agentic_scrapper(
157 |     url: str,
158 |     user_prompt: str = None,
159 |     output_schema: dict = None,
160 |     steps: list = None,
161 |     ai_extraction: bool = None,
162 |     persistent_session: bool = None,
163 |     timeout_seconds: float = None
164 | )
165 | ```
166 | - **Use case**: Complex multi-step workflows with custom schemas and persistent sessions
167 | 
168 | ## Setup Instructions
169 | 
170 | To utilize this server, you'll need a ScrapeGraph API key. Follow these steps to obtain one:
171 | 
172 | 1. Navigate to the [ScrapeGraph Dashboard](https://dashboard.scrapegraphai.com)
173 | 2. Create an account and generate your API key
174 | 
175 | ### Automated Installation via Smithery
176 | 
177 | For automated installation of the ScrapeGraph API Integration Server using [Smithery](https://smithery.ai/server/@ScrapeGraphAI/scrapegraph-mcp):
178 | 
179 | ```bash
180 | npx -y @smithery/cli install @ScrapeGraphAI/scrapegraph-mcp --client claude
181 | ```
182 | 
183 | ### Claude Desktop Configuration
184 | 
185 | Update your Claude Desktop configuration file with the following settings (located on the top rigth of the Cursor page):
186 | 
187 | (remember to add your API key inside the config)
188 | 
189 | ```json
190 | {
191 |     "mcpServers": {
192 |         "@ScrapeGraphAI-scrapegraph-mcp": {
193 |             "command": "npx",
194 |             "args": [
195 |                 "-y",
196 |                 "@smithery/cli@latest",
197 |                 "run",
198 |                 "@ScrapeGraphAI/scrapegraph-mcp",
199 |                 "--config",
200 |                 "\"{\\\"scrapegraphApiKey\\\":\\\"YOUR-SGAI-API-KEY\\\"}\""
201 |             ]
202 |         }
203 |     }
204 | }
205 | ```
206 | 
207 | The configuration file is located at:
208 | - Windows: `%APPDATA%/Claude/claude_desktop_config.json`
209 | - macOS: `~/Library/Application\ Support/Claude/claude_desktop_config.json`
210 | 
211 | ### Cursor Integration
212 | 
213 | Add the ScrapeGraphAI MCP server on the settings:
214 | 
215 | ![Cursor MCP Integration](assets/cursor_mcp.png)
216 | 
217 | ## Local Usage
218 | 
219 | To run the MCP server locally for development or testing, follow these steps:
220 | 
221 | ### Prerequisites
222 | 
223 | - Python 3.13 or higher
224 | - pip or uv package manager
225 | - ScrapeGraph API key
226 | 
227 | ### Installation
228 | 
229 | 1. **Clone the repository** (if you haven't already):
230 | 
231 | ```bash
232 | git clone https://github.com/ScrapeGraphAI/scrapegraph-mcp
233 | cd scrapegraph-mcp
234 | ```
235 | 
236 | 2. **Install the package**:
237 | 
238 | ```bash
239 | # Using pip
240 | pip install -e .
241 | 
242 | # Or using uv (faster)
243 | uv pip install -e .
244 | ```
245 | 
246 | 3. **Set your API key**:
247 | 
248 | ```bash
249 | # macOS/Linux
250 | export SGAI_API_KEY=your-api-key-here
251 | 
252 | # Windows (PowerShell)
253 | $env:SGAI_API_KEY="your-api-key-here"
254 | 
255 | # Windows (CMD)
256 | set SGAI_API_KEY=your-api-key-here
257 | ```
258 | 
259 | ### Running the Server Locally
260 | 
261 | You can run the server directly:
262 | 
263 | ```bash
264 | # Using the installed command
265 | scrapegraph-mcp
266 | 
267 | # Or using Python module
268 | python -m scrapegraph_mcp.server
269 | ```
270 | 
271 | The server will start and communicate via stdio (standard input/output), which is the standard MCP transport method.
272 | 
273 | ### Testing with MCP Inspector
274 | 
275 | Test your local server using the MCP Inspector tool:
276 | 
277 | ```bash
278 | npx @modelcontextprotocol/inspector python -m scrapegraph_mcp.server
279 | ```
280 | 
281 | This provides a web interface to test all available tools interactively.
282 | 
283 | ### Configuring Claude Desktop for Local Server
284 | 
285 | To use your locally running server with Claude Desktop, update your configuration file:
286 | 
287 | **macOS/Linux** (`~/Library/Application Support/Claude/claude_desktop_config.json`):
288 | 
289 | ```json
290 | {
291 |     "mcpServers": {
292 |         "scrapegraph-mcp-local": {
293 |             "command": "python",
294 |             "args": [
295 |                 "-m",
296 |                 "scrapegraph_mcp.server"
297 |             ],
298 |             "env": {
299 |                 "SGAI_API_KEY": "your-api-key-here"
300 |             }
301 |         }
302 |     }
303 | }
304 | ```
305 | 
306 | **Windows** (`%APPDATA%\Claude\claude_desktop_config.json`):
307 | 
308 | ```json
309 | {
310 |     "mcpServers": {
311 |         "scrapegraph-mcp-local": {
312 |             "command": "python",
313 |             "args": [
314 |                 "-m",
315 |                 "scrapegraph_mcp.server"
316 |             ],
317 |             "env": {
318 |                 "SGAI_API_KEY": "your-api-key-here"
319 |             }
320 |         }
321 |     }
322 | }
323 | ```
324 | 
325 | **Note**: Make sure Python is in your PATH. You can verify by running `python --version` in your terminal.
326 | 
327 | ### Configuring Cursor for Local Server
328 | 
329 | In Cursor's MCP settings, add a new server with:
330 | 
331 | - **Command**: `python`
332 | - **Args**: `["-m", "scrapegraph_mcp.server"]`
333 | - **Environment Variables**: `{"SGAI_API_KEY": "your-api-key-here"}`
334 | 
335 | ### Troubleshooting Local Setup
336 | 
337 | **Server not starting:**
338 | - Verify Python is installed: `python --version`
339 | - Check that the package is installed: `pip list | grep scrapegraph-mcp`
340 | - Ensure API key is set: `echo $SGAI_API_KEY` (macOS/Linux) or `echo %SGAI_API_KEY%` (Windows)
341 | 
342 | **Tools not appearing:**
343 | - Check Claude Desktop logs:
344 |   - macOS: `~/Library/Logs/Claude/`
345 |   - Windows: `%APPDATA%\Claude\Logs\`
346 | - Verify the server starts without errors when run directly
347 | - Check that the configuration JSON is valid
348 | 
349 | **Import errors:**
350 | - Reinstall the package: `pip install -e . --force-reinstall`
351 | - Verify dependencies: `pip install -r requirements.txt` (if available)
352 | 
353 | ## Google ADK Integration
354 | 
355 | The ScrapeGraph MCP server can be integrated with [Google ADK (Agent Development Kit)](https://github.com/google/adk) to create AI agents with web scraping capabilities.
356 | 
357 | ### Prerequisites
358 | 
359 | - Python 3.13 or higher
360 | - Google ADK installed
361 | - ScrapeGraph API key
362 | 
363 | ### Installation
364 | 
365 | 1. **Install Google ADK** (if not already installed):
366 | 
367 | ```bash
368 | pip install google-adk
369 | ```
370 | 
371 | 2. **Set your API key**:
372 | 
373 | ```bash
374 | export SGAI_API_KEY=your-api-key-here
375 | ```
376 | 
377 | ### Basic Integration Example
378 | 
379 | Create an agent file (e.g., `agent.py`) with the following configuration:
380 | 
381 | ```python
382 | import os
383 | from google.adk.agents import LlmAgent
384 | from google.adk.tools.mcp_tool.mcp_toolset import MCPToolset
385 | from google.adk.tools.mcp_tool.mcp_session_manager import StdioConnectionParams
386 | from mcp import StdioServerParameters
387 | 
388 | # Path to the scrapegraph-mcp server directory
389 | SCRAPEGRAPH_MCP_PATH = "/path/to/scrapegraph-mcp"
390 | 
391 | # Path to the server.py file
392 | SERVER_SCRIPT_PATH = os.path.join(
393 |     SCRAPEGRAPH_MCP_PATH, 
394 |     "src", 
395 |     "scrapegraph_mcp", 
396 |     "server.py"
397 | )
398 | 
399 | root_agent = LlmAgent(
400 |     model='gemini-2.0-flash',
401 |     name='scrapegraph_assistant_agent',
402 |     instruction='Help the user with web scraping and data extraction using ScrapeGraph AI. '
403 |                 'You can convert webpages to markdown, extract structured data using AI, '
404 |                 'perform web searches, crawl multiple pages, and automate complex scraping workflows.',
405 |     tools=[
406 |         MCPToolset(
407 |             connection_params=StdioConnectionParams(
408 |                 server_params=StdioServerParameters(
409 |                     command='python3',
410 |                     args=[
411 |                         SERVER_SCRIPT_PATH,
412 |                     ],
413 |                     env={
414 |                         'SGAI_API_KEY': os.getenv('SGAI_API_KEY'),
415 |                     },
416 |                 ),
417 |                 timeout=300.0,)
418 |             ),
419 |             # Optional: Filter which tools from the MCP server are exposed
420 |             # tool_filter=['markdownify', 'smartscraper', 'searchscraper']
421 |         )
422 |     ],
423 | )
424 | ```
425 | 
426 | ### Configuration Options
427 | 
428 | **Timeout Settings:**
429 | - Default timeout is 5 seconds, which may be too short for web scraping operations
430 | - Recommended: Set `timeout=300.0
431 | - Adjust based on your use case (crawling operations may need even longer timeouts)
432 | 
433 | **Tool Filtering:**
434 | - By default, all 8 tools are exposed to the agent
435 | - Use `tool_filter` to limit which tools are available:
436 |   ```python
437 |   tool_filter=['markdownify', 'smartscraper', 'searchscraper']
438 |   ```
439 | 
440 | **API Key Configuration:**
441 | - Set via environment variable: `export SGAI_API_KEY=your-key`
442 | - Or pass directly in `env` dict: `'SGAI_API_KEY': 'your-key-here'`
443 | - Environment variable approach is recommended for security
444 | 
445 | ### Usage Example
446 | 
447 | Once configured, your agent can use natural language to interact with web scraping tools:
448 | 
449 | ```python
450 | # The agent can now handle queries like:
451 | # - "Convert https://example.com to markdown"
452 | # - "Extract all product prices from this e-commerce page"
453 | # - "Search for recent AI research papers and summarize them"
454 | # - "Crawl this documentation site and extract all API endpoints"
455 | ```
456 | For more information about Google ADK, visit the [official documentation](https://github.com/google/adk).
457 | 
458 | ## Example Use Cases
459 | 
460 | The server enables sophisticated queries across various scraping scenarios:
461 | 
462 | ### Single Page Scraping
463 | - **Markdownify**: "Convert the ScrapeGraph documentation page to markdown"
464 | - **SmartScraper**: "Extract all product names, prices, and ratings from this e-commerce page"
465 | - **SmartScraper with scrolling**: "Scrape this infinite scroll page with 5 scrolls and extract all items"
466 | - **Basic Scrape**: "Fetch the HTML content of this JavaScript-heavy page with full rendering"
467 | 
468 | ### Search and Research
469 | - **SearchScraper**: "Research and summarize recent developments in AI-powered web scraping"
470 | - **SearchScraper**: "Search for the top 5 articles about machine learning frameworks and extract key insights"
471 | - **SearchScraper**: "Find recent news about GPT-4 and provide a structured summary"
472 | 
473 | ### Website Analysis
474 | - **Sitemap**: "Extract the complete sitemap structure from the ScrapeGraph website"
475 | - **Sitemap**: "Discover all URLs on this blog site"
476 | 
477 | ### Multi-Page Crawling
478 | - **SmartCrawler (AI mode)**: "Crawl the entire documentation site and extract all API endpoints with descriptions"
479 | - **SmartCrawler (Markdown mode)**: "Convert all pages in the blog to markdown up to 2 levels deep"
480 | - **SmartCrawler**: "Extract all product information from an e-commerce site, maximum 100 pages, same domain only"
481 | 
482 | ### Advanced Agentic Scraping
483 | - **Agentic Scraper**: "Navigate through a multi-step authentication form and extract user dashboard data"
484 | - **Agentic Scraper with schema**: "Follow pagination links and compile a dataset with schema: {title, author, date, content}"
485 | - **Agentic Scraper**: "Execute a complex workflow: login, navigate to reports, download data, and extract summary statistics"
486 | 
487 | ## Error Handling
488 | 
489 | The server implements robust error handling with detailed, actionable error messages for:
490 | 
491 | - API authentication issues
492 | - Malformed URL structures
493 | - Network connectivity failures
494 | - Rate limiting and quota management
495 | 
496 | ## Common Issues
497 | 
498 | ### Windows-Specific Connection
499 | 
500 | When running on Windows systems, you may need to use the following command to connect to the MCP server:
501 | 
502 | ```bash
503 | C:\Windows\System32\cmd.exe /c npx -y @smithery/cli@latest run @ScrapeGraphAI/scrapegraph-mcp --config "{\"scrapegraphApiKey\":\"YOUR-SGAI-API-KEY\"}"
504 | ```
505 | 
506 | This ensures proper execution in the Windows environment.
507 | 
508 | ### Other Common Issues
509 | 
510 | **"ScrapeGraph client not initialized"**
511 | - **Cause**: Missing API key
512 | - **Solution**: Set `SGAI_API_KEY` environment variable or provide via `--config`
513 | 
514 | **"Error 401: Unauthorized"**
515 | - **Cause**: Invalid API key
516 | - **Solution**: Verify your API key at the [ScrapeGraph Dashboard](https://dashboard.scrapegraphai.com)
517 | 
518 | **"Error 402: Payment Required"**
519 | - **Cause**: Insufficient credits
520 | - **Solution**: Add credits to your ScrapeGraph account
521 | 
522 | **SmartCrawler not returning results**
523 | - **Cause**: Still processing (asynchronous operation)
524 | - **Solution**: Keep polling `smartcrawler_fetch_results()` until status is "completed"
525 | 
526 | **Tools not appearing in Claude Desktop**
527 | - **Cause**: Server not starting or configuration error
528 | - **Solution**: Check Claude logs at `~/Library/Logs/Claude/` (macOS) or `%APPDATA%\Claude\Logs\` (Windows)
529 | 
530 | For detailed troubleshooting, see the [.agent documentation](.agent/README.md).
531 | 
532 | ## Development
533 | 
534 | ### Prerequisites
535 | 
536 | - Python 3.13 or higher
537 | - pip or uv package manager
538 | - ScrapeGraph API key
539 | 
540 | ### Installation from Source
541 | 
542 | ```bash
543 | # Clone the repository
544 | git clone https://github.com/ScrapeGraphAI/scrapegraph-mcp
545 | cd scrapegraph-mcp
546 | 
547 | # Install dependencies
548 | pip install -e ".[dev]"
549 | 
550 | # Set your API key
551 | export SGAI_API_KEY=your-api-key
552 | 
553 | # Run the server
554 | scrapegraph-mcp
555 | # or
556 | python -m scrapegraph_mcp.server
557 | ```
558 | 
559 | ### Testing with MCP Inspector
560 | 
561 | Test your server locally using the MCP Inspector tool:
562 | 
563 | ```bash
564 | npx @modelcontextprotocol/inspector scrapegraph-mcp
565 | ```
566 | 
567 | This provides a web interface to test all available tools.
568 | 
569 | ### Code Quality
570 | 
571 | **Linting:**
572 | ```bash
573 | ruff check src/
574 | ```
575 | 
576 | **Type Checking:**
577 | ```bash
578 | mypy src/
579 | ```
580 | 
581 | **Format Checking:**
582 | ```bash
583 | ruff format --check src/
584 | ```
585 | 
586 | ### Project Structure
587 | 
588 | ```
589 | scrapegraph-mcp/
590 | ├── src/
591 | │   └── scrapegraph_mcp/
592 | │       ├── __init__.py      # Package initialization
593 | │       └── server.py        # Main MCP server (all code in one file)
594 | ├── .agent/                  # Developer documentation
595 | │   ├── README.md           # Documentation index
596 | │   └── system/             # System architecture docs
597 | ├── assets/                  # Images and badges
598 | ├── pyproject.toml          # Project metadata & dependencies
599 | ├── smithery.yaml           # Smithery deployment config
600 | └── README.md               # This file
601 | ```
602 | 
603 | ## Contributing
604 | 
605 | We welcome contributions! Here's how you can help:
606 | 
607 | ### Adding a New Tool
608 | 
609 | 1. **Add method to `ScapeGraphClient` class** in [server.py](src/scrapegraph_mcp/server.py):
610 | 
611 | ```python
612 | def new_tool(self, param: str) -> Dict[str, Any]:
613 |     """Tool description."""
614 |     url = f"{self.BASE_URL}/new-endpoint"
615 |     data = {"param": param}
616 |     response = self.client.post(url, headers=self.headers, json=data)
617 |     if response.status_code != 200:
618 |         raise Exception(f"Error {response.status_code}: {response.text}")
619 |     return response.json()
620 | ```
621 | 
622 | 2. **Add MCP tool decorator**:
623 | 
624 | ```python
625 | @mcp.tool()
626 | def new_tool(param: str) -> Dict[str, Any]:
627 |     """
628 |     Tool description for AI assistants.
629 | 
630 |     Args:
631 |         param: Parameter description
632 | 
633 |     Returns:
634 |         Dictionary containing results
635 |     """
636 |     if scrapegraph_client is None:
637 |         return {"error": "ScrapeGraph client not initialized. Please provide an API key."}
638 | 
639 |     try:
640 |         return scrapegraph_client.new_tool(param)
641 |     except Exception as e:
642 |         return {"error": str(e)}
643 | ```
644 | 
645 | 3. **Test with MCP Inspector**:
646 | ```bash
647 | npx @modelcontextprotocol/inspector scrapegraph-mcp
648 | ```
649 | 
650 | 4. **Update documentation**:
651 |    - Add tool to this README
652 |    - Update [.agent documentation](.agent/README.md)
653 | 
654 | 5. **Submit a pull request**
655 | 
656 | ### Development Workflow
657 | 
658 | 1. Fork the repository
659 | 2. Create a feature branch (`git checkout -b feature/amazing-feature`)
660 | 3. Make your changes
661 | 4. Run linting and type checking
662 | 5. Test with MCP Inspector and Claude Desktop
663 | 6. Update documentation
664 | 7. Commit your changes (`git commit -m 'Add amazing feature'`)
665 | 8. Push to the branch (`git push origin feature/amazing-feature`)
666 | 9. Open a Pull Request
667 | 
668 | ### Code Style
669 | 
670 | - **Line length**: 100 characters
671 | - **Type hints**: Required for all functions
672 | - **Docstrings**: Google-style docstrings
673 | - **Error handling**: Return error dicts, don't raise exceptions in tools
674 | - **Python version**: Target 3.13+
675 | 
676 | For detailed development guidelines, see the [.agent documentation](.agent/README.md).
677 | 
678 | ## Documentation
679 | 
680 | For comprehensive developer documentation, see:
681 | 
682 | - **[.agent/README.md](.agent/README.md)** - Complete developer documentation index
683 | - **[.agent/system/project_architecture.md](.agent/system/project_architecture.md)** - System architecture and design
684 | - **[.agent/system/mcp_protocol.md](.agent/system/mcp_protocol.md)** - MCP protocol integration details
685 | 
686 | ## Technology Stack
687 | 
688 | ### Core Framework
689 | - **Python 3.13+** - Modern Python with type hints
690 | - **FastMCP** - Lightweight MCP server framework
691 | - **httpx 0.24.0+** - Modern async HTTP client
692 | 
693 | ### Development Tools
694 | - **Ruff** - Fast Python linter and formatter
695 | - **mypy** - Static type checker
696 | - **Hatchling** - Modern build backend
697 | 
698 | ### Deployment
699 | - **Smithery** - Automated MCP server deployment
700 | - **Docker** - Container support with Alpine Linux
701 | - **stdio transport** - Standard MCP communication
702 | 
703 | ### API Integration
704 | - **ScrapeGraph AI API** - Enterprise web scraping service
705 | - **Base URL**: `https://api.scrapegraphai.com/v1`
706 | - **Authentication**: API key-based
707 | 
708 | ## License
709 | 
710 | This project is distributed under the MIT License. For detailed terms and conditions, please refer to the LICENSE file.
711 | 
712 | ## Acknowledgments
713 | 
714 | Special thanks to [tomekkorbak](https://github.com/tomekkorbak) for his implementation of [oura-mcp-server](https://github.com/tomekkorbak/oura-mcp-server), which served as starting point for this repo.
715 | 
716 | ## Resources
717 | 
718 | ### Official Links
719 | - [ScrapeGraph AI Homepage](https://scrapegraphai.com)
720 | - [ScrapeGraph Dashboard](https://dashboard.scrapegraphai.com) - Get your API key
721 | - [ScrapeGraph API Documentation](https://api.scrapegraphai.com/docs)
722 | - [GitHub Repository](https://github.com/ScrapeGraphAI/scrapegraph-mcp)
723 | 
724 | ### MCP Resources
725 | - [Model Context Protocol](https://modelcontextprotocol.io/) - Official MCP specification
726 | - [FastMCP Framework](https://github.com/jlowin/fastmcp) - Framework used by this server
727 | - [MCP Inspector](https://github.com/modelcontextprotocol/inspector) - Testing tool
728 | - [Smithery](https://smithery.ai/server/@ScrapeGraphAI/scrapegraph-mcp) - MCP server distribution
729 | - mcp-name: io.github.ScrapeGraphAI/scrapegraph-mcp
730 | 
731 | ### AI Assistant Integration
732 | - [Claude Desktop](https://claude.ai/desktop) - Desktop app with MCP support
733 | - [Cursor](https://cursor.sh/) - AI-powered code editor
734 | 
735 | ### Support
736 | - [GitHub Issues](https://github.com/ScrapeGraphAI/scrapegraph-mcp/issues) - Report bugs or request features
737 | - [Developer Documentation](.agent/README.md) - Comprehensive dev docs
738 | 
739 | ---
740 | 
741 | Made with ❤️ by [ScrapeGraphAI](https://scrapegraphai.com) Team
742 | 
```

--------------------------------------------------------------------------------
/CLAUDE.md:
--------------------------------------------------------------------------------

```markdown
1 | # DOCS
2 | We keep all important docs in agent folder and keep updating them, structure like below
3 | .agent
4 | - Tasks: PRD & implementation plan for each feature
5 | - System: Document the current state of the system (project structure, tech stack, integration points, database schema, and core functionalities such as agent architecture, LLM layer, etc.)
6 | - SOP: Best practices of execute certain tasks (e.g. how to add a schema migration, how to add a new page route, etc.)
7 | - README.md: an index of all the documentations we have so people know what & where to look for things
8 | We should always update «agent docs after we implement certain featrue, to make sure it fully reflect the up to date information
9 | Before you plan any implementation, always read the •agent/README first to get context
```

--------------------------------------------------------------------------------
/src/scrapegraph_mcp/__init__.py:
--------------------------------------------------------------------------------

```python
1 | """ScapeGraph MCP Server."""
2 | 
3 | from .server import main
4 | 
5 | __all__ = ["main"]
6 | 
```

--------------------------------------------------------------------------------
/smithery.yaml:
--------------------------------------------------------------------------------

```yaml
 1 | # Smithery configuration file: https://smithery.ai/docs/config#smitheryyaml
 2 | 
 3 | 
 4 | configSchema:
 5 |   # JSON Schema defining the configuration options for the MCP.
 6 |   type: "object"
 7 |   required: ["scrapegraphApiKey"]
 8 |   properties:
 9 |     scrapegraphApiKey:
10 |       type: "string"
11 |       description: "Your Scrapegraph API key"
12 | 
13 | runtime: "python"
```

--------------------------------------------------------------------------------
/.github/workflows/python-package.yml:
--------------------------------------------------------------------------------

```yaml
 1 | name: Python Package
 2 | 
 3 | on:
 4 |   push:
 5 |     branches: [ main ]
 6 |   pull_request:
 7 |     branches: [ main ]
 8 | 
 9 | jobs:
10 |   lint:
11 |     runs-on: ubuntu-latest
12 |     steps:
13 |     - uses: actions/checkout@v3
14 |     - name: Set up Python 3.12
15 |       uses: actions/setup-python@v4
16 |       with:
17 |         python-version: "3.12"
18 |     - name: Install dependencies
19 |       run: |
20 |         python -m pip install --upgrade pip
21 |         python -m pip install ruff mypy
22 |         pip install -e .
23 |     - name: Lint with ruff
24 |       run: |
25 |         ruff check .
26 |     - name: Type check with mypy
27 |       run: |
28 |         mypy src 
```

--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------

```dockerfile
 1 | # Use Python slim image
 2 | FROM python:3.11-slim
 3 | 
 4 | # Set working directory
 5 | WORKDIR /app
 6 | 
 7 | # Set Python unbuffered mode
 8 | ENV PYTHONUNBUFFERED=1
 9 | 
10 | # Copy pyproject.toml and README.md first for better caching
11 | COPY pyproject.toml README.md ./
12 | 
13 | # Copy the source code
14 | COPY src/ ./src/
15 | 
16 | # Install the package and its dependencies from pyproject.toml
17 | RUN pip install --no-cache-dir .
18 | 
19 | # Create non-root user
20 | RUN useradd -m -u 1000 mcpuser && \
21 |     chown -R mcpuser:mcpuser /app
22 | 
23 | # Switch to non-root user
24 | USER mcpuser
25 | 
26 | # Run the server
27 | CMD ["python", "-m", "scrapegraph_mcp.server"]
28 | 
29 | 
```

--------------------------------------------------------------------------------
/server.json:
--------------------------------------------------------------------------------

```json
 1 | {
 2 |   "$schema": "https://static.modelcontextprotocol.io/schemas/2025-10-17/server.schema.json",
 3 |   "name": "io.github.ScrapeGraphAI/scrapegraph-mcp",
 4 |   "description": "AI-powered web scraping and data extraction capabilities through ScrapeGraph API",
 5 |   "repository": {
 6 |     "url": "https://github.com/ScrapeGraphAI/scrapegraph-mcp",
 7 |     "source": "github"
 8 |   },
 9 |   "version": "1.0.0",
10 |   "packages": [
11 |     {
12 |       "registryType": "pypi",
13 |       "identifier": "scrapegraph-mcp",
14 |       "version": "1.0.0",
15 |       "transport": {
16 |         "type": "stdio"
17 |       },
18 |       "environmentVariables": [
19 |         {
20 |           "description": "Your ScapeGraph API key (optional - can also be set via MCP config)",
21 |           "isRequired": false,
22 |           "format": "string",
23 |           "isSecret": true,
24 |           "name": "SGAI_API_KEY"
25 |         }
26 |       ]
27 |     }
28 |   ]
29 | }
```

--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------

```toml
 1 | [project]
 2 | name = "scrapegraph-mcp"
 3 | version = "1.0.1"
 4 | description = "MCP server for ScapeGraph API integration"
 5 | license = {text = "MIT"}
 6 | readme = "README.md"
 7 | authors = [
 8 |     { name = "Marco Perini", email = "[email protected]" }
 9 | ]
10 | requires-python = ">=3.10"
11 | dependencies = [
12 |     "fastmcp>=2.0.0",
13 |     "httpx>=0.24.0",
14 |     "uvicorn>=0.27.0",
15 |     "pydantic>=2.0.0",
16 |     "smithery>=0.4.2",
17 | ]
18 | classifiers = [
19 |     "Development Status :: 4 - Beta",
20 |     "Intended Audience :: Developers",
21 |     "License :: OSI Approved :: MIT License",
22 |     "Programming Language :: Python :: 3",
23 |     "Programming Language :: Python :: 3.10",
24 |     "Topic :: Software Development :: Libraries :: Python Modules",
25 | ]
26 | 
27 | [project.optional-dependencies]
28 | dev = [
29 |     "ruff>=0.1.0",
30 |     "mypy>=1.0.0",
31 | ]
32 | 
33 | [project.urls]
34 | "Homepage" = "https://github.com/ScrapeGraphAI/scrapegraph-mcp"
35 | "Bug Tracker" = "https://github.com/ScrapeGraphAI/scrapegraph-mcp/issues"
36 | 
37 | [project.scripts]
38 | scrapegraph-mcp = "scrapegraph_mcp.server:main"
39 | 
40 | [build-system]
41 | requires = ["hatchling"]
42 | build-backend = "hatchling.build"
43 | 
44 | [tool.hatch.build.targets.wheel]
45 | packages = ["src/scrapegraph_mcp"]
46 | 
47 | [tool.hatch.build]
48 | only-packages = true
49 | 
50 | [tool.ruff]
51 | line-length = 100
52 | target-version = "py312"
53 | select = ["E", "F", "I", "B", "W"]
54 | 
55 | [tool.mypy]
56 | python_version = "3.12"
57 | warn_return_any = true
58 | warn_unused_configs = true
59 | disallow_untyped_defs = true
60 | disallow_incomplete_defs = true
61 | 
62 | [tool.smithery]
63 | server = "scrapegraph_mcp.server:create_server"
```

--------------------------------------------------------------------------------
/.agent/system/project_architecture.md:
--------------------------------------------------------------------------------

```markdown
  1 | # ScrapeGraph MCP Server - Project Architecture
  2 | 
  3 | **Last Updated:** October 2025
  4 | **Version:** 1.0.0
  5 | 
  6 | ## Table of Contents
  7 | - [System Overview](#system-overview)
  8 | - [Technology Stack](#technology-stack)
  9 | - [Project Structure](#project-structure)
 10 | - [Core Architecture](#core-architecture)
 11 | - [MCP Tools](#mcp-tools)
 12 | - [API Integration](#api-integration)
 13 | - [Deployment](#deployment)
 14 | - [Recent Updates](#recent-updates)
 15 | 
 16 | ---
 17 | 
 18 | ## System Overview
 19 | 
 20 | The ScrapeGraph MCP Server is a production-ready [Model Context Protocol](https://modelcontextprotocol.io/introduction) (MCP) server that provides seamless integration between AI assistants (like Claude, Cursor, etc.) and the [ScrapeGraphAI API](https://scrapegraphai.com). This server enables language models to leverage advanced AI-powered web scraping capabilities with enterprise-grade reliability.
 21 | 
 22 | **Key Capabilities:**
 23 | - **Markdownify** - Convert webpages to clean, structured markdown
 24 | - **SmartScraper** - AI-powered structured data extraction from webpages
 25 | - **SearchScraper** - AI-powered web searches with structured results
 26 | - **SmartCrawler** - Intelligent multi-page web crawling with AI extraction or markdown conversion
 27 | 
 28 | **Purpose:**
 29 | - Bridge AI assistants (Claude, Cursor, etc.) with web scraping capabilities
 30 | - Enable LLMs to extract structured data from any website
 31 | - Provide clean, formatted markdown conversion of web content
 32 | - Execute multi-page crawling operations with AI-powered extraction
 33 | 
 34 | ---
 35 | 
 36 | ## Technology Stack
 37 | 
 38 | ### Core Framework
 39 | - **Python 3.10+** - Programming language (minimum version)
 40 | - **mcp[cli] 1.3.0+** - Model Context Protocol SDK for Python
 41 | - **FastMCP** - Lightweight MCP server framework built on top of mcp
 42 | 
 43 | ### HTTP Client
 44 | - **httpx 0.24.0+** - Modern async HTTP client for API requests
 45 | - **Timeout:** 60 seconds for all API requests
 46 | 
 47 | ### Development Tools
 48 | - **Ruff 0.1.0+** - Fast Python linter
 49 | - **mypy 1.0.0+** - Static type checker
 50 | 
 51 | ### Build System
 52 | - **Hatchling** - Modern Python build backend
 53 | - **pyproject.toml** - PEP 621 compliant project metadata
 54 | 
 55 | ### Deployment
 56 | - **Docker** - Containerization with Alpine Linux base
 57 | - **Smithery** - Automated MCP server deployment and distribution
 58 | - **stdio transport** - Standard input/output for MCP communication
 59 | 
 60 | ---
 61 | 
 62 | ## Project Structure
 63 | 
 64 | ```
 65 | scrapegraph-mcp/
 66 | ├── src/
 67 | │   └── scrapegraph_mcp/
 68 | │       ├── __init__.py           # Package initialization
 69 | │       └── server.py             # Main MCP server implementation
 70 | │
 71 | ├── assets/
 72 | │   ├── sgai_smithery.png         # Smithery integration badge
 73 | │   └── cursor_mcp.png            # Cursor integration screenshot
 74 | │
 75 | ├── .github/
 76 | │   └── workflows/                # CI/CD workflows (if any)
 77 | │
 78 | ├── pyproject.toml                # Project metadata and dependencies
 79 | ├── Dockerfile                    # Docker container definition
 80 | ├── smithery.yaml                 # Smithery deployment configuration
 81 | ├── README.md                     # User-facing documentation
 82 | ├── LICENSE                       # MIT License
 83 | └── .python-version               # Python version specification
 84 | ```
 85 | 
 86 | ### Key Files
 87 | 
 88 | **`src/scrapegraph_mcp/server.py`**
 89 | - Main server implementation
 90 | - `ScapeGraphClient` - API client wrapper
 91 | - MCP tool definitions (`@mcp.tool()` decorators)
 92 | - Server initialization and main entry point
 93 | 
 94 | **`pyproject.toml`**
 95 | - Project metadata (name, version, authors)
 96 | - Dependencies (mcp, httpx)
 97 | - Build configuration (hatchling)
 98 | - Tool configuration (ruff, mypy)
 99 | - Entry point: `scrapegraph-mcp` → `scrapegraph_mcp.server:main`
100 | 
101 | **`Dockerfile`**
102 | - Python 3.12 Alpine base image
103 | - Build dependencies (gcc, musl-dev, libffi-dev)
104 | - Package installation
105 | - Entrypoint: `scrapegraph-mcp`
106 | 
107 | **`smithery.yaml`**
108 | - Smithery deployment configuration
109 | - NPM package metadata
110 | - Installation instructions
111 | 
112 | ---
113 | 
114 | ## Core Architecture
115 | 
116 | ### Model Context Protocol (MCP)
117 | 
118 | The server implements the Model Context Protocol, which defines a standard way for AI assistants to interact with external tools and services.
119 | 
120 | **MCP Components:**
121 | 1. **Server** - Exposes tools to AI assistants (this project)
122 | 2. **Client** - AI assistant that uses the tools (Claude, Cursor, etc.)
123 | 3. **Transport** - Communication layer (stdio)
124 | 4. **Tools** - Functions that the AI can call
125 | 
126 | **Communication Flow:**
127 | ```
128 | AI Assistant (Claude/Cursor)
129 |     ↓ (stdio via MCP)
130 | FastMCP Server (this project)
131 |     ↓ (HTTPS API calls)
132 | ScrapeGraphAI API (https://api.scrapegraphai.com/v1)
133 |     ↓ (web scraping)
134 | Target Websites
135 | ```
136 | 
137 | ### Server Architecture
138 | 
139 | The server follows a simple, single-file architecture:
140 | 
141 | **`ScapeGraphClient` Class:**
142 | - HTTP client wrapper for ScrapeGraphAI API
143 | - Base URL: `https://api.scrapegraphai.com/v1`
144 | - API key authentication via `SGAI-APIKEY` header
145 | - Methods: `markdownify()`, `smartscraper()`, `searchscraper()`, `smartcrawler_initiate()`, `smartcrawler_fetch_results()`
146 | 
147 | **FastMCP Server:**
148 | - Created with `FastMCP("ScapeGraph API MCP Server")`
149 | - Exposes tools via `@mcp.tool()` decorators
150 | - Tool functions wrap `ScapeGraphClient` methods
151 | - Error handling with try/except blocks
152 | - Returns dictionaries with results or error messages
153 | 
154 | **Initialization Flow:**
155 | 1. Import dependencies (`httpx`, `mcp.server.fastmcp`)
156 | 2. Define `ScapeGraphClient` class
157 | 3. Create `FastMCP` server instance
158 | 4. Initialize `ScapeGraphClient` with API key from env or config
159 | 5. Define MCP tools with `@mcp.tool()` decorators
160 | 6. Start server with `mcp.run(transport="stdio")`
161 | 
162 | ### Design Patterns
163 | 
164 | **1. Wrapper Pattern**
165 | - `ScapeGraphClient` wraps the ScrapeGraphAI REST API
166 | - Simplifies API interactions with typed methods
167 | - Centralizes authentication and error handling
168 | 
169 | **2. Decorator Pattern**
170 | - `@mcp.tool()` decorators expose functions as MCP tools
171 | - Automatic serialization/deserialization
172 | - Type hints → MCP schema generation
173 | 
174 | **3. Singleton Pattern**
175 | - Single `scrapegraph_client` instance
176 | - Shared across all tool invocations
177 | - Reused HTTP client connection
178 | 
179 | **4. Error Handling Pattern**
180 | - Try/except blocks in all tool functions
181 | - Return error dictionaries instead of raising exceptions
182 | - Ensures graceful degradation for AI assistants
183 | 
184 | ---
185 | 
186 | ## MCP Tools
187 | 
188 | The server exposes 5 tools to AI assistants:
189 | 
190 | ### 1. `markdownify(website_url: str)`
191 | 
192 | **Purpose:** Convert a webpage into clean, formatted markdown
193 | 
194 | **Parameters:**
195 | - `website_url` (str) - URL of the webpage to convert
196 | 
197 | **Returns:**
198 | ```json
199 | {
200 |   "result": "# Page Title\n\nContent in markdown format..."
201 | }
202 | ```
203 | 
204 | **Error Response:**
205 | ```json
206 | {
207 |   "error": "Error 404: Not Found"
208 | }
209 | ```
210 | 
211 | **Example Usage (from AI):**
212 | ```
213 | "Convert https://scrapegraphai.com to markdown"
214 | → AI calls: markdownify("https://scrapegraphai.com")
215 | ```
216 | 
217 | **API Endpoint:** `POST /v1/markdownify`
218 | 
219 | **Credits:** 2 credits per request
220 | 
221 | ---
222 | 
223 | ### 2. `smartscraper(user_prompt: str, website_url: str, number_of_scrolls: int = None, markdown_only: bool = None)`
224 | 
225 | **Purpose:** Extract structured data from a webpage using AI
226 | 
227 | **Parameters:**
228 | - `user_prompt` (str) - Instructions for what data to extract
229 | - `website_url` (str) - URL of the webpage to scrape
230 | - `number_of_scrolls` (int, optional) - Number of infinite scrolls to perform
231 | - `markdown_only` (bool, optional) - Return only markdown without AI processing
232 | 
233 | **Returns:**
234 | ```json
235 | {
236 |   "result": {
237 |     "extracted_field_1": "value1",
238 |     "extracted_field_2": "value2"
239 |   }
240 | }
241 | ```
242 | 
243 | **Example Usage:**
244 | ```
245 | "Extract all product names and prices from https://example.com/products"
246 | → AI calls: smartscraper(
247 |     user_prompt="Extract product names and prices",
248 |     website_url="https://example.com/products"
249 | )
250 | ```
251 | 
252 | **API Endpoint:** `POST /v1/smartscraper`
253 | 
254 | **Credits:** 10 credits (base) + 1 credit per scroll + additional charges
255 | 
256 | ---
257 | 
258 | ### 3. `searchscraper(user_prompt: str, num_results: int = None, number_of_scrolls: int = None)`
259 | 
260 | **Purpose:** Perform AI-powered web searches with structured results
261 | 
262 | **Parameters:**
263 | - `user_prompt` (str) - Search query or instructions
264 | - `num_results` (int, optional) - Number of websites to search (default: 3 = 30 credits)
265 | - `number_of_scrolls` (int, optional) - Number of infinite scrolls per website
266 | 
267 | **Returns:**
268 | ```json
269 | {
270 |   "result": {
271 |     "answer": "Aggregated answer from multiple sources",
272 |     "sources": [
273 |       {"url": "https://source1.com", "data": {...}},
274 |       {"url": "https://source2.com", "data": {...}}
275 |     ]
276 |   }
277 | }
278 | ```
279 | 
280 | **Example Usage:**
281 | ```
282 | "Research the latest AI developments in 2025"
283 | → AI calls: searchscraper(
284 |     user_prompt="Latest AI developments in 2025",
285 |     num_results=5
286 | )
287 | ```
288 | 
289 | **API Endpoint:** `POST /v1/searchscraper`
290 | 
291 | **Credits:** Variable (3-20 websites × 10 credits per website)
292 | 
293 | ---
294 | 
295 | ### 4. `smartcrawler_initiate(url: str, prompt: str = None, extraction_mode: str = "ai", depth: int = None, max_pages: int = None, same_domain_only: bool = None)`
296 | 
297 | **Purpose:** Initiate intelligent multi-page web crawling (asynchronous)
298 | 
299 | **Parameters:**
300 | - `url` (str) - Starting URL to crawl
301 | - `prompt` (str, optional) - AI prompt for data extraction (required for AI mode)
302 | - `extraction_mode` (str) - "ai" for AI extraction (10 credits/page) or "markdown" for markdown conversion (2 credits/page)
303 | - `depth` (int, optional) - Maximum link traversal depth
304 | - `max_pages` (int, optional) - Maximum number of pages to crawl
305 | - `same_domain_only` (bool, optional) - Crawl only within the same domain
306 | 
307 | **Returns:**
308 | ```json
309 | {
310 |   "request_id": "uuid-here",
311 |   "status": "processing"
312 | }
313 | ```
314 | 
315 | **Example Usage:**
316 | ```
317 | "Crawl https://docs.python.org and extract all function signatures"
318 | → AI calls: smartcrawler_initiate(
319 |     url="https://docs.python.org",
320 |     prompt="Extract function signatures and descriptions",
321 |     extraction_mode="ai",
322 |     max_pages=50,
323 |     same_domain_only=True
324 | )
325 | ```
326 | 
327 | **API Endpoint:** `POST /v1/crawl`
328 | 
329 | **Credits:** 100 credits (base) + 10 credits per page (AI mode) or 2 credits per page (markdown mode)
330 | 
331 | **Note:** This is an asynchronous operation. Use `smartcrawler_fetch_results()` to retrieve results.
332 | 
333 | ---
334 | 
335 | ### 5. `smartcrawler_fetch_results(request_id: str)`
336 | 
337 | **Purpose:** Fetch the results of a SmartCrawler operation
338 | 
339 | **Parameters:**
340 | - `request_id` (str) - The request ID returned by `smartcrawler_initiate()`
341 | 
342 | **Returns (while processing):**
343 | ```json
344 | {
345 |   "status": "processing",
346 |   "pages_processed": 15,
347 |   "total_pages": 50
348 | }
349 | ```
350 | 
351 | **Returns (completed):**
352 | ```json
353 | {
354 |   "status": "completed",
355 |   "results": [
356 |     {"url": "https://page1.com", "data": {...}},
357 |     {"url": "https://page2.com", "data": {...}}
358 |   ],
359 |   "pages_processed": 50,
360 |   "total_pages": 50
361 | }
362 | ```
363 | 
364 | **Example Usage:**
365 | ```
366 | AI: "Check the status of crawl request abc-123"
367 | → AI calls: smartcrawler_fetch_results("abc-123")
368 | 
369 | If status is "processing":
370 | → AI: "Still processing, 15/50 pages completed"
371 | 
372 | If status is "completed":
373 | → AI: "Crawl complete! Here are the results..."
374 | ```
375 | 
376 | **API Endpoint:** `GET /v1/crawl/{request_id}`
377 | 
378 | **Polling Strategy:**
379 | - AI assistants should poll this endpoint until `status == "completed"`
380 | - Recommended polling interval: 5-10 seconds
381 | - Maximum wait time: ~30 minutes for large crawls
382 | 
383 | ---
384 | 
385 | ## API Integration
386 | 
387 | ### ScrapeGraphAI API
388 | 
389 | **Base URL:** `https://api.scrapegraphai.com/v1`
390 | 
391 | **Authentication:**
392 | - Header: `SGAI-APIKEY: your-api-key`
393 | - Obtain API key from: [ScrapeGraph Dashboard](https://dashboard.scrapegraphai.com)
394 | 
395 | **Endpoints Used:**
396 | 
397 | | Endpoint | Method | Tool |
398 | |----------|--------|------|
399 | | `/v1/markdownify` | POST | `markdownify()` |
400 | | `/v1/smartscraper` | POST | `smartscraper()` |
401 | | `/v1/searchscraper` | POST | `searchscraper()` |
402 | | `/v1/crawl` | POST | `smartcrawler_initiate()` |
403 | | `/v1/crawl/{request_id}` | GET | `smartcrawler_fetch_results()` |
404 | 
405 | **Request Format:**
406 | ```json
407 | {
408 |   "website_url": "https://example.com",
409 |   "user_prompt": "Extract product names"
410 | }
411 | ```
412 | 
413 | **Response Format:**
414 | ```json
415 | {
416 |   "result": {...},
417 |   "credits_used": 10
418 | }
419 | ```
420 | 
421 | **Error Handling:**
422 | ```python
423 | response = self.client.post(url, headers=self.headers, json=data)
424 | 
425 | if response.status_code != 200:
426 |     error_msg = f"Error {response.status_code}: {response.text}"
427 |     raise Exception(error_msg)
428 | 
429 | return response.json()
430 | ```
431 | 
432 | **HTTP Client Configuration:**
433 | - Library: `httpx`
434 | - Timeout: 60 seconds
435 | - Synchronous client (not async)
436 | 
437 | ### Credit System
438 | 
439 | The MCP server is a pass-through to the ScrapeGraphAI API, so all credit costs are determined by the API:
440 | 
441 | - **Markdownify:** 2 credits
442 | - **SmartScraper:** 10 credits (base) + variable
443 | - **SearchScraper:** Variable (websites × 10 credits)
444 | - **SmartCrawler (AI mode):** 100 + (pages × 10) credits
445 | - **SmartCrawler (Markdown mode):** 100 + (pages × 2) credits
446 | 
447 | Credits are deducted from the API key balance on the ScrapeGraphAI platform.
448 | 
449 | ---
450 | 
451 | ## Deployment
452 | 
453 | ### Installation Methods
454 | 
455 | #### 1. Automated Installation via Smithery
456 | 
457 | **Smithery** is the recommended deployment method for MCP servers.
458 | 
459 | ```bash
460 | npx -y @smithery/cli install @ScrapeGraphAI/scrapegraph-mcp --client claude
461 | ```
462 | 
463 | This automatically:
464 | - Installs the MCP server
465 | - Configures the AI client (Claude Desktop)
466 | - Prompts for API key
467 | 
468 | #### 2. Manual Claude Desktop Configuration
469 | 
470 | **macOS:** `~/Library/Application Support/Claude/claude_desktop_config.json`
471 | **Windows:** `%APPDATA%/Claude/claude_desktop_config.json`
472 | 
473 | ```json
474 | {
475 |     "mcpServers": {
476 |         "@ScrapeGraphAI-scrapegraph-mcp": {
477 |             "command": "npx",
478 |             "args": [
479 |                 "-y",
480 |                 "@smithery/cli@latest",
481 |                 "run",
482 |                 "@ScrapeGraphAI/scrapegraph-mcp",
483 |                 "--config",
484 |                 "{\"scrapegraphApiKey\":\"YOUR-SGAI-API-KEY\"}"
485 |             ]
486 |         }
487 |     }
488 | }
489 | ```
490 | 
491 | **Windows-Specific Command:**
492 | ```bash
493 | C:\Windows\System32\cmd.exe /c npx -y @smithery/cli@latest run @ScrapeGraphAI/scrapegraph-mcp --config "{\"scrapegraphApiKey\":\"YOUR-SGAI-API-KEY\"}"
494 | ```
495 | 
496 | #### 3. Cursor Integration
497 | 
498 | Add the MCP server in Cursor settings:
499 | 
500 | 1. Open Cursor settings
501 | 2. Navigate to MCP section
502 | 3. Add ScrapeGraphAI MCP server
503 | 4. Configure API key
504 | 
505 | (See `assets/cursor_mcp.png` for screenshot)
506 | 
507 | #### 4. Docker Deployment
508 | 
509 | **Build:**
510 | ```bash
511 | docker build -t scrapegraph-mcp .
512 | ```
513 | 
514 | **Run:**
515 | ```bash
516 | docker run -e SGAI_API_KEY=your-api-key scrapegraph-mcp
517 | ```
518 | 
519 | **Dockerfile:**
520 | - Base: Python 3.12 Alpine
521 | - Build deps: gcc, musl-dev, libffi-dev
522 | - Install via pip: `pip install .`
523 | - Entrypoint: `scrapegraph-mcp`
524 | 
525 | #### 5. Python Package Installation
526 | 
527 | **From PyPI (once published):**
528 | ```bash
529 | pip install scrapegraph-mcp
530 | ```
531 | 
532 | **From Source:**
533 | ```bash
534 | git clone https://github.com/ScrapeGraphAI/scrapegraph-mcp
535 | cd scrapegraph-mcp
536 | pip install .
537 | ```
538 | 
539 | **Run:**
540 | ```bash
541 | export SGAI_API_KEY=your-api-key
542 | scrapegraph-mcp
543 | ```
544 | 
545 | ### Configuration
546 | 
547 | **API Key Sources (in order of precedence):**
548 | 1. `--config` parameter (Smithery): `"{\"scrapegraphApiKey\":\"key\"}"`
549 | 2. Environment variable: `SGAI_API_KEY`
550 | 3. Default: `None` (server fails to initialize)
551 | 
552 | **Server Transport:**
553 | - **stdio** - Standard input/output (default for MCP)
554 | - Communication via JSON-RPC over stdin/stdout
555 | 
556 | ### Production Considerations
557 | 
558 | **Error Handling:**
559 | - All tool functions return error dictionaries instead of raising exceptions
560 | - Prevents server crashes on API errors
561 | - Graceful degradation for AI assistants
562 | 
563 | **Timeout:**
564 | - 60-second timeout for all API requests
565 | - Prevents hanging on slow websites
566 | - Consider increasing for large crawls
567 | 
568 | **API Key Security:**
569 | - Never commit API keys to version control
570 | - Use environment variables or config files
571 | - Rotate keys periodically
572 | 
573 | **Rate Limiting:**
574 | - Handled by the ScrapeGraphAI API
575 | - MCP server has no built-in rate limiting
576 | - Consider implementing client-side throttling for high-volume use
577 | 
578 | ---
579 | 
580 | ## Recent Updates
581 | 
582 | ### October 2025
583 | 
584 | **SmartCrawler Integration (Latest):**
585 | - Added `smartcrawler_initiate()` tool for multi-page crawling
586 | - Added `smartcrawler_fetch_results()` tool for async result retrieval
587 | - Support for AI extraction mode (10 credits/page) and markdown mode (2 credits/page)
588 | - Configurable depth, max_pages, and same_domain_only parameters
589 | - Enhanced error handling for extraction mode validation
590 | 
591 | **Recent Commits:**
592 | - `aebeebd` - Merge PR #5: Update to new features and add SmartCrawler
593 | - `b75053d` - Merge PR #4: Fix SmartCrawler issues
594 | - `54b330d` - Enhance error handling in ScapeGraphClient for extraction modes
595 | - `b3139dc` - Refactor web crawling methods to SmartCrawler terminology
596 | - `94173b0` - Add MseeP.ai security assessment badge
597 | - `53c2d99` - Add MCP server badge
598 | 
599 | **Key Features:**
600 | 1. **SmartCrawler Support** - Multi-page crawling with AI or markdown modes
601 | 2. **Enhanced Error Handling** - Validation for extraction modes and prompts
602 | 3. **Async Operation Support** - Initiate/fetch pattern for long-running crawls
603 | 4. **Security Badges** - MseeP.ai security assessment and MCP server badges
604 | 
605 | ---
606 | 
607 | ## Development
608 | 
609 | ### Running Locally
610 | 
611 | **Prerequisites:**
612 | - Python 3.10+
613 | - pip or pipx
614 | 
615 | **Install Dependencies:**
616 | ```bash
617 | pip install -e ".[dev]"
618 | ```
619 | 
620 | **Run Server:**
621 | ```bash
622 | export SGAI_API_KEY=your-api-key
623 | python -m scrapegraph_mcp.server
624 | # or
625 | scrapegraph-mcp
626 | ```
627 | 
628 | **Test with MCP Inspector:**
629 | ```bash
630 | npx @modelcontextprotocol/inspector scrapegraph-mcp
631 | ```
632 | 
633 | ### Code Quality
634 | 
635 | **Linting:**
636 | ```bash
637 | ruff check src/
638 | ```
639 | 
640 | **Type Checking:**
641 | ```bash
642 | mypy src/
643 | ```
644 | 
645 | **Configuration:**
646 | - **Ruff:** Line length 100, target Python 3.12, rules: E, F, I, B, W
647 | - **mypy:** Python 3.12, strict mode, disallow untyped defs
648 | 
649 | ### Project Structure Best Practices
650 | 
651 | **Single-File Architecture:**
652 | - All code in `src/scrapegraph_mcp/server.py`
653 | - Simple, easy to understand
654 | - Minimal dependencies
655 | - No complex abstractions
656 | 
657 | **When to Refactor:**
658 | - If adding 5+ new tools, consider splitting into modules
659 | - If adding authentication logic, create separate auth module
660 | - If adding caching, create separate cache module
661 | 
662 | ---
663 | 
664 | ## Testing
665 | 
666 | ### Manual Testing
667 | 
668 | **Test markdownify:**
669 | ```bash
670 | echo '{"method":"tools/call","params":{"name":"markdownify","arguments":{"website_url":"https://scrapegraphai.com"}}}' | scrapegraph-mcp
671 | ```
672 | 
673 | **Test smartscraper:**
674 | ```bash
675 | echo '{"method":"tools/call","params":{"name":"smartscraper","arguments":{"user_prompt":"Extract main features","website_url":"https://scrapegraphai.com"}}}' | scrapegraph-mcp
676 | ```
677 | 
678 | **Test searchscraper:**
679 | ```bash
680 | echo '{"method":"tools/call","params":{"name":"searchscraper","arguments":{"user_prompt":"Latest AI news"}}}' | scrapegraph-mcp
681 | ```
682 | 
683 | ### Integration Testing
684 | 
685 | **Claude Desktop:**
686 | 1. Configure MCP server in Claude Desktop
687 | 2. Restart Claude
688 | 3. Ask: "Convert https://scrapegraphai.com to markdown"
689 | 4. Verify tool is called and results returned
690 | 
691 | **Cursor:**
692 | 1. Add MCP server in settings
693 | 2. Test with chat prompts
694 | 3. Verify tool integration
695 | 
696 | ---
697 | 
698 | ## Troubleshooting
699 | 
700 | ### Common Issues
701 | 
702 | **Issue: "ScapeGraph client not initialized"**
703 | - **Cause:** Missing API key
704 | - **Solution:** Set `SGAI_API_KEY` environment variable or pass via `--config`
705 | 
706 | **Issue: "Error 401: Unauthorized"**
707 | - **Cause:** Invalid API key
708 | - **Solution:** Verify API key at [dashboard.scrapegraphai.com](https://dashboard.scrapegraphai.com)
709 | 
710 | **Issue: "Error 402: Payment Required"**
711 | - **Cause:** Insufficient credits
712 | - **Solution:** Add credits to your account
713 | 
714 | **Issue: "Error 504: Gateway Timeout"**
715 | - **Cause:** Website took too long to scrape
716 | - **Solution:** Retry or use `markdown_only=True` for faster processing
717 | 
718 | **Issue: Windows cmd.exe not found**
719 | - **Cause:** Smithery can't find Windows command prompt
720 | - **Solution:** Use full path `C:\Windows\System32\cmd.exe`
721 | 
722 | **Issue: SmartCrawler not returning results**
723 | - **Cause:** Still processing (async operation)
724 | - **Solution:** Keep polling `smartcrawler_fetch_results()` until `status == "completed"`
725 | 
726 | ---
727 | 
728 | ## Contributing
729 | 
730 | ### Adding New Tools
731 | 
732 | 1. Add method to `ScapeGraphClient` class:
733 | ```python
734 | def new_tool(self, param: str) -> Dict[str, Any]:
735 |     """Tool description."""
736 |     url = f"{self.BASE_URL}/new-endpoint"
737 |     data = {"param": param}
738 |     response = self.client.post(url, headers=self.headers, json=data)
739 |     if response.status_code != 200:
740 |         raise Exception(f"Error {response.status_code}: {response.text}")
741 |     return response.json()
742 | ```
743 | 
744 | 2. Add MCP tool decorator:
745 | ```python
746 | @mcp.tool()
747 | def new_tool(param: str) -> Dict[str, Any]:
748 |     """Tool description for AI."""
749 |     if scrapegraph_client is None:
750 |         return {"error": "Client not initialized"}
751 |     try:
752 |         return scrapegraph_client.new_tool(param)
753 |     except Exception as e:
754 |         return {"error": str(e)}
755 | ```
756 | 
757 | 3. Update documentation:
758 | - Add tool to [MCP Tools](#mcp-tools) section
759 | - Update README.md
760 | - Update API integration section
761 | 
762 | ### Submitting Changes
763 | 
764 | 1. Fork the repository
765 | 2. Create a feature branch
766 | 3. Make changes
767 | 4. Run linting and type checking
768 | 5. Test with Claude Desktop or Cursor
769 | 6. Submit pull request
770 | 
771 | ---
772 | 
773 | ## License
774 | 
775 | This project is distributed under the MIT License. See [LICENSE](../../LICENSE) file for details.
776 | 
777 | ---
778 | 
779 | ## Acknowledgments
780 | 
781 | - **[tomekkorbak](https://github.com/tomekkorbak)** - For [oura-mcp-server](https://github.com/tomekkorbak/oura-mcp-server) implementation inspiration
782 | - **[Model Context Protocol](https://modelcontextprotocol.io/)** - For the MCP specification
783 | - **[Smithery](https://smithery.ai/)** - For MCP server distribution platform
784 | - **[ScrapeGraphAI Team](https://scrapegraphai.com)** - For the API and support
785 | 
786 | ---
787 | 
788 | **Made with ❤️ by [ScrapeGraphAI](https://scrapegraphai.com) Team**
789 | 
```

--------------------------------------------------------------------------------
/src/scrapegraph_mcp/server.py:
--------------------------------------------------------------------------------

```python
   1 | #!/usr/bin/env python3
   2 | """
   3 | MCP server for ScapeGraph API integration.
   4 | 
   5 | This server exposes methods to use ScapeGraph's AI-powered web scraping services:
   6 | - markdownify: Convert any webpage into clean, formatted markdown
   7 | - smartscraper: Extract structured data from any webpage using AI
   8 | - searchscraper: Perform AI-powered web searches with structured results
   9 | - smartcrawler_initiate: Initiate intelligent multi-page web crawling with AI extraction or markdown conversion
  10 | - smartcrawler_fetch_results: Retrieve results from asynchronous crawling operations
  11 | - scrape: Fetch raw page content with optional JavaScript rendering
  12 | - sitemap: Extract and discover complete website structure
  13 | - agentic_scrapper: Execute complex multi-step web scraping workflows
  14 | 
  15 | ## Parameter Validation and Error Handling
  16 | 
  17 | All tools include comprehensive parameter validation with detailed error messages:
  18 | 
  19 | ### Common Validation Rules:
  20 | - URLs must include protocol (http:// or https://)
  21 | - Numeric parameters must be within specified ranges
  22 | - Mutually exclusive parameters cannot be used together
  23 | - Required parameters must be provided
  24 | - JSON schemas must be valid JSON format
  25 | 
  26 | ### Error Response Format:
  27 | All tools return errors in a consistent format:
  28 | ```json
  29 | {
  30 |   "error": "Detailed error message explaining the issue",
  31 |   "error_type": "ValidationError|HTTPError|TimeoutError|etc.",
  32 |   "parameter": "parameter_name_if_applicable",
  33 |   "valid_range": "acceptable_values_if_applicable"
  34 | }
  35 | ```
  36 | 
  37 | ### Example Validation Errors:
  38 | - Invalid URL: "website_url must include protocol (http:// or https://)"
  39 | - Range violation: "number_of_scrolls must be between 0 and 50"
  40 | - Mutual exclusion: "Cannot specify both website_url and website_html"
  41 | - Missing required: "prompt is required when extraction_mode is 'ai'"
  42 | - Invalid JSON: "output_schema must be valid JSON format"
  43 | 
  44 | ### Best Practices for Error Handling:
  45 | 1. Always check the 'error' field in responses
  46 | 2. Use parameter validation before making requests
  47 | 3. Implement retry logic for timeout errors
  48 | 4. Handle rate limiting gracefully
  49 | 5. Validate URLs before passing to tools
  50 | 
  51 | For comprehensive parameter documentation, use the resource:
  52 | `scrapegraph://parameters/reference`
  53 | """
  54 | 
  55 | import json
  56 | import logging
  57 | import os
  58 | from typing import Any, Dict, Optional, List, Union, Annotated
  59 | 
  60 | import httpx
  61 | from fastmcp import Context, FastMCP
  62 | from smithery.decorators import smithery
  63 | from pydantic import BaseModel, Field, AliasChoices
  64 | 
  65 | # Configure logging
  66 | logging.basicConfig(
  67 |     level=logging.INFO,
  68 |     format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
  69 | )
  70 | logger = logging.getLogger(__name__)
  71 | 
  72 | 
  73 | class ScapeGraphClient:
  74 |     """Client for interacting with the ScapeGraph API."""
  75 | 
  76 |     BASE_URL = "https://api.scrapegraphai.com/v1"
  77 | 
  78 |     def __init__(self, api_key: str):
  79 |         """
  80 |         Initialize the ScapeGraph API client.
  81 | 
  82 |         Args:
  83 |             api_key: API key for ScapeGraph API
  84 |         """
  85 |         self.api_key = api_key
  86 |         self.headers = {
  87 |             "SGAI-APIKEY": api_key,
  88 |             "Content-Type": "application/json"
  89 |         }
  90 |         self.client = httpx.Client(timeout=httpx.Timeout(120.0))
  91 | 
  92 | 
  93 |     def markdownify(self, website_url: str) -> Dict[str, Any]:
  94 |         """
  95 |         Convert a webpage into clean, formatted markdown.
  96 | 
  97 |         Args:
  98 |             website_url: URL of the webpage to convert
  99 | 
 100 |         Returns:
 101 |             Dictionary containing the markdown result
 102 |         """
 103 |         url = f"{self.BASE_URL}/markdownify"
 104 |         data = {
 105 |             "website_url": website_url
 106 |         }
 107 | 
 108 |         response = self.client.post(url, headers=self.headers, json=data)
 109 | 
 110 |         if response.status_code != 200:
 111 |             error_msg = f"Error {response.status_code}: {response.text}"
 112 |             raise Exception(error_msg)
 113 | 
 114 |         return response.json()
 115 | 
 116 |     def smartscraper(
 117 |         self,
 118 |         user_prompt: str,
 119 |         website_url: str = None,
 120 |         website_html: str = None,
 121 |         website_markdown: str = None,
 122 |         output_schema: Dict[str, Any] = None,
 123 |         number_of_scrolls: int = None,
 124 |         total_pages: int = None,
 125 |         render_heavy_js: bool = None,
 126 |         stealth: bool = None
 127 |     ) -> Dict[str, Any]:
 128 |         """
 129 |         Extract structured data from a webpage using AI.
 130 | 
 131 |         Args:
 132 |             user_prompt: Instructions for what data to extract
 133 |             website_url: URL of the webpage to scrape (mutually exclusive with website_html and website_markdown)
 134 |             website_html: HTML content to process locally (mutually exclusive with website_url and website_markdown, max 2MB)
 135 |             website_markdown: Markdown content to process locally (mutually exclusive with website_url and website_html, max 2MB)
 136 |             output_schema: JSON schema defining expected output structure (optional)
 137 |             number_of_scrolls: Number of infinite scrolls to perform (0-50, default 0)
 138 |             total_pages: Number of pages to process for pagination (1-100, default 1)
 139 |             render_heavy_js: Enable heavy JavaScript rendering for dynamic pages (default false)
 140 |             stealth: Enable stealth mode to avoid bot detection (default false)
 141 | 
 142 |         Returns:
 143 |             Dictionary containing the extracted data
 144 |         """
 145 |         url = f"{self.BASE_URL}/smartscraper"
 146 |         data = {"user_prompt": user_prompt}
 147 | 
 148 |         # Add input source (mutually exclusive)
 149 |         if website_url is not None:
 150 |             data["website_url"] = website_url
 151 |         elif website_html is not None:
 152 |             data["website_html"] = website_html
 153 |         elif website_markdown is not None:
 154 |             data["website_markdown"] = website_markdown
 155 |         else:
 156 |             raise ValueError("Must provide one of: website_url, website_html, or website_markdown")
 157 | 
 158 |         # Add optional parameters
 159 |         if output_schema is not None:
 160 |             data["output_schema"] = output_schema
 161 |         if number_of_scrolls is not None:
 162 |             data["number_of_scrolls"] = number_of_scrolls
 163 |         if total_pages is not None:
 164 |             data["total_pages"] = total_pages
 165 |         if render_heavy_js is not None:
 166 |             data["render_heavy_js"] = render_heavy_js
 167 |         if stealth is not None:
 168 |             data["stealth"] = stealth
 169 | 
 170 |         response = self.client.post(url, headers=self.headers, json=data)
 171 | 
 172 |         if response.status_code != 200:
 173 |             error_msg = f"Error {response.status_code}: {response.text}"
 174 |             raise Exception(error_msg)
 175 | 
 176 |         return response.json()
 177 | 
 178 |     def searchscraper(self, user_prompt: str, num_results: int = None, number_of_scrolls: int = None) -> Dict[str, Any]:
 179 |         """
 180 |         Perform AI-powered web searches with structured results.
 181 | 
 182 |         Args:
 183 |             user_prompt: Search query or instructions
 184 |             num_results: Number of websites to search (optional, default: 3 websites = 30 credits)
 185 |             number_of_scrolls: Number of infinite scrolls to perform on each website (optional)
 186 | 
 187 |         Returns:
 188 |             Dictionary containing search results and reference URLs
 189 |         """
 190 |         url = f"{self.BASE_URL}/searchscraper"
 191 |         data = {
 192 |             "user_prompt": user_prompt
 193 |         }
 194 |         
 195 |         # Add num_results to the request if provided
 196 |         if num_results is not None:
 197 |             data["num_results"] = num_results
 198 |             
 199 |         # Add number_of_scrolls to the request if provided
 200 |         if number_of_scrolls is not None:
 201 |             data["number_of_scrolls"] = number_of_scrolls
 202 | 
 203 |         response = self.client.post(url, headers=self.headers, json=data)
 204 | 
 205 |         if response.status_code != 200:
 206 |             error_msg = f"Error {response.status_code}: {response.text}"
 207 |             raise Exception(error_msg)
 208 | 
 209 |         return response.json()
 210 | 
 211 |     def scrape(self, website_url: str, render_heavy_js: Optional[bool] = None) -> Dict[str, Any]:
 212 |         """
 213 |         Basic scrape endpoint to fetch page content.
 214 | 
 215 |         Args:
 216 |             website_url: URL to scrape
 217 |             render_heavy_js: Whether to render heavy JS (optional)
 218 | 
 219 |         Returns:
 220 |             Dictionary containing the scraped result
 221 |         """
 222 |         url = f"{self.BASE_URL}/scrape"
 223 |         payload: Dict[str, Any] = {"website_url": website_url}
 224 |         if render_heavy_js is not None:
 225 |             payload["render_heavy_js"] = render_heavy_js
 226 | 
 227 |         response = self.client.post(url, headers=self.headers, json=payload)
 228 |         response.raise_for_status()
 229 |         return response.json()
 230 | 
 231 |     def sitemap(self, website_url: str) -> Dict[str, Any]:
 232 |         """
 233 |         Extract sitemap for a given website.
 234 | 
 235 |         Args:
 236 |             website_url: Base website URL
 237 | 
 238 |         Returns:
 239 |             Dictionary containing sitemap URLs/structure
 240 |         """
 241 |         url = f"{self.BASE_URL}/sitemap"
 242 |         payload: Dict[str, Any] = {"website_url": website_url}
 243 | 
 244 |         response = self.client.post(url, headers=self.headers, json=payload)
 245 |         response.raise_for_status()
 246 |         return response.json()
 247 | 
 248 |     def agentic_scrapper(
 249 |         self,
 250 |         url: str,
 251 |         user_prompt: Optional[str] = None,
 252 |         output_schema: Optional[Dict[str, Any]] = None,
 253 |         steps: Optional[List[str]] = None,
 254 |         ai_extraction: Optional[bool] = None,
 255 |         persistent_session: Optional[bool] = None,
 256 |         timeout_seconds: Optional[float] = None,
 257 |     ) -> Dict[str, Any]:
 258 |         """
 259 |         Run the Agentic Scraper workflow (no live session/browser interaction).
 260 | 
 261 |         Args:
 262 |             url: Target website URL
 263 |             user_prompt: Instructions for what to do/extract (optional)
 264 |             output_schema: Desired structured output schema (optional)
 265 |             steps: High-level steps/instructions for the agent (optional)
 266 |             ai_extraction: Whether to enable AI extraction mode (optional)
 267 |             persistent_session: Whether to keep session alive between steps (optional)
 268 |             timeout_seconds: Per-request timeout override in seconds (optional)
 269 |         """
 270 |         endpoint = f"{self.BASE_URL}/agentic-scrapper"
 271 |         payload: Dict[str, Any] = {"url": url}
 272 |         if user_prompt is not None:
 273 |             payload["user_prompt"] = user_prompt
 274 |         if output_schema is not None:
 275 |             payload["output_schema"] = output_schema
 276 |         if steps is not None:
 277 |             payload["steps"] = steps
 278 |         if ai_extraction is not None:
 279 |             payload["ai_extraction"] = ai_extraction
 280 |         if persistent_session is not None:
 281 |             payload["persistent_session"] = persistent_session
 282 | 
 283 |         if timeout_seconds is not None:
 284 |             response = self.client.post(endpoint, headers=self.headers, json=payload, timeout=timeout_seconds)
 285 |         else:
 286 |             response = self.client.post(endpoint, headers=self.headers, json=payload)
 287 |         response.raise_for_status()
 288 |         return response.json()
 289 | 
 290 |     def smartcrawler_initiate(
 291 |         self, 
 292 |         url: str, 
 293 |         prompt: str = None, 
 294 |         extraction_mode: str = "ai",
 295 |         depth: int = None,
 296 |         max_pages: int = None,
 297 |         same_domain_only: bool = None
 298 |     ) -> Dict[str, Any]:
 299 |         """
 300 |         Initiate a SmartCrawler request for multi-page web crawling.
 301 |         
 302 |         SmartCrawler supports two modes:
 303 |         - AI Extraction Mode (10 credits per page): Extracts structured data based on your prompt
 304 |         - Markdown Conversion Mode (2 credits per page): Converts pages to clean markdown
 305 | 
 306 |         Smartcrawler takes some time to process the request and returns the request id.
 307 |         Use smartcrawler_fetch_results to get the results of the request.
 308 |         You have to keep polling the smartcrawler_fetch_results until the request is complete.
 309 |         The request is complete when the status is "completed".
 310 | 
 311 |         Args:
 312 |             url: Starting URL to crawl
 313 |             prompt: AI prompt for data extraction (required for AI mode)
 314 |             extraction_mode: "ai" for AI extraction or "markdown" for markdown conversion (default: "ai")
 315 |             depth: Maximum link traversal depth (optional)
 316 |             max_pages: Maximum number of pages to crawl (optional)
 317 |             same_domain_only: Whether to crawl only within the same domain (optional)
 318 | 
 319 |         Returns:
 320 |             Dictionary containing the request ID for async processing
 321 |         """
 322 |         endpoint = f"{self.BASE_URL}/crawl"
 323 |         data = {
 324 |             "url": url
 325 |         }
 326 |         
 327 |         # Handle extraction mode
 328 |         if extraction_mode == "markdown":
 329 |             data["markdown_only"] = True
 330 |         elif extraction_mode == "ai":
 331 |             if prompt is None:
 332 |                 raise ValueError("prompt is required when extraction_mode is 'ai'")
 333 |             data["prompt"] = prompt
 334 |         else:
 335 |             raise ValueError(f"Invalid extraction_mode: {extraction_mode}. Must be 'ai' or 'markdown'")
 336 |         if depth is not None:
 337 |             data["depth"] = depth
 338 |         if max_pages is not None:
 339 |             data["max_pages"] = max_pages
 340 |         if same_domain_only is not None:
 341 |             data["same_domain_only"] = same_domain_only
 342 | 
 343 |         response = self.client.post(endpoint, headers=self.headers, json=data)
 344 | 
 345 |         if response.status_code != 200:
 346 |             error_msg = f"Error {response.status_code}: {response.text}"
 347 |             raise Exception(error_msg)
 348 | 
 349 |         return response.json()
 350 | 
 351 |     def smartcrawler_fetch_results(self, request_id: str) -> Dict[str, Any]:
 352 |         """
 353 |         Fetch the results of a SmartCrawler operation.
 354 | 
 355 |         Args:
 356 |             request_id: The request ID returned by smartcrawler_initiate
 357 | 
 358 |         Returns:
 359 |             Dictionary containing the crawled data (structured extraction or markdown)
 360 |             and metadata about processed pages
 361 | 
 362 |         Note:
 363 |         It takes some time to process the request and returns the results.
 364 |         Meanwhile it returns the status of the request.
 365 |         You have to keep polling the smartcrawler_fetch_results until the request is complete.
 366 |         The request is complete when the status is "completed". and you get results
 367 |         Keep polling the smartcrawler_fetch_results until the request is complete.
 368 |         """
 369 |         endpoint = f"{self.BASE_URL}/crawl/{request_id}"
 370 |         
 371 |         response = self.client.get(endpoint, headers=self.headers)
 372 | 
 373 |         if response.status_code != 200:
 374 |             error_msg = f"Error {response.status_code}: {response.text}"
 375 |             raise Exception(error_msg)
 376 | 
 377 |         return response.json()
 378 | 
 379 |     def close(self) -> None:
 380 |         """Close the HTTP client."""
 381 |         self.client.close()
 382 | 
 383 | 
 384 | # Pydantic configuration schema for Smithery
 385 | class ConfigSchema(BaseModel):
 386 |     scrapegraph_api_key: Optional[str] = Field(
 387 |         default=None,
 388 |         description="Your Scrapegraph API key (optional - can also be set via SGAI_API_KEY environment variable)",
 389 |         # Accept both camelCase (from smithery.yaml) and snake_case (internal) for validation,
 390 |         # and serialize back to camelCase to match Smithery expectations.
 391 |         validation_alias=AliasChoices("scrapegraphApiKey", "scrapegraph_api_key"),
 392 |         serialization_alias="scrapegraphApiKey",
 393 |     )
 394 | 
 395 | 
 396 | def get_api_key(ctx: Context) -> str:
 397 |     """
 398 |     Get the API key from config or environment variable.
 399 |     
 400 |     Args:
 401 |         ctx: FastMCP context
 402 |         
 403 |     Returns:
 404 |         API key string
 405 |         
 406 |     Raises:
 407 |         ValueError: If no API key is found
 408 |     """
 409 |     try:
 410 |         logger.info(f"Getting API key. Context type: {type(ctx)}")
 411 |         logger.info(f"Context has session_config: {hasattr(ctx, 'session_config')}")
 412 |         
 413 |         # Try to get from config first, but handle cases where session_config might be None
 414 |         api_key = None
 415 |         if hasattr(ctx, 'session_config') and ctx.session_config is not None:
 416 |             logger.info(f"Session config type: {type(ctx.session_config)}")
 417 |             api_key = getattr(ctx.session_config, 'scrapegraph_api_key', None)
 418 |             logger.info(f"API key from config: {'***' if api_key else 'None'}")
 419 |         else:
 420 |             logger.info("No session_config available or session_config is None")
 421 |         
 422 |         # If not in config, try environment variable
 423 |         if not api_key:
 424 |             api_key = os.getenv('SGAI_API_KEY')
 425 |             logger.info(f"API key from env: {'***' if api_key else 'None'}")
 426 | 
 427 |         # If still no API key found, raise error
 428 |         if not api_key:
 429 |             logger.error("No API key found in config or environment")
 430 |             raise ValueError(
 431 |                 "ScapeGraph API key is required. Please provide it either:\n"
 432 |                 "1. In the MCP server configuration as 'scrapegraph_api_key'\n"
 433 |                 "2. As an environment variable 'SGAI_API_KEY'"
 434 |             )
 435 |         
 436 |         logger.info("API key successfully retrieved")
 437 |         return api_key
 438 |     
 439 |     except Exception as e:
 440 |         logger.warning(f"Error getting API key from context: {e}. Falling back to cached method.")
 441 |         # Fallback to cached method if context handling fails
 442 |         return get_cached_api_key()
 443 | 
 444 | 
 445 | # Create MCP server instance
 446 | mcp = FastMCP("ScapeGraph API MCP Server")
 447 | 
 448 | # Global API key cache to handle session issues
 449 | _api_key_cache: Optional[str] = None
 450 | 
 451 | def get_cached_api_key() -> str:
 452 |     """Get API key from cache or environment, bypassing session config issues."""
 453 |     global _api_key_cache
 454 |     
 455 |     if _api_key_cache is None:
 456 |         _api_key_cache = os.getenv('SGAI_API_KEY')
 457 |         if _api_key_cache:
 458 |             logger.info("API key loaded from environment variable")
 459 |         else:
 460 |             logger.error("No API key found in environment variable SGAI_API_KEY")
 461 |             raise ValueError(
 462 |                 "ScapeGraph API key is required. Please set the SGAI_API_KEY environment variable."
 463 |             )
 464 |     
 465 |     return _api_key_cache
 466 | 
 467 | 
 468 | # Add prompts to help users interact with the server
 469 | @mcp.prompt()
 470 | def web_scraping_guide() -> str:
 471 |     """
 472 |     A comprehensive guide to using ScapeGraph's web scraping tools effectively.
 473 |     
 474 |     This prompt provides examples and best practices for each tool in the ScapeGraph MCP server.
 475 |     """
 476 |     return """# ScapeGraph Web Scraping Guide
 477 | 
 478 | ## Available Tools Overview
 479 | 
 480 | ### 1. **markdownify** - Convert webpages to clean markdown
 481 | **Use case**: Get clean, readable content from any webpage
 482 | **Example**: 
 483 | - Input: `https://docs.python.org/3/tutorial/`
 484 | - Output: Clean markdown of the Python tutorial
 485 | 
 486 | ### 2. **smartscraper** - AI-powered data extraction
 487 | **Use case**: Extract specific structured data using natural language prompts
 488 | **Examples**:
 489 | - "Extract all product names and prices from this e-commerce page"
 490 | - "Get contact information including email, phone, and address"
 491 | - "Find all article titles, authors, and publication dates"
 492 | 
 493 | ### 3. **searchscraper** - AI web search with extraction
 494 | **Use case**: Search the web and extract structured information
 495 | **Examples**:
 496 | - "Find the latest AI research papers and their abstracts"
 497 | - "Search for Python web scraping tutorials with ratings"
 498 | - "Get current cryptocurrency prices and market caps"
 499 | 
 500 | ### 4. **smartcrawler_initiate** - Multi-page intelligent crawling
 501 | **Use case**: Crawl multiple pages with AI extraction or markdown conversion
 502 | **Modes**:
 503 | - AI Mode (10 credits/page): Extract structured data
 504 | - Markdown Mode (2 credits/page): Convert to markdown
 505 | **Example**: Crawl a documentation site to extract all API endpoints
 506 | 
 507 | ### 5. **smartcrawler_fetch_results** - Get crawling results
 508 | **Use case**: Retrieve results from initiated crawling operations
 509 | **Note**: Keep polling until status is "completed"
 510 | 
 511 | ### 6. **scrape** - Basic page content fetching
 512 | **Use case**: Get raw page content with optional JavaScript rendering
 513 | **Example**: Fetch content from dynamic pages that require JS
 514 | 
 515 | ### 7. **sitemap** - Extract website structure
 516 | **Use case**: Get all URLs and structure of a website
 517 | **Example**: Map out a website's architecture before crawling
 518 | 
 519 | ### 8. **agentic_scrapper** - AI-powered automated scraping
 520 | **Use case**: Complex multi-step scraping with AI automation
 521 | **Example**: Navigate through forms, click buttons, extract data
 522 | 
 523 | ## Best Practices
 524 | 
 525 | 1. **Start Simple**: Use `markdownify` or `scrape` for basic content
 526 | 2. **Be Specific**: Provide detailed prompts for better AI extraction
 527 | 3. **Use Crawling Wisely**: Set appropriate limits for `max_pages` and `depth`
 528 | 4. **Monitor Credits**: AI extraction uses more credits than markdown conversion
 529 | 5. **Handle Async**: Use `smartcrawler_fetch_results` to poll for completion
 530 | 
 531 | ## Common Workflows
 532 | 
 533 | ### Extract Product Information
 534 | 1. Use `smartscraper` with prompt: "Extract product name, price, description, and availability"
 535 | 2. For multiple pages: Use `smartcrawler_initiate` in AI mode
 536 | 
 537 | ### Research and Analysis
 538 | 1. Use `searchscraper` to find relevant pages
 539 | 2. Use `smartscraper` on specific pages for detailed extraction
 540 | 
 541 | ### Site Documentation
 542 | 1. Use `sitemap` to discover all pages
 543 | 2. Use `smartcrawler_initiate` in markdown mode to convert all pages
 544 | 
 545 | ### Complex Navigation
 546 | 1. Use `agentic_scrapper` for sites requiring interaction
 547 | 2. Provide step-by-step instructions in the `steps` parameter
 548 | """
 549 | 
 550 | 
 551 | @mcp.prompt()
 552 | def quick_start_examples() -> str:
 553 |     """
 554 |     Quick start examples for common ScapeGraph use cases.
 555 |     
 556 |     Ready-to-use examples for immediate productivity.
 557 |     """
 558 |     return """# ScapeGraph Quick Start Examples
 559 | 
 560 | ## 🚀 Ready-to-Use Examples
 561 | 
 562 | ### Extract E-commerce Product Data
 563 | ```
 564 | Tool: smartscraper
 565 | URL: https://example-shop.com/products/laptop
 566 | Prompt: "Extract product name, price, specifications, customer rating, and availability status"
 567 | ```
 568 | 
 569 | ### Convert Documentation to Markdown
 570 | ```
 571 | Tool: markdownify
 572 | URL: https://docs.example.com/api-reference
 573 | ```
 574 | 
 575 | ### Research Latest News
 576 | ```
 577 | Tool: searchscraper
 578 | Prompt: "Find latest news about artificial intelligence breakthroughs in 2024"
 579 | num_results: 5
 580 | ```
 581 | 
 582 | ### Crawl Entire Blog for Articles
 583 | ```
 584 | Tool: smartcrawler_initiate
 585 | URL: https://blog.example.com
 586 | Prompt: "Extract article title, author, publication date, and summary"
 587 | extraction_mode: "ai"
 588 | max_pages: 20
 589 | ```
 590 | 
 591 | ### Get Website Structure
 592 | ```
 593 | Tool: sitemap
 594 | URL: https://example.com
 595 | ```
 596 | 
 597 | ### Extract Contact Information
 598 | ```
 599 | Tool: smartscraper
 600 | URL: https://company.example.com/contact
 601 | Prompt: "Find all contact methods: email addresses, phone numbers, physical address, and social media links"
 602 | ```
 603 | 
 604 | ### Automated Form Navigation
 605 | ```
 606 | Tool: agentic_scrapper
 607 | URL: https://example.com/search
 608 | user_prompt: "Navigate to the search page, enter 'web scraping tools', and extract the top 5 results"
 609 | steps: ["Find search box", "Enter search term", "Submit form", "Extract results"]
 610 | ```
 611 | 
 612 | ## 💡 Pro Tips
 613 | 
 614 | 1. **For Dynamic Content**: Use `render_heavy_js: true` with the `scrape` tool
 615 | 2. **For Large Sites**: Start with `sitemap` to understand structure
 616 | 3. **For Async Operations**: Always poll `smartcrawler_fetch_results` until complete
 617 | 4. **For Complex Sites**: Use `agentic_scrapper` with detailed step instructions
 618 | 5. **For Cost Efficiency**: Use markdown mode for content conversion, AI mode for data extraction
 619 | 
 620 | ## 🔧 Configuration
 621 | 
 622 | Set your API key via:
 623 | - Environment variable: `SGAI_API_KEY=your_key_here`
 624 | - MCP configuration: `scrapegraph_api_key: "your_key_here"`
 625 | 
 626 | No configuration required - the server works with environment variables!
 627 | """
 628 | 
 629 | 
 630 | # Add resources to expose server capabilities and data
 631 | @mcp.resource("scrapegraph://api/status")
 632 | def api_status() -> str:
 633 |     """
 634 |     Current status and capabilities of the ScapeGraph API server.
 635 |     
 636 |     Provides real-time information about available tools, credit usage, and server health.
 637 |     """
 638 |     return """# ScapeGraph API Status
 639 | 
 640 | ## Server Information
 641 | - **Status**: ✅ Online and Ready
 642 | - **Version**: 1.0.0
 643 | - **Base URL**: https://api.scrapegraphai.com/v1
 644 | 
 645 | ## Available Tools
 646 | 1. **markdownify** - Convert webpages to markdown (2 credits/page)
 647 | 2. **smartscraper** - AI data extraction (10 credits/page)
 648 | 3. **searchscraper** - AI web search (30 credits for 3 websites)
 649 | 4. **smartcrawler** - Multi-page crawling (2-10 credits/page)
 650 | 5. **scrape** - Basic page fetching (1 credit/page)
 651 | 6. **sitemap** - Website structure extraction (1 credit)
 652 | 7. **agentic_scrapper** - AI automation (variable credits)
 653 | 
 654 | ## Credit Costs
 655 | - **Markdown Conversion**: 2 credits per page
 656 | - **AI Extraction**: 10 credits per page
 657 | - **Web Search**: 10 credits per website (default 3 websites)
 658 | - **Basic Scraping**: 1 credit per page
 659 | - **Sitemap**: 1 credit per request
 660 | 
 661 | ## Configuration
 662 | - **API Key**: Required (set via SGAI_API_KEY env var or config)
 663 | - **Timeout**: 120 seconds default (configurable)
 664 | - **Rate Limits**: Applied per API key
 665 | 
 666 | ## Best Practices
 667 | - Use markdown mode for content conversion (cheaper)
 668 | - Use AI mode for structured data extraction
 669 | - Set appropriate limits for crawling operations
 670 | - Monitor credit usage for cost optimization
 671 | 
 672 | Last Updated: $(date)
 673 | """
 674 | 
 675 | 
 676 | @mcp.resource("scrapegraph://examples/use-cases")
 677 | def common_use_cases() -> str:
 678 |     """
 679 |     Common use cases and example implementations for ScapeGraph tools.
 680 |     
 681 |     Real-world examples with expected inputs and outputs.
 682 |     """
 683 |     return """# ScapeGraph Common Use Cases
 684 | 
 685 | ## 🛍️ E-commerce Data Extraction
 686 | 
 687 | ### Product Information Scraping
 688 | **Tool**: smartscraper
 689 | **Input**: Product page URL + "Extract name, price, description, rating, availability"
 690 | **Output**: Structured JSON with product details
 691 | **Credits**: 10 per page
 692 | 
 693 | ### Price Monitoring
 694 | **Tool**: smartcrawler_initiate (AI mode)
 695 | **Input**: Product category page + price extraction prompt
 696 | **Output**: Structured price data across multiple products
 697 | **Credits**: 10 per page crawled
 698 | 
 699 | ## 📰 Content & Research
 700 | 
 701 | ### News Article Extraction
 702 | **Tool**: searchscraper
 703 | **Input**: "Latest news about [topic]" + num_results
 704 | **Output**: Article titles, summaries, sources, dates
 705 | **Credits**: 10 per website searched
 706 | 
 707 | ### Documentation Conversion
 708 | **Tool**: smartcrawler_initiate (markdown mode)
 709 | **Input**: Documentation site root URL
 710 | **Output**: Clean markdown files for all pages
 711 | **Credits**: 2 per page converted
 712 | 
 713 | ## 🏢 Business Intelligence
 714 | 
 715 | ### Contact Information Gathering
 716 | **Tool**: smartscraper
 717 | **Input**: Company website + "Find contact details"
 718 | **Output**: Emails, phones, addresses, social media
 719 | **Credits**: 10 per page
 720 | 
 721 | ### Competitor Analysis
 722 | **Tool**: searchscraper + smartscraper combination
 723 | **Input**: Search for competitors + extract key metrics
 724 | **Output**: Structured competitive intelligence
 725 | **Credits**: Variable based on pages analyzed
 726 | 
 727 | ## 🔍 Research & Analysis
 728 | 
 729 | ### Academic Paper Research
 730 | **Tool**: searchscraper
 731 | **Input**: Research query + academic site focus
 732 | **Output**: Paper titles, abstracts, authors, citations
 733 | **Credits**: 10 per source website
 734 | 
 735 | ### Market Research
 736 | **Tool**: smartcrawler_initiate
 737 | **Input**: Industry website + data extraction prompts
 738 | **Output**: Market trends, statistics, insights
 739 | **Credits**: 10 per page (AI mode)
 740 | 
 741 | ## 🤖 Automation Workflows
 742 | 
 743 | ### Form-based Data Collection
 744 | **Tool**: agentic_scrapper
 745 | **Input**: Site URL + navigation steps + extraction goals
 746 | **Output**: Data collected through automated interaction
 747 | **Credits**: Variable based on complexity
 748 | 
 749 | ### Multi-step Research Process
 750 | **Workflow**: sitemap → smartcrawler_initiate → smartscraper
 751 | **Input**: Target site + research objectives
 752 | **Output**: Comprehensive site analysis and data extraction
 753 | **Credits**: Cumulative based on tools used
 754 | 
 755 | ## 💡 Optimization Tips
 756 | 
 757 | 1. **Start with sitemap** to understand site structure
 758 | 2. **Use markdown mode** for content archival (cheaper)
 759 | 3. **Use AI mode** for structured data extraction
 760 | 4. **Batch similar requests** to optimize credit usage
 761 | 5. **Set appropriate crawl limits** to control costs
 762 | 6. **Use specific prompts** for better AI extraction accuracy
 763 | 
 764 | ## 📊 Expected Response Times
 765 | 
 766 | - **Simple scraping**: 5-15 seconds
 767 | - **AI extraction**: 15-45 seconds per page
 768 | - **Crawling operations**: 1-5 minutes (async)
 769 | - **Search operations**: 30-90 seconds
 770 | - **Agentic workflows**: 2-10 minutes
 771 | 
 772 | ## 🚨 Common Pitfalls
 773 | 
 774 | - Not setting crawl limits (unexpected credit usage)
 775 | - Vague extraction prompts (poor AI results)
 776 | - Not polling async operations (missing results)
 777 | - Ignoring rate limits (request failures)
 778 | - Not handling JavaScript-heavy sites (incomplete data)
 779 | """
 780 | 
 781 | 
 782 | @mcp.resource("scrapegraph://parameters/reference")
 783 | def parameter_reference_guide() -> str:
 784 |     """
 785 |     Comprehensive parameter reference guide for all ScapeGraph MCP tools.
 786 |     
 787 |     Complete documentation of every parameter with examples, constraints, and best practices.
 788 |     """
 789 |     return """# ScapeGraph MCP Parameter Reference Guide
 790 | 
 791 | ## 📋 Complete Parameter Documentation
 792 | 
 793 | This guide provides comprehensive documentation for every parameter across all ScapeGraph MCP tools. Use this as your definitive reference for understanding parameter behavior, constraints, and best practices.
 794 | 
 795 | ---
 796 | 
 797 | ## 🔧 Common Parameters
 798 | 
 799 | ### URL Parameters
 800 | **Used in**: markdownify, smartscraper, searchscraper, smartcrawler_initiate, scrape, sitemap, agentic_scrapper
 801 | 
 802 | #### `website_url` / `url`
 803 | - **Type**: `str` (required)
 804 | - **Format**: Must include protocol (http:// or https://)
 805 | - **Examples**: 
 806 |   - ✅ `https://example.com/page`
 807 |   - ✅ `https://docs.python.org/3/tutorial/`
 808 |   - ❌ `example.com` (missing protocol)
 809 |   - ❌ `ftp://example.com` (unsupported protocol)
 810 | - **Best Practices**:
 811 |   - Always include the full URL with protocol
 812 |   - Ensure the URL is publicly accessible
 813 |   - Test URLs manually before automation
 814 | 
 815 | ---
 816 | 
 817 | ## 🤖 AI and Extraction Parameters
 818 | 
 819 | ### `user_prompt`
 820 | **Used in**: smartscraper, searchscraper, agentic_scrapper
 821 | 
 822 | - **Type**: `str` (required)
 823 | - **Purpose**: Natural language instructions for AI extraction
 824 | - **Examples**:
 825 |   - `"Extract product name, price, description, and availability"`
 826 |   - `"Find contact information: email, phone, address"`
 827 |   - `"Get article title, author, publication date, summary"`
 828 | - **Best Practices**:
 829 |   - Be specific about desired fields
 830 |   - Mention data types (numbers, dates, URLs)
 831 |   - Include context about data location
 832 |   - Use clear, descriptive language
 833 | 
 834 | ### `output_schema`
 835 | **Used in**: smartscraper, agentic_scrapper
 836 | 
 837 | - **Type**: `Optional[Union[str, Dict[str, Any]]]`
 838 | - **Purpose**: Define expected output structure
 839 | - **Formats**:
 840 |   - Dictionary: `{'type': 'object', 'properties': {'title': {'type': 'string'}}}`
 841 |   - JSON string: `'{"type": "object", "properties": {"name": {"type": "string"}}}'`
 842 | - **Examples**:
 843 |   ```json
 844 |   {
 845 |     "type": "object",
 846 |     "properties": {
 847 |       "products": {
 848 |         "type": "array",
 849 |         "items": {
 850 |           "type": "object",
 851 |           "properties": {
 852 |             "name": {"type": "string"},
 853 |             "price": {"type": "number"},
 854 |             "available": {"type": "boolean"}
 855 |           }
 856 |         }
 857 |       }
 858 |     }
 859 |   }
 860 |   ```
 861 | - **Best Practices**:
 862 |   - Use for complex, structured extractions
 863 |   - Define clear data types
 864 |   - Consider nested structures for complex data
 865 | 
 866 | ---
 867 | 
 868 | ## 🌐 Content Source Parameters
 869 | 
 870 | ### `website_html`
 871 | **Used in**: smartscraper
 872 | 
 873 | - **Type**: `Optional[str]`
 874 | - **Purpose**: Process local HTML content
 875 | - **Constraints**: Maximum 2MB
 876 | - **Use Cases**:
 877 |   - Pre-fetched HTML content
 878 |   - Generated HTML from other sources
 879 |   - Offline HTML processing
 880 | - **Mutually Exclusive**: Cannot use with `website_url` or `website_markdown`
 881 | 
 882 | ### `website_markdown`
 883 | **Used in**: smartscraper
 884 | 
 885 | - **Type**: `Optional[str]`
 886 | - **Purpose**: Process local markdown content
 887 | - **Constraints**: Maximum 2MB
 888 | - **Use Cases**:
 889 |   - Documentation processing
 890 |   - README file analysis
 891 |   - Converted web content
 892 | - **Mutually Exclusive**: Cannot use with `website_url` or `website_html`
 893 | 
 894 | ---
 895 | 
 896 | ## 📄 Pagination and Scrolling Parameters
 897 | 
 898 | ### `number_of_scrolls`
 899 | **Used in**: smartscraper, searchscraper
 900 | 
 901 | - **Type**: `Optional[int]`
 902 | - **Range**: 0-50 scrolls
 903 | - **Default**: 0 (no scrolling)
 904 | - **Purpose**: Handle dynamically loaded content
 905 | - **Examples**:
 906 |   - `0`: Static content, no scrolling needed
 907 |   - `3`: Social media feeds, product listings
 908 |   - `10`: Long articles, extensive catalogs
 909 | - **Performance Impact**: +5-10 seconds per scroll
 910 | - **Best Practices**:
 911 |   - Start with 0 and increase if content seems incomplete
 912 |   - Use sparingly to control processing time
 913 |   - Consider site loading behavior
 914 | 
 915 | ### `total_pages`
 916 | **Used in**: smartscraper
 917 | 
 918 | - **Type**: `Optional[int]`
 919 | - **Range**: 1-100 pages
 920 | - **Default**: 1 (single page)
 921 | - **Purpose**: Process paginated content
 922 | - **Cost Impact**: 10 credits × pages
 923 | - **Examples**:
 924 |   - `1`: Single page extraction
 925 |   - `5`: First 5 pages of results
 926 |   - `20`: Comprehensive pagination
 927 | - **Best Practices**:
 928 |   - Set reasonable limits to control costs
 929 |   - Consider total credit usage
 930 |   - Test with small numbers first
 931 | 
 932 | ---
 933 | 
 934 | ## 🚀 Performance Parameters
 935 | 
 936 | ### `render_heavy_js`
 937 | **Used in**: smartscraper, scrape
 938 | 
 939 | - **Type**: `Optional[bool]`
 940 | - **Default**: `false`
 941 | - **Purpose**: Enable JavaScript rendering for SPAs
 942 | - **When to Use `true`**:
 943 |   - React/Angular/Vue applications
 944 |   - Dynamic content loading
 945 |   - AJAX-heavy interfaces
 946 |   - Content appearing after page load
 947 | - **When to Use `false`**:
 948 |   - Static websites
 949 |   - Server-side rendered content
 950 |   - Traditional HTML pages
 951 |   - When speed is priority
 952 | - **Performance Impact**:
 953 |   - `false`: 2-5 seconds
 954 |   - `true`: 15-30 seconds
 955 | - **Cost**: Same regardless of setting
 956 | 
 957 | ### `stealth`
 958 | **Used in**: smartscraper
 959 | 
 960 | - **Type**: `Optional[bool]`
 961 | - **Default**: `false`
 962 | - **Purpose**: Bypass basic bot detection
 963 | - **When to Use**:
 964 |   - Sites with anti-scraping measures
 965 |   - E-commerce sites with protection
 966 |   - Sites requiring "human-like" behavior
 967 | - **Limitations**:
 968 |   - Not 100% guaranteed
 969 |   - May increase processing time
 970 |   - Some advanced detection may still work
 971 | 
 972 | ---
 973 | 
 974 | ## 🔄 Crawling Parameters
 975 | 
 976 | ### `prompt`
 977 | **Used in**: smartcrawler_initiate
 978 | 
 979 | - **Type**: `Optional[str]`
 980 | - **Required**: When `extraction_mode="ai"`
 981 | - **Purpose**: AI extraction instructions for all crawled pages
 982 | - **Examples**:
 983 |   - `"Extract API endpoint name, method, parameters"`
 984 |   - `"Get article title, author, publication date"`
 985 | - **Best Practices**:
 986 |   - Use general terms that apply across page types
 987 |   - Consider varying page structures
 988 |   - Be specific about desired fields
 989 | 
 990 | ### `extraction_mode`
 991 | **Used in**: smartcrawler_initiate
 992 | 
 993 | - **Type**: `str`
 994 | - **Default**: `"ai"`
 995 | - **Options**:
 996 |   - `"ai"`: AI-powered extraction (10 credits/page)
 997 |   - `"markdown"`: Markdown conversion (2 credits/page)
 998 | - **Cost Comparison**:
 999 |   - AI mode: 50 pages = 500 credits
1000 |   - Markdown mode: 50 pages = 100 credits
1001 | - **Use Cases**:
1002 |   - AI: Data collection, research, analysis
1003 |   - Markdown: Content archival, documentation backup
1004 | 
1005 | ### `depth`
1006 | **Used in**: smartcrawler_initiate
1007 | 
1008 | - **Type**: `Optional[int]`
1009 | - **Default**: Unlimited
1010 | - **Purpose**: Control link traversal depth
1011 | - **Levels**:
1012 |   - `0`: Only starting URL
1013 |   - `1`: Starting URL + direct links
1014 |   - `2`: Two levels of link following
1015 |   - `3+`: Deeper traversal
1016 | - **Considerations**:
1017 |   - Higher depth = exponential growth
1018 |   - Use with `max_pages` for control
1019 |   - Consider site structure
1020 | 
1021 | ### `max_pages`
1022 | **Used in**: smartcrawler_initiate
1023 | 
1024 | - **Type**: `Optional[int]`
1025 | - **Default**: Unlimited
1026 | - **Purpose**: Limit total pages crawled
1027 | - **Recommended Ranges**:
1028 |   - `10-20`: Testing, small sites
1029 |   - `50-100`: Medium sites
1030 |   - `200-500`: Large sites
1031 |   - `1000+`: Enterprise crawling
1032 | - **Cost Calculation**:
1033 |   - AI mode: `max_pages × 10` credits
1034 |   - Markdown mode: `max_pages × 2` credits
1035 | 
1036 | ### `same_domain_only`
1037 | **Used in**: smartcrawler_initiate
1038 | 
1039 | - **Type**: `Optional[bool]`
1040 | - **Default**: `true`
1041 | - **Purpose**: Control cross-domain crawling
1042 | - **Options**:
1043 |   - `true`: Stay within same domain (recommended)
1044 |   - `false`: Allow external domains (use with caution)
1045 | - **Best Practices**:
1046 |   - Use `true` for focused crawling
1047 |   - Set `max_pages` when using `false`
1048 |   - Consider crawling scope carefully
1049 | 
1050 | ---
1051 | 
1052 | ## 🔄 Search Parameters
1053 | 
1054 | ### `num_results`
1055 | **Used in**: searchscraper
1056 | 
1057 | - **Type**: `Optional[int]`
1058 | - **Default**: 3 websites
1059 | - **Range**: 1-20 (recommended ≤10)
1060 | - **Cost**: `num_results × 10` credits
1061 | - **Examples**:
1062 |   - `1`: Quick lookup (10 credits)
1063 |   - `3`: Standard research (30 credits)
1064 |   - `5`: Comprehensive (50 credits)
1065 |   - `10`: Extensive analysis (100 credits)
1066 | 
1067 | ---
1068 | 
1069 | ## 🤖 Agentic Automation Parameters
1070 | 
1071 | ### `steps`
1072 | **Used in**: agentic_scrapper
1073 | 
1074 | - **Type**: `Optional[Union[str, List[str]]]`
1075 | - **Purpose**: Sequential workflow instructions
1076 | - **Formats**:
1077 |   - List: `['Click search', 'Enter term', 'Extract results']`
1078 |   - JSON string: `'["Step 1", "Step 2", "Step 3"]'`
1079 | - **Best Practices**:
1080 |   - Break complex actions into simple steps
1081 |   - Be specific about UI elements
1082 |   - Include wait/loading steps
1083 |   - Order logically
1084 | 
1085 | ### `ai_extraction`
1086 | **Used in**: agentic_scrapper
1087 | 
1088 | - **Type**: `Optional[bool]`
1089 | - **Default**: `true`
1090 | - **Purpose**: Control extraction intelligence
1091 | - **Options**:
1092 |   - `true`: Advanced AI extraction (recommended)
1093 |   - `false`: Simpler, faster extraction
1094 | - **Trade-offs**:
1095 |   - `true`: Better accuracy, slower processing
1096 |   - `false`: Faster execution, less accurate
1097 | 
1098 | ### `persistent_session`
1099 | **Used in**: agentic_scrapper
1100 | 
1101 | - **Type**: `Optional[bool]`
1102 | - **Default**: `false`
1103 | - **Purpose**: Maintain session state between steps
1104 | - **When to Use `true`**:
1105 |   - Login flows
1106 |   - Shopping cart processes
1107 |   - Form wizards with dependencies
1108 | - **When to Use `false`**:
1109 |   - Simple data extraction
1110 |   - Independent actions
1111 |   - Public content scraping
1112 | 
1113 | ### `timeout_seconds`
1114 | **Used in**: agentic_scrapper
1115 | 
1116 | - **Type**: `Optional[float]`
1117 | - **Default**: 120.0 (2 minutes)
1118 | - **Recommended Ranges**:
1119 |   - `60-120`: Simple workflows (2-5 steps)
1120 |   - `180-300`: Medium complexity (5-10 steps)
1121 |   - `300-600`: Complex workflows (10+ steps)
1122 |   - `600+`: Very complex workflows
1123 | - **Considerations**:
1124 |   - Include page load times
1125 |   - Factor in network latency
1126 |   - Allow for AI processing time
1127 | 
1128 | ---
1129 | 
1130 | ## 💰 Credit Cost Summary
1131 | 
1132 | | Tool | Base Cost | Additional Costs |
1133 | |------|-----------|------------------|
1134 | | `markdownify` | 2 credits | None |
1135 | | `smartscraper` | 10 credits | +10 per additional page |
1136 | | `searchscraper` | 30 credits (3 sites) | +10 per additional site |
1137 | | `smartcrawler` | 2-10 credits/page | Depends on extraction mode |
1138 | | `scrape` | 1 credit | None |
1139 | | `sitemap` | 1 credit | None |
1140 | | `agentic_scrapper` | Variable | Based on complexity |
1141 | 
1142 | ---
1143 | 
1144 | ## ⚠️ Common Parameter Mistakes
1145 | 
1146 | ### URL Formatting
1147 | - ❌ `example.com` → ✅ `https://example.com`
1148 | - ❌ `ftp://site.com` → ✅ `https://site.com`
1149 | 
1150 | ### Mutually Exclusive Parameters
1151 | - ❌ Setting both `website_url` and `website_html`
1152 | - ✅ Choose one input source only
1153 | 
1154 | ### Range Violations
1155 | - ❌ `number_of_scrolls: 100` → ✅ `number_of_scrolls: 10`
1156 | - ❌ `total_pages: 1000` → ✅ `total_pages: 100`
1157 | 
1158 | ### JSON Schema Errors
1159 | - ❌ Invalid JSON string format
1160 | - ✅ Valid JSON or dictionary format
1161 | 
1162 | ### Timeout Issues
1163 | - ❌ `timeout_seconds: 30` for complex workflows
1164 | - ✅ `timeout_seconds: 300` for complex workflows
1165 | 
1166 | ---
1167 | 
1168 | ## 🎯 Parameter Selection Guide
1169 | 
1170 | ### For Simple Content Extraction
1171 | ```
1172 | Tool: markdownify or smartscraper
1173 | Parameters: website_url, user_prompt (if smartscraper)
1174 | ```
1175 | 
1176 | ### For Dynamic Content
1177 | ```
1178 | Tool: smartscraper or scrape
1179 | Parameters: render_heavy_js=true, stealth=true (if needed)
1180 | ```
1181 | 
1182 | ### For Multi-Page Content
1183 | ```
1184 | Tool: smartcrawler_initiate
1185 | Parameters: max_pages, depth, extraction_mode
1186 | ```
1187 | 
1188 | ### For Research Tasks
1189 | ```
1190 | Tool: searchscraper
1191 | Parameters: num_results, user_prompt
1192 | ```
1193 | 
1194 | ### For Complex Automation
1195 | ```
1196 | Tool: agentic_scrapper
1197 | Parameters: steps, persistent_session, timeout_seconds
1198 | ```
1199 | 
1200 | ---
1201 | 
1202 | ## 📚 Additional Resources
1203 | 
1204 | - **Tool Comparison**: Use `scrapegraph://tools/comparison` resource
1205 | - **Use Cases**: Check `scrapegraph://examples/use-cases` resource
1206 | - **API Status**: Monitor `scrapegraph://api/status` resource
1207 | - **Quick Examples**: See prompt `quick_start_examples`
1208 | 
1209 | ---
1210 | 
1211 | *Last Updated: November 2024*
1212 | *For the most current parameter information, refer to individual tool documentation.*
1213 | """
1214 | 
1215 | 
1216 | @mcp.resource("scrapegraph://tools/comparison")
1217 | def tool_comparison_guide() -> str:
1218 |     """
1219 |     Detailed comparison of ScapeGraph tools to help choose the right tool for each task.
1220 |     
1221 |     Decision matrix and feature comparison across all available tools.
1222 |     """
1223 |     return """# ScapeGraph Tools Comparison Guide
1224 | 
1225 | ## 🎯 Quick Decision Matrix
1226 | 
1227 | | Need | Recommended Tool | Alternative | Credits |
1228 | |------|------------------|-------------|---------|
1229 | | Convert page to markdown | `markdownify` | `scrape` + manual | 2 |
1230 | | Extract specific data | `smartscraper` | `agentic_scrapper` | 10 |
1231 | | Search web for info | `searchscraper` | Multiple `smartscraper` | 30 |
1232 | | Crawl multiple pages | `smartcrawler_initiate` | Loop `smartscraper` | 2-10/page |
1233 | | Get raw page content | `scrape` | `markdownify` | 1 |
1234 | | Map site structure | `sitemap` | Manual discovery | 1 |
1235 | | Complex automation | `agentic_scrapper` | Custom scripting | Variable |
1236 | 
1237 | ## 🔍 Detailed Tool Comparison
1238 | 
1239 | ### Content Extraction Tools
1240 | 
1241 | #### markdownify vs scrape
1242 | - **markdownify**: Clean, formatted markdown output
1243 | - **scrape**: Raw HTML with optional JS rendering
1244 | - **Use markdownify when**: You need readable content
1245 | - **Use scrape when**: You need full HTML or custom parsing
1246 | 
1247 | #### smartscraper vs agentic_scrapper
1248 | - **smartscraper**: Single-page AI extraction
1249 | - **agentic_scrapper**: Multi-step automated workflows
1250 | - **Use smartscraper when**: Simple data extraction from one page
1251 | - **Use agentic_scrapper when**: Complex navigation required
1252 | 
1253 | ### Scale & Automation
1254 | 
1255 | #### Single Page Tools
1256 | - `markdownify`, `smartscraper`, `scrape`, `sitemap`
1257 | - **Pros**: Fast, predictable costs, simple
1258 | - **Cons**: Manual iteration for multiple pages
1259 | 
1260 | #### Multi-Page Tools
1261 | - `smartcrawler_initiate`, `searchscraper`, `agentic_scrapper`
1262 | - **Pros**: Automated scale, comprehensive results
1263 | - **Cons**: Higher costs, longer processing times
1264 | 
1265 | ### Cost Optimization
1266 | 
1267 | #### Low Cost (1-2 credits)
1268 | - `scrape`: Basic page fetching
1269 | - `markdownify`: Content conversion
1270 | - `sitemap`: Site structure
1271 | 
1272 | #### Medium Cost (10 credits)
1273 | - `smartscraper`: AI data extraction
1274 | - `searchscraper`: Per website searched
1275 | 
1276 | #### Variable Cost
1277 | - `smartcrawler_initiate`: 2-10 credits per page
1278 | - `agentic_scrapper`: Depends on complexity
1279 | 
1280 | ## 🚀 Performance Characteristics
1281 | 
1282 | ### Speed (Typical Response Times)
1283 | 1. **scrape**: 2-5 seconds
1284 | 2. **sitemap**: 3-8 seconds
1285 | 3. **markdownify**: 5-15 seconds
1286 | 4. **smartscraper**: 15-45 seconds
1287 | 5. **searchscraper**: 30-90 seconds
1288 | 6. **smartcrawler**: 1-5 minutes (async)
1289 | 7. **agentic_scrapper**: 2-10 minutes
1290 | 
1291 | ### Reliability
1292 | - **Highest**: `scrape`, `sitemap`, `markdownify`
1293 | - **High**: `smartscraper`, `searchscraper`
1294 | - **Variable**: `smartcrawler`, `agentic_scrapper` (depends on site complexity)
1295 | 
1296 | ## 🎨 Output Format Comparison
1297 | 
1298 | ### Structured Data
1299 | - **smartscraper**: JSON with extracted fields
1300 | - **searchscraper**: JSON with search results
1301 | - **agentic_scrapper**: Custom schema support
1302 | 
1303 | ### Content Formats
1304 | - **markdownify**: Clean markdown text
1305 | - **scrape**: Raw HTML
1306 | - **sitemap**: URL list/structure
1307 | 
1308 | ### Async Operations
1309 | - **smartcrawler_initiate**: Returns request ID
1310 | - **smartcrawler_fetch_results**: Returns final data
1311 | - All others: Immediate response
1312 | 
1313 | ## 🛠️ Integration Patterns
1314 | 
1315 | ### Simple Workflows
1316 | ```
1317 | URL → markdownify → Markdown content
1318 | URL → smartscraper → Structured data
1319 | Query → searchscraper → Research results
1320 | ```
1321 | 
1322 | ### Complex Workflows
1323 | ```
1324 | URL → sitemap → smartcrawler_initiate → smartcrawler_fetch_results
1325 | URL → agentic_scrapper (with steps) → Complex extracted data
1326 | Query → searchscraper → smartscraper (on results) → Detailed analysis
1327 | ```
1328 | 
1329 | ### Hybrid Approaches
1330 | ```
1331 | URL → scrape (check if JS needed) → smartscraper (extract data)
1332 | URL → sitemap (map structure) → smartcrawler (batch process)
1333 | ```
1334 | 
1335 | ## 📋 Selection Checklist
1336 | 
1337 | **Choose markdownify when:**
1338 | - ✅ Need readable content format
1339 | - ✅ Converting documentation/articles
1340 | - ✅ Cost is a primary concern
1341 | 
1342 | **Choose smartscraper when:**
1343 | - ✅ Need specific data extracted
1344 | - ✅ Working with single pages
1345 | - ✅ Want AI-powered extraction
1346 | 
1347 | **Choose searchscraper when:**
1348 | - ✅ Need to find information across web
1349 | - ✅ Research-oriented tasks
1350 | - ✅ Don't have specific URLs
1351 | 
1352 | **Choose smartcrawler when:**
1353 | - ✅ Need to process multiple pages
1354 | - ✅ Can wait for async processing
1355 | - ✅ Want consistent extraction across site
1356 | 
1357 | **Choose agentic_scrapper when:**
1358 | - ✅ Site requires complex navigation
1359 | - ✅ Need to interact with forms/buttons
1360 | - ✅ Custom workflow requirements
1361 | """
1362 | 
1363 | 
1364 | # Add tool for markdownify
1365 | @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
1366 | def markdownify(website_url: str, ctx: Context) -> Dict[str, Any]:
1367 |     """
1368 |     Convert a webpage into clean, formatted markdown.
1369 | 
1370 |     This tool fetches any webpage and converts its content into clean, readable markdown format.
1371 |     Useful for extracting content from documentation, articles, and web pages for further processing.
1372 |     Costs 2 credits per page. Read-only operation with no side effects.
1373 | 
1374 |     Args:
1375 |         website_url (str): The complete URL of the webpage to convert to markdown format.
1376 |             - Must include protocol (http:// or https://)
1377 |             - Supports most web content types (HTML, articles, documentation)
1378 |             - Works with both static and dynamic content
1379 |             - Examples:
1380 |               * https://example.com/page
1381 |               * https://docs.python.org/3/tutorial/
1382 |               * https://github.com/user/repo/README.md
1383 |             - Invalid examples:
1384 |               * example.com (missing protocol)
1385 |               * ftp://example.com (unsupported protocol)
1386 |               * localhost:3000 (missing protocol)
1387 | 
1388 |     Returns:
1389 |         Dictionary containing:
1390 |         - markdown: The converted markdown content as a string
1391 |         - metadata: Additional information about the conversion (title, description, etc.)
1392 |         - status: Success/error status of the operation
1393 |         - credits_used: Number of credits consumed (always 2 for this operation)
1394 | 
1395 |     Raises:
1396 |         ValueError: If website_url is malformed or missing protocol
1397 |         HTTPError: If the webpage cannot be accessed or returns an error
1398 |         TimeoutError: If the webpage takes too long to load (>120 seconds)
1399 |     """
1400 |     try:
1401 |         api_key = get_api_key(ctx)
1402 |         client = ScapeGraphClient(api_key)
1403 |         return client.markdownify(website_url)
1404 |     except Exception as e:
1405 |         return {"error": str(e)}
1406 | 
1407 | 
1408 | # Add tool for smartscraper
1409 | @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
1410 | def smartscraper(
1411 |     user_prompt: str,
1412 |     ctx: Context,
1413 |     website_url: Optional[str] = None,
1414 |     website_html: Optional[str] = None,
1415 |     website_markdown: Optional[str] = None,
1416 |     output_schema: Optional[Annotated[Union[str, Dict[str, Any]], Field(
1417 |         default=None,
1418 |         description="JSON schema dict or JSON string defining the expected output structure",
1419 |         json_schema_extra={
1420 |             "oneOf": [
1421 |                 {"type": "string"},
1422 |                 {"type": "object"}
1423 |             ]
1424 |         }
1425 |     )]] = None,
1426 |     number_of_scrolls: Optional[int] = None,
1427 |     total_pages: Optional[int] = None,
1428 |     render_heavy_js: Optional[bool] = None,
1429 |     stealth: Optional[bool] = None
1430 | ) -> Dict[str, Any]:
1431 |     """
1432 |     Extract structured data from a webpage, HTML, or markdown using AI-powered extraction.
1433 | 
1434 |     This tool uses advanced AI to understand your natural language prompt and extract specific
1435 |     structured data from web content. Supports three input modes: URL scraping, local HTML processing,
1436 |     or local markdown processing. Ideal for extracting product information, contact details,
1437 |     article metadata, or any structured content. Costs 10 credits per page. Read-only operation.
1438 | 
1439 |     Args:
1440 |         user_prompt (str): Natural language instructions describing what data to extract.
1441 |             - Be specific about the fields you want for better results
1442 |             - Use clear, descriptive language about the target data
1443 |             - Examples:
1444 |               * "Extract product name, price, description, and availability status"
1445 |               * "Find all contact methods: email addresses, phone numbers, and social media links"
1446 |               * "Get article title, author, publication date, and summary"
1447 |               * "Extract all job listings with title, company, location, and salary"
1448 |             - Tips for better results:
1449 |               * Specify exact field names you want
1450 |               * Mention data types (numbers, dates, URLs, etc.)
1451 |               * Include context about where data might be located
1452 | 
1453 |         website_url (Optional[str]): The complete URL of the webpage to scrape.
1454 |             - Mutually exclusive with website_html and website_markdown
1455 |             - Must include protocol (http:// or https://)
1456 |             - Supports dynamic and static content
1457 |             - Examples:
1458 |               * https://example.com/products/item
1459 |               * https://news.site.com/article/123
1460 |               * https://company.com/contact
1461 |             - Default: None (must provide one of the three input sources)
1462 | 
1463 |         website_html (Optional[str]): Raw HTML content to process locally.
1464 |             - Mutually exclusive with website_url and website_markdown
1465 |             - Maximum size: 2MB
1466 |             - Useful for processing pre-fetched or generated HTML
1467 |             - Use when you already have HTML content from another source
1468 |             - Example: "<html><body><h1>Title</h1><p>Content</p></body></html>"
1469 |             - Default: None
1470 | 
1471 |         website_markdown (Optional[str]): Markdown content to process locally.
1472 |             - Mutually exclusive with website_url and website_html
1473 |             - Maximum size: 2MB
1474 |             - Useful for extracting from markdown documents or converted content
1475 |             - Works well with documentation, README files, or converted web content
1476 |             - Example: "# Title\n\n## Section\n\nContent here..."
1477 |             - Default: None
1478 | 
1479 |         output_schema (Optional[Union[str, Dict]]): JSON schema defining expected output structure.
1480 |             - Can be provided as a dictionary or JSON string
1481 |             - Helps ensure consistent, structured output format
1482 |             - Optional but recommended for complex extractions
1483 |             - Examples:
1484 |               * As dict: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}}
1485 |               * As JSON string: '{"type": "object", "properties": {"name": {"type": "string"}}}'
1486 |               * For arrays: {'type': 'array', 'items': {'type': 'object', 'properties': {...}}}
1487 |             - Default: None (AI will infer structure from prompt)
1488 | 
1489 |         number_of_scrolls (Optional[int]): Number of infinite scrolls to perform before scraping.
1490 |             - Range: 0-50 scrolls
1491 |             - Default: 0 (no scrolling)
1492 |             - Useful for dynamically loaded content (lazy loading, infinite scroll)
1493 |             - Each scroll waits for content to load before continuing
1494 |             - Examples:
1495 |               * 0: Static content, no scrolling needed
1496 |               * 3: Social media feeds, product listings
1497 |               * 10: Long articles, extensive product catalogs
1498 |             - Note: Increases processing time proportionally
1499 | 
1500 |         total_pages (Optional[int]): Number of pages to process for pagination.
1501 |             - Range: 1-100 pages
1502 |             - Default: 1 (single page only)
1503 |             - Automatically follows pagination links when available
1504 |             - Useful for multi-page listings, search results, catalogs
1505 |             - Examples:
1506 |               * 1: Single page extraction
1507 |               * 5: First 5 pages of search results
1508 |               * 20: Comprehensive catalog scraping
1509 |             - Note: Each page counts toward credit usage (10 credits × pages)
1510 | 
1511 |         render_heavy_js (Optional[bool]): Enable heavy JavaScript rendering for dynamic sites.
1512 |             - Default: false
1513 |             - Set to true for Single Page Applications (SPAs), React apps, Vue.js sites
1514 |             - Increases processing time but captures client-side rendered content
1515 |             - Use when content is loaded dynamically via JavaScript
1516 |             - Examples of when to use:
1517 |               * React/Angular/Vue applications
1518 |               * Sites with dynamic content loading
1519 |               * AJAX-heavy interfaces
1520 |               * Content that appears after page load
1521 |             - Note: Significantly increases processing time (30-60 seconds vs 5-15 seconds)
1522 | 
1523 |         stealth (Optional[bool]): Enable stealth mode to avoid bot detection.
1524 |             - Default: false
1525 |             - Helps bypass basic anti-scraping measures
1526 |             - Uses techniques to appear more like a human browser
1527 |             - Useful for sites with bot detection systems
1528 |             - Examples of when to use:
1529 |               * Sites that block automated requests
1530 |               * E-commerce sites with protection
1531 |               * Sites that require "human-like" behavior
1532 |             - Note: May increase processing time and is not 100% guaranteed
1533 | 
1534 |     Returns:
1535 |         Dictionary containing:
1536 |         - extracted_data: The structured data matching your prompt and optional schema
1537 |         - metadata: Information about the extraction process
1538 |         - credits_used: Number of credits consumed (10 per page processed)
1539 |         - processing_time: Time taken for the extraction
1540 |         - pages_processed: Number of pages that were analyzed
1541 |         - status: Success/error status of the operation
1542 | 
1543 |     Raises:
1544 |         ValueError: If no input source provided or multiple sources provided
1545 |         HTTPError: If website_url cannot be accessed
1546 |         TimeoutError: If processing exceeds timeout limits
1547 |         ValidationError: If output_schema is malformed JSON
1548 |     """
1549 |     try:
1550 |         api_key = get_api_key(ctx)
1551 |         client = ScapeGraphClient(api_key)
1552 | 
1553 |         # Parse output_schema if it's a JSON string
1554 |         normalized_schema: Optional[Dict[str, Any]] = None
1555 |         if isinstance(output_schema, dict):
1556 |             normalized_schema = output_schema
1557 |         elif isinstance(output_schema, str):
1558 |             try:
1559 |                 parsed_schema = json.loads(output_schema)
1560 |                 if isinstance(parsed_schema, dict):
1561 |                     normalized_schema = parsed_schema
1562 |                 else:
1563 |                     return {"error": "output_schema must be a JSON object"}
1564 |             except json.JSONDecodeError as e:
1565 |                 return {"error": f"Invalid JSON for output_schema: {str(e)}"}
1566 | 
1567 |         return client.smartscraper(
1568 |             user_prompt=user_prompt,
1569 |             website_url=website_url,
1570 |             website_html=website_html,
1571 |             website_markdown=website_markdown,
1572 |             output_schema=normalized_schema,
1573 |             number_of_scrolls=number_of_scrolls,
1574 |             total_pages=total_pages,
1575 |             render_heavy_js=render_heavy_js,
1576 |             stealth=stealth
1577 |         )
1578 |     except Exception as e:
1579 |         return {"error": str(e)}
1580 | 
1581 | 
1582 | # Add tool for searchscraper
1583 | @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": False})
1584 | def searchscraper(
1585 |     user_prompt: str,
1586 |     ctx: Context,
1587 |     num_results: Optional[int] = None,
1588 |     number_of_scrolls: Optional[int] = None
1589 | ) -> Dict[str, Any]:
1590 |     """
1591 |     Perform AI-powered web searches with structured data extraction.
1592 | 
1593 |     This tool searches the web based on your query and uses AI to extract structured information
1594 |     from the search results. Ideal for research, competitive analysis, and gathering information
1595 |     from multiple sources. Each website searched costs 10 credits (default 3 websites = 30 credits).
1596 |     Read-only operation but results may vary over time (non-idempotent).
1597 | 
1598 |     Args:
1599 |         user_prompt (str): Search query or natural language instructions for information to find.
1600 |             - Can be a simple search query or detailed extraction instructions
1601 |             - The AI will search the web and extract relevant data from found pages
1602 |             - Be specific about what information you want extracted
1603 |             - Examples:
1604 |               * "Find latest AI research papers published in 2024 with author names and abstracts"
1605 |               * "Search for Python web scraping tutorials with ratings and difficulty levels"
1606 |               * "Get current cryptocurrency prices and market caps for top 10 coins"
1607 |               * "Find contact information for tech startups in San Francisco"
1608 |               * "Search for job openings for data scientists with salary information"
1609 |             - Tips for better results:
1610 |               * Include specific fields you want extracted
1611 |               * Mention timeframes or filters (e.g., "latest", "2024", "top 10")
1612 |               * Specify data types needed (prices, dates, ratings, etc.)
1613 | 
1614 |         num_results (Optional[int]): Number of websites to search and extract data from.
1615 |             - Default: 3 websites (costs 30 credits total)
1616 |             - Range: 1-20 websites (recommended to stay under 10 for cost efficiency)
1617 |             - Each website costs 10 credits, so total cost = num_results × 10
1618 |             - Examples:
1619 |               * 1: Quick single-source lookup (10 credits)
1620 |               * 3: Standard research (30 credits) - good balance of coverage and cost
1621 |               * 5: Comprehensive research (50 credits)
1622 |               * 10: Extensive analysis (100 credits)
1623 |             - Note: More results provide broader coverage but increase costs and processing time
1624 | 
1625 |         number_of_scrolls (Optional[int]): Number of infinite scrolls per searched webpage.
1626 |             - Default: 0 (no scrolling on search result pages)
1627 |             - Range: 0-10 scrolls per page
1628 |             - Useful when search results point to pages with dynamic content loading
1629 |             - Each scroll waits for content to load before continuing
1630 |             - Examples:
1631 |               * 0: Static content pages, news articles, documentation
1632 |               * 2: Social media pages, product listings with lazy loading
1633 |               * 5: Extensive feeds, long-form content with infinite scroll
1634 |             - Note: Increases processing time significantly (adds 5-10 seconds per scroll per page)
1635 | 
1636 |     Returns:
1637 |         Dictionary containing:
1638 |         - search_results: Array of extracted data from each website found
1639 |         - sources: List of URLs that were searched and processed
1640 |         - total_websites_processed: Number of websites successfully analyzed
1641 |         - credits_used: Total credits consumed (num_results × 10)
1642 |         - processing_time: Total time taken for search and extraction
1643 |         - search_query_used: The actual search query sent to search engines
1644 |         - metadata: Additional information about the search process
1645 | 
1646 |     Raises:
1647 |         ValueError: If user_prompt is empty or num_results is out of range
1648 |         HTTPError: If search engines are unavailable or return errors
1649 |         TimeoutError: If search or extraction process exceeds timeout limits
1650 |         RateLimitError: If too many requests are made in a short time period
1651 | 
1652 |     Note:
1653 |         - Results may vary between calls due to changing web content (non-idempotent)
1654 |         - Search engines may return different results over time
1655 |         - Some websites may be inaccessible or block automated access
1656 |         - Processing time increases with num_results and number_of_scrolls
1657 |         - Consider using smartscraper on specific URLs if you know the target sites
1658 |     """
1659 |     try:
1660 |         api_key = get_api_key(ctx)
1661 |         client = ScapeGraphClient(api_key)
1662 |         return client.searchscraper(user_prompt, num_results, number_of_scrolls)
1663 |     except Exception as e:
1664 |         return {"error": str(e)}
1665 | 
1666 | 
1667 | # Add tool for SmartCrawler initiation
1668 | @mcp.tool(annotations={"readOnlyHint": False, "destructiveHint": False, "idempotentHint": False})
1669 | def smartcrawler_initiate(
1670 |     url: str,
1671 |     ctx: Context,
1672 |     prompt: Optional[str] = None,
1673 |     extraction_mode: str = "ai",
1674 |     depth: Optional[int] = None,
1675 |     max_pages: Optional[int] = None,
1676 |     same_domain_only: Optional[bool] = None
1677 | ) -> Dict[str, Any]:
1678 |     """
1679 |     Initiate an asynchronous multi-page web crawling operation with AI extraction or markdown conversion.
1680 | 
1681 |     This tool starts an intelligent crawler that discovers and processes multiple pages from a starting URL.
1682 |     Choose between AI Extraction Mode (10 credits/page) for structured data or Markdown Mode (2 credits/page)
1683 |     for content conversion. The operation is asynchronous - use smartcrawler_fetch_results to retrieve results.
1684 |     Creates a new crawl request (non-idempotent, non-read-only).
1685 | 
1686 |     SmartCrawler supports two modes:
1687 |     - AI Extraction Mode: Extracts structured data based on your prompt from every crawled page
1688 |     - Markdown Conversion Mode: Converts each page to clean markdown format
1689 | 
1690 |     Args:
1691 |         url (str): The starting URL to begin crawling from.
1692 |             - Must include protocol (http:// or https://)
1693 |             - The crawler will discover and process linked pages from this starting point
1694 |             - Should be a page with links to other pages you want to crawl
1695 |             - Examples:
1696 |               * https://docs.example.com (documentation site root)
1697 |               * https://blog.company.com (blog homepage)
1698 |               * https://example.com/products (product category page)
1699 |               * https://news.site.com/category/tech (news section)
1700 |             - Best practices:
1701 |               * Use homepage or main category pages as starting points
1702 |               * Ensure the starting page has links to content you want to crawl
1703 |               * Consider site structure when choosing the starting URL
1704 | 
1705 |         prompt (Optional[str]): AI prompt for data extraction.
1706 |             - REQUIRED when extraction_mode is 'ai'
1707 |             - Ignored when extraction_mode is 'markdown'
1708 |             - Describes what data to extract from each crawled page
1709 |             - Applied consistently across all discovered pages
1710 |             - Examples:
1711 |               * "Extract API endpoint name, method, parameters, and description"
1712 |               * "Get article title, author, publication date, and summary"
1713 |               * "Find product name, price, description, and availability"
1714 |               * "Extract job title, company, location, salary, and requirements"
1715 |             - Tips for better results:
1716 |               * Be specific about fields you want from each page
1717 |               * Consider that different pages may have different content structures
1718 |               * Use general terms that apply across multiple page types
1719 | 
1720 |         extraction_mode (str): Extraction mode for processing crawled pages.
1721 |             - Default: "ai"
1722 |             - Options:
1723 |               * "ai": AI-powered structured data extraction (10 credits per page)
1724 |                 - Uses the prompt to extract specific data from each page
1725 |                 - Returns structured JSON data
1726 |                 - More expensive but provides targeted information
1727 |                 - Best for: Data collection, research, structured analysis
1728 |               * "markdown": Simple markdown conversion (2 credits per page)
1729 |                 - Converts each page to clean markdown format
1730 |                 - No AI processing, just content conversion
1731 |                 - More cost-effective for content archival
1732 |                 - Best for: Documentation backup, content migration, reading
1733 |             - Cost comparison:
1734 |               * AI mode: 50 pages = 500 credits
1735 |               * Markdown mode: 50 pages = 100 credits
1736 | 
1737 |         depth (Optional[int]): Maximum depth of link traversal from the starting URL.
1738 |             - Default: unlimited (will follow links until max_pages or no more links)
1739 |             - Depth levels:
1740 |               * 0: Only the starting URL (no link following)
1741 |               * 1: Starting URL + pages directly linked from it
1742 |               * 2: Starting URL + direct links + links from those pages
1743 |               * 3+: Continues following links to specified depth
1744 |             - Examples:
1745 |               * 1: Crawl blog homepage + all blog posts
1746 |               * 2: Crawl docs homepage + category pages + individual doc pages
1747 |               * 3: Deep crawling for comprehensive site coverage
1748 |             - Considerations:
1749 |               * Higher depth can lead to exponential page growth
1750 |               * Use with max_pages to control scope and cost
1751 |               * Consider site structure when setting depth
1752 | 
1753 |         max_pages (Optional[int]): Maximum number of pages to crawl in total.
1754 |             - Default: unlimited (will crawl until no more links or depth limit)
1755 |             - Recommended ranges:
1756 |               * 10-20: Testing and small sites
1757 |               * 50-100: Medium sites and focused crawling
1758 |               * 200-500: Large sites and comprehensive analysis
1759 |               * 1000+: Enterprise-level crawling (high cost)
1760 |             - Cost implications:
1761 |               * AI mode: max_pages × 10 credits
1762 |               * Markdown mode: max_pages × 2 credits
1763 |             - Examples:
1764 |               * 10: Quick site sampling (20-100 credits)
1765 |               * 50: Standard documentation crawl (100-500 credits)
1766 |               * 200: Comprehensive site analysis (400-2000 credits)
1767 |             - Note: Crawler stops when this limit is reached, regardless of remaining links
1768 | 
1769 |         same_domain_only (Optional[bool]): Whether to crawl only within the same domain.
1770 |             - Default: true (recommended for most use cases)
1771 |             - Options:
1772 |               * true: Only crawl pages within the same domain as starting URL
1773 |                 - Prevents following external links
1774 |                 - Keeps crawling focused on the target site
1775 |                 - Reduces risk of crawling unrelated content
1776 |                 - Example: Starting at docs.example.com only crawls docs.example.com pages
1777 |               * false: Allow crawling external domains
1778 |                 - Follows links to other domains
1779 |                 - Can lead to very broad crawling scope
1780 |                 - May crawl unrelated or unwanted content
1781 |                 - Use with caution and appropriate max_pages limit
1782 |             - Recommendations:
1783 |               * Use true for focused site crawling
1784 |               * Use false only when you specifically need cross-domain data
1785 |               * Always set max_pages when using false to prevent runaway crawling
1786 | 
1787 |     Returns:
1788 |         Dictionary containing:
1789 |         - request_id: Unique identifier for this crawl operation (use with smartcrawler_fetch_results)
1790 |         - status: Initial status of the crawl request ("initiated" or "processing")
1791 |         - estimated_cost: Estimated credit cost based on parameters (actual cost may vary)
1792 |         - crawl_parameters: Summary of the crawling configuration
1793 |         - estimated_time: Rough estimate of processing time
1794 |         - next_steps: Instructions for retrieving results
1795 | 
1796 |     Raises:
1797 |         ValueError: If URL is malformed, prompt is missing for AI mode, or parameters are invalid
1798 |         HTTPError: If the starting URL cannot be accessed
1799 |         RateLimitError: If too many crawl requests are initiated too quickly
1800 | 
1801 |     Note:
1802 |         - This operation is asynchronous and may take several minutes to complete
1803 |         - Use smartcrawler_fetch_results with the returned request_id to get results
1804 |         - Keep polling smartcrawler_fetch_results until status is "completed"
1805 |         - Actual pages crawled may be less than max_pages if fewer links are found
1806 |         - Processing time increases with max_pages, depth, and extraction_mode complexity
1807 |     """
1808 |     try:
1809 |         api_key = get_api_key(ctx)
1810 |         client = ScapeGraphClient(api_key)
1811 |         return client.smartcrawler_initiate(
1812 |             url=url,
1813 |             prompt=prompt,
1814 |             extraction_mode=extraction_mode,
1815 |             depth=depth,
1816 |             max_pages=max_pages,
1817 |             same_domain_only=same_domain_only
1818 |         )
1819 |     except Exception as e:
1820 |         return {"error": str(e)}
1821 | 
1822 | 
1823 | # Add tool for fetching SmartCrawler results
1824 | @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
1825 | def smartcrawler_fetch_results(request_id: str, ctx: Context) -> Dict[str, Any]:
1826 |     """
1827 |     Retrieve the results of an asynchronous SmartCrawler operation.
1828 | 
1829 |     This tool fetches the results from a previously initiated crawling operation using the request_id.
1830 |     The crawl request processes asynchronously in the background. Keep polling this endpoint until
1831 |     the status field indicates 'completed'. While processing, you'll receive status updates.
1832 |     Read-only operation that safely retrieves results without side effects.
1833 | 
1834 |     Args:
1835 |         request_id: The unique request ID returned by smartcrawler_initiate. Use this to retrieve the crawling results. Keep polling until status is 'completed'. Example: 'req_abc123xyz'
1836 | 
1837 |     Returns:
1838 |         Dictionary containing:
1839 |         - status: Current status of the crawl operation ('processing', 'completed', 'failed')
1840 |         - results: Crawled data (structured extraction or markdown) when completed
1841 |         - metadata: Information about processed pages, URLs visited, and processing statistics
1842 |         Keep polling until status is 'completed' to get final results
1843 |     """
1844 |     try:
1845 |         api_key = get_api_key(ctx)
1846 |         client = ScapeGraphClient(api_key)
1847 |         return client.smartcrawler_fetch_results(request_id)
1848 |     except Exception as e:
1849 |         return {"error": str(e)}
1850 | 
1851 | 
1852 | # Add tool for basic scrape
1853 | @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
1854 | def scrape(
1855 |     website_url: str,
1856 |     ctx: Context,
1857 |     render_heavy_js: Optional[bool] = None
1858 | ) -> Dict[str, Any]:
1859 |     """
1860 |     Fetch raw page content from any URL with optional JavaScript rendering.
1861 | 
1862 |     This tool performs basic web scraping to retrieve the raw HTML content of a webpage.
1863 |     Optionally enable JavaScript rendering for Single Page Applications (SPAs) and sites with
1864 |     heavy client-side rendering. Lower cost than AI extraction (1 credit/page).
1865 |     Read-only operation with no side effects.
1866 | 
1867 |     Args:
1868 |         website_url (str): The complete URL of the webpage to scrape.
1869 |             - Must include protocol (http:// or https://)
1870 |             - Returns raw HTML content of the page
1871 |             - Works with both static and dynamic websites
1872 |             - Examples:
1873 |               * https://example.com/page
1874 |               * https://api.example.com/docs
1875 |               * https://news.site.com/article/123
1876 |               * https://app.example.com/dashboard (may need render_heavy_js=true)
1877 |             - Supported protocols: HTTP, HTTPS
1878 |             - Invalid examples:
1879 |               * example.com (missing protocol)
1880 |               * ftp://example.com (unsupported protocol)
1881 | 
1882 |         render_heavy_js (Optional[bool]): Enable full JavaScript rendering for dynamic content.
1883 |             - Default: false (faster, lower cost, works for most static sites)
1884 |             - Set to true for sites that require JavaScript execution to display content
1885 |             - When to use true:
1886 |               * Single Page Applications (React, Angular, Vue.js)
1887 |               * Sites with dynamic content loading via AJAX
1888 |               * Content that appears only after JavaScript execution
1889 |               * Interactive web applications
1890 |               * Sites where initial HTML is mostly empty
1891 |             - When to use false (default):
1892 |               * Static websites and blogs
1893 |               * Server-side rendered content
1894 |               * Traditional HTML pages
1895 |               * News articles and documentation
1896 |               * When you need faster processing
1897 |             - Performance impact:
1898 |               * false: 2-5 seconds processing time
1899 |               * true: 15-30 seconds processing time (waits for JS execution)
1900 |             - Cost: Same (1 credit) regardless of render_heavy_js setting
1901 | 
1902 |     Returns:
1903 |         Dictionary containing:
1904 |         - html_content: The raw HTML content of the page as a string
1905 |         - page_title: Extracted page title if available
1906 |         - status_code: HTTP response status code (200 for success)
1907 |         - final_url: Final URL after any redirects
1908 |         - content_length: Size of the HTML content in bytes
1909 |         - processing_time: Time taken to fetch and process the page
1910 |         - javascript_rendered: Whether JavaScript rendering was used
1911 |         - credits_used: Number of credits consumed (always 1)
1912 | 
1913 |     Raises:
1914 |         ValueError: If website_url is malformed or missing protocol
1915 |         HTTPError: If the webpage returns an error status (404, 500, etc.)
1916 |         TimeoutError: If the page takes too long to load
1917 |         ConnectionError: If the website cannot be reached
1918 | 
1919 |     Use Cases:
1920 |         - Getting raw HTML for custom parsing
1921 |         - Checking page structure before using other tools
1922 |         - Fetching content for offline processing
1923 |         - Debugging website content issues
1924 |         - Pre-processing before AI extraction
1925 | 
1926 |     Note:
1927 |         - This tool returns raw HTML without any AI processing
1928 |         - Use smartscraper for structured data extraction
1929 |         - Use markdownify for clean, readable content
1930 |         - Consider render_heavy_js=true if initial results seem incomplete
1931 |     """
1932 |     try:
1933 |         api_key = get_api_key(ctx)
1934 |         client = ScapeGraphClient(api_key)
1935 |         return client.scrape(website_url=website_url, render_heavy_js=render_heavy_js)
1936 |     except httpx.HTTPError as http_err:
1937 |         return {"error": str(http_err)}
1938 |     except ValueError as val_err:
1939 |         return {"error": str(val_err)}
1940 | 
1941 | 
1942 | # Add tool for sitemap extraction
1943 | @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
1944 | def sitemap(website_url: str, ctx: Context) -> Dict[str, Any]:
1945 |     """
1946 |     Extract and discover the complete sitemap structure of any website.
1947 | 
1948 |     This tool automatically discovers all accessible URLs and pages within a website, providing
1949 |     a comprehensive map of the site's structure. Useful for understanding site architecture before
1950 |     crawling or for discovering all available content. Very cost-effective at 1 credit per request.
1951 |     Read-only operation with no side effects.
1952 | 
1953 |     Args:
1954 |         website_url (str): The base URL of the website to extract sitemap from.
1955 |             - Must include protocol (http:// or https://)
1956 |             - Should be the root domain or main section you want to map
1957 |             - The tool will discover all accessible pages from this starting point
1958 |             - Examples:
1959 |               * https://example.com (discover entire website structure)
1960 |               * https://docs.example.com (map documentation site)
1961 |               * https://blog.company.com (discover all blog pages)
1962 |               * https://shop.example.com (map e-commerce structure)
1963 |             - Best practices:
1964 |               * Use root domain (https://example.com) for complete site mapping
1965 |               * Use subdomain (https://docs.example.com) for focused mapping
1966 |               * Ensure the URL is accessible and doesn't require authentication
1967 |             - Discovery methods:
1968 |               * Checks for robots.txt and sitemap.xml files
1969 |               * Crawls navigation links and menus
1970 |               * Discovers pages through internal link analysis
1971 |               * Identifies common URL patterns and structures
1972 | 
1973 |     Returns:
1974 |         Dictionary containing:
1975 |         - discovered_urls: List of all URLs found on the website
1976 |         - site_structure: Hierarchical organization of pages and sections
1977 |         - url_categories: URLs grouped by type (pages, images, documents, etc.)
1978 |         - total_pages: Total number of pages discovered
1979 |         - subdomains: List of subdomains found (if any)
1980 |         - sitemap_sources: Sources used for discovery (sitemap.xml, robots.txt, crawling)
1981 |         - page_types: Breakdown of different content types found
1982 |         - depth_analysis: URL organization by depth from root
1983 |         - external_links: Links pointing to external domains (if found)
1984 |         - processing_time: Time taken to complete the discovery
1985 |         - credits_used: Number of credits consumed (always 1)
1986 | 
1987 |     Raises:
1988 |         ValueError: If website_url is malformed or missing protocol
1989 |         HTTPError: If the website cannot be accessed or returns errors
1990 |         TimeoutError: If the discovery process takes too long
1991 |         ConnectionError: If the website cannot be reached
1992 | 
1993 |     Use Cases:
1994 |         - Planning comprehensive crawling operations
1995 |         - Understanding website architecture and organization
1996 |         - Discovering all available content before targeted scraping
1997 |         - SEO analysis and site structure optimization
1998 |         - Content inventory and audit preparation
1999 |         - Identifying pages for bulk processing operations
2000 | 
2001 |     Best Practices:
2002 |         - Run sitemap before using smartcrawler_initiate for better planning
2003 |         - Use results to set appropriate max_pages and depth parameters
2004 |         - Check discovered URLs to understand site organization
2005 |         - Identify high-value pages for targeted extraction
2006 |         - Use for cost estimation before large crawling operations
2007 | 
2008 |     Note:
2009 |         - Very cost-effective at only 1 credit per request
2010 |         - Results may vary based on site structure and accessibility
2011 |         - Some pages may require authentication and won't be discovered
2012 |         - Large sites may have thousands of URLs - consider filtering results
2013 |         - Use discovered URLs as input for other scraping tools
2014 |     """
2015 |     try:
2016 |         api_key = get_api_key(ctx)
2017 |         client = ScapeGraphClient(api_key)
2018 |         return client.sitemap(website_url=website_url)
2019 |     except httpx.HTTPError as http_err:
2020 |         return {"error": str(http_err)}
2021 |     except ValueError as val_err:
2022 |         return {"error": str(val_err)}
2023 | 
2024 | 
2025 | # Add tool for Agentic Scraper (no live session/browser interaction)
2026 | @mcp.tool(annotations={"readOnlyHint": False, "destructiveHint": False, "idempotentHint": False})
2027 | def agentic_scrapper(
2028 |     url: str,
2029 |     ctx: Context,
2030 |     user_prompt: Optional[str] = None,
2031 |     output_schema: Optional[Annotated[Union[str, Dict[str, Any]], Field(
2032 |         default=None,
2033 |         description="Desired output structure as a JSON schema dict or JSON string",
2034 |         json_schema_extra={
2035 |             "oneOf": [
2036 |                 {"type": "string"},
2037 |                 {"type": "object"}
2038 |             ]
2039 |         }
2040 |     )]] = None,
2041 |     steps: Optional[Annotated[Union[str, List[str]], Field(
2042 |         default=None,
2043 |         description="Step-by-step instructions for the agent as a list of strings or JSON array string",
2044 |         json_schema_extra={
2045 |             "oneOf": [
2046 |                 {"type": "string"},
2047 |                 {"type": "array", "items": {"type": "string"}}
2048 |             ]
2049 |         }
2050 |     )]] = None,
2051 |     ai_extraction: Optional[bool] = None,
2052 |     persistent_session: Optional[bool] = None,
2053 |     timeout_seconds: Optional[float] = None
2054 | ) -> Dict[str, Any]:
2055 |     """
2056 |     Execute complex multi-step web scraping workflows with AI-powered automation.
2057 | 
2058 |     This tool runs an intelligent agent that can navigate websites, interact with forms and buttons,
2059 |     follow multi-step workflows, and extract structured data. Ideal for complex scraping scenarios
2060 |     requiring user interaction simulation, form submissions, or multi-page navigation flows.
2061 |     Supports custom output schemas and step-by-step instructions. Variable credit cost based on
2062 |     complexity. Can perform actions on the website (non-read-only, non-idempotent).
2063 | 
2064 |     The agent accepts flexible input formats for steps (list or JSON string) and output_schema
2065 |     (dict or JSON string) to accommodate different client implementations.
2066 | 
2067 |     Args:
2068 |         url (str): The target website URL where the agentic scraping workflow should start.
2069 |             - Must include protocol (http:// or https://)
2070 |             - Should be the starting page for your automation workflow
2071 |             - The agent will begin its actions from this URL
2072 |             - Examples:
2073 |               * https://example.com/search (start at search page)
2074 |               * https://shop.example.com/login (begin with login flow)
2075 |               * https://app.example.com/dashboard (start at main interface)
2076 |               * https://forms.example.com/contact (begin at form page)
2077 |             - Considerations:
2078 |               * Choose a starting point that makes sense for your workflow
2079 |               * Ensure the page is publicly accessible or handle authentication
2080 |               * Consider the logical flow of actions from this starting point
2081 | 
2082 |         user_prompt (Optional[str]): High-level instructions for what the agent should accomplish.
2083 |             - Describes the overall goal and desired outcome of the automation
2084 |             - Should be clear and specific about what you want to achieve
2085 |             - Works in conjunction with the steps parameter for detailed guidance
2086 |             - Examples:
2087 |               * "Navigate to the search page, search for laptops, and extract the top 5 results with prices"
2088 |               * "Fill out the contact form with sample data and submit it"
2089 |               * "Login to the dashboard and extract all recent notifications"
2090 |               * "Browse the product catalog and collect information about all items"
2091 |               * "Navigate through the multi-step checkout process and capture each step"
2092 |             - Tips for better results:
2093 |               * Be specific about the end goal
2094 |               * Mention what data you want extracted
2095 |               * Include context about the expected workflow
2096 |               * Specify any particular elements or sections to focus on
2097 | 
2098 |         output_schema (Optional[Union[str, Dict]]): Desired output structure for extracted data.
2099 |             - Can be provided as a dictionary or JSON string
2100 |             - Defines the format and structure of the final extracted data
2101 |             - Helps ensure consistent, predictable output format
2102 |             - Examples:
2103 |               * Simple object: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}}
2104 |               * Array of objects: {'type': 'array', 'items': {'type': 'object', 'properties': {'name': {'type': 'string'}, 'value': {'type': 'string'}}}}
2105 |               * Complex nested: {'type': 'object', 'properties': {'products': {'type': 'array', 'items': {...}}, 'total_count': {'type': 'number'}}}
2106 |               * As JSON string: '{"type": "object", "properties": {"results": {"type": "array"}}}'
2107 |             - Default: None (agent will infer structure from prompt and steps)
2108 | 
2109 |         steps (Optional[Union[str, List[str]]]): Step-by-step instructions for the agent.
2110 |             - Can be provided as a list of strings or JSON array string
2111 |             - Provides detailed, sequential instructions for the automation workflow
2112 |             - Each step should be a clear, actionable instruction
2113 |             - Examples as list:
2114 |               * ['Click the search button', 'Enter "laptops" in the search box', 'Press Enter', 'Wait for results to load', 'Extract product information']
2115 |               * ['Fill in email field with [email protected]', 'Fill in password field', 'Click login button', 'Navigate to profile page']
2116 |             - Examples as JSON string:
2117 |               * '["Open navigation menu", "Click on Products", "Select category filters", "Extract all product data"]'
2118 |             - Best practices:
2119 |               * Break complex actions into simple steps
2120 |               * Be specific about UI elements (button text, field names, etc.)
2121 |               * Include waiting/loading steps when necessary
2122 |               * Specify extraction points clearly
2123 |               * Order steps logically for the workflow
2124 | 
2125 |         ai_extraction (Optional[bool]): Enable AI-powered extraction mode for intelligent data parsing.
2126 |             - Default: true (recommended for most use cases)
2127 |             - Options:
2128 |               * true: Uses advanced AI to intelligently extract and structure data
2129 |                 - Better at handling complex page layouts
2130 |                 - Can adapt to different content structures
2131 |                 - Provides more accurate data extraction
2132 |                 - Recommended for most scenarios
2133 |               * false: Uses simpler extraction methods
2134 |                 - Faster processing but less intelligent
2135 |                 - May miss complex or nested data
2136 |                 - Use when speed is more important than accuracy
2137 |             - Performance impact:
2138 |               * true: Higher processing time but better results
2139 |               * false: Faster execution but potentially less accurate extraction
2140 | 
2141 |         persistent_session (Optional[bool]): Maintain session state between steps.
2142 |             - Default: false (each step starts fresh)
2143 |             - Options:
2144 |               * true: Keeps cookies, login state, and session data between steps
2145 |                 - Essential for authenticated workflows
2146 |                 - Maintains shopping cart contents, user preferences, etc.
2147 |                 - Required for multi-step processes that depend on previous actions
2148 |                 - Use for: Login flows, shopping processes, form wizards
2149 |               * false: Each step starts with a clean session
2150 |                 - Faster and simpler for independent actions
2151 |                 - No state carried between steps
2152 |                 - Use for: Simple data extraction, public content scraping
2153 |             - Examples when to use true:
2154 |               * Login → Navigate to protected area → Extract data
2155 |               * Add items to cart → Proceed to checkout → Extract order details
2156 |               * Multi-step form completion with session dependencies
2157 | 
2158 |         timeout_seconds (Optional[float]): Maximum time to wait for the entire workflow.
2159 |             - Default: 120 seconds (2 minutes)
2160 |             - Recommended ranges:
2161 |               * 60-120: Simple workflows (2-5 steps)
2162 |               * 180-300: Medium complexity (5-10 steps)
2163 |               * 300-600: Complex workflows (10+ steps or slow sites)
2164 |               * 600+: Very complex or slow-loading workflows
2165 |             - Considerations:
2166 |               * Include time for page loads, form submissions, and processing
2167 |               * Factor in network latency and site response times
2168 |               * Allow extra time for AI processing and extraction
2169 |               * Balance between thoroughness and efficiency
2170 |             - Examples:
2171 |               * 60.0: Quick single-page data extraction
2172 |               * 180.0: Multi-step form filling and submission
2173 |               * 300.0: Complex navigation and comprehensive data extraction
2174 |               * 600.0: Extensive workflows with multiple page interactions
2175 | 
2176 |     Returns:
2177 |         Dictionary containing:
2178 |         - extracted_data: The structured data matching your prompt and optional schema
2179 |         - workflow_log: Detailed log of all actions performed by the agent
2180 |         - pages_visited: List of URLs visited during the workflow
2181 |         - actions_performed: Summary of interactions (clicks, form fills, navigations)
2182 |         - execution_time: Total time taken for the workflow
2183 |         - steps_completed: Number of steps successfully executed
2184 |         - final_page_url: The URL where the workflow ended
2185 |         - session_data: Session information if persistent_session was enabled
2186 |         - credits_used: Number of credits consumed (varies by complexity)
2187 |         - status: Success/failure status with any error details
2188 | 
2189 |     Raises:
2190 |         ValueError: If URL is malformed or required parameters are missing
2191 |         TimeoutError: If the workflow exceeds the specified timeout
2192 |         NavigationError: If the agent cannot navigate to required pages
2193 |         InteractionError: If the agent cannot interact with specified elements
2194 |         ExtractionError: If data extraction fails or returns invalid results
2195 | 
2196 |     Use Cases:
2197 |         - Automated form filling and submission
2198 |         - Multi-step checkout processes
2199 |         - Login-protected content extraction
2200 |         - Interactive search and filtering workflows
2201 |         - Complex navigation scenarios requiring user simulation
2202 |         - Data collection from dynamic, JavaScript-heavy applications
2203 | 
2204 |     Best Practices:
2205 |         - Start with simple workflows and gradually increase complexity
2206 |         - Use specific element identifiers in steps (button text, field labels)
2207 |         - Include appropriate wait times for page loads and dynamic content
2208 |         - Test with persistent_session=true for authentication-dependent workflows
2209 |         - Set realistic timeouts based on workflow complexity
2210 |         - Provide clear, sequential steps that build on each other
2211 |         - Use output_schema to ensure consistent data structure
2212 | 
2213 |     Note:
2214 |         - This tool can perform actions on websites (non-read-only)
2215 |         - Results may vary between runs due to dynamic content (non-idempotent)
2216 |         - Credit cost varies based on workflow complexity and execution time
2217 |         - Some websites may have anti-automation measures that could affect success
2218 |         - Consider using simpler tools (smartscraper, markdownify) for basic extraction needs
2219 |     """
2220 |     # Normalize inputs to handle flexible formats from different MCP clients
2221 |     normalized_steps: Optional[List[str]] = None
2222 |     if isinstance(steps, list):
2223 |         normalized_steps = steps
2224 |     elif isinstance(steps, str):
2225 |         parsed_steps: Optional[Any] = None
2226 |         try:
2227 |             parsed_steps = json.loads(steps)
2228 |         except json.JSONDecodeError:
2229 |             parsed_steps = None
2230 |         if isinstance(parsed_steps, list):
2231 |             normalized_steps = parsed_steps
2232 |         else:
2233 |             normalized_steps = [steps]
2234 | 
2235 |     normalized_schema: Optional[Dict[str, Any]] = None
2236 |     if isinstance(output_schema, dict):
2237 |         normalized_schema = output_schema
2238 |     elif isinstance(output_schema, str):
2239 |         try:
2240 |             parsed_schema = json.loads(output_schema)
2241 |             if isinstance(parsed_schema, dict):
2242 |                 normalized_schema = parsed_schema
2243 |             else:
2244 |                 return {"error": "output_schema must be a JSON object"}
2245 |         except json.JSONDecodeError as e:
2246 |             return {"error": f"Invalid JSON for output_schema: {str(e)}"}
2247 | 
2248 |     try:
2249 |         api_key = get_api_key(ctx)
2250 |         client = ScapeGraphClient(api_key)
2251 |         return client.agentic_scrapper(
2252 |             url=url,
2253 |             user_prompt=user_prompt,
2254 |             output_schema=normalized_schema,
2255 |             steps=normalized_steps,
2256 |             ai_extraction=ai_extraction,
2257 |             persistent_session=persistent_session,
2258 |             timeout_seconds=timeout_seconds,
2259 |         )
2260 |     except httpx.TimeoutException as timeout_err:
2261 |         return {"error": f"Request timed out: {str(timeout_err)}"}
2262 |     except httpx.HTTPError as http_err:
2263 |         return {"error": str(http_err)}
2264 |     except ValueError as val_err:
2265 |         return {"error": str(val_err)}
2266 | 
2267 | 
2268 | # Smithery server creation function
2269 | @smithery.server(config_schema=ConfigSchema)
2270 | def create_server() -> FastMCP:
2271 |     """
2272 |     Create and return the FastMCP server instance for Smithery deployment.
2273 |     
2274 |     Returns:
2275 |         Configured FastMCP server instance
2276 |     """
2277 |     return mcp
2278 | 
2279 | 
2280 | def main() -> None:
2281 |     """Run the ScapeGraph MCP server."""
2282 |     try:
2283 |         logger.info("Starting ScapeGraph MCP server!")
2284 |         print("Starting ScapeGraph MCP server!")
2285 |         mcp.run(transport="stdio")
2286 |     except Exception as e:
2287 |         logger.error(f"Failed to start MCP server: {e}")
2288 |         print(f"Error starting server: {e}")
2289 |         raise
2290 | 
2291 | 
2292 | if __name__ == "__main__":
2293 |     main()
```