# Directory Structure
```
├── .agent
│ ├── README.md
│ └── system
│ ├── mcp_protocol.md
│ └── project_architecture.md
├── .github
│ └── workflows
│ └── python-package.yml
├── .gitignore
├── assets
│ ├── cursor_mcp.png
│ └── sgai_smithery.png
├── CLAUDE.md
├── Dockerfile
├── LICENSE
├── pyproject.toml
├── README.md
├── server.json
├── smithery.yaml
├── src
│ └── scrapegraph_mcp
│ ├── __init__.py
│ └── server.py
└── uv.lock
```
# Files
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
```
1 | # Build artifacts
2 | build/
3 | dist/
4 | wheels/
5 | *.egg-info/
6 | *.egg
7 |
8 | # Python artifacts
9 | __pycache__/
10 | *.py[cod]
11 | *$py.class
12 | *.so
13 | .Python
14 | .pytest_cache/
15 | .coverage
16 | htmlcov/
17 | .tox/
18 | .nox/
19 | .hypothesis/
20 | .mypy_cache/
21 | .ruff_cache/
22 |
23 | # Virtual environments
24 | .venv/
25 | venv/
26 | ENV/
27 | env/
28 |
29 | # IDE files
30 | .idea/
31 | .vscode/
32 | *.swp
33 | *.swo
34 | .DS_Store
35 |
36 | # Environment variables
37 | .env
38 | .env.local
39 |
40 | # Logs
41 | *.logdin.com/in/andrea-perozzi-46682830
42 |
43 | .mcpregistry_github_token
44 | .mcpregistry_registry_token
45 | mcp-publisher
```
--------------------------------------------------------------------------------
/.agent/README.md:
--------------------------------------------------------------------------------
```markdown
1 | # ScrapeGraph MCP Server Documentation
2 |
3 | Welcome to the ScrapeGraph MCP Server documentation hub. This directory contains comprehensive documentation for understanding, developing, and maintaining the ScrapeGraph MCP Server.
4 |
5 | ## 📚 Available Documentation
6 |
7 | ### System Documentation (`system/`)
8 |
9 | #### [Project Architecture](./system/project_architecture.md)
10 | Complete system architecture documentation including:
11 | - **System Overview** - MCP server purpose and capabilities
12 | - **Technology Stack** - Python 3.10+, FastMCP, httpx dependencies
13 | - **Project Structure** - File organization and key files
14 | - **Core Architecture** - MCP design, server architecture, patterns
15 | - **MCP Tools** - All 5 tools (markdownify, smartscraper, searchscraper, smartcrawler_initiate, smartcrawler_fetch_results)
16 | - **API Integration** - ScrapeGraphAI API endpoints and credit system
17 | - **Deployment** - Smithery, Claude Desktop, Cursor, Docker setup
18 | - **Recent Updates** - SmartCrawler integration and latest features
19 |
20 | #### [MCP Protocol](./system/mcp_protocol.md)
21 | Complete Model Context Protocol integration documentation:
22 | - **What is MCP?** - Protocol overview and key concepts
23 | - **MCP in ScrapeGraph** - Architecture and FastMCP usage
24 | - **Communication Protocol** - JSON-RPC over stdio transport
25 | - **Tool Schema** - Schema generation from Python type hints
26 | - **Error Handling** - Graceful error handling patterns
27 | - **Client Integration** - Claude Desktop, Cursor, custom clients
28 | - **Advanced Topics** - Versioning, streaming, authentication, rate limiting
29 | - **Debugging** - MCP Inspector, logs, troubleshooting
30 |
31 | ### Task Documentation (`tasks/`)
32 |
33 | *Future: PRD and implementation plans for specific features*
34 |
35 | ### SOP Documentation (`sop/`)
36 |
37 | *Future: Standard operating procedures (e.g., adding new tools, testing)*
38 |
39 | ---
40 |
41 | ## 🚀 Quick Start
42 |
43 | ### For New Engineers
44 |
45 | 1. **Read First:**
46 | - [Project Architecture - System Overview](./system/project_architecture.md#system-overview)
47 | - [MCP Protocol - What is MCP?](./system/mcp_protocol.md#what-is-mcp)
48 |
49 | 2. **Setup Development Environment:**
50 | - Install Python 3.10+
51 | - Clone repository: `git clone https://github.com/ScrapeGraphAI/scrapegraph-mcp`
52 | - Install dependencies: `pip install -e ".[dev]"`
53 | - Get API key from: [dashboard.scrapegraphai.com](https://dashboard.scrapegraphai.com)
54 |
55 | 3. **Run the Server:**
56 | ```bash
57 | export SGAI_API_KEY=your-api-key
58 | scrapegraph-mcp
59 | ```
60 |
61 | 4. **Test with MCP Inspector:**
62 | ```bash
63 | npx @modelcontextprotocol/inspector scrapegraph-mcp
64 | ```
65 |
66 | 5. **Integrate with Claude Desktop:**
67 | - See: [Project Architecture - Deployment](./system/project_architecture.md#deployment)
68 | - Add config to `~/Library/Application Support/Claude/claude_desktop_config.json` (macOS)
69 |
70 | ---
71 |
72 | ## 🔍 Finding Information
73 |
74 | ### I want to understand...
75 |
76 | **...what MCP is:**
77 | - Read: [MCP Protocol - What is MCP?](./system/mcp_protocol.md#what-is-mcp)
78 | - Read: [Project Architecture - Core Architecture](./system/project_architecture.md#core-architecture)
79 |
80 | **...how to add a new tool:**
81 | - Read: [Project Architecture - Contributing - Adding New Tools](./system/project_architecture.md#adding-new-tools)
82 | - Example: See existing tools in `src/scrapegraph_mcp/server.py`
83 |
84 | **...how tools are defined:**
85 | - Read: [MCP Protocol - Tool Schema](./system/mcp_protocol.md#tool-schema)
86 | - Code: `src/scrapegraph_mcp/server.py` (lines 232-372)
87 |
88 | **...how to debug MCP issues:**
89 | - Read: [MCP Protocol - Debugging MCP](./system/mcp_protocol.md#debugging-mcp)
90 | - Tools: MCP Inspector, Claude Desktop logs
91 |
92 | **...how to deploy:**
93 | - Read: [Project Architecture - Deployment](./system/project_architecture.md#deployment)
94 | - Options: Smithery (automated), Docker, pip install
95 |
96 | **...available tools and their parameters:**
97 | - Read: [Project Architecture - MCP Tools](./system/project_architecture.md#mcp-tools)
98 | - Quick reference: 5 tools (markdownify, smartscraper, searchscraper, smartcrawler_initiate, smartcrawler_fetch_results)
99 |
100 | **...error handling:**
101 | - Read: [MCP Protocol - Error Handling](./system/mcp_protocol.md#error-handling)
102 | - Pattern: Return `{"error": "message"}` instead of raising exceptions
103 |
104 | **...how SmartCrawler works:**
105 | - Read: [Project Architecture - Tool #4 & #5](./system/project_architecture.md#4-smartcrawler_initiate)
106 | - Pattern: Initiate (async) → Poll fetch_results until complete
107 |
108 | ---
109 |
110 | ## 🛠️ Development Workflows
111 |
112 | ### Running Locally
113 |
114 | ```bash
115 | # Install dependencies
116 | pip install -e ".[dev]"
117 |
118 | # Set API key
119 | export SGAI_API_KEY=your-api-key
120 |
121 | # Run server
122 | scrapegraph-mcp
123 | # or
124 | python -m scrapegraph_mcp.server
125 | ```
126 |
127 | ### Testing
128 |
129 | **Manual Testing (MCP Inspector):**
130 | ```bash
131 | npx @modelcontextprotocol/inspector scrapegraph-mcp
132 | ```
133 |
134 | **Manual Testing (stdio):**
135 | ```bash
136 | echo '{"jsonrpc":"2.0","method":"tools/call","params":{"name":"markdownify","arguments":{"website_url":"https://scrapegraphai.com"}},"id":1}' | scrapegraph-mcp
137 | ```
138 |
139 | **Integration Testing (Claude Desktop):**
140 | 1. Configure MCP server in Claude Desktop
141 | 2. Restart Claude
142 | 3. Ask: "Convert https://scrapegraphai.com to markdown"
143 | 4. Verify tool invocation and results
144 |
145 | ### Code Quality
146 |
147 | ```bash
148 | # Linting
149 | ruff check src/
150 |
151 | # Type checking
152 | mypy src/
153 |
154 | # Format checking
155 | ruff format --check src/
156 | ```
157 |
158 | ### Building Docker Image
159 |
160 | ```bash
161 | # Build
162 | docker build -t scrapegraph-mcp .
163 |
164 | # Run
165 | docker run -e SGAI_API_KEY=your-api-key scrapegraph-mcp
166 |
167 | # Test
168 | echo '{"jsonrpc":"2.0","method":"tools/list","id":1}' | docker run -i -e SGAI_API_KEY=your-api-key scrapegraph-mcp
169 | ```
170 |
171 | ---
172 |
173 | ## 📊 MCP Tools Reference
174 |
175 | Quick reference to all MCP tools:
176 |
177 | | Tool | Parameters | Purpose | Credits | Async |
178 | |------|------------|---------|---------|-------|
179 | | `markdownify` | `website_url` | Convert webpage to markdown | 2 | No |
180 | | `smartscraper` | `user_prompt`, `website_url`, `number_of_scrolls?`, `markdown_only?` | AI-powered data extraction | 10+ | No |
181 | | `searchscraper` | `user_prompt`, `num_results?`, `number_of_scrolls?` | AI-powered web search | Variable | No |
182 | | `smartcrawler_initiate` | `url`, `prompt?`, `extraction_mode`, `depth?`, `max_pages?`, `same_domain_only?` | Start multi-page crawl | 100+ | Yes (returns request_id) |
183 | | `smartcrawler_fetch_results` | `request_id` | Get crawl results | N/A | No (polls status) |
184 |
185 | For detailed tool documentation, see [Project Architecture - MCP Tools](./system/project_architecture.md#mcp-tools).
186 |
187 | ---
188 |
189 | ## 🔧 Key Files Reference
190 |
191 | ### Core Files
192 | - `src/scrapegraph_mcp/server.py` - Main server implementation (all code)
193 | - `src/scrapegraph_mcp/__init__.py` - Package initialization
194 |
195 | ### Configuration
196 | - `pyproject.toml` - Project metadata, dependencies, build config
197 | - `Dockerfile` - Docker container definition
198 | - `smithery.yaml` - Smithery deployment config
199 |
200 | ### Documentation
201 | - `README.md` - User-facing documentation
202 | - `.agent/README.md` - This file (developer documentation index)
203 | - `.agent/system/project_architecture.md` - Architecture documentation
204 | - `.agent/system/mcp_protocol.md` - MCP protocol documentation
205 |
206 | ---
207 |
208 | ## 🚨 Troubleshooting
209 |
210 | ### Common Issues
211 |
212 | **Issue: "ScapeGraph client not initialized"**
213 | - **Cause:** Missing `SGAI_API_KEY` environment variable
214 | - **Solution:** Set `export SGAI_API_KEY=your-api-key` or pass via `--config`
215 |
216 | **Issue: "Error 401: Unauthorized"**
217 | - **Cause:** Invalid API key
218 | - **Solution:** Verify API key at [dashboard.scrapegraphai.com](https://dashboard.scrapegraphai.com)
219 |
220 | **Issue: "Error 402: Payment Required"**
221 | - **Cause:** Insufficient credits
222 | - **Solution:** Add credits to your ScrapeGraphAI account
223 |
224 | **Issue: Tools not appearing in Claude Desktop**
225 | - **Cause:** Server not starting or config error
226 | - **Solution:** Check Claude logs at `~/Library/Logs/Claude/` (macOS)
227 |
228 | **Issue: SmartCrawler not returning results**
229 | - **Cause:** Still processing (async operation)
230 | - **Solution:** Keep polling `smartcrawler_fetch_results()` until `status == "completed"`
231 |
232 | **Issue: Python version error**
233 | - **Cause:** Python < 3.10
234 | - **Solution:** Upgrade Python to 3.10+
235 |
236 | For more troubleshooting, see:
237 | - [Project Architecture - Troubleshooting](./system/project_architecture.md#troubleshooting)
238 | - [MCP Protocol - Debugging MCP](./system/mcp_protocol.md#debugging-mcp)
239 |
240 | ---
241 |
242 | ## 🤝 Contributing
243 |
244 | ### Before Making Changes
245 |
246 | 1. **Read relevant documentation** - Understand MCP and the server architecture
247 | 2. **Check existing issues** - Avoid duplicate work
248 | 3. **Test locally** - Use MCP Inspector to verify changes
249 | 4. **Test with clients** - Verify with Claude Desktop or Cursor
250 |
251 | ### Adding a New Tool
252 |
253 | **Step-by-step guide:**
254 |
255 | 1. **Add method to `ScapeGraphClient` class:**
256 | ```python
257 | def new_tool(self, param: str) -> Dict[str, Any]:
258 | """Tool description."""
259 | url = f"{self.BASE_URL}/new-endpoint"
260 | data = {"param": param}
261 | response = self.client.post(url, headers=self.headers, json=data)
262 | if response.status_code != 200:
263 | raise Exception(f"Error {response.status_code}: {response.text}")
264 | return response.json()
265 | ```
266 |
267 | 2. **Add MCP tool decorator:**
268 | ```python
269 | @mcp.tool()
270 | def new_tool(param: str) -> Dict[str, Any]:
271 | """
272 | Tool description for AI assistants.
273 |
274 | Args:
275 | param: Parameter description
276 |
277 | Returns:
278 | Dictionary containing results
279 | """
280 | if scrapegraph_client is None:
281 | return {"error": "ScapeGraph client not initialized. Please provide an API key."}
282 |
283 | try:
284 | return scrapegraph_client.new_tool(param)
285 | except Exception as e:
286 | return {"error": str(e)}
287 | ```
288 |
289 | 3. **Test with MCP Inspector:**
290 | ```bash
291 | npx @modelcontextprotocol/inspector scrapegraph-mcp
292 | ```
293 |
294 | 4. **Update documentation:**
295 | - Add tool to [Project Architecture - MCP Tools](./system/project_architecture.md#mcp-tools)
296 | - Add schema to [MCP Protocol - Tool Schema](./system/mcp_protocol.md#tool-schema)
297 | - Update tool reference table in this README
298 |
299 | 5. **Submit pull request**
300 |
301 | ### Development Process
302 |
303 | 1. **Make changes** - Edit `src/scrapegraph_mcp/server.py`
304 | 2. **Run linting** - `ruff check src/`
305 | 3. **Run type checking** - `mypy src/`
306 | 4. **Test locally** - MCP Inspector + Claude Desktop
307 | 5. **Update docs** - Keep `.agent/` docs in sync
308 | 6. **Commit** - Clear commit message
309 | 7. **Create PR** - Describe changes thoroughly
310 |
311 | ### Code Style
312 |
313 | - **Ruff:** Line length 100, target Python 3.12
314 | - **mypy:** Strict mode, disallow untyped defs
315 | - **Type hints:** Always use type hints for parameters and return values
316 | - **Docstrings:** Google-style docstrings for all public functions
317 | - **Error handling:** Return error dicts, don't raise exceptions in tools
318 |
319 | ---
320 |
321 | ## 📖 External Documentation
322 |
323 | ### MCP Resources
324 | - [Model Context Protocol Specification](https://modelcontextprotocol.io/)
325 | - [MCP Python SDK](https://github.com/modelcontextprotocol/python-sdk)
326 | - [FastMCP Framework](https://github.com/jlowin/fastmcp)
327 | - [MCP Inspector](https://github.com/modelcontextprotocol/inspector)
328 |
329 | ### ScrapeGraphAI Resources
330 | - [ScrapeGraphAI Homepage](https://scrapegraphai.com)
331 | - [ScrapeGraphAI Dashboard](https://dashboard.scrapegraphai.com)
332 | - [ScrapeGraphAI API Documentation](https://api.scrapegraphai.com/docs)
333 |
334 | ### AI Assistant Integration
335 | - [Claude Desktop](https://claude.ai/desktop)
336 | - [Cursor](https://cursor.sh/)
337 | - [Smithery MCP Distribution](https://smithery.ai/)
338 |
339 | ### Development Tools
340 | - [Python httpx](https://www.python-httpx.org/)
341 | - [Ruff Linter](https://docs.astral.sh/ruff/)
342 | - [mypy Type Checker](https://mypy-lang.org/)
343 |
344 | ---
345 |
346 | ## 📝 Documentation Maintenance
347 |
348 | ### When to Update Documentation
349 |
350 | **Update `.agent/system/project_architecture.md` when:**
351 | - Adding new MCP tools
352 | - Changing tool parameters or return types
353 | - Updating deployment methods
354 | - Modifying technology stack
355 |
356 | **Update `.agent/system/mcp_protocol.md` when:**
357 | - Changing MCP protocol implementation
358 | - Adding new communication patterns
359 | - Modifying error handling strategy
360 | - Updating authentication method
361 |
362 | **Update `.agent/README.md` when:**
363 | - Adding new documentation files
364 | - Changing development workflows
365 | - Updating quick start instructions
366 |
367 | ### Documentation Best Practices
368 |
369 | 1. **Keep it current** - Update docs with code changes in the same PR
370 | 2. **Be specific** - Include code snippets, file paths, line numbers
371 | 3. **Include examples** - Show real-world usage patterns
372 | 4. **Link related sections** - Cross-reference between documents
373 | 5. **Test examples** - Verify all code examples work
374 |
375 | ---
376 |
377 | ## 📅 Changelog
378 |
379 | ### October 2025
380 | - ✅ Initial comprehensive documentation created
381 | - ✅ Project architecture fully documented
382 | - ✅ MCP protocol integration documented
383 | - ✅ All 5 MCP tools documented
384 | - ✅ SmartCrawler integration (initiate + fetch_results)
385 | - ✅ Deployment guides (Smithery, Docker, Claude Desktop, Cursor)
386 | - ✅ Recent updates: Enhanced error handling, extraction mode validation
387 |
388 | ---
389 |
390 | ## 🔗 Quick Links
391 |
392 | - [Main README](../README.md) - User-facing documentation
393 | - [Server Implementation](../src/scrapegraph_mcp/server.py) - All code (single file)
394 | - [pyproject.toml](../pyproject.toml) - Project metadata
395 | - [Dockerfile](../Dockerfile) - Docker configuration
396 | - [smithery.yaml](../smithery.yaml) - Smithery config
397 | - [GitHub Repository](https://github.com/ScrapeGraphAI/scrapegraph-mcp)
398 |
399 | ---
400 |
401 | ## 📧 Support
402 |
403 | For questions or issues:
404 | 1. Check this documentation first
405 | 2. Review [Project Architecture](./system/project_architecture.md) and [MCP Protocol](./system/mcp_protocol.md)
406 | 3. Test with [MCP Inspector](https://github.com/modelcontextprotocol/inspector)
407 | 4. Search [GitHub issues](https://github.com/ScrapeGraphAI/scrapegraph-mcp/issues)
408 | 5. Create a new issue with detailed information
409 |
410 | ---
411 |
412 | **Made with ❤️ by [ScrapeGraphAI](https://scrapegraphai.com) Team**
413 |
414 | **Happy Coding! 🚀**
415 |
```
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
```markdown
1 | # ScrapeGraph MCP Server
2 |
3 | [](https://opensource.org/licenses/MIT)
4 | [](https://www.python.org/downloads/)
5 | [](https://smithery.ai/server/@ScrapeGraphAI/scrapegraph-mcp)
6 |
7 |
8 | A production-ready [Model Context Protocol](https://modelcontextprotocol.io/introduction) (MCP) server that provides seamless integration with the [ScrapeGraph AI](https://scrapegraphai.com) API. This server enables language models to leverage advanced AI-powered web scraping capabilities with enterprise-grade reliability.
9 |
10 | ## Table of Contents
11 |
12 | - [Key Features](#key-features)
13 | - [Quick Start](#quick-start)
14 | - [Available Tools](#available-tools)
15 | - [Setup Instructions](#setup-instructions)
16 | - [Local Usage](#local-usage)
17 | - [Google ADK Integration](#google-adk-integration)
18 | - [Example Use Cases](#example-use-cases)
19 | - [Error Handling](#error-handling)
20 | - [Common Issues](#common-issues)
21 | - [Development](#development)
22 | - [Contributing](#contributing)
23 | - [Documentation](#documentation)
24 | - [Technology Stack](#technology-stack)
25 | - [License](#license)
26 |
27 | ## Key Features
28 |
29 | - **8 Powerful Tools**: From simple markdown conversion to complex multi-page crawling and agentic workflows
30 | - **AI-Powered Extraction**: Intelligently extract structured data using natural language prompts
31 | - **Multi-Page Crawling**: SmartCrawler supports asynchronous crawling with configurable depth and page limits
32 | - **Infinite Scroll Support**: Handle dynamic content loading with configurable scroll counts
33 | - **JavaScript Rendering**: Full support for JavaScript-heavy websites
34 | - **Flexible Output Formats**: Get results as markdown, structured JSON, or custom schemas
35 | - **Easy Integration**: Works seamlessly with Claude Desktop, Cursor, and any MCP-compatible client
36 | - **Enterprise-Ready**: Robust error handling, timeout management, and production-tested reliability
37 | - **Simple Deployment**: One-command installation via Smithery or manual setup
38 | - **Comprehensive Documentation**: Detailed developer docs in `.agent/` folder
39 |
40 | ## Quick Start
41 |
42 | ### 1. Get Your API Key
43 |
44 | Sign up and get your API key from the [ScrapeGraph Dashboard](https://dashboard.scrapegraphai.com)
45 |
46 | ### 2. Install with Smithery (Recommended)
47 |
48 | ```bash
49 | npx -y @smithery/cli install @ScrapeGraphAI/scrapegraph-mcp --client claude
50 | ```
51 |
52 | ### 3. Start Using
53 |
54 | Ask Claude or Cursor:
55 | - "Convert https://scrapegraphai.com to markdown"
56 | - "Extract all product prices from this e-commerce page"
57 | - "Research the latest AI developments and summarize findings"
58 |
59 | That's it! The server is now available to your AI assistant.
60 |
61 | ## Available Tools
62 |
63 | The server provides **8 enterprise-ready tools** for AI-powered web scraping:
64 |
65 | ### Core Scraping Tools
66 |
67 | #### 1. `markdownify`
68 | Transform any webpage into clean, structured markdown format.
69 |
70 | ```python
71 | markdownify(website_url: str)
72 | ```
73 | - **Credits**: 2 per request
74 | - **Use case**: Quick webpage content extraction in markdown
75 |
76 | #### 2. `smartscraper`
77 | Leverage AI to extract structured data from any webpage with support for infinite scrolling.
78 |
79 | ```python
80 | smartscraper(
81 | user_prompt: str,
82 | website_url: str,
83 | number_of_scrolls: int = None,
84 | markdown_only: bool = None
85 | )
86 | ```
87 | - **Credits**: 10+ (base) + variable based on scrolling
88 | - **Use case**: AI-powered data extraction with custom prompts
89 |
90 | #### 3. `searchscraper`
91 | Execute AI-powered web searches with structured, actionable results.
92 |
93 | ```python
94 | searchscraper(
95 | user_prompt: str,
96 | num_results: int = None,
97 | number_of_scrolls: int = None
98 | )
99 | ```
100 | - **Credits**: Variable (3-20 websites × 10 credits)
101 | - **Use case**: Multi-source research and data aggregation
102 |
103 | ### Advanced Scraping Tools
104 |
105 | #### 4. `scrape`
106 | Basic scraping endpoint to fetch page content with optional heavy JavaScript rendering.
107 |
108 | ```python
109 | scrape(website_url: str, render_heavy_js: bool = None)
110 | ```
111 | - **Use case**: Simple page content fetching with JS rendering support
112 |
113 | #### 5. `sitemap`
114 | Extract sitemap URLs and structure for any website.
115 |
116 | ```python
117 | sitemap(website_url: str)
118 | ```
119 | - **Use case**: Website structure analysis and URL discovery
120 |
121 | ### Multi-Page Crawling
122 |
123 | #### 6. `smartcrawler_initiate`
124 | Initiate intelligent multi-page web crawling (asynchronous operation).
125 |
126 | ```python
127 | smartcrawler_initiate(
128 | url: str,
129 | prompt: str = None,
130 | extraction_mode: str = "ai",
131 | depth: int = None,
132 | max_pages: int = None,
133 | same_domain_only: bool = None
134 | )
135 | ```
136 | - **AI Extraction Mode**: 10 credits per page - extracts structured data
137 | - **Markdown Mode**: 2 credits per page - converts to markdown
138 | - **Returns**: `request_id` for polling
139 | - **Use case**: Large-scale website crawling and data extraction
140 |
141 | #### 7. `smartcrawler_fetch_results`
142 | Retrieve results from asynchronous crawling operations.
143 |
144 | ```python
145 | smartcrawler_fetch_results(request_id: str)
146 | ```
147 | - **Returns**: Status and results when crawling is complete
148 | - **Use case**: Poll for crawl completion and retrieve results
149 |
150 | ### Intelligent Agent-Based Scraping
151 |
152 | #### 8. `agentic_scrapper`
153 | Run advanced agentic scraping workflows with customizable steps and structured output schemas.
154 |
155 | ```python
156 | agentic_scrapper(
157 | url: str,
158 | user_prompt: str = None,
159 | output_schema: dict = None,
160 | steps: list = None,
161 | ai_extraction: bool = None,
162 | persistent_session: bool = None,
163 | timeout_seconds: float = None
164 | )
165 | ```
166 | - **Use case**: Complex multi-step workflows with custom schemas and persistent sessions
167 |
168 | ## Setup Instructions
169 |
170 | To utilize this server, you'll need a ScrapeGraph API key. Follow these steps to obtain one:
171 |
172 | 1. Navigate to the [ScrapeGraph Dashboard](https://dashboard.scrapegraphai.com)
173 | 2. Create an account and generate your API key
174 |
175 | ### Automated Installation via Smithery
176 |
177 | For automated installation of the ScrapeGraph API Integration Server using [Smithery](https://smithery.ai/server/@ScrapeGraphAI/scrapegraph-mcp):
178 |
179 | ```bash
180 | npx -y @smithery/cli install @ScrapeGraphAI/scrapegraph-mcp --client claude
181 | ```
182 |
183 | ### Claude Desktop Configuration
184 |
185 | Update your Claude Desktop configuration file with the following settings (located on the top rigth of the Cursor page):
186 |
187 | (remember to add your API key inside the config)
188 |
189 | ```json
190 | {
191 | "mcpServers": {
192 | "@ScrapeGraphAI-scrapegraph-mcp": {
193 | "command": "npx",
194 | "args": [
195 | "-y",
196 | "@smithery/cli@latest",
197 | "run",
198 | "@ScrapeGraphAI/scrapegraph-mcp",
199 | "--config",
200 | "\"{\\\"scrapegraphApiKey\\\":\\\"YOUR-SGAI-API-KEY\\\"}\""
201 | ]
202 | }
203 | }
204 | }
205 | ```
206 |
207 | The configuration file is located at:
208 | - Windows: `%APPDATA%/Claude/claude_desktop_config.json`
209 | - macOS: `~/Library/Application\ Support/Claude/claude_desktop_config.json`
210 |
211 | ### Cursor Integration
212 |
213 | Add the ScrapeGraphAI MCP server on the settings:
214 |
215 | 
216 |
217 | ## Local Usage
218 |
219 | To run the MCP server locally for development or testing, follow these steps:
220 |
221 | ### Prerequisites
222 |
223 | - Python 3.13 or higher
224 | - pip or uv package manager
225 | - ScrapeGraph API key
226 |
227 | ### Installation
228 |
229 | 1. **Clone the repository** (if you haven't already):
230 |
231 | ```bash
232 | git clone https://github.com/ScrapeGraphAI/scrapegraph-mcp
233 | cd scrapegraph-mcp
234 | ```
235 |
236 | 2. **Install the package**:
237 |
238 | ```bash
239 | # Using pip
240 | pip install -e .
241 |
242 | # Or using uv (faster)
243 | uv pip install -e .
244 | ```
245 |
246 | 3. **Set your API key**:
247 |
248 | ```bash
249 | # macOS/Linux
250 | export SGAI_API_KEY=your-api-key-here
251 |
252 | # Windows (PowerShell)
253 | $env:SGAI_API_KEY="your-api-key-here"
254 |
255 | # Windows (CMD)
256 | set SGAI_API_KEY=your-api-key-here
257 | ```
258 |
259 | ### Running the Server Locally
260 |
261 | You can run the server directly:
262 |
263 | ```bash
264 | # Using the installed command
265 | scrapegraph-mcp
266 |
267 | # Or using Python module
268 | python -m scrapegraph_mcp.server
269 | ```
270 |
271 | The server will start and communicate via stdio (standard input/output), which is the standard MCP transport method.
272 |
273 | ### Testing with MCP Inspector
274 |
275 | Test your local server using the MCP Inspector tool:
276 |
277 | ```bash
278 | npx @modelcontextprotocol/inspector python -m scrapegraph_mcp.server
279 | ```
280 |
281 | This provides a web interface to test all available tools interactively.
282 |
283 | ### Configuring Claude Desktop for Local Server
284 |
285 | To use your locally running server with Claude Desktop, update your configuration file:
286 |
287 | **macOS/Linux** (`~/Library/Application Support/Claude/claude_desktop_config.json`):
288 |
289 | ```json
290 | {
291 | "mcpServers": {
292 | "scrapegraph-mcp-local": {
293 | "command": "python",
294 | "args": [
295 | "-m",
296 | "scrapegraph_mcp.server"
297 | ],
298 | "env": {
299 | "SGAI_API_KEY": "your-api-key-here"
300 | }
301 | }
302 | }
303 | }
304 | ```
305 |
306 | **Windows** (`%APPDATA%\Claude\claude_desktop_config.json`):
307 |
308 | ```json
309 | {
310 | "mcpServers": {
311 | "scrapegraph-mcp-local": {
312 | "command": "python",
313 | "args": [
314 | "-m",
315 | "scrapegraph_mcp.server"
316 | ],
317 | "env": {
318 | "SGAI_API_KEY": "your-api-key-here"
319 | }
320 | }
321 | }
322 | }
323 | ```
324 |
325 | **Note**: Make sure Python is in your PATH. You can verify by running `python --version` in your terminal.
326 |
327 | ### Configuring Cursor for Local Server
328 |
329 | In Cursor's MCP settings, add a new server with:
330 |
331 | - **Command**: `python`
332 | - **Args**: `["-m", "scrapegraph_mcp.server"]`
333 | - **Environment Variables**: `{"SGAI_API_KEY": "your-api-key-here"}`
334 |
335 | ### Troubleshooting Local Setup
336 |
337 | **Server not starting:**
338 | - Verify Python is installed: `python --version`
339 | - Check that the package is installed: `pip list | grep scrapegraph-mcp`
340 | - Ensure API key is set: `echo $SGAI_API_KEY` (macOS/Linux) or `echo %SGAI_API_KEY%` (Windows)
341 |
342 | **Tools not appearing:**
343 | - Check Claude Desktop logs:
344 | - macOS: `~/Library/Logs/Claude/`
345 | - Windows: `%APPDATA%\Claude\Logs\`
346 | - Verify the server starts without errors when run directly
347 | - Check that the configuration JSON is valid
348 |
349 | **Import errors:**
350 | - Reinstall the package: `pip install -e . --force-reinstall`
351 | - Verify dependencies: `pip install -r requirements.txt` (if available)
352 |
353 | ## Google ADK Integration
354 |
355 | The ScrapeGraph MCP server can be integrated with [Google ADK (Agent Development Kit)](https://github.com/google/adk) to create AI agents with web scraping capabilities.
356 |
357 | ### Prerequisites
358 |
359 | - Python 3.13 or higher
360 | - Google ADK installed
361 | - ScrapeGraph API key
362 |
363 | ### Installation
364 |
365 | 1. **Install Google ADK** (if not already installed):
366 |
367 | ```bash
368 | pip install google-adk
369 | ```
370 |
371 | 2. **Set your API key**:
372 |
373 | ```bash
374 | export SGAI_API_KEY=your-api-key-here
375 | ```
376 |
377 | ### Basic Integration Example
378 |
379 | Create an agent file (e.g., `agent.py`) with the following configuration:
380 |
381 | ```python
382 | import os
383 | from google.adk.agents import LlmAgent
384 | from google.adk.tools.mcp_tool.mcp_toolset import MCPToolset
385 | from google.adk.tools.mcp_tool.mcp_session_manager import StdioConnectionParams
386 | from mcp import StdioServerParameters
387 |
388 | # Path to the scrapegraph-mcp server directory
389 | SCRAPEGRAPH_MCP_PATH = "/path/to/scrapegraph-mcp"
390 |
391 | # Path to the server.py file
392 | SERVER_SCRIPT_PATH = os.path.join(
393 | SCRAPEGRAPH_MCP_PATH,
394 | "src",
395 | "scrapegraph_mcp",
396 | "server.py"
397 | )
398 |
399 | root_agent = LlmAgent(
400 | model='gemini-2.0-flash',
401 | name='scrapegraph_assistant_agent',
402 | instruction='Help the user with web scraping and data extraction using ScrapeGraph AI. '
403 | 'You can convert webpages to markdown, extract structured data using AI, '
404 | 'perform web searches, crawl multiple pages, and automate complex scraping workflows.',
405 | tools=[
406 | MCPToolset(
407 | connection_params=StdioConnectionParams(
408 | server_params=StdioServerParameters(
409 | command='python3',
410 | args=[
411 | SERVER_SCRIPT_PATH,
412 | ],
413 | env={
414 | 'SGAI_API_KEY': os.getenv('SGAI_API_KEY'),
415 | },
416 | ),
417 | timeout=300.0,)
418 | ),
419 | # Optional: Filter which tools from the MCP server are exposed
420 | # tool_filter=['markdownify', 'smartscraper', 'searchscraper']
421 | )
422 | ],
423 | )
424 | ```
425 |
426 | ### Configuration Options
427 |
428 | **Timeout Settings:**
429 | - Default timeout is 5 seconds, which may be too short for web scraping operations
430 | - Recommended: Set `timeout=300.0
431 | - Adjust based on your use case (crawling operations may need even longer timeouts)
432 |
433 | **Tool Filtering:**
434 | - By default, all 8 tools are exposed to the agent
435 | - Use `tool_filter` to limit which tools are available:
436 | ```python
437 | tool_filter=['markdownify', 'smartscraper', 'searchscraper']
438 | ```
439 |
440 | **API Key Configuration:**
441 | - Set via environment variable: `export SGAI_API_KEY=your-key`
442 | - Or pass directly in `env` dict: `'SGAI_API_KEY': 'your-key-here'`
443 | - Environment variable approach is recommended for security
444 |
445 | ### Usage Example
446 |
447 | Once configured, your agent can use natural language to interact with web scraping tools:
448 |
449 | ```python
450 | # The agent can now handle queries like:
451 | # - "Convert https://example.com to markdown"
452 | # - "Extract all product prices from this e-commerce page"
453 | # - "Search for recent AI research papers and summarize them"
454 | # - "Crawl this documentation site and extract all API endpoints"
455 | ```
456 | For more information about Google ADK, visit the [official documentation](https://github.com/google/adk).
457 |
458 | ## Example Use Cases
459 |
460 | The server enables sophisticated queries across various scraping scenarios:
461 |
462 | ### Single Page Scraping
463 | - **Markdownify**: "Convert the ScrapeGraph documentation page to markdown"
464 | - **SmartScraper**: "Extract all product names, prices, and ratings from this e-commerce page"
465 | - **SmartScraper with scrolling**: "Scrape this infinite scroll page with 5 scrolls and extract all items"
466 | - **Basic Scrape**: "Fetch the HTML content of this JavaScript-heavy page with full rendering"
467 |
468 | ### Search and Research
469 | - **SearchScraper**: "Research and summarize recent developments in AI-powered web scraping"
470 | - **SearchScraper**: "Search for the top 5 articles about machine learning frameworks and extract key insights"
471 | - **SearchScraper**: "Find recent news about GPT-4 and provide a structured summary"
472 |
473 | ### Website Analysis
474 | - **Sitemap**: "Extract the complete sitemap structure from the ScrapeGraph website"
475 | - **Sitemap**: "Discover all URLs on this blog site"
476 |
477 | ### Multi-Page Crawling
478 | - **SmartCrawler (AI mode)**: "Crawl the entire documentation site and extract all API endpoints with descriptions"
479 | - **SmartCrawler (Markdown mode)**: "Convert all pages in the blog to markdown up to 2 levels deep"
480 | - **SmartCrawler**: "Extract all product information from an e-commerce site, maximum 100 pages, same domain only"
481 |
482 | ### Advanced Agentic Scraping
483 | - **Agentic Scraper**: "Navigate through a multi-step authentication form and extract user dashboard data"
484 | - **Agentic Scraper with schema**: "Follow pagination links and compile a dataset with schema: {title, author, date, content}"
485 | - **Agentic Scraper**: "Execute a complex workflow: login, navigate to reports, download data, and extract summary statistics"
486 |
487 | ## Error Handling
488 |
489 | The server implements robust error handling with detailed, actionable error messages for:
490 |
491 | - API authentication issues
492 | - Malformed URL structures
493 | - Network connectivity failures
494 | - Rate limiting and quota management
495 |
496 | ## Common Issues
497 |
498 | ### Windows-Specific Connection
499 |
500 | When running on Windows systems, you may need to use the following command to connect to the MCP server:
501 |
502 | ```bash
503 | C:\Windows\System32\cmd.exe /c npx -y @smithery/cli@latest run @ScrapeGraphAI/scrapegraph-mcp --config "{\"scrapegraphApiKey\":\"YOUR-SGAI-API-KEY\"}"
504 | ```
505 |
506 | This ensures proper execution in the Windows environment.
507 |
508 | ### Other Common Issues
509 |
510 | **"ScrapeGraph client not initialized"**
511 | - **Cause**: Missing API key
512 | - **Solution**: Set `SGAI_API_KEY` environment variable or provide via `--config`
513 |
514 | **"Error 401: Unauthorized"**
515 | - **Cause**: Invalid API key
516 | - **Solution**: Verify your API key at the [ScrapeGraph Dashboard](https://dashboard.scrapegraphai.com)
517 |
518 | **"Error 402: Payment Required"**
519 | - **Cause**: Insufficient credits
520 | - **Solution**: Add credits to your ScrapeGraph account
521 |
522 | **SmartCrawler not returning results**
523 | - **Cause**: Still processing (asynchronous operation)
524 | - **Solution**: Keep polling `smartcrawler_fetch_results()` until status is "completed"
525 |
526 | **Tools not appearing in Claude Desktop**
527 | - **Cause**: Server not starting or configuration error
528 | - **Solution**: Check Claude logs at `~/Library/Logs/Claude/` (macOS) or `%APPDATA%\Claude\Logs\` (Windows)
529 |
530 | For detailed troubleshooting, see the [.agent documentation](.agent/README.md).
531 |
532 | ## Development
533 |
534 | ### Prerequisites
535 |
536 | - Python 3.13 or higher
537 | - pip or uv package manager
538 | - ScrapeGraph API key
539 |
540 | ### Installation from Source
541 |
542 | ```bash
543 | # Clone the repository
544 | git clone https://github.com/ScrapeGraphAI/scrapegraph-mcp
545 | cd scrapegraph-mcp
546 |
547 | # Install dependencies
548 | pip install -e ".[dev]"
549 |
550 | # Set your API key
551 | export SGAI_API_KEY=your-api-key
552 |
553 | # Run the server
554 | scrapegraph-mcp
555 | # or
556 | python -m scrapegraph_mcp.server
557 | ```
558 |
559 | ### Testing with MCP Inspector
560 |
561 | Test your server locally using the MCP Inspector tool:
562 |
563 | ```bash
564 | npx @modelcontextprotocol/inspector scrapegraph-mcp
565 | ```
566 |
567 | This provides a web interface to test all available tools.
568 |
569 | ### Code Quality
570 |
571 | **Linting:**
572 | ```bash
573 | ruff check src/
574 | ```
575 |
576 | **Type Checking:**
577 | ```bash
578 | mypy src/
579 | ```
580 |
581 | **Format Checking:**
582 | ```bash
583 | ruff format --check src/
584 | ```
585 |
586 | ### Project Structure
587 |
588 | ```
589 | scrapegraph-mcp/
590 | ├── src/
591 | │ └── scrapegraph_mcp/
592 | │ ├── __init__.py # Package initialization
593 | │ └── server.py # Main MCP server (all code in one file)
594 | ├── .agent/ # Developer documentation
595 | │ ├── README.md # Documentation index
596 | │ └── system/ # System architecture docs
597 | ├── assets/ # Images and badges
598 | ├── pyproject.toml # Project metadata & dependencies
599 | ├── smithery.yaml # Smithery deployment config
600 | └── README.md # This file
601 | ```
602 |
603 | ## Contributing
604 |
605 | We welcome contributions! Here's how you can help:
606 |
607 | ### Adding a New Tool
608 |
609 | 1. **Add method to `ScapeGraphClient` class** in [server.py](src/scrapegraph_mcp/server.py):
610 |
611 | ```python
612 | def new_tool(self, param: str) -> Dict[str, Any]:
613 | """Tool description."""
614 | url = f"{self.BASE_URL}/new-endpoint"
615 | data = {"param": param}
616 | response = self.client.post(url, headers=self.headers, json=data)
617 | if response.status_code != 200:
618 | raise Exception(f"Error {response.status_code}: {response.text}")
619 | return response.json()
620 | ```
621 |
622 | 2. **Add MCP tool decorator**:
623 |
624 | ```python
625 | @mcp.tool()
626 | def new_tool(param: str) -> Dict[str, Any]:
627 | """
628 | Tool description for AI assistants.
629 |
630 | Args:
631 | param: Parameter description
632 |
633 | Returns:
634 | Dictionary containing results
635 | """
636 | if scrapegraph_client is None:
637 | return {"error": "ScrapeGraph client not initialized. Please provide an API key."}
638 |
639 | try:
640 | return scrapegraph_client.new_tool(param)
641 | except Exception as e:
642 | return {"error": str(e)}
643 | ```
644 |
645 | 3. **Test with MCP Inspector**:
646 | ```bash
647 | npx @modelcontextprotocol/inspector scrapegraph-mcp
648 | ```
649 |
650 | 4. **Update documentation**:
651 | - Add tool to this README
652 | - Update [.agent documentation](.agent/README.md)
653 |
654 | 5. **Submit a pull request**
655 |
656 | ### Development Workflow
657 |
658 | 1. Fork the repository
659 | 2. Create a feature branch (`git checkout -b feature/amazing-feature`)
660 | 3. Make your changes
661 | 4. Run linting and type checking
662 | 5. Test with MCP Inspector and Claude Desktop
663 | 6. Update documentation
664 | 7. Commit your changes (`git commit -m 'Add amazing feature'`)
665 | 8. Push to the branch (`git push origin feature/amazing-feature`)
666 | 9. Open a Pull Request
667 |
668 | ### Code Style
669 |
670 | - **Line length**: 100 characters
671 | - **Type hints**: Required for all functions
672 | - **Docstrings**: Google-style docstrings
673 | - **Error handling**: Return error dicts, don't raise exceptions in tools
674 | - **Python version**: Target 3.13+
675 |
676 | For detailed development guidelines, see the [.agent documentation](.agent/README.md).
677 |
678 | ## Documentation
679 |
680 | For comprehensive developer documentation, see:
681 |
682 | - **[.agent/README.md](.agent/README.md)** - Complete developer documentation index
683 | - **[.agent/system/project_architecture.md](.agent/system/project_architecture.md)** - System architecture and design
684 | - **[.agent/system/mcp_protocol.md](.agent/system/mcp_protocol.md)** - MCP protocol integration details
685 |
686 | ## Technology Stack
687 |
688 | ### Core Framework
689 | - **Python 3.13+** - Modern Python with type hints
690 | - **FastMCP** - Lightweight MCP server framework
691 | - **httpx 0.24.0+** - Modern async HTTP client
692 |
693 | ### Development Tools
694 | - **Ruff** - Fast Python linter and formatter
695 | - **mypy** - Static type checker
696 | - **Hatchling** - Modern build backend
697 |
698 | ### Deployment
699 | - **Smithery** - Automated MCP server deployment
700 | - **Docker** - Container support with Alpine Linux
701 | - **stdio transport** - Standard MCP communication
702 |
703 | ### API Integration
704 | - **ScrapeGraph AI API** - Enterprise web scraping service
705 | - **Base URL**: `https://api.scrapegraphai.com/v1`
706 | - **Authentication**: API key-based
707 |
708 | ## License
709 |
710 | This project is distributed under the MIT License. For detailed terms and conditions, please refer to the LICENSE file.
711 |
712 | ## Acknowledgments
713 |
714 | Special thanks to [tomekkorbak](https://github.com/tomekkorbak) for his implementation of [oura-mcp-server](https://github.com/tomekkorbak/oura-mcp-server), which served as starting point for this repo.
715 |
716 | ## Resources
717 |
718 | ### Official Links
719 | - [ScrapeGraph AI Homepage](https://scrapegraphai.com)
720 | - [ScrapeGraph Dashboard](https://dashboard.scrapegraphai.com) - Get your API key
721 | - [ScrapeGraph API Documentation](https://api.scrapegraphai.com/docs)
722 | - [GitHub Repository](https://github.com/ScrapeGraphAI/scrapegraph-mcp)
723 |
724 | ### MCP Resources
725 | - [Model Context Protocol](https://modelcontextprotocol.io/) - Official MCP specification
726 | - [FastMCP Framework](https://github.com/jlowin/fastmcp) - Framework used by this server
727 | - [MCP Inspector](https://github.com/modelcontextprotocol/inspector) - Testing tool
728 | - [Smithery](https://smithery.ai/server/@ScrapeGraphAI/scrapegraph-mcp) - MCP server distribution
729 | - mcp-name: io.github.ScrapeGraphAI/scrapegraph-mcp
730 |
731 | ### AI Assistant Integration
732 | - [Claude Desktop](https://claude.ai/desktop) - Desktop app with MCP support
733 | - [Cursor](https://cursor.sh/) - AI-powered code editor
734 |
735 | ### Support
736 | - [GitHub Issues](https://github.com/ScrapeGraphAI/scrapegraph-mcp/issues) - Report bugs or request features
737 | - [Developer Documentation](.agent/README.md) - Comprehensive dev docs
738 |
739 | ---
740 |
741 | Made with ❤️ by [ScrapeGraphAI](https://scrapegraphai.com) Team
742 |
```
--------------------------------------------------------------------------------
/CLAUDE.md:
--------------------------------------------------------------------------------
```markdown
1 | # DOCS
2 | We keep all important docs in agent folder and keep updating them, structure like below
3 | .agent
4 | - Tasks: PRD & implementation plan for each feature
5 | - System: Document the current state of the system (project structure, tech stack, integration points, database schema, and core functionalities such as agent architecture, LLM layer, etc.)
6 | - SOP: Best practices of execute certain tasks (e.g. how to add a schema migration, how to add a new page route, etc.)
7 | - README.md: an index of all the documentations we have so people know what & where to look for things
8 | We should always update «agent docs after we implement certain featrue, to make sure it fully reflect the up to date information
9 | Before you plan any implementation, always read the •agent/README first to get context
```
--------------------------------------------------------------------------------
/src/scrapegraph_mcp/__init__.py:
--------------------------------------------------------------------------------
```python
1 | """ScapeGraph MCP Server."""
2 |
3 | from .server import main
4 |
5 | __all__ = ["main"]
6 |
```
--------------------------------------------------------------------------------
/smithery.yaml:
--------------------------------------------------------------------------------
```yaml
1 | # Smithery configuration file: https://smithery.ai/docs/config#smitheryyaml
2 |
3 |
4 | configSchema:
5 | # JSON Schema defining the configuration options for the MCP.
6 | type: "object"
7 | required: ["scrapegraphApiKey"]
8 | properties:
9 | scrapegraphApiKey:
10 | type: "string"
11 | description: "Your Scrapegraph API key"
12 |
13 | runtime: "python"
```
--------------------------------------------------------------------------------
/.github/workflows/python-package.yml:
--------------------------------------------------------------------------------
```yaml
1 | name: Python Package
2 |
3 | on:
4 | push:
5 | branches: [ main ]
6 | pull_request:
7 | branches: [ main ]
8 |
9 | jobs:
10 | lint:
11 | runs-on: ubuntu-latest
12 | steps:
13 | - uses: actions/checkout@v3
14 | - name: Set up Python 3.12
15 | uses: actions/setup-python@v4
16 | with:
17 | python-version: "3.12"
18 | - name: Install dependencies
19 | run: |
20 | python -m pip install --upgrade pip
21 | python -m pip install ruff mypy
22 | pip install -e .
23 | - name: Lint with ruff
24 | run: |
25 | ruff check .
26 | - name: Type check with mypy
27 | run: |
28 | mypy src
```
--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
```dockerfile
1 | # Use Python slim image
2 | FROM python:3.11-slim
3 |
4 | # Set working directory
5 | WORKDIR /app
6 |
7 | # Set Python unbuffered mode
8 | ENV PYTHONUNBUFFERED=1
9 |
10 | # Copy pyproject.toml and README.md first for better caching
11 | COPY pyproject.toml README.md ./
12 |
13 | # Copy the source code
14 | COPY src/ ./src/
15 |
16 | # Install the package and its dependencies from pyproject.toml
17 | RUN pip install --no-cache-dir .
18 |
19 | # Create non-root user
20 | RUN useradd -m -u 1000 mcpuser && \
21 | chown -R mcpuser:mcpuser /app
22 |
23 | # Switch to non-root user
24 | USER mcpuser
25 |
26 | # Run the server
27 | CMD ["python", "-m", "scrapegraph_mcp.server"]
28 |
29 |
```
--------------------------------------------------------------------------------
/server.json:
--------------------------------------------------------------------------------
```json
1 | {
2 | "$schema": "https://static.modelcontextprotocol.io/schemas/2025-10-17/server.schema.json",
3 | "name": "io.github.ScrapeGraphAI/scrapegraph-mcp",
4 | "description": "AI-powered web scraping and data extraction capabilities through ScrapeGraph API",
5 | "repository": {
6 | "url": "https://github.com/ScrapeGraphAI/scrapegraph-mcp",
7 | "source": "github"
8 | },
9 | "version": "1.0.0",
10 | "packages": [
11 | {
12 | "registryType": "pypi",
13 | "identifier": "scrapegraph-mcp",
14 | "version": "1.0.0",
15 | "transport": {
16 | "type": "stdio"
17 | },
18 | "environmentVariables": [
19 | {
20 | "description": "Your ScapeGraph API key (optional - can also be set via MCP config)",
21 | "isRequired": false,
22 | "format": "string",
23 | "isSecret": true,
24 | "name": "SGAI_API_KEY"
25 | }
26 | ]
27 | }
28 | ]
29 | }
```
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
```toml
1 | [project]
2 | name = "scrapegraph-mcp"
3 | version = "1.0.1"
4 | description = "MCP server for ScapeGraph API integration"
5 | license = {text = "MIT"}
6 | readme = "README.md"
7 | authors = [
8 | { name = "Marco Perini", email = "[email protected]" }
9 | ]
10 | requires-python = ">=3.10"
11 | dependencies = [
12 | "fastmcp>=2.0.0",
13 | "httpx>=0.24.0",
14 | "uvicorn>=0.27.0",
15 | "pydantic>=2.0.0",
16 | "smithery>=0.4.2",
17 | ]
18 | classifiers = [
19 | "Development Status :: 4 - Beta",
20 | "Intended Audience :: Developers",
21 | "License :: OSI Approved :: MIT License",
22 | "Programming Language :: Python :: 3",
23 | "Programming Language :: Python :: 3.10",
24 | "Topic :: Software Development :: Libraries :: Python Modules",
25 | ]
26 |
27 | [project.optional-dependencies]
28 | dev = [
29 | "ruff>=0.1.0",
30 | "mypy>=1.0.0",
31 | ]
32 |
33 | [project.urls]
34 | "Homepage" = "https://github.com/ScrapeGraphAI/scrapegraph-mcp"
35 | "Bug Tracker" = "https://github.com/ScrapeGraphAI/scrapegraph-mcp/issues"
36 |
37 | [project.scripts]
38 | scrapegraph-mcp = "scrapegraph_mcp.server:main"
39 |
40 | [build-system]
41 | requires = ["hatchling"]
42 | build-backend = "hatchling.build"
43 |
44 | [tool.hatch.build.targets.wheel]
45 | packages = ["src/scrapegraph_mcp"]
46 |
47 | [tool.hatch.build]
48 | only-packages = true
49 |
50 | [tool.ruff]
51 | line-length = 100
52 | target-version = "py312"
53 | select = ["E", "F", "I", "B", "W"]
54 |
55 | [tool.mypy]
56 | python_version = "3.12"
57 | warn_return_any = true
58 | warn_unused_configs = true
59 | disallow_untyped_defs = true
60 | disallow_incomplete_defs = true
61 |
62 | [tool.smithery]
63 | server = "scrapegraph_mcp.server:create_server"
```
--------------------------------------------------------------------------------
/.agent/system/project_architecture.md:
--------------------------------------------------------------------------------
```markdown
1 | # ScrapeGraph MCP Server - Project Architecture
2 |
3 | **Last Updated:** October 2025
4 | **Version:** 1.0.0
5 |
6 | ## Table of Contents
7 | - [System Overview](#system-overview)
8 | - [Technology Stack](#technology-stack)
9 | - [Project Structure](#project-structure)
10 | - [Core Architecture](#core-architecture)
11 | - [MCP Tools](#mcp-tools)
12 | - [API Integration](#api-integration)
13 | - [Deployment](#deployment)
14 | - [Recent Updates](#recent-updates)
15 |
16 | ---
17 |
18 | ## System Overview
19 |
20 | The ScrapeGraph MCP Server is a production-ready [Model Context Protocol](https://modelcontextprotocol.io/introduction) (MCP) server that provides seamless integration between AI assistants (like Claude, Cursor, etc.) and the [ScrapeGraphAI API](https://scrapegraphai.com). This server enables language models to leverage advanced AI-powered web scraping capabilities with enterprise-grade reliability.
21 |
22 | **Key Capabilities:**
23 | - **Markdownify** - Convert webpages to clean, structured markdown
24 | - **SmartScraper** - AI-powered structured data extraction from webpages
25 | - **SearchScraper** - AI-powered web searches with structured results
26 | - **SmartCrawler** - Intelligent multi-page web crawling with AI extraction or markdown conversion
27 |
28 | **Purpose:**
29 | - Bridge AI assistants (Claude, Cursor, etc.) with web scraping capabilities
30 | - Enable LLMs to extract structured data from any website
31 | - Provide clean, formatted markdown conversion of web content
32 | - Execute multi-page crawling operations with AI-powered extraction
33 |
34 | ---
35 |
36 | ## Technology Stack
37 |
38 | ### Core Framework
39 | - **Python 3.10+** - Programming language (minimum version)
40 | - **mcp[cli] 1.3.0+** - Model Context Protocol SDK for Python
41 | - **FastMCP** - Lightweight MCP server framework built on top of mcp
42 |
43 | ### HTTP Client
44 | - **httpx 0.24.0+** - Modern async HTTP client for API requests
45 | - **Timeout:** 60 seconds for all API requests
46 |
47 | ### Development Tools
48 | - **Ruff 0.1.0+** - Fast Python linter
49 | - **mypy 1.0.0+** - Static type checker
50 |
51 | ### Build System
52 | - **Hatchling** - Modern Python build backend
53 | - **pyproject.toml** - PEP 621 compliant project metadata
54 |
55 | ### Deployment
56 | - **Docker** - Containerization with Alpine Linux base
57 | - **Smithery** - Automated MCP server deployment and distribution
58 | - **stdio transport** - Standard input/output for MCP communication
59 |
60 | ---
61 |
62 | ## Project Structure
63 |
64 | ```
65 | scrapegraph-mcp/
66 | ├── src/
67 | │ └── scrapegraph_mcp/
68 | │ ├── __init__.py # Package initialization
69 | │ └── server.py # Main MCP server implementation
70 | │
71 | ├── assets/
72 | │ ├── sgai_smithery.png # Smithery integration badge
73 | │ └── cursor_mcp.png # Cursor integration screenshot
74 | │
75 | ├── .github/
76 | │ └── workflows/ # CI/CD workflows (if any)
77 | │
78 | ├── pyproject.toml # Project metadata and dependencies
79 | ├── Dockerfile # Docker container definition
80 | ├── smithery.yaml # Smithery deployment configuration
81 | ├── README.md # User-facing documentation
82 | ├── LICENSE # MIT License
83 | └── .python-version # Python version specification
84 | ```
85 |
86 | ### Key Files
87 |
88 | **`src/scrapegraph_mcp/server.py`**
89 | - Main server implementation
90 | - `ScapeGraphClient` - API client wrapper
91 | - MCP tool definitions (`@mcp.tool()` decorators)
92 | - Server initialization and main entry point
93 |
94 | **`pyproject.toml`**
95 | - Project metadata (name, version, authors)
96 | - Dependencies (mcp, httpx)
97 | - Build configuration (hatchling)
98 | - Tool configuration (ruff, mypy)
99 | - Entry point: `scrapegraph-mcp` → `scrapegraph_mcp.server:main`
100 |
101 | **`Dockerfile`**
102 | - Python 3.12 Alpine base image
103 | - Build dependencies (gcc, musl-dev, libffi-dev)
104 | - Package installation
105 | - Entrypoint: `scrapegraph-mcp`
106 |
107 | **`smithery.yaml`**
108 | - Smithery deployment configuration
109 | - NPM package metadata
110 | - Installation instructions
111 |
112 | ---
113 |
114 | ## Core Architecture
115 |
116 | ### Model Context Protocol (MCP)
117 |
118 | The server implements the Model Context Protocol, which defines a standard way for AI assistants to interact with external tools and services.
119 |
120 | **MCP Components:**
121 | 1. **Server** - Exposes tools to AI assistants (this project)
122 | 2. **Client** - AI assistant that uses the tools (Claude, Cursor, etc.)
123 | 3. **Transport** - Communication layer (stdio)
124 | 4. **Tools** - Functions that the AI can call
125 |
126 | **Communication Flow:**
127 | ```
128 | AI Assistant (Claude/Cursor)
129 | ↓ (stdio via MCP)
130 | FastMCP Server (this project)
131 | ↓ (HTTPS API calls)
132 | ScrapeGraphAI API (https://api.scrapegraphai.com/v1)
133 | ↓ (web scraping)
134 | Target Websites
135 | ```
136 |
137 | ### Server Architecture
138 |
139 | The server follows a simple, single-file architecture:
140 |
141 | **`ScapeGraphClient` Class:**
142 | - HTTP client wrapper for ScrapeGraphAI API
143 | - Base URL: `https://api.scrapegraphai.com/v1`
144 | - API key authentication via `SGAI-APIKEY` header
145 | - Methods: `markdownify()`, `smartscraper()`, `searchscraper()`, `smartcrawler_initiate()`, `smartcrawler_fetch_results()`
146 |
147 | **FastMCP Server:**
148 | - Created with `FastMCP("ScapeGraph API MCP Server")`
149 | - Exposes tools via `@mcp.tool()` decorators
150 | - Tool functions wrap `ScapeGraphClient` methods
151 | - Error handling with try/except blocks
152 | - Returns dictionaries with results or error messages
153 |
154 | **Initialization Flow:**
155 | 1. Import dependencies (`httpx`, `mcp.server.fastmcp`)
156 | 2. Define `ScapeGraphClient` class
157 | 3. Create `FastMCP` server instance
158 | 4. Initialize `ScapeGraphClient` with API key from env or config
159 | 5. Define MCP tools with `@mcp.tool()` decorators
160 | 6. Start server with `mcp.run(transport="stdio")`
161 |
162 | ### Design Patterns
163 |
164 | **1. Wrapper Pattern**
165 | - `ScapeGraphClient` wraps the ScrapeGraphAI REST API
166 | - Simplifies API interactions with typed methods
167 | - Centralizes authentication and error handling
168 |
169 | **2. Decorator Pattern**
170 | - `@mcp.tool()` decorators expose functions as MCP tools
171 | - Automatic serialization/deserialization
172 | - Type hints → MCP schema generation
173 |
174 | **3. Singleton Pattern**
175 | - Single `scrapegraph_client` instance
176 | - Shared across all tool invocations
177 | - Reused HTTP client connection
178 |
179 | **4. Error Handling Pattern**
180 | - Try/except blocks in all tool functions
181 | - Return error dictionaries instead of raising exceptions
182 | - Ensures graceful degradation for AI assistants
183 |
184 | ---
185 |
186 | ## MCP Tools
187 |
188 | The server exposes 5 tools to AI assistants:
189 |
190 | ### 1. `markdownify(website_url: str)`
191 |
192 | **Purpose:** Convert a webpage into clean, formatted markdown
193 |
194 | **Parameters:**
195 | - `website_url` (str) - URL of the webpage to convert
196 |
197 | **Returns:**
198 | ```json
199 | {
200 | "result": "# Page Title\n\nContent in markdown format..."
201 | }
202 | ```
203 |
204 | **Error Response:**
205 | ```json
206 | {
207 | "error": "Error 404: Not Found"
208 | }
209 | ```
210 |
211 | **Example Usage (from AI):**
212 | ```
213 | "Convert https://scrapegraphai.com to markdown"
214 | → AI calls: markdownify("https://scrapegraphai.com")
215 | ```
216 |
217 | **API Endpoint:** `POST /v1/markdownify`
218 |
219 | **Credits:** 2 credits per request
220 |
221 | ---
222 |
223 | ### 2. `smartscraper(user_prompt: str, website_url: str, number_of_scrolls: int = None, markdown_only: bool = None)`
224 |
225 | **Purpose:** Extract structured data from a webpage using AI
226 |
227 | **Parameters:**
228 | - `user_prompt` (str) - Instructions for what data to extract
229 | - `website_url` (str) - URL of the webpage to scrape
230 | - `number_of_scrolls` (int, optional) - Number of infinite scrolls to perform
231 | - `markdown_only` (bool, optional) - Return only markdown without AI processing
232 |
233 | **Returns:**
234 | ```json
235 | {
236 | "result": {
237 | "extracted_field_1": "value1",
238 | "extracted_field_2": "value2"
239 | }
240 | }
241 | ```
242 |
243 | **Example Usage:**
244 | ```
245 | "Extract all product names and prices from https://example.com/products"
246 | → AI calls: smartscraper(
247 | user_prompt="Extract product names and prices",
248 | website_url="https://example.com/products"
249 | )
250 | ```
251 |
252 | **API Endpoint:** `POST /v1/smartscraper`
253 |
254 | **Credits:** 10 credits (base) + 1 credit per scroll + additional charges
255 |
256 | ---
257 |
258 | ### 3. `searchscraper(user_prompt: str, num_results: int = None, number_of_scrolls: int = None)`
259 |
260 | **Purpose:** Perform AI-powered web searches with structured results
261 |
262 | **Parameters:**
263 | - `user_prompt` (str) - Search query or instructions
264 | - `num_results` (int, optional) - Number of websites to search (default: 3 = 30 credits)
265 | - `number_of_scrolls` (int, optional) - Number of infinite scrolls per website
266 |
267 | **Returns:**
268 | ```json
269 | {
270 | "result": {
271 | "answer": "Aggregated answer from multiple sources",
272 | "sources": [
273 | {"url": "https://source1.com", "data": {...}},
274 | {"url": "https://source2.com", "data": {...}}
275 | ]
276 | }
277 | }
278 | ```
279 |
280 | **Example Usage:**
281 | ```
282 | "Research the latest AI developments in 2025"
283 | → AI calls: searchscraper(
284 | user_prompt="Latest AI developments in 2025",
285 | num_results=5
286 | )
287 | ```
288 |
289 | **API Endpoint:** `POST /v1/searchscraper`
290 |
291 | **Credits:** Variable (3-20 websites × 10 credits per website)
292 |
293 | ---
294 |
295 | ### 4. `smartcrawler_initiate(url: str, prompt: str = None, extraction_mode: str = "ai", depth: int = None, max_pages: int = None, same_domain_only: bool = None)`
296 |
297 | **Purpose:** Initiate intelligent multi-page web crawling (asynchronous)
298 |
299 | **Parameters:**
300 | - `url` (str) - Starting URL to crawl
301 | - `prompt` (str, optional) - AI prompt for data extraction (required for AI mode)
302 | - `extraction_mode` (str) - "ai" for AI extraction (10 credits/page) or "markdown" for markdown conversion (2 credits/page)
303 | - `depth` (int, optional) - Maximum link traversal depth
304 | - `max_pages` (int, optional) - Maximum number of pages to crawl
305 | - `same_domain_only` (bool, optional) - Crawl only within the same domain
306 |
307 | **Returns:**
308 | ```json
309 | {
310 | "request_id": "uuid-here",
311 | "status": "processing"
312 | }
313 | ```
314 |
315 | **Example Usage:**
316 | ```
317 | "Crawl https://docs.python.org and extract all function signatures"
318 | → AI calls: smartcrawler_initiate(
319 | url="https://docs.python.org",
320 | prompt="Extract function signatures and descriptions",
321 | extraction_mode="ai",
322 | max_pages=50,
323 | same_domain_only=True
324 | )
325 | ```
326 |
327 | **API Endpoint:** `POST /v1/crawl`
328 |
329 | **Credits:** 100 credits (base) + 10 credits per page (AI mode) or 2 credits per page (markdown mode)
330 |
331 | **Note:** This is an asynchronous operation. Use `smartcrawler_fetch_results()` to retrieve results.
332 |
333 | ---
334 |
335 | ### 5. `smartcrawler_fetch_results(request_id: str)`
336 |
337 | **Purpose:** Fetch the results of a SmartCrawler operation
338 |
339 | **Parameters:**
340 | - `request_id` (str) - The request ID returned by `smartcrawler_initiate()`
341 |
342 | **Returns (while processing):**
343 | ```json
344 | {
345 | "status": "processing",
346 | "pages_processed": 15,
347 | "total_pages": 50
348 | }
349 | ```
350 |
351 | **Returns (completed):**
352 | ```json
353 | {
354 | "status": "completed",
355 | "results": [
356 | {"url": "https://page1.com", "data": {...}},
357 | {"url": "https://page2.com", "data": {...}}
358 | ],
359 | "pages_processed": 50,
360 | "total_pages": 50
361 | }
362 | ```
363 |
364 | **Example Usage:**
365 | ```
366 | AI: "Check the status of crawl request abc-123"
367 | → AI calls: smartcrawler_fetch_results("abc-123")
368 |
369 | If status is "processing":
370 | → AI: "Still processing, 15/50 pages completed"
371 |
372 | If status is "completed":
373 | → AI: "Crawl complete! Here are the results..."
374 | ```
375 |
376 | **API Endpoint:** `GET /v1/crawl/{request_id}`
377 |
378 | **Polling Strategy:**
379 | - AI assistants should poll this endpoint until `status == "completed"`
380 | - Recommended polling interval: 5-10 seconds
381 | - Maximum wait time: ~30 minutes for large crawls
382 |
383 | ---
384 |
385 | ## API Integration
386 |
387 | ### ScrapeGraphAI API
388 |
389 | **Base URL:** `https://api.scrapegraphai.com/v1`
390 |
391 | **Authentication:**
392 | - Header: `SGAI-APIKEY: your-api-key`
393 | - Obtain API key from: [ScrapeGraph Dashboard](https://dashboard.scrapegraphai.com)
394 |
395 | **Endpoints Used:**
396 |
397 | | Endpoint | Method | Tool |
398 | |----------|--------|------|
399 | | `/v1/markdownify` | POST | `markdownify()` |
400 | | `/v1/smartscraper` | POST | `smartscraper()` |
401 | | `/v1/searchscraper` | POST | `searchscraper()` |
402 | | `/v1/crawl` | POST | `smartcrawler_initiate()` |
403 | | `/v1/crawl/{request_id}` | GET | `smartcrawler_fetch_results()` |
404 |
405 | **Request Format:**
406 | ```json
407 | {
408 | "website_url": "https://example.com",
409 | "user_prompt": "Extract product names"
410 | }
411 | ```
412 |
413 | **Response Format:**
414 | ```json
415 | {
416 | "result": {...},
417 | "credits_used": 10
418 | }
419 | ```
420 |
421 | **Error Handling:**
422 | ```python
423 | response = self.client.post(url, headers=self.headers, json=data)
424 |
425 | if response.status_code != 200:
426 | error_msg = f"Error {response.status_code}: {response.text}"
427 | raise Exception(error_msg)
428 |
429 | return response.json()
430 | ```
431 |
432 | **HTTP Client Configuration:**
433 | - Library: `httpx`
434 | - Timeout: 60 seconds
435 | - Synchronous client (not async)
436 |
437 | ### Credit System
438 |
439 | The MCP server is a pass-through to the ScrapeGraphAI API, so all credit costs are determined by the API:
440 |
441 | - **Markdownify:** 2 credits
442 | - **SmartScraper:** 10 credits (base) + variable
443 | - **SearchScraper:** Variable (websites × 10 credits)
444 | - **SmartCrawler (AI mode):** 100 + (pages × 10) credits
445 | - **SmartCrawler (Markdown mode):** 100 + (pages × 2) credits
446 |
447 | Credits are deducted from the API key balance on the ScrapeGraphAI platform.
448 |
449 | ---
450 |
451 | ## Deployment
452 |
453 | ### Installation Methods
454 |
455 | #### 1. Automated Installation via Smithery
456 |
457 | **Smithery** is the recommended deployment method for MCP servers.
458 |
459 | ```bash
460 | npx -y @smithery/cli install @ScrapeGraphAI/scrapegraph-mcp --client claude
461 | ```
462 |
463 | This automatically:
464 | - Installs the MCP server
465 | - Configures the AI client (Claude Desktop)
466 | - Prompts for API key
467 |
468 | #### 2. Manual Claude Desktop Configuration
469 |
470 | **macOS:** `~/Library/Application Support/Claude/claude_desktop_config.json`
471 | **Windows:** `%APPDATA%/Claude/claude_desktop_config.json`
472 |
473 | ```json
474 | {
475 | "mcpServers": {
476 | "@ScrapeGraphAI-scrapegraph-mcp": {
477 | "command": "npx",
478 | "args": [
479 | "-y",
480 | "@smithery/cli@latest",
481 | "run",
482 | "@ScrapeGraphAI/scrapegraph-mcp",
483 | "--config",
484 | "{\"scrapegraphApiKey\":\"YOUR-SGAI-API-KEY\"}"
485 | ]
486 | }
487 | }
488 | }
489 | ```
490 |
491 | **Windows-Specific Command:**
492 | ```bash
493 | C:\Windows\System32\cmd.exe /c npx -y @smithery/cli@latest run @ScrapeGraphAI/scrapegraph-mcp --config "{\"scrapegraphApiKey\":\"YOUR-SGAI-API-KEY\"}"
494 | ```
495 |
496 | #### 3. Cursor Integration
497 |
498 | Add the MCP server in Cursor settings:
499 |
500 | 1. Open Cursor settings
501 | 2. Navigate to MCP section
502 | 3. Add ScrapeGraphAI MCP server
503 | 4. Configure API key
504 |
505 | (See `assets/cursor_mcp.png` for screenshot)
506 |
507 | #### 4. Docker Deployment
508 |
509 | **Build:**
510 | ```bash
511 | docker build -t scrapegraph-mcp .
512 | ```
513 |
514 | **Run:**
515 | ```bash
516 | docker run -e SGAI_API_KEY=your-api-key scrapegraph-mcp
517 | ```
518 |
519 | **Dockerfile:**
520 | - Base: Python 3.12 Alpine
521 | - Build deps: gcc, musl-dev, libffi-dev
522 | - Install via pip: `pip install .`
523 | - Entrypoint: `scrapegraph-mcp`
524 |
525 | #### 5. Python Package Installation
526 |
527 | **From PyPI (once published):**
528 | ```bash
529 | pip install scrapegraph-mcp
530 | ```
531 |
532 | **From Source:**
533 | ```bash
534 | git clone https://github.com/ScrapeGraphAI/scrapegraph-mcp
535 | cd scrapegraph-mcp
536 | pip install .
537 | ```
538 |
539 | **Run:**
540 | ```bash
541 | export SGAI_API_KEY=your-api-key
542 | scrapegraph-mcp
543 | ```
544 |
545 | ### Configuration
546 |
547 | **API Key Sources (in order of precedence):**
548 | 1. `--config` parameter (Smithery): `"{\"scrapegraphApiKey\":\"key\"}"`
549 | 2. Environment variable: `SGAI_API_KEY`
550 | 3. Default: `None` (server fails to initialize)
551 |
552 | **Server Transport:**
553 | - **stdio** - Standard input/output (default for MCP)
554 | - Communication via JSON-RPC over stdin/stdout
555 |
556 | ### Production Considerations
557 |
558 | **Error Handling:**
559 | - All tool functions return error dictionaries instead of raising exceptions
560 | - Prevents server crashes on API errors
561 | - Graceful degradation for AI assistants
562 |
563 | **Timeout:**
564 | - 60-second timeout for all API requests
565 | - Prevents hanging on slow websites
566 | - Consider increasing for large crawls
567 |
568 | **API Key Security:**
569 | - Never commit API keys to version control
570 | - Use environment variables or config files
571 | - Rotate keys periodically
572 |
573 | **Rate Limiting:**
574 | - Handled by the ScrapeGraphAI API
575 | - MCP server has no built-in rate limiting
576 | - Consider implementing client-side throttling for high-volume use
577 |
578 | ---
579 |
580 | ## Recent Updates
581 |
582 | ### October 2025
583 |
584 | **SmartCrawler Integration (Latest):**
585 | - Added `smartcrawler_initiate()` tool for multi-page crawling
586 | - Added `smartcrawler_fetch_results()` tool for async result retrieval
587 | - Support for AI extraction mode (10 credits/page) and markdown mode (2 credits/page)
588 | - Configurable depth, max_pages, and same_domain_only parameters
589 | - Enhanced error handling for extraction mode validation
590 |
591 | **Recent Commits:**
592 | - `aebeebd` - Merge PR #5: Update to new features and add SmartCrawler
593 | - `b75053d` - Merge PR #4: Fix SmartCrawler issues
594 | - `54b330d` - Enhance error handling in ScapeGraphClient for extraction modes
595 | - `b3139dc` - Refactor web crawling methods to SmartCrawler terminology
596 | - `94173b0` - Add MseeP.ai security assessment badge
597 | - `53c2d99` - Add MCP server badge
598 |
599 | **Key Features:**
600 | 1. **SmartCrawler Support** - Multi-page crawling with AI or markdown modes
601 | 2. **Enhanced Error Handling** - Validation for extraction modes and prompts
602 | 3. **Async Operation Support** - Initiate/fetch pattern for long-running crawls
603 | 4. **Security Badges** - MseeP.ai security assessment and MCP server badges
604 |
605 | ---
606 |
607 | ## Development
608 |
609 | ### Running Locally
610 |
611 | **Prerequisites:**
612 | - Python 3.10+
613 | - pip or pipx
614 |
615 | **Install Dependencies:**
616 | ```bash
617 | pip install -e ".[dev]"
618 | ```
619 |
620 | **Run Server:**
621 | ```bash
622 | export SGAI_API_KEY=your-api-key
623 | python -m scrapegraph_mcp.server
624 | # or
625 | scrapegraph-mcp
626 | ```
627 |
628 | **Test with MCP Inspector:**
629 | ```bash
630 | npx @modelcontextprotocol/inspector scrapegraph-mcp
631 | ```
632 |
633 | ### Code Quality
634 |
635 | **Linting:**
636 | ```bash
637 | ruff check src/
638 | ```
639 |
640 | **Type Checking:**
641 | ```bash
642 | mypy src/
643 | ```
644 |
645 | **Configuration:**
646 | - **Ruff:** Line length 100, target Python 3.12, rules: E, F, I, B, W
647 | - **mypy:** Python 3.12, strict mode, disallow untyped defs
648 |
649 | ### Project Structure Best Practices
650 |
651 | **Single-File Architecture:**
652 | - All code in `src/scrapegraph_mcp/server.py`
653 | - Simple, easy to understand
654 | - Minimal dependencies
655 | - No complex abstractions
656 |
657 | **When to Refactor:**
658 | - If adding 5+ new tools, consider splitting into modules
659 | - If adding authentication logic, create separate auth module
660 | - If adding caching, create separate cache module
661 |
662 | ---
663 |
664 | ## Testing
665 |
666 | ### Manual Testing
667 |
668 | **Test markdownify:**
669 | ```bash
670 | echo '{"method":"tools/call","params":{"name":"markdownify","arguments":{"website_url":"https://scrapegraphai.com"}}}' | scrapegraph-mcp
671 | ```
672 |
673 | **Test smartscraper:**
674 | ```bash
675 | echo '{"method":"tools/call","params":{"name":"smartscraper","arguments":{"user_prompt":"Extract main features","website_url":"https://scrapegraphai.com"}}}' | scrapegraph-mcp
676 | ```
677 |
678 | **Test searchscraper:**
679 | ```bash
680 | echo '{"method":"tools/call","params":{"name":"searchscraper","arguments":{"user_prompt":"Latest AI news"}}}' | scrapegraph-mcp
681 | ```
682 |
683 | ### Integration Testing
684 |
685 | **Claude Desktop:**
686 | 1. Configure MCP server in Claude Desktop
687 | 2. Restart Claude
688 | 3. Ask: "Convert https://scrapegraphai.com to markdown"
689 | 4. Verify tool is called and results returned
690 |
691 | **Cursor:**
692 | 1. Add MCP server in settings
693 | 2. Test with chat prompts
694 | 3. Verify tool integration
695 |
696 | ---
697 |
698 | ## Troubleshooting
699 |
700 | ### Common Issues
701 |
702 | **Issue: "ScapeGraph client not initialized"**
703 | - **Cause:** Missing API key
704 | - **Solution:** Set `SGAI_API_KEY` environment variable or pass via `--config`
705 |
706 | **Issue: "Error 401: Unauthorized"**
707 | - **Cause:** Invalid API key
708 | - **Solution:** Verify API key at [dashboard.scrapegraphai.com](https://dashboard.scrapegraphai.com)
709 |
710 | **Issue: "Error 402: Payment Required"**
711 | - **Cause:** Insufficient credits
712 | - **Solution:** Add credits to your account
713 |
714 | **Issue: "Error 504: Gateway Timeout"**
715 | - **Cause:** Website took too long to scrape
716 | - **Solution:** Retry or use `markdown_only=True` for faster processing
717 |
718 | **Issue: Windows cmd.exe not found**
719 | - **Cause:** Smithery can't find Windows command prompt
720 | - **Solution:** Use full path `C:\Windows\System32\cmd.exe`
721 |
722 | **Issue: SmartCrawler not returning results**
723 | - **Cause:** Still processing (async operation)
724 | - **Solution:** Keep polling `smartcrawler_fetch_results()` until `status == "completed"`
725 |
726 | ---
727 |
728 | ## Contributing
729 |
730 | ### Adding New Tools
731 |
732 | 1. Add method to `ScapeGraphClient` class:
733 | ```python
734 | def new_tool(self, param: str) -> Dict[str, Any]:
735 | """Tool description."""
736 | url = f"{self.BASE_URL}/new-endpoint"
737 | data = {"param": param}
738 | response = self.client.post(url, headers=self.headers, json=data)
739 | if response.status_code != 200:
740 | raise Exception(f"Error {response.status_code}: {response.text}")
741 | return response.json()
742 | ```
743 |
744 | 2. Add MCP tool decorator:
745 | ```python
746 | @mcp.tool()
747 | def new_tool(param: str) -> Dict[str, Any]:
748 | """Tool description for AI."""
749 | if scrapegraph_client is None:
750 | return {"error": "Client not initialized"}
751 | try:
752 | return scrapegraph_client.new_tool(param)
753 | except Exception as e:
754 | return {"error": str(e)}
755 | ```
756 |
757 | 3. Update documentation:
758 | - Add tool to [MCP Tools](#mcp-tools) section
759 | - Update README.md
760 | - Update API integration section
761 |
762 | ### Submitting Changes
763 |
764 | 1. Fork the repository
765 | 2. Create a feature branch
766 | 3. Make changes
767 | 4. Run linting and type checking
768 | 5. Test with Claude Desktop or Cursor
769 | 6. Submit pull request
770 |
771 | ---
772 |
773 | ## License
774 |
775 | This project is distributed under the MIT License. See [LICENSE](../../LICENSE) file for details.
776 |
777 | ---
778 |
779 | ## Acknowledgments
780 |
781 | - **[tomekkorbak](https://github.com/tomekkorbak)** - For [oura-mcp-server](https://github.com/tomekkorbak/oura-mcp-server) implementation inspiration
782 | - **[Model Context Protocol](https://modelcontextprotocol.io/)** - For the MCP specification
783 | - **[Smithery](https://smithery.ai/)** - For MCP server distribution platform
784 | - **[ScrapeGraphAI Team](https://scrapegraphai.com)** - For the API and support
785 |
786 | ---
787 |
788 | **Made with ❤️ by [ScrapeGraphAI](https://scrapegraphai.com) Team**
789 |
```
--------------------------------------------------------------------------------
/src/scrapegraph_mcp/server.py:
--------------------------------------------------------------------------------
```python
1 | #!/usr/bin/env python3
2 | """
3 | MCP server for ScapeGraph API integration.
4 |
5 | This server exposes methods to use ScapeGraph's AI-powered web scraping services:
6 | - markdownify: Convert any webpage into clean, formatted markdown
7 | - smartscraper: Extract structured data from any webpage using AI
8 | - searchscraper: Perform AI-powered web searches with structured results
9 | - smartcrawler_initiate: Initiate intelligent multi-page web crawling with AI extraction or markdown conversion
10 | - smartcrawler_fetch_results: Retrieve results from asynchronous crawling operations
11 | - scrape: Fetch raw page content with optional JavaScript rendering
12 | - sitemap: Extract and discover complete website structure
13 | - agentic_scrapper: Execute complex multi-step web scraping workflows
14 |
15 | ## Parameter Validation and Error Handling
16 |
17 | All tools include comprehensive parameter validation with detailed error messages:
18 |
19 | ### Common Validation Rules:
20 | - URLs must include protocol (http:// or https://)
21 | - Numeric parameters must be within specified ranges
22 | - Mutually exclusive parameters cannot be used together
23 | - Required parameters must be provided
24 | - JSON schemas must be valid JSON format
25 |
26 | ### Error Response Format:
27 | All tools return errors in a consistent format:
28 | ```json
29 | {
30 | "error": "Detailed error message explaining the issue",
31 | "error_type": "ValidationError|HTTPError|TimeoutError|etc.",
32 | "parameter": "parameter_name_if_applicable",
33 | "valid_range": "acceptable_values_if_applicable"
34 | }
35 | ```
36 |
37 | ### Example Validation Errors:
38 | - Invalid URL: "website_url must include protocol (http:// or https://)"
39 | - Range violation: "number_of_scrolls must be between 0 and 50"
40 | - Mutual exclusion: "Cannot specify both website_url and website_html"
41 | - Missing required: "prompt is required when extraction_mode is 'ai'"
42 | - Invalid JSON: "output_schema must be valid JSON format"
43 |
44 | ### Best Practices for Error Handling:
45 | 1. Always check the 'error' field in responses
46 | 2. Use parameter validation before making requests
47 | 3. Implement retry logic for timeout errors
48 | 4. Handle rate limiting gracefully
49 | 5. Validate URLs before passing to tools
50 |
51 | For comprehensive parameter documentation, use the resource:
52 | `scrapegraph://parameters/reference`
53 | """
54 |
55 | import json
56 | import logging
57 | import os
58 | from typing import Any, Dict, Optional, List, Union, Annotated
59 |
60 | import httpx
61 | from fastmcp import Context, FastMCP
62 | from smithery.decorators import smithery
63 | from pydantic import BaseModel, Field, AliasChoices
64 |
65 | # Configure logging
66 | logging.basicConfig(
67 | level=logging.INFO,
68 | format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
69 | )
70 | logger = logging.getLogger(__name__)
71 |
72 |
73 | class ScapeGraphClient:
74 | """Client for interacting with the ScapeGraph API."""
75 |
76 | BASE_URL = "https://api.scrapegraphai.com/v1"
77 |
78 | def __init__(self, api_key: str):
79 | """
80 | Initialize the ScapeGraph API client.
81 |
82 | Args:
83 | api_key: API key for ScapeGraph API
84 | """
85 | self.api_key = api_key
86 | self.headers = {
87 | "SGAI-APIKEY": api_key,
88 | "Content-Type": "application/json"
89 | }
90 | self.client = httpx.Client(timeout=httpx.Timeout(120.0))
91 |
92 |
93 | def markdownify(self, website_url: str) -> Dict[str, Any]:
94 | """
95 | Convert a webpage into clean, formatted markdown.
96 |
97 | Args:
98 | website_url: URL of the webpage to convert
99 |
100 | Returns:
101 | Dictionary containing the markdown result
102 | """
103 | url = f"{self.BASE_URL}/markdownify"
104 | data = {
105 | "website_url": website_url
106 | }
107 |
108 | response = self.client.post(url, headers=self.headers, json=data)
109 |
110 | if response.status_code != 200:
111 | error_msg = f"Error {response.status_code}: {response.text}"
112 | raise Exception(error_msg)
113 |
114 | return response.json()
115 |
116 | def smartscraper(
117 | self,
118 | user_prompt: str,
119 | website_url: str = None,
120 | website_html: str = None,
121 | website_markdown: str = None,
122 | output_schema: Dict[str, Any] = None,
123 | number_of_scrolls: int = None,
124 | total_pages: int = None,
125 | render_heavy_js: bool = None,
126 | stealth: bool = None
127 | ) -> Dict[str, Any]:
128 | """
129 | Extract structured data from a webpage using AI.
130 |
131 | Args:
132 | user_prompt: Instructions for what data to extract
133 | website_url: URL of the webpage to scrape (mutually exclusive with website_html and website_markdown)
134 | website_html: HTML content to process locally (mutually exclusive with website_url and website_markdown, max 2MB)
135 | website_markdown: Markdown content to process locally (mutually exclusive with website_url and website_html, max 2MB)
136 | output_schema: JSON schema defining expected output structure (optional)
137 | number_of_scrolls: Number of infinite scrolls to perform (0-50, default 0)
138 | total_pages: Number of pages to process for pagination (1-100, default 1)
139 | render_heavy_js: Enable heavy JavaScript rendering for dynamic pages (default false)
140 | stealth: Enable stealth mode to avoid bot detection (default false)
141 |
142 | Returns:
143 | Dictionary containing the extracted data
144 | """
145 | url = f"{self.BASE_URL}/smartscraper"
146 | data = {"user_prompt": user_prompt}
147 |
148 | # Add input source (mutually exclusive)
149 | if website_url is not None:
150 | data["website_url"] = website_url
151 | elif website_html is not None:
152 | data["website_html"] = website_html
153 | elif website_markdown is not None:
154 | data["website_markdown"] = website_markdown
155 | else:
156 | raise ValueError("Must provide one of: website_url, website_html, or website_markdown")
157 |
158 | # Add optional parameters
159 | if output_schema is not None:
160 | data["output_schema"] = output_schema
161 | if number_of_scrolls is not None:
162 | data["number_of_scrolls"] = number_of_scrolls
163 | if total_pages is not None:
164 | data["total_pages"] = total_pages
165 | if render_heavy_js is not None:
166 | data["render_heavy_js"] = render_heavy_js
167 | if stealth is not None:
168 | data["stealth"] = stealth
169 |
170 | response = self.client.post(url, headers=self.headers, json=data)
171 |
172 | if response.status_code != 200:
173 | error_msg = f"Error {response.status_code}: {response.text}"
174 | raise Exception(error_msg)
175 |
176 | return response.json()
177 |
178 | def searchscraper(self, user_prompt: str, num_results: int = None, number_of_scrolls: int = None) -> Dict[str, Any]:
179 | """
180 | Perform AI-powered web searches with structured results.
181 |
182 | Args:
183 | user_prompt: Search query or instructions
184 | num_results: Number of websites to search (optional, default: 3 websites = 30 credits)
185 | number_of_scrolls: Number of infinite scrolls to perform on each website (optional)
186 |
187 | Returns:
188 | Dictionary containing search results and reference URLs
189 | """
190 | url = f"{self.BASE_URL}/searchscraper"
191 | data = {
192 | "user_prompt": user_prompt
193 | }
194 |
195 | # Add num_results to the request if provided
196 | if num_results is not None:
197 | data["num_results"] = num_results
198 |
199 | # Add number_of_scrolls to the request if provided
200 | if number_of_scrolls is not None:
201 | data["number_of_scrolls"] = number_of_scrolls
202 |
203 | response = self.client.post(url, headers=self.headers, json=data)
204 |
205 | if response.status_code != 200:
206 | error_msg = f"Error {response.status_code}: {response.text}"
207 | raise Exception(error_msg)
208 |
209 | return response.json()
210 |
211 | def scrape(self, website_url: str, render_heavy_js: Optional[bool] = None) -> Dict[str, Any]:
212 | """
213 | Basic scrape endpoint to fetch page content.
214 |
215 | Args:
216 | website_url: URL to scrape
217 | render_heavy_js: Whether to render heavy JS (optional)
218 |
219 | Returns:
220 | Dictionary containing the scraped result
221 | """
222 | url = f"{self.BASE_URL}/scrape"
223 | payload: Dict[str, Any] = {"website_url": website_url}
224 | if render_heavy_js is not None:
225 | payload["render_heavy_js"] = render_heavy_js
226 |
227 | response = self.client.post(url, headers=self.headers, json=payload)
228 | response.raise_for_status()
229 | return response.json()
230 |
231 | def sitemap(self, website_url: str) -> Dict[str, Any]:
232 | """
233 | Extract sitemap for a given website.
234 |
235 | Args:
236 | website_url: Base website URL
237 |
238 | Returns:
239 | Dictionary containing sitemap URLs/structure
240 | """
241 | url = f"{self.BASE_URL}/sitemap"
242 | payload: Dict[str, Any] = {"website_url": website_url}
243 |
244 | response = self.client.post(url, headers=self.headers, json=payload)
245 | response.raise_for_status()
246 | return response.json()
247 |
248 | def agentic_scrapper(
249 | self,
250 | url: str,
251 | user_prompt: Optional[str] = None,
252 | output_schema: Optional[Dict[str, Any]] = None,
253 | steps: Optional[List[str]] = None,
254 | ai_extraction: Optional[bool] = None,
255 | persistent_session: Optional[bool] = None,
256 | timeout_seconds: Optional[float] = None,
257 | ) -> Dict[str, Any]:
258 | """
259 | Run the Agentic Scraper workflow (no live session/browser interaction).
260 |
261 | Args:
262 | url: Target website URL
263 | user_prompt: Instructions for what to do/extract (optional)
264 | output_schema: Desired structured output schema (optional)
265 | steps: High-level steps/instructions for the agent (optional)
266 | ai_extraction: Whether to enable AI extraction mode (optional)
267 | persistent_session: Whether to keep session alive between steps (optional)
268 | timeout_seconds: Per-request timeout override in seconds (optional)
269 | """
270 | endpoint = f"{self.BASE_URL}/agentic-scrapper"
271 | payload: Dict[str, Any] = {"url": url}
272 | if user_prompt is not None:
273 | payload["user_prompt"] = user_prompt
274 | if output_schema is not None:
275 | payload["output_schema"] = output_schema
276 | if steps is not None:
277 | payload["steps"] = steps
278 | if ai_extraction is not None:
279 | payload["ai_extraction"] = ai_extraction
280 | if persistent_session is not None:
281 | payload["persistent_session"] = persistent_session
282 |
283 | if timeout_seconds is not None:
284 | response = self.client.post(endpoint, headers=self.headers, json=payload, timeout=timeout_seconds)
285 | else:
286 | response = self.client.post(endpoint, headers=self.headers, json=payload)
287 | response.raise_for_status()
288 | return response.json()
289 |
290 | def smartcrawler_initiate(
291 | self,
292 | url: str,
293 | prompt: str = None,
294 | extraction_mode: str = "ai",
295 | depth: int = None,
296 | max_pages: int = None,
297 | same_domain_only: bool = None
298 | ) -> Dict[str, Any]:
299 | """
300 | Initiate a SmartCrawler request for multi-page web crawling.
301 |
302 | SmartCrawler supports two modes:
303 | - AI Extraction Mode (10 credits per page): Extracts structured data based on your prompt
304 | - Markdown Conversion Mode (2 credits per page): Converts pages to clean markdown
305 |
306 | Smartcrawler takes some time to process the request and returns the request id.
307 | Use smartcrawler_fetch_results to get the results of the request.
308 | You have to keep polling the smartcrawler_fetch_results until the request is complete.
309 | The request is complete when the status is "completed".
310 |
311 | Args:
312 | url: Starting URL to crawl
313 | prompt: AI prompt for data extraction (required for AI mode)
314 | extraction_mode: "ai" for AI extraction or "markdown" for markdown conversion (default: "ai")
315 | depth: Maximum link traversal depth (optional)
316 | max_pages: Maximum number of pages to crawl (optional)
317 | same_domain_only: Whether to crawl only within the same domain (optional)
318 |
319 | Returns:
320 | Dictionary containing the request ID for async processing
321 | """
322 | endpoint = f"{self.BASE_URL}/crawl"
323 | data = {
324 | "url": url
325 | }
326 |
327 | # Handle extraction mode
328 | if extraction_mode == "markdown":
329 | data["markdown_only"] = True
330 | elif extraction_mode == "ai":
331 | if prompt is None:
332 | raise ValueError("prompt is required when extraction_mode is 'ai'")
333 | data["prompt"] = prompt
334 | else:
335 | raise ValueError(f"Invalid extraction_mode: {extraction_mode}. Must be 'ai' or 'markdown'")
336 | if depth is not None:
337 | data["depth"] = depth
338 | if max_pages is not None:
339 | data["max_pages"] = max_pages
340 | if same_domain_only is not None:
341 | data["same_domain_only"] = same_domain_only
342 |
343 | response = self.client.post(endpoint, headers=self.headers, json=data)
344 |
345 | if response.status_code != 200:
346 | error_msg = f"Error {response.status_code}: {response.text}"
347 | raise Exception(error_msg)
348 |
349 | return response.json()
350 |
351 | def smartcrawler_fetch_results(self, request_id: str) -> Dict[str, Any]:
352 | """
353 | Fetch the results of a SmartCrawler operation.
354 |
355 | Args:
356 | request_id: The request ID returned by smartcrawler_initiate
357 |
358 | Returns:
359 | Dictionary containing the crawled data (structured extraction or markdown)
360 | and metadata about processed pages
361 |
362 | Note:
363 | It takes some time to process the request and returns the results.
364 | Meanwhile it returns the status of the request.
365 | You have to keep polling the smartcrawler_fetch_results until the request is complete.
366 | The request is complete when the status is "completed". and you get results
367 | Keep polling the smartcrawler_fetch_results until the request is complete.
368 | """
369 | endpoint = f"{self.BASE_URL}/crawl/{request_id}"
370 |
371 | response = self.client.get(endpoint, headers=self.headers)
372 |
373 | if response.status_code != 200:
374 | error_msg = f"Error {response.status_code}: {response.text}"
375 | raise Exception(error_msg)
376 |
377 | return response.json()
378 |
379 | def close(self) -> None:
380 | """Close the HTTP client."""
381 | self.client.close()
382 |
383 |
384 | # Pydantic configuration schema for Smithery
385 | class ConfigSchema(BaseModel):
386 | scrapegraph_api_key: Optional[str] = Field(
387 | default=None,
388 | description="Your Scrapegraph API key (optional - can also be set via SGAI_API_KEY environment variable)",
389 | # Accept both camelCase (from smithery.yaml) and snake_case (internal) for validation,
390 | # and serialize back to camelCase to match Smithery expectations.
391 | validation_alias=AliasChoices("scrapegraphApiKey", "scrapegraph_api_key"),
392 | serialization_alias="scrapegraphApiKey",
393 | )
394 |
395 |
396 | def get_api_key(ctx: Context) -> str:
397 | """
398 | Get the API key from config or environment variable.
399 |
400 | Args:
401 | ctx: FastMCP context
402 |
403 | Returns:
404 | API key string
405 |
406 | Raises:
407 | ValueError: If no API key is found
408 | """
409 | try:
410 | logger.info(f"Getting API key. Context type: {type(ctx)}")
411 | logger.info(f"Context has session_config: {hasattr(ctx, 'session_config')}")
412 |
413 | # Try to get from config first, but handle cases where session_config might be None
414 | api_key = None
415 | if hasattr(ctx, 'session_config') and ctx.session_config is not None:
416 | logger.info(f"Session config type: {type(ctx.session_config)}")
417 | api_key = getattr(ctx.session_config, 'scrapegraph_api_key', None)
418 | logger.info(f"API key from config: {'***' if api_key else 'None'}")
419 | else:
420 | logger.info("No session_config available or session_config is None")
421 |
422 | # If not in config, try environment variable
423 | if not api_key:
424 | api_key = os.getenv('SGAI_API_KEY')
425 | logger.info(f"API key from env: {'***' if api_key else 'None'}")
426 |
427 | # If still no API key found, raise error
428 | if not api_key:
429 | logger.error("No API key found in config or environment")
430 | raise ValueError(
431 | "ScapeGraph API key is required. Please provide it either:\n"
432 | "1. In the MCP server configuration as 'scrapegraph_api_key'\n"
433 | "2. As an environment variable 'SGAI_API_KEY'"
434 | )
435 |
436 | logger.info("API key successfully retrieved")
437 | return api_key
438 |
439 | except Exception as e:
440 | logger.warning(f"Error getting API key from context: {e}. Falling back to cached method.")
441 | # Fallback to cached method if context handling fails
442 | return get_cached_api_key()
443 |
444 |
445 | # Create MCP server instance
446 | mcp = FastMCP("ScapeGraph API MCP Server")
447 |
448 | # Global API key cache to handle session issues
449 | _api_key_cache: Optional[str] = None
450 |
451 | def get_cached_api_key() -> str:
452 | """Get API key from cache or environment, bypassing session config issues."""
453 | global _api_key_cache
454 |
455 | if _api_key_cache is None:
456 | _api_key_cache = os.getenv('SGAI_API_KEY')
457 | if _api_key_cache:
458 | logger.info("API key loaded from environment variable")
459 | else:
460 | logger.error("No API key found in environment variable SGAI_API_KEY")
461 | raise ValueError(
462 | "ScapeGraph API key is required. Please set the SGAI_API_KEY environment variable."
463 | )
464 |
465 | return _api_key_cache
466 |
467 |
468 | # Add prompts to help users interact with the server
469 | @mcp.prompt()
470 | def web_scraping_guide() -> str:
471 | """
472 | A comprehensive guide to using ScapeGraph's web scraping tools effectively.
473 |
474 | This prompt provides examples and best practices for each tool in the ScapeGraph MCP server.
475 | """
476 | return """# ScapeGraph Web Scraping Guide
477 |
478 | ## Available Tools Overview
479 |
480 | ### 1. **markdownify** - Convert webpages to clean markdown
481 | **Use case**: Get clean, readable content from any webpage
482 | **Example**:
483 | - Input: `https://docs.python.org/3/tutorial/`
484 | - Output: Clean markdown of the Python tutorial
485 |
486 | ### 2. **smartscraper** - AI-powered data extraction
487 | **Use case**: Extract specific structured data using natural language prompts
488 | **Examples**:
489 | - "Extract all product names and prices from this e-commerce page"
490 | - "Get contact information including email, phone, and address"
491 | - "Find all article titles, authors, and publication dates"
492 |
493 | ### 3. **searchscraper** - AI web search with extraction
494 | **Use case**: Search the web and extract structured information
495 | **Examples**:
496 | - "Find the latest AI research papers and their abstracts"
497 | - "Search for Python web scraping tutorials with ratings"
498 | - "Get current cryptocurrency prices and market caps"
499 |
500 | ### 4. **smartcrawler_initiate** - Multi-page intelligent crawling
501 | **Use case**: Crawl multiple pages with AI extraction or markdown conversion
502 | **Modes**:
503 | - AI Mode (10 credits/page): Extract structured data
504 | - Markdown Mode (2 credits/page): Convert to markdown
505 | **Example**: Crawl a documentation site to extract all API endpoints
506 |
507 | ### 5. **smartcrawler_fetch_results** - Get crawling results
508 | **Use case**: Retrieve results from initiated crawling operations
509 | **Note**: Keep polling until status is "completed"
510 |
511 | ### 6. **scrape** - Basic page content fetching
512 | **Use case**: Get raw page content with optional JavaScript rendering
513 | **Example**: Fetch content from dynamic pages that require JS
514 |
515 | ### 7. **sitemap** - Extract website structure
516 | **Use case**: Get all URLs and structure of a website
517 | **Example**: Map out a website's architecture before crawling
518 |
519 | ### 8. **agentic_scrapper** - AI-powered automated scraping
520 | **Use case**: Complex multi-step scraping with AI automation
521 | **Example**: Navigate through forms, click buttons, extract data
522 |
523 | ## Best Practices
524 |
525 | 1. **Start Simple**: Use `markdownify` or `scrape` for basic content
526 | 2. **Be Specific**: Provide detailed prompts for better AI extraction
527 | 3. **Use Crawling Wisely**: Set appropriate limits for `max_pages` and `depth`
528 | 4. **Monitor Credits**: AI extraction uses more credits than markdown conversion
529 | 5. **Handle Async**: Use `smartcrawler_fetch_results` to poll for completion
530 |
531 | ## Common Workflows
532 |
533 | ### Extract Product Information
534 | 1. Use `smartscraper` with prompt: "Extract product name, price, description, and availability"
535 | 2. For multiple pages: Use `smartcrawler_initiate` in AI mode
536 |
537 | ### Research and Analysis
538 | 1. Use `searchscraper` to find relevant pages
539 | 2. Use `smartscraper` on specific pages for detailed extraction
540 |
541 | ### Site Documentation
542 | 1. Use `sitemap` to discover all pages
543 | 2. Use `smartcrawler_initiate` in markdown mode to convert all pages
544 |
545 | ### Complex Navigation
546 | 1. Use `agentic_scrapper` for sites requiring interaction
547 | 2. Provide step-by-step instructions in the `steps` parameter
548 | """
549 |
550 |
551 | @mcp.prompt()
552 | def quick_start_examples() -> str:
553 | """
554 | Quick start examples for common ScapeGraph use cases.
555 |
556 | Ready-to-use examples for immediate productivity.
557 | """
558 | return """# ScapeGraph Quick Start Examples
559 |
560 | ## 🚀 Ready-to-Use Examples
561 |
562 | ### Extract E-commerce Product Data
563 | ```
564 | Tool: smartscraper
565 | URL: https://example-shop.com/products/laptop
566 | Prompt: "Extract product name, price, specifications, customer rating, and availability status"
567 | ```
568 |
569 | ### Convert Documentation to Markdown
570 | ```
571 | Tool: markdownify
572 | URL: https://docs.example.com/api-reference
573 | ```
574 |
575 | ### Research Latest News
576 | ```
577 | Tool: searchscraper
578 | Prompt: "Find latest news about artificial intelligence breakthroughs in 2024"
579 | num_results: 5
580 | ```
581 |
582 | ### Crawl Entire Blog for Articles
583 | ```
584 | Tool: smartcrawler_initiate
585 | URL: https://blog.example.com
586 | Prompt: "Extract article title, author, publication date, and summary"
587 | extraction_mode: "ai"
588 | max_pages: 20
589 | ```
590 |
591 | ### Get Website Structure
592 | ```
593 | Tool: sitemap
594 | URL: https://example.com
595 | ```
596 |
597 | ### Extract Contact Information
598 | ```
599 | Tool: smartscraper
600 | URL: https://company.example.com/contact
601 | Prompt: "Find all contact methods: email addresses, phone numbers, physical address, and social media links"
602 | ```
603 |
604 | ### Automated Form Navigation
605 | ```
606 | Tool: agentic_scrapper
607 | URL: https://example.com/search
608 | user_prompt: "Navigate to the search page, enter 'web scraping tools', and extract the top 5 results"
609 | steps: ["Find search box", "Enter search term", "Submit form", "Extract results"]
610 | ```
611 |
612 | ## 💡 Pro Tips
613 |
614 | 1. **For Dynamic Content**: Use `render_heavy_js: true` with the `scrape` tool
615 | 2. **For Large Sites**: Start with `sitemap` to understand structure
616 | 3. **For Async Operations**: Always poll `smartcrawler_fetch_results` until complete
617 | 4. **For Complex Sites**: Use `agentic_scrapper` with detailed step instructions
618 | 5. **For Cost Efficiency**: Use markdown mode for content conversion, AI mode for data extraction
619 |
620 | ## 🔧 Configuration
621 |
622 | Set your API key via:
623 | - Environment variable: `SGAI_API_KEY=your_key_here`
624 | - MCP configuration: `scrapegraph_api_key: "your_key_here"`
625 |
626 | No configuration required - the server works with environment variables!
627 | """
628 |
629 |
630 | # Add resources to expose server capabilities and data
631 | @mcp.resource("scrapegraph://api/status")
632 | def api_status() -> str:
633 | """
634 | Current status and capabilities of the ScapeGraph API server.
635 |
636 | Provides real-time information about available tools, credit usage, and server health.
637 | """
638 | return """# ScapeGraph API Status
639 |
640 | ## Server Information
641 | - **Status**: ✅ Online and Ready
642 | - **Version**: 1.0.0
643 | - **Base URL**: https://api.scrapegraphai.com/v1
644 |
645 | ## Available Tools
646 | 1. **markdownify** - Convert webpages to markdown (2 credits/page)
647 | 2. **smartscraper** - AI data extraction (10 credits/page)
648 | 3. **searchscraper** - AI web search (30 credits for 3 websites)
649 | 4. **smartcrawler** - Multi-page crawling (2-10 credits/page)
650 | 5. **scrape** - Basic page fetching (1 credit/page)
651 | 6. **sitemap** - Website structure extraction (1 credit)
652 | 7. **agentic_scrapper** - AI automation (variable credits)
653 |
654 | ## Credit Costs
655 | - **Markdown Conversion**: 2 credits per page
656 | - **AI Extraction**: 10 credits per page
657 | - **Web Search**: 10 credits per website (default 3 websites)
658 | - **Basic Scraping**: 1 credit per page
659 | - **Sitemap**: 1 credit per request
660 |
661 | ## Configuration
662 | - **API Key**: Required (set via SGAI_API_KEY env var or config)
663 | - **Timeout**: 120 seconds default (configurable)
664 | - **Rate Limits**: Applied per API key
665 |
666 | ## Best Practices
667 | - Use markdown mode for content conversion (cheaper)
668 | - Use AI mode for structured data extraction
669 | - Set appropriate limits for crawling operations
670 | - Monitor credit usage for cost optimization
671 |
672 | Last Updated: $(date)
673 | """
674 |
675 |
676 | @mcp.resource("scrapegraph://examples/use-cases")
677 | def common_use_cases() -> str:
678 | """
679 | Common use cases and example implementations for ScapeGraph tools.
680 |
681 | Real-world examples with expected inputs and outputs.
682 | """
683 | return """# ScapeGraph Common Use Cases
684 |
685 | ## 🛍️ E-commerce Data Extraction
686 |
687 | ### Product Information Scraping
688 | **Tool**: smartscraper
689 | **Input**: Product page URL + "Extract name, price, description, rating, availability"
690 | **Output**: Structured JSON with product details
691 | **Credits**: 10 per page
692 |
693 | ### Price Monitoring
694 | **Tool**: smartcrawler_initiate (AI mode)
695 | **Input**: Product category page + price extraction prompt
696 | **Output**: Structured price data across multiple products
697 | **Credits**: 10 per page crawled
698 |
699 | ## 📰 Content & Research
700 |
701 | ### News Article Extraction
702 | **Tool**: searchscraper
703 | **Input**: "Latest news about [topic]" + num_results
704 | **Output**: Article titles, summaries, sources, dates
705 | **Credits**: 10 per website searched
706 |
707 | ### Documentation Conversion
708 | **Tool**: smartcrawler_initiate (markdown mode)
709 | **Input**: Documentation site root URL
710 | **Output**: Clean markdown files for all pages
711 | **Credits**: 2 per page converted
712 |
713 | ## 🏢 Business Intelligence
714 |
715 | ### Contact Information Gathering
716 | **Tool**: smartscraper
717 | **Input**: Company website + "Find contact details"
718 | **Output**: Emails, phones, addresses, social media
719 | **Credits**: 10 per page
720 |
721 | ### Competitor Analysis
722 | **Tool**: searchscraper + smartscraper combination
723 | **Input**: Search for competitors + extract key metrics
724 | **Output**: Structured competitive intelligence
725 | **Credits**: Variable based on pages analyzed
726 |
727 | ## 🔍 Research & Analysis
728 |
729 | ### Academic Paper Research
730 | **Tool**: searchscraper
731 | **Input**: Research query + academic site focus
732 | **Output**: Paper titles, abstracts, authors, citations
733 | **Credits**: 10 per source website
734 |
735 | ### Market Research
736 | **Tool**: smartcrawler_initiate
737 | **Input**: Industry website + data extraction prompts
738 | **Output**: Market trends, statistics, insights
739 | **Credits**: 10 per page (AI mode)
740 |
741 | ## 🤖 Automation Workflows
742 |
743 | ### Form-based Data Collection
744 | **Tool**: agentic_scrapper
745 | **Input**: Site URL + navigation steps + extraction goals
746 | **Output**: Data collected through automated interaction
747 | **Credits**: Variable based on complexity
748 |
749 | ### Multi-step Research Process
750 | **Workflow**: sitemap → smartcrawler_initiate → smartscraper
751 | **Input**: Target site + research objectives
752 | **Output**: Comprehensive site analysis and data extraction
753 | **Credits**: Cumulative based on tools used
754 |
755 | ## 💡 Optimization Tips
756 |
757 | 1. **Start with sitemap** to understand site structure
758 | 2. **Use markdown mode** for content archival (cheaper)
759 | 3. **Use AI mode** for structured data extraction
760 | 4. **Batch similar requests** to optimize credit usage
761 | 5. **Set appropriate crawl limits** to control costs
762 | 6. **Use specific prompts** for better AI extraction accuracy
763 |
764 | ## 📊 Expected Response Times
765 |
766 | - **Simple scraping**: 5-15 seconds
767 | - **AI extraction**: 15-45 seconds per page
768 | - **Crawling operations**: 1-5 minutes (async)
769 | - **Search operations**: 30-90 seconds
770 | - **Agentic workflows**: 2-10 minutes
771 |
772 | ## 🚨 Common Pitfalls
773 |
774 | - Not setting crawl limits (unexpected credit usage)
775 | - Vague extraction prompts (poor AI results)
776 | - Not polling async operations (missing results)
777 | - Ignoring rate limits (request failures)
778 | - Not handling JavaScript-heavy sites (incomplete data)
779 | """
780 |
781 |
782 | @mcp.resource("scrapegraph://parameters/reference")
783 | def parameter_reference_guide() -> str:
784 | """
785 | Comprehensive parameter reference guide for all ScapeGraph MCP tools.
786 |
787 | Complete documentation of every parameter with examples, constraints, and best practices.
788 | """
789 | return """# ScapeGraph MCP Parameter Reference Guide
790 |
791 | ## 📋 Complete Parameter Documentation
792 |
793 | This guide provides comprehensive documentation for every parameter across all ScapeGraph MCP tools. Use this as your definitive reference for understanding parameter behavior, constraints, and best practices.
794 |
795 | ---
796 |
797 | ## 🔧 Common Parameters
798 |
799 | ### URL Parameters
800 | **Used in**: markdownify, smartscraper, searchscraper, smartcrawler_initiate, scrape, sitemap, agentic_scrapper
801 |
802 | #### `website_url` / `url`
803 | - **Type**: `str` (required)
804 | - **Format**: Must include protocol (http:// or https://)
805 | - **Examples**:
806 | - ✅ `https://example.com/page`
807 | - ✅ `https://docs.python.org/3/tutorial/`
808 | - ❌ `example.com` (missing protocol)
809 | - ❌ `ftp://example.com` (unsupported protocol)
810 | - **Best Practices**:
811 | - Always include the full URL with protocol
812 | - Ensure the URL is publicly accessible
813 | - Test URLs manually before automation
814 |
815 | ---
816 |
817 | ## 🤖 AI and Extraction Parameters
818 |
819 | ### `user_prompt`
820 | **Used in**: smartscraper, searchscraper, agentic_scrapper
821 |
822 | - **Type**: `str` (required)
823 | - **Purpose**: Natural language instructions for AI extraction
824 | - **Examples**:
825 | - `"Extract product name, price, description, and availability"`
826 | - `"Find contact information: email, phone, address"`
827 | - `"Get article title, author, publication date, summary"`
828 | - **Best Practices**:
829 | - Be specific about desired fields
830 | - Mention data types (numbers, dates, URLs)
831 | - Include context about data location
832 | - Use clear, descriptive language
833 |
834 | ### `output_schema`
835 | **Used in**: smartscraper, agentic_scrapper
836 |
837 | - **Type**: `Optional[Union[str, Dict[str, Any]]]`
838 | - **Purpose**: Define expected output structure
839 | - **Formats**:
840 | - Dictionary: `{'type': 'object', 'properties': {'title': {'type': 'string'}}}`
841 | - JSON string: `'{"type": "object", "properties": {"name": {"type": "string"}}}'`
842 | - **Examples**:
843 | ```json
844 | {
845 | "type": "object",
846 | "properties": {
847 | "products": {
848 | "type": "array",
849 | "items": {
850 | "type": "object",
851 | "properties": {
852 | "name": {"type": "string"},
853 | "price": {"type": "number"},
854 | "available": {"type": "boolean"}
855 | }
856 | }
857 | }
858 | }
859 | }
860 | ```
861 | - **Best Practices**:
862 | - Use for complex, structured extractions
863 | - Define clear data types
864 | - Consider nested structures for complex data
865 |
866 | ---
867 |
868 | ## 🌐 Content Source Parameters
869 |
870 | ### `website_html`
871 | **Used in**: smartscraper
872 |
873 | - **Type**: `Optional[str]`
874 | - **Purpose**: Process local HTML content
875 | - **Constraints**: Maximum 2MB
876 | - **Use Cases**:
877 | - Pre-fetched HTML content
878 | - Generated HTML from other sources
879 | - Offline HTML processing
880 | - **Mutually Exclusive**: Cannot use with `website_url` or `website_markdown`
881 |
882 | ### `website_markdown`
883 | **Used in**: smartscraper
884 |
885 | - **Type**: `Optional[str]`
886 | - **Purpose**: Process local markdown content
887 | - **Constraints**: Maximum 2MB
888 | - **Use Cases**:
889 | - Documentation processing
890 | - README file analysis
891 | - Converted web content
892 | - **Mutually Exclusive**: Cannot use with `website_url` or `website_html`
893 |
894 | ---
895 |
896 | ## 📄 Pagination and Scrolling Parameters
897 |
898 | ### `number_of_scrolls`
899 | **Used in**: smartscraper, searchscraper
900 |
901 | - **Type**: `Optional[int]`
902 | - **Range**: 0-50 scrolls
903 | - **Default**: 0 (no scrolling)
904 | - **Purpose**: Handle dynamically loaded content
905 | - **Examples**:
906 | - `0`: Static content, no scrolling needed
907 | - `3`: Social media feeds, product listings
908 | - `10`: Long articles, extensive catalogs
909 | - **Performance Impact**: +5-10 seconds per scroll
910 | - **Best Practices**:
911 | - Start with 0 and increase if content seems incomplete
912 | - Use sparingly to control processing time
913 | - Consider site loading behavior
914 |
915 | ### `total_pages`
916 | **Used in**: smartscraper
917 |
918 | - **Type**: `Optional[int]`
919 | - **Range**: 1-100 pages
920 | - **Default**: 1 (single page)
921 | - **Purpose**: Process paginated content
922 | - **Cost Impact**: 10 credits × pages
923 | - **Examples**:
924 | - `1`: Single page extraction
925 | - `5`: First 5 pages of results
926 | - `20`: Comprehensive pagination
927 | - **Best Practices**:
928 | - Set reasonable limits to control costs
929 | - Consider total credit usage
930 | - Test with small numbers first
931 |
932 | ---
933 |
934 | ## 🚀 Performance Parameters
935 |
936 | ### `render_heavy_js`
937 | **Used in**: smartscraper, scrape
938 |
939 | - **Type**: `Optional[bool]`
940 | - **Default**: `false`
941 | - **Purpose**: Enable JavaScript rendering for SPAs
942 | - **When to Use `true`**:
943 | - React/Angular/Vue applications
944 | - Dynamic content loading
945 | - AJAX-heavy interfaces
946 | - Content appearing after page load
947 | - **When to Use `false`**:
948 | - Static websites
949 | - Server-side rendered content
950 | - Traditional HTML pages
951 | - When speed is priority
952 | - **Performance Impact**:
953 | - `false`: 2-5 seconds
954 | - `true`: 15-30 seconds
955 | - **Cost**: Same regardless of setting
956 |
957 | ### `stealth`
958 | **Used in**: smartscraper
959 |
960 | - **Type**: `Optional[bool]`
961 | - **Default**: `false`
962 | - **Purpose**: Bypass basic bot detection
963 | - **When to Use**:
964 | - Sites with anti-scraping measures
965 | - E-commerce sites with protection
966 | - Sites requiring "human-like" behavior
967 | - **Limitations**:
968 | - Not 100% guaranteed
969 | - May increase processing time
970 | - Some advanced detection may still work
971 |
972 | ---
973 |
974 | ## 🔄 Crawling Parameters
975 |
976 | ### `prompt`
977 | **Used in**: smartcrawler_initiate
978 |
979 | - **Type**: `Optional[str]`
980 | - **Required**: When `extraction_mode="ai"`
981 | - **Purpose**: AI extraction instructions for all crawled pages
982 | - **Examples**:
983 | - `"Extract API endpoint name, method, parameters"`
984 | - `"Get article title, author, publication date"`
985 | - **Best Practices**:
986 | - Use general terms that apply across page types
987 | - Consider varying page structures
988 | - Be specific about desired fields
989 |
990 | ### `extraction_mode`
991 | **Used in**: smartcrawler_initiate
992 |
993 | - **Type**: `str`
994 | - **Default**: `"ai"`
995 | - **Options**:
996 | - `"ai"`: AI-powered extraction (10 credits/page)
997 | - `"markdown"`: Markdown conversion (2 credits/page)
998 | - **Cost Comparison**:
999 | - AI mode: 50 pages = 500 credits
1000 | - Markdown mode: 50 pages = 100 credits
1001 | - **Use Cases**:
1002 | - AI: Data collection, research, analysis
1003 | - Markdown: Content archival, documentation backup
1004 |
1005 | ### `depth`
1006 | **Used in**: smartcrawler_initiate
1007 |
1008 | - **Type**: `Optional[int]`
1009 | - **Default**: Unlimited
1010 | - **Purpose**: Control link traversal depth
1011 | - **Levels**:
1012 | - `0`: Only starting URL
1013 | - `1`: Starting URL + direct links
1014 | - `2`: Two levels of link following
1015 | - `3+`: Deeper traversal
1016 | - **Considerations**:
1017 | - Higher depth = exponential growth
1018 | - Use with `max_pages` for control
1019 | - Consider site structure
1020 |
1021 | ### `max_pages`
1022 | **Used in**: smartcrawler_initiate
1023 |
1024 | - **Type**: `Optional[int]`
1025 | - **Default**: Unlimited
1026 | - **Purpose**: Limit total pages crawled
1027 | - **Recommended Ranges**:
1028 | - `10-20`: Testing, small sites
1029 | - `50-100`: Medium sites
1030 | - `200-500`: Large sites
1031 | - `1000+`: Enterprise crawling
1032 | - **Cost Calculation**:
1033 | - AI mode: `max_pages × 10` credits
1034 | - Markdown mode: `max_pages × 2` credits
1035 |
1036 | ### `same_domain_only`
1037 | **Used in**: smartcrawler_initiate
1038 |
1039 | - **Type**: `Optional[bool]`
1040 | - **Default**: `true`
1041 | - **Purpose**: Control cross-domain crawling
1042 | - **Options**:
1043 | - `true`: Stay within same domain (recommended)
1044 | - `false`: Allow external domains (use with caution)
1045 | - **Best Practices**:
1046 | - Use `true` for focused crawling
1047 | - Set `max_pages` when using `false`
1048 | - Consider crawling scope carefully
1049 |
1050 | ---
1051 |
1052 | ## 🔄 Search Parameters
1053 |
1054 | ### `num_results`
1055 | **Used in**: searchscraper
1056 |
1057 | - **Type**: `Optional[int]`
1058 | - **Default**: 3 websites
1059 | - **Range**: 1-20 (recommended ≤10)
1060 | - **Cost**: `num_results × 10` credits
1061 | - **Examples**:
1062 | - `1`: Quick lookup (10 credits)
1063 | - `3`: Standard research (30 credits)
1064 | - `5`: Comprehensive (50 credits)
1065 | - `10`: Extensive analysis (100 credits)
1066 |
1067 | ---
1068 |
1069 | ## 🤖 Agentic Automation Parameters
1070 |
1071 | ### `steps`
1072 | **Used in**: agentic_scrapper
1073 |
1074 | - **Type**: `Optional[Union[str, List[str]]]`
1075 | - **Purpose**: Sequential workflow instructions
1076 | - **Formats**:
1077 | - List: `['Click search', 'Enter term', 'Extract results']`
1078 | - JSON string: `'["Step 1", "Step 2", "Step 3"]'`
1079 | - **Best Practices**:
1080 | - Break complex actions into simple steps
1081 | - Be specific about UI elements
1082 | - Include wait/loading steps
1083 | - Order logically
1084 |
1085 | ### `ai_extraction`
1086 | **Used in**: agentic_scrapper
1087 |
1088 | - **Type**: `Optional[bool]`
1089 | - **Default**: `true`
1090 | - **Purpose**: Control extraction intelligence
1091 | - **Options**:
1092 | - `true`: Advanced AI extraction (recommended)
1093 | - `false`: Simpler, faster extraction
1094 | - **Trade-offs**:
1095 | - `true`: Better accuracy, slower processing
1096 | - `false`: Faster execution, less accurate
1097 |
1098 | ### `persistent_session`
1099 | **Used in**: agentic_scrapper
1100 |
1101 | - **Type**: `Optional[bool]`
1102 | - **Default**: `false`
1103 | - **Purpose**: Maintain session state between steps
1104 | - **When to Use `true`**:
1105 | - Login flows
1106 | - Shopping cart processes
1107 | - Form wizards with dependencies
1108 | - **When to Use `false`**:
1109 | - Simple data extraction
1110 | - Independent actions
1111 | - Public content scraping
1112 |
1113 | ### `timeout_seconds`
1114 | **Used in**: agentic_scrapper
1115 |
1116 | - **Type**: `Optional[float]`
1117 | - **Default**: 120.0 (2 minutes)
1118 | - **Recommended Ranges**:
1119 | - `60-120`: Simple workflows (2-5 steps)
1120 | - `180-300`: Medium complexity (5-10 steps)
1121 | - `300-600`: Complex workflows (10+ steps)
1122 | - `600+`: Very complex workflows
1123 | - **Considerations**:
1124 | - Include page load times
1125 | - Factor in network latency
1126 | - Allow for AI processing time
1127 |
1128 | ---
1129 |
1130 | ## 💰 Credit Cost Summary
1131 |
1132 | | Tool | Base Cost | Additional Costs |
1133 | |------|-----------|------------------|
1134 | | `markdownify` | 2 credits | None |
1135 | | `smartscraper` | 10 credits | +10 per additional page |
1136 | | `searchscraper` | 30 credits (3 sites) | +10 per additional site |
1137 | | `smartcrawler` | 2-10 credits/page | Depends on extraction mode |
1138 | | `scrape` | 1 credit | None |
1139 | | `sitemap` | 1 credit | None |
1140 | | `agentic_scrapper` | Variable | Based on complexity |
1141 |
1142 | ---
1143 |
1144 | ## ⚠️ Common Parameter Mistakes
1145 |
1146 | ### URL Formatting
1147 | - ❌ `example.com` → ✅ `https://example.com`
1148 | - ❌ `ftp://site.com` → ✅ `https://site.com`
1149 |
1150 | ### Mutually Exclusive Parameters
1151 | - ❌ Setting both `website_url` and `website_html`
1152 | - ✅ Choose one input source only
1153 |
1154 | ### Range Violations
1155 | - ❌ `number_of_scrolls: 100` → ✅ `number_of_scrolls: 10`
1156 | - ❌ `total_pages: 1000` → ✅ `total_pages: 100`
1157 |
1158 | ### JSON Schema Errors
1159 | - ❌ Invalid JSON string format
1160 | - ✅ Valid JSON or dictionary format
1161 |
1162 | ### Timeout Issues
1163 | - ❌ `timeout_seconds: 30` for complex workflows
1164 | - ✅ `timeout_seconds: 300` for complex workflows
1165 |
1166 | ---
1167 |
1168 | ## 🎯 Parameter Selection Guide
1169 |
1170 | ### For Simple Content Extraction
1171 | ```
1172 | Tool: markdownify or smartscraper
1173 | Parameters: website_url, user_prompt (if smartscraper)
1174 | ```
1175 |
1176 | ### For Dynamic Content
1177 | ```
1178 | Tool: smartscraper or scrape
1179 | Parameters: render_heavy_js=true, stealth=true (if needed)
1180 | ```
1181 |
1182 | ### For Multi-Page Content
1183 | ```
1184 | Tool: smartcrawler_initiate
1185 | Parameters: max_pages, depth, extraction_mode
1186 | ```
1187 |
1188 | ### For Research Tasks
1189 | ```
1190 | Tool: searchscraper
1191 | Parameters: num_results, user_prompt
1192 | ```
1193 |
1194 | ### For Complex Automation
1195 | ```
1196 | Tool: agentic_scrapper
1197 | Parameters: steps, persistent_session, timeout_seconds
1198 | ```
1199 |
1200 | ---
1201 |
1202 | ## 📚 Additional Resources
1203 |
1204 | - **Tool Comparison**: Use `scrapegraph://tools/comparison` resource
1205 | - **Use Cases**: Check `scrapegraph://examples/use-cases` resource
1206 | - **API Status**: Monitor `scrapegraph://api/status` resource
1207 | - **Quick Examples**: See prompt `quick_start_examples`
1208 |
1209 | ---
1210 |
1211 | *Last Updated: November 2024*
1212 | *For the most current parameter information, refer to individual tool documentation.*
1213 | """
1214 |
1215 |
1216 | @mcp.resource("scrapegraph://tools/comparison")
1217 | def tool_comparison_guide() -> str:
1218 | """
1219 | Detailed comparison of ScapeGraph tools to help choose the right tool for each task.
1220 |
1221 | Decision matrix and feature comparison across all available tools.
1222 | """
1223 | return """# ScapeGraph Tools Comparison Guide
1224 |
1225 | ## 🎯 Quick Decision Matrix
1226 |
1227 | | Need | Recommended Tool | Alternative | Credits |
1228 | |------|------------------|-------------|---------|
1229 | | Convert page to markdown | `markdownify` | `scrape` + manual | 2 |
1230 | | Extract specific data | `smartscraper` | `agentic_scrapper` | 10 |
1231 | | Search web for info | `searchscraper` | Multiple `smartscraper` | 30 |
1232 | | Crawl multiple pages | `smartcrawler_initiate` | Loop `smartscraper` | 2-10/page |
1233 | | Get raw page content | `scrape` | `markdownify` | 1 |
1234 | | Map site structure | `sitemap` | Manual discovery | 1 |
1235 | | Complex automation | `agentic_scrapper` | Custom scripting | Variable |
1236 |
1237 | ## 🔍 Detailed Tool Comparison
1238 |
1239 | ### Content Extraction Tools
1240 |
1241 | #### markdownify vs scrape
1242 | - **markdownify**: Clean, formatted markdown output
1243 | - **scrape**: Raw HTML with optional JS rendering
1244 | - **Use markdownify when**: You need readable content
1245 | - **Use scrape when**: You need full HTML or custom parsing
1246 |
1247 | #### smartscraper vs agentic_scrapper
1248 | - **smartscraper**: Single-page AI extraction
1249 | - **agentic_scrapper**: Multi-step automated workflows
1250 | - **Use smartscraper when**: Simple data extraction from one page
1251 | - **Use agentic_scrapper when**: Complex navigation required
1252 |
1253 | ### Scale & Automation
1254 |
1255 | #### Single Page Tools
1256 | - `markdownify`, `smartscraper`, `scrape`, `sitemap`
1257 | - **Pros**: Fast, predictable costs, simple
1258 | - **Cons**: Manual iteration for multiple pages
1259 |
1260 | #### Multi-Page Tools
1261 | - `smartcrawler_initiate`, `searchscraper`, `agentic_scrapper`
1262 | - **Pros**: Automated scale, comprehensive results
1263 | - **Cons**: Higher costs, longer processing times
1264 |
1265 | ### Cost Optimization
1266 |
1267 | #### Low Cost (1-2 credits)
1268 | - `scrape`: Basic page fetching
1269 | - `markdownify`: Content conversion
1270 | - `sitemap`: Site structure
1271 |
1272 | #### Medium Cost (10 credits)
1273 | - `smartscraper`: AI data extraction
1274 | - `searchscraper`: Per website searched
1275 |
1276 | #### Variable Cost
1277 | - `smartcrawler_initiate`: 2-10 credits per page
1278 | - `agentic_scrapper`: Depends on complexity
1279 |
1280 | ## 🚀 Performance Characteristics
1281 |
1282 | ### Speed (Typical Response Times)
1283 | 1. **scrape**: 2-5 seconds
1284 | 2. **sitemap**: 3-8 seconds
1285 | 3. **markdownify**: 5-15 seconds
1286 | 4. **smartscraper**: 15-45 seconds
1287 | 5. **searchscraper**: 30-90 seconds
1288 | 6. **smartcrawler**: 1-5 minutes (async)
1289 | 7. **agentic_scrapper**: 2-10 minutes
1290 |
1291 | ### Reliability
1292 | - **Highest**: `scrape`, `sitemap`, `markdownify`
1293 | - **High**: `smartscraper`, `searchscraper`
1294 | - **Variable**: `smartcrawler`, `agentic_scrapper` (depends on site complexity)
1295 |
1296 | ## 🎨 Output Format Comparison
1297 |
1298 | ### Structured Data
1299 | - **smartscraper**: JSON with extracted fields
1300 | - **searchscraper**: JSON with search results
1301 | - **agentic_scrapper**: Custom schema support
1302 |
1303 | ### Content Formats
1304 | - **markdownify**: Clean markdown text
1305 | - **scrape**: Raw HTML
1306 | - **sitemap**: URL list/structure
1307 |
1308 | ### Async Operations
1309 | - **smartcrawler_initiate**: Returns request ID
1310 | - **smartcrawler_fetch_results**: Returns final data
1311 | - All others: Immediate response
1312 |
1313 | ## 🛠️ Integration Patterns
1314 |
1315 | ### Simple Workflows
1316 | ```
1317 | URL → markdownify → Markdown content
1318 | URL → smartscraper → Structured data
1319 | Query → searchscraper → Research results
1320 | ```
1321 |
1322 | ### Complex Workflows
1323 | ```
1324 | URL → sitemap → smartcrawler_initiate → smartcrawler_fetch_results
1325 | URL → agentic_scrapper (with steps) → Complex extracted data
1326 | Query → searchscraper → smartscraper (on results) → Detailed analysis
1327 | ```
1328 |
1329 | ### Hybrid Approaches
1330 | ```
1331 | URL → scrape (check if JS needed) → smartscraper (extract data)
1332 | URL → sitemap (map structure) → smartcrawler (batch process)
1333 | ```
1334 |
1335 | ## 📋 Selection Checklist
1336 |
1337 | **Choose markdownify when:**
1338 | - ✅ Need readable content format
1339 | - ✅ Converting documentation/articles
1340 | - ✅ Cost is a primary concern
1341 |
1342 | **Choose smartscraper when:**
1343 | - ✅ Need specific data extracted
1344 | - ✅ Working with single pages
1345 | - ✅ Want AI-powered extraction
1346 |
1347 | **Choose searchscraper when:**
1348 | - ✅ Need to find information across web
1349 | - ✅ Research-oriented tasks
1350 | - ✅ Don't have specific URLs
1351 |
1352 | **Choose smartcrawler when:**
1353 | - ✅ Need to process multiple pages
1354 | - ✅ Can wait for async processing
1355 | - ✅ Want consistent extraction across site
1356 |
1357 | **Choose agentic_scrapper when:**
1358 | - ✅ Site requires complex navigation
1359 | - ✅ Need to interact with forms/buttons
1360 | - ✅ Custom workflow requirements
1361 | """
1362 |
1363 |
1364 | # Add tool for markdownify
1365 | @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
1366 | def markdownify(website_url: str, ctx: Context) -> Dict[str, Any]:
1367 | """
1368 | Convert a webpage into clean, formatted markdown.
1369 |
1370 | This tool fetches any webpage and converts its content into clean, readable markdown format.
1371 | Useful for extracting content from documentation, articles, and web pages for further processing.
1372 | Costs 2 credits per page. Read-only operation with no side effects.
1373 |
1374 | Args:
1375 | website_url (str): The complete URL of the webpage to convert to markdown format.
1376 | - Must include protocol (http:// or https://)
1377 | - Supports most web content types (HTML, articles, documentation)
1378 | - Works with both static and dynamic content
1379 | - Examples:
1380 | * https://example.com/page
1381 | * https://docs.python.org/3/tutorial/
1382 | * https://github.com/user/repo/README.md
1383 | - Invalid examples:
1384 | * example.com (missing protocol)
1385 | * ftp://example.com (unsupported protocol)
1386 | * localhost:3000 (missing protocol)
1387 |
1388 | Returns:
1389 | Dictionary containing:
1390 | - markdown: The converted markdown content as a string
1391 | - metadata: Additional information about the conversion (title, description, etc.)
1392 | - status: Success/error status of the operation
1393 | - credits_used: Number of credits consumed (always 2 for this operation)
1394 |
1395 | Raises:
1396 | ValueError: If website_url is malformed or missing protocol
1397 | HTTPError: If the webpage cannot be accessed or returns an error
1398 | TimeoutError: If the webpage takes too long to load (>120 seconds)
1399 | """
1400 | try:
1401 | api_key = get_api_key(ctx)
1402 | client = ScapeGraphClient(api_key)
1403 | return client.markdownify(website_url)
1404 | except Exception as e:
1405 | return {"error": str(e)}
1406 |
1407 |
1408 | # Add tool for smartscraper
1409 | @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
1410 | def smartscraper(
1411 | user_prompt: str,
1412 | ctx: Context,
1413 | website_url: Optional[str] = None,
1414 | website_html: Optional[str] = None,
1415 | website_markdown: Optional[str] = None,
1416 | output_schema: Optional[Annotated[Union[str, Dict[str, Any]], Field(
1417 | default=None,
1418 | description="JSON schema dict or JSON string defining the expected output structure",
1419 | json_schema_extra={
1420 | "oneOf": [
1421 | {"type": "string"},
1422 | {"type": "object"}
1423 | ]
1424 | }
1425 | )]] = None,
1426 | number_of_scrolls: Optional[int] = None,
1427 | total_pages: Optional[int] = None,
1428 | render_heavy_js: Optional[bool] = None,
1429 | stealth: Optional[bool] = None
1430 | ) -> Dict[str, Any]:
1431 | """
1432 | Extract structured data from a webpage, HTML, or markdown using AI-powered extraction.
1433 |
1434 | This tool uses advanced AI to understand your natural language prompt and extract specific
1435 | structured data from web content. Supports three input modes: URL scraping, local HTML processing,
1436 | or local markdown processing. Ideal for extracting product information, contact details,
1437 | article metadata, or any structured content. Costs 10 credits per page. Read-only operation.
1438 |
1439 | Args:
1440 | user_prompt (str): Natural language instructions describing what data to extract.
1441 | - Be specific about the fields you want for better results
1442 | - Use clear, descriptive language about the target data
1443 | - Examples:
1444 | * "Extract product name, price, description, and availability status"
1445 | * "Find all contact methods: email addresses, phone numbers, and social media links"
1446 | * "Get article title, author, publication date, and summary"
1447 | * "Extract all job listings with title, company, location, and salary"
1448 | - Tips for better results:
1449 | * Specify exact field names you want
1450 | * Mention data types (numbers, dates, URLs, etc.)
1451 | * Include context about where data might be located
1452 |
1453 | website_url (Optional[str]): The complete URL of the webpage to scrape.
1454 | - Mutually exclusive with website_html and website_markdown
1455 | - Must include protocol (http:// or https://)
1456 | - Supports dynamic and static content
1457 | - Examples:
1458 | * https://example.com/products/item
1459 | * https://news.site.com/article/123
1460 | * https://company.com/contact
1461 | - Default: None (must provide one of the three input sources)
1462 |
1463 | website_html (Optional[str]): Raw HTML content to process locally.
1464 | - Mutually exclusive with website_url and website_markdown
1465 | - Maximum size: 2MB
1466 | - Useful for processing pre-fetched or generated HTML
1467 | - Use when you already have HTML content from another source
1468 | - Example: "<html><body><h1>Title</h1><p>Content</p></body></html>"
1469 | - Default: None
1470 |
1471 | website_markdown (Optional[str]): Markdown content to process locally.
1472 | - Mutually exclusive with website_url and website_html
1473 | - Maximum size: 2MB
1474 | - Useful for extracting from markdown documents or converted content
1475 | - Works well with documentation, README files, or converted web content
1476 | - Example: "# Title\n\n## Section\n\nContent here..."
1477 | - Default: None
1478 |
1479 | output_schema (Optional[Union[str, Dict]]): JSON schema defining expected output structure.
1480 | - Can be provided as a dictionary or JSON string
1481 | - Helps ensure consistent, structured output format
1482 | - Optional but recommended for complex extractions
1483 | - Examples:
1484 | * As dict: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}}
1485 | * As JSON string: '{"type": "object", "properties": {"name": {"type": "string"}}}'
1486 | * For arrays: {'type': 'array', 'items': {'type': 'object', 'properties': {...}}}
1487 | - Default: None (AI will infer structure from prompt)
1488 |
1489 | number_of_scrolls (Optional[int]): Number of infinite scrolls to perform before scraping.
1490 | - Range: 0-50 scrolls
1491 | - Default: 0 (no scrolling)
1492 | - Useful for dynamically loaded content (lazy loading, infinite scroll)
1493 | - Each scroll waits for content to load before continuing
1494 | - Examples:
1495 | * 0: Static content, no scrolling needed
1496 | * 3: Social media feeds, product listings
1497 | * 10: Long articles, extensive product catalogs
1498 | - Note: Increases processing time proportionally
1499 |
1500 | total_pages (Optional[int]): Number of pages to process for pagination.
1501 | - Range: 1-100 pages
1502 | - Default: 1 (single page only)
1503 | - Automatically follows pagination links when available
1504 | - Useful for multi-page listings, search results, catalogs
1505 | - Examples:
1506 | * 1: Single page extraction
1507 | * 5: First 5 pages of search results
1508 | * 20: Comprehensive catalog scraping
1509 | - Note: Each page counts toward credit usage (10 credits × pages)
1510 |
1511 | render_heavy_js (Optional[bool]): Enable heavy JavaScript rendering for dynamic sites.
1512 | - Default: false
1513 | - Set to true for Single Page Applications (SPAs), React apps, Vue.js sites
1514 | - Increases processing time but captures client-side rendered content
1515 | - Use when content is loaded dynamically via JavaScript
1516 | - Examples of when to use:
1517 | * React/Angular/Vue applications
1518 | * Sites with dynamic content loading
1519 | * AJAX-heavy interfaces
1520 | * Content that appears after page load
1521 | - Note: Significantly increases processing time (30-60 seconds vs 5-15 seconds)
1522 |
1523 | stealth (Optional[bool]): Enable stealth mode to avoid bot detection.
1524 | - Default: false
1525 | - Helps bypass basic anti-scraping measures
1526 | - Uses techniques to appear more like a human browser
1527 | - Useful for sites with bot detection systems
1528 | - Examples of when to use:
1529 | * Sites that block automated requests
1530 | * E-commerce sites with protection
1531 | * Sites that require "human-like" behavior
1532 | - Note: May increase processing time and is not 100% guaranteed
1533 |
1534 | Returns:
1535 | Dictionary containing:
1536 | - extracted_data: The structured data matching your prompt and optional schema
1537 | - metadata: Information about the extraction process
1538 | - credits_used: Number of credits consumed (10 per page processed)
1539 | - processing_time: Time taken for the extraction
1540 | - pages_processed: Number of pages that were analyzed
1541 | - status: Success/error status of the operation
1542 |
1543 | Raises:
1544 | ValueError: If no input source provided or multiple sources provided
1545 | HTTPError: If website_url cannot be accessed
1546 | TimeoutError: If processing exceeds timeout limits
1547 | ValidationError: If output_schema is malformed JSON
1548 | """
1549 | try:
1550 | api_key = get_api_key(ctx)
1551 | client = ScapeGraphClient(api_key)
1552 |
1553 | # Parse output_schema if it's a JSON string
1554 | normalized_schema: Optional[Dict[str, Any]] = None
1555 | if isinstance(output_schema, dict):
1556 | normalized_schema = output_schema
1557 | elif isinstance(output_schema, str):
1558 | try:
1559 | parsed_schema = json.loads(output_schema)
1560 | if isinstance(parsed_schema, dict):
1561 | normalized_schema = parsed_schema
1562 | else:
1563 | return {"error": "output_schema must be a JSON object"}
1564 | except json.JSONDecodeError as e:
1565 | return {"error": f"Invalid JSON for output_schema: {str(e)}"}
1566 |
1567 | return client.smartscraper(
1568 | user_prompt=user_prompt,
1569 | website_url=website_url,
1570 | website_html=website_html,
1571 | website_markdown=website_markdown,
1572 | output_schema=normalized_schema,
1573 | number_of_scrolls=number_of_scrolls,
1574 | total_pages=total_pages,
1575 | render_heavy_js=render_heavy_js,
1576 | stealth=stealth
1577 | )
1578 | except Exception as e:
1579 | return {"error": str(e)}
1580 |
1581 |
1582 | # Add tool for searchscraper
1583 | @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": False})
1584 | def searchscraper(
1585 | user_prompt: str,
1586 | ctx: Context,
1587 | num_results: Optional[int] = None,
1588 | number_of_scrolls: Optional[int] = None
1589 | ) -> Dict[str, Any]:
1590 | """
1591 | Perform AI-powered web searches with structured data extraction.
1592 |
1593 | This tool searches the web based on your query and uses AI to extract structured information
1594 | from the search results. Ideal for research, competitive analysis, and gathering information
1595 | from multiple sources. Each website searched costs 10 credits (default 3 websites = 30 credits).
1596 | Read-only operation but results may vary over time (non-idempotent).
1597 |
1598 | Args:
1599 | user_prompt (str): Search query or natural language instructions for information to find.
1600 | - Can be a simple search query or detailed extraction instructions
1601 | - The AI will search the web and extract relevant data from found pages
1602 | - Be specific about what information you want extracted
1603 | - Examples:
1604 | * "Find latest AI research papers published in 2024 with author names and abstracts"
1605 | * "Search for Python web scraping tutorials with ratings and difficulty levels"
1606 | * "Get current cryptocurrency prices and market caps for top 10 coins"
1607 | * "Find contact information for tech startups in San Francisco"
1608 | * "Search for job openings for data scientists with salary information"
1609 | - Tips for better results:
1610 | * Include specific fields you want extracted
1611 | * Mention timeframes or filters (e.g., "latest", "2024", "top 10")
1612 | * Specify data types needed (prices, dates, ratings, etc.)
1613 |
1614 | num_results (Optional[int]): Number of websites to search and extract data from.
1615 | - Default: 3 websites (costs 30 credits total)
1616 | - Range: 1-20 websites (recommended to stay under 10 for cost efficiency)
1617 | - Each website costs 10 credits, so total cost = num_results × 10
1618 | - Examples:
1619 | * 1: Quick single-source lookup (10 credits)
1620 | * 3: Standard research (30 credits) - good balance of coverage and cost
1621 | * 5: Comprehensive research (50 credits)
1622 | * 10: Extensive analysis (100 credits)
1623 | - Note: More results provide broader coverage but increase costs and processing time
1624 |
1625 | number_of_scrolls (Optional[int]): Number of infinite scrolls per searched webpage.
1626 | - Default: 0 (no scrolling on search result pages)
1627 | - Range: 0-10 scrolls per page
1628 | - Useful when search results point to pages with dynamic content loading
1629 | - Each scroll waits for content to load before continuing
1630 | - Examples:
1631 | * 0: Static content pages, news articles, documentation
1632 | * 2: Social media pages, product listings with lazy loading
1633 | * 5: Extensive feeds, long-form content with infinite scroll
1634 | - Note: Increases processing time significantly (adds 5-10 seconds per scroll per page)
1635 |
1636 | Returns:
1637 | Dictionary containing:
1638 | - search_results: Array of extracted data from each website found
1639 | - sources: List of URLs that were searched and processed
1640 | - total_websites_processed: Number of websites successfully analyzed
1641 | - credits_used: Total credits consumed (num_results × 10)
1642 | - processing_time: Total time taken for search and extraction
1643 | - search_query_used: The actual search query sent to search engines
1644 | - metadata: Additional information about the search process
1645 |
1646 | Raises:
1647 | ValueError: If user_prompt is empty or num_results is out of range
1648 | HTTPError: If search engines are unavailable or return errors
1649 | TimeoutError: If search or extraction process exceeds timeout limits
1650 | RateLimitError: If too many requests are made in a short time period
1651 |
1652 | Note:
1653 | - Results may vary between calls due to changing web content (non-idempotent)
1654 | - Search engines may return different results over time
1655 | - Some websites may be inaccessible or block automated access
1656 | - Processing time increases with num_results and number_of_scrolls
1657 | - Consider using smartscraper on specific URLs if you know the target sites
1658 | """
1659 | try:
1660 | api_key = get_api_key(ctx)
1661 | client = ScapeGraphClient(api_key)
1662 | return client.searchscraper(user_prompt, num_results, number_of_scrolls)
1663 | except Exception as e:
1664 | return {"error": str(e)}
1665 |
1666 |
1667 | # Add tool for SmartCrawler initiation
1668 | @mcp.tool(annotations={"readOnlyHint": False, "destructiveHint": False, "idempotentHint": False})
1669 | def smartcrawler_initiate(
1670 | url: str,
1671 | ctx: Context,
1672 | prompt: Optional[str] = None,
1673 | extraction_mode: str = "ai",
1674 | depth: Optional[int] = None,
1675 | max_pages: Optional[int] = None,
1676 | same_domain_only: Optional[bool] = None
1677 | ) -> Dict[str, Any]:
1678 | """
1679 | Initiate an asynchronous multi-page web crawling operation with AI extraction or markdown conversion.
1680 |
1681 | This tool starts an intelligent crawler that discovers and processes multiple pages from a starting URL.
1682 | Choose between AI Extraction Mode (10 credits/page) for structured data or Markdown Mode (2 credits/page)
1683 | for content conversion. The operation is asynchronous - use smartcrawler_fetch_results to retrieve results.
1684 | Creates a new crawl request (non-idempotent, non-read-only).
1685 |
1686 | SmartCrawler supports two modes:
1687 | - AI Extraction Mode: Extracts structured data based on your prompt from every crawled page
1688 | - Markdown Conversion Mode: Converts each page to clean markdown format
1689 |
1690 | Args:
1691 | url (str): The starting URL to begin crawling from.
1692 | - Must include protocol (http:// or https://)
1693 | - The crawler will discover and process linked pages from this starting point
1694 | - Should be a page with links to other pages you want to crawl
1695 | - Examples:
1696 | * https://docs.example.com (documentation site root)
1697 | * https://blog.company.com (blog homepage)
1698 | * https://example.com/products (product category page)
1699 | * https://news.site.com/category/tech (news section)
1700 | - Best practices:
1701 | * Use homepage or main category pages as starting points
1702 | * Ensure the starting page has links to content you want to crawl
1703 | * Consider site structure when choosing the starting URL
1704 |
1705 | prompt (Optional[str]): AI prompt for data extraction.
1706 | - REQUIRED when extraction_mode is 'ai'
1707 | - Ignored when extraction_mode is 'markdown'
1708 | - Describes what data to extract from each crawled page
1709 | - Applied consistently across all discovered pages
1710 | - Examples:
1711 | * "Extract API endpoint name, method, parameters, and description"
1712 | * "Get article title, author, publication date, and summary"
1713 | * "Find product name, price, description, and availability"
1714 | * "Extract job title, company, location, salary, and requirements"
1715 | - Tips for better results:
1716 | * Be specific about fields you want from each page
1717 | * Consider that different pages may have different content structures
1718 | * Use general terms that apply across multiple page types
1719 |
1720 | extraction_mode (str): Extraction mode for processing crawled pages.
1721 | - Default: "ai"
1722 | - Options:
1723 | * "ai": AI-powered structured data extraction (10 credits per page)
1724 | - Uses the prompt to extract specific data from each page
1725 | - Returns structured JSON data
1726 | - More expensive but provides targeted information
1727 | - Best for: Data collection, research, structured analysis
1728 | * "markdown": Simple markdown conversion (2 credits per page)
1729 | - Converts each page to clean markdown format
1730 | - No AI processing, just content conversion
1731 | - More cost-effective for content archival
1732 | - Best for: Documentation backup, content migration, reading
1733 | - Cost comparison:
1734 | * AI mode: 50 pages = 500 credits
1735 | * Markdown mode: 50 pages = 100 credits
1736 |
1737 | depth (Optional[int]): Maximum depth of link traversal from the starting URL.
1738 | - Default: unlimited (will follow links until max_pages or no more links)
1739 | - Depth levels:
1740 | * 0: Only the starting URL (no link following)
1741 | * 1: Starting URL + pages directly linked from it
1742 | * 2: Starting URL + direct links + links from those pages
1743 | * 3+: Continues following links to specified depth
1744 | - Examples:
1745 | * 1: Crawl blog homepage + all blog posts
1746 | * 2: Crawl docs homepage + category pages + individual doc pages
1747 | * 3: Deep crawling for comprehensive site coverage
1748 | - Considerations:
1749 | * Higher depth can lead to exponential page growth
1750 | * Use with max_pages to control scope and cost
1751 | * Consider site structure when setting depth
1752 |
1753 | max_pages (Optional[int]): Maximum number of pages to crawl in total.
1754 | - Default: unlimited (will crawl until no more links or depth limit)
1755 | - Recommended ranges:
1756 | * 10-20: Testing and small sites
1757 | * 50-100: Medium sites and focused crawling
1758 | * 200-500: Large sites and comprehensive analysis
1759 | * 1000+: Enterprise-level crawling (high cost)
1760 | - Cost implications:
1761 | * AI mode: max_pages × 10 credits
1762 | * Markdown mode: max_pages × 2 credits
1763 | - Examples:
1764 | * 10: Quick site sampling (20-100 credits)
1765 | * 50: Standard documentation crawl (100-500 credits)
1766 | * 200: Comprehensive site analysis (400-2000 credits)
1767 | - Note: Crawler stops when this limit is reached, regardless of remaining links
1768 |
1769 | same_domain_only (Optional[bool]): Whether to crawl only within the same domain.
1770 | - Default: true (recommended for most use cases)
1771 | - Options:
1772 | * true: Only crawl pages within the same domain as starting URL
1773 | - Prevents following external links
1774 | - Keeps crawling focused on the target site
1775 | - Reduces risk of crawling unrelated content
1776 | - Example: Starting at docs.example.com only crawls docs.example.com pages
1777 | * false: Allow crawling external domains
1778 | - Follows links to other domains
1779 | - Can lead to very broad crawling scope
1780 | - May crawl unrelated or unwanted content
1781 | - Use with caution and appropriate max_pages limit
1782 | - Recommendations:
1783 | * Use true for focused site crawling
1784 | * Use false only when you specifically need cross-domain data
1785 | * Always set max_pages when using false to prevent runaway crawling
1786 |
1787 | Returns:
1788 | Dictionary containing:
1789 | - request_id: Unique identifier for this crawl operation (use with smartcrawler_fetch_results)
1790 | - status: Initial status of the crawl request ("initiated" or "processing")
1791 | - estimated_cost: Estimated credit cost based on parameters (actual cost may vary)
1792 | - crawl_parameters: Summary of the crawling configuration
1793 | - estimated_time: Rough estimate of processing time
1794 | - next_steps: Instructions for retrieving results
1795 |
1796 | Raises:
1797 | ValueError: If URL is malformed, prompt is missing for AI mode, or parameters are invalid
1798 | HTTPError: If the starting URL cannot be accessed
1799 | RateLimitError: If too many crawl requests are initiated too quickly
1800 |
1801 | Note:
1802 | - This operation is asynchronous and may take several minutes to complete
1803 | - Use smartcrawler_fetch_results with the returned request_id to get results
1804 | - Keep polling smartcrawler_fetch_results until status is "completed"
1805 | - Actual pages crawled may be less than max_pages if fewer links are found
1806 | - Processing time increases with max_pages, depth, and extraction_mode complexity
1807 | """
1808 | try:
1809 | api_key = get_api_key(ctx)
1810 | client = ScapeGraphClient(api_key)
1811 | return client.smartcrawler_initiate(
1812 | url=url,
1813 | prompt=prompt,
1814 | extraction_mode=extraction_mode,
1815 | depth=depth,
1816 | max_pages=max_pages,
1817 | same_domain_only=same_domain_only
1818 | )
1819 | except Exception as e:
1820 | return {"error": str(e)}
1821 |
1822 |
1823 | # Add tool for fetching SmartCrawler results
1824 | @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
1825 | def smartcrawler_fetch_results(request_id: str, ctx: Context) -> Dict[str, Any]:
1826 | """
1827 | Retrieve the results of an asynchronous SmartCrawler operation.
1828 |
1829 | This tool fetches the results from a previously initiated crawling operation using the request_id.
1830 | The crawl request processes asynchronously in the background. Keep polling this endpoint until
1831 | the status field indicates 'completed'. While processing, you'll receive status updates.
1832 | Read-only operation that safely retrieves results without side effects.
1833 |
1834 | Args:
1835 | request_id: The unique request ID returned by smartcrawler_initiate. Use this to retrieve the crawling results. Keep polling until status is 'completed'. Example: 'req_abc123xyz'
1836 |
1837 | Returns:
1838 | Dictionary containing:
1839 | - status: Current status of the crawl operation ('processing', 'completed', 'failed')
1840 | - results: Crawled data (structured extraction or markdown) when completed
1841 | - metadata: Information about processed pages, URLs visited, and processing statistics
1842 | Keep polling until status is 'completed' to get final results
1843 | """
1844 | try:
1845 | api_key = get_api_key(ctx)
1846 | client = ScapeGraphClient(api_key)
1847 | return client.smartcrawler_fetch_results(request_id)
1848 | except Exception as e:
1849 | return {"error": str(e)}
1850 |
1851 |
1852 | # Add tool for basic scrape
1853 | @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
1854 | def scrape(
1855 | website_url: str,
1856 | ctx: Context,
1857 | render_heavy_js: Optional[bool] = None
1858 | ) -> Dict[str, Any]:
1859 | """
1860 | Fetch raw page content from any URL with optional JavaScript rendering.
1861 |
1862 | This tool performs basic web scraping to retrieve the raw HTML content of a webpage.
1863 | Optionally enable JavaScript rendering for Single Page Applications (SPAs) and sites with
1864 | heavy client-side rendering. Lower cost than AI extraction (1 credit/page).
1865 | Read-only operation with no side effects.
1866 |
1867 | Args:
1868 | website_url (str): The complete URL of the webpage to scrape.
1869 | - Must include protocol (http:// or https://)
1870 | - Returns raw HTML content of the page
1871 | - Works with both static and dynamic websites
1872 | - Examples:
1873 | * https://example.com/page
1874 | * https://api.example.com/docs
1875 | * https://news.site.com/article/123
1876 | * https://app.example.com/dashboard (may need render_heavy_js=true)
1877 | - Supported protocols: HTTP, HTTPS
1878 | - Invalid examples:
1879 | * example.com (missing protocol)
1880 | * ftp://example.com (unsupported protocol)
1881 |
1882 | render_heavy_js (Optional[bool]): Enable full JavaScript rendering for dynamic content.
1883 | - Default: false (faster, lower cost, works for most static sites)
1884 | - Set to true for sites that require JavaScript execution to display content
1885 | - When to use true:
1886 | * Single Page Applications (React, Angular, Vue.js)
1887 | * Sites with dynamic content loading via AJAX
1888 | * Content that appears only after JavaScript execution
1889 | * Interactive web applications
1890 | * Sites where initial HTML is mostly empty
1891 | - When to use false (default):
1892 | * Static websites and blogs
1893 | * Server-side rendered content
1894 | * Traditional HTML pages
1895 | * News articles and documentation
1896 | * When you need faster processing
1897 | - Performance impact:
1898 | * false: 2-5 seconds processing time
1899 | * true: 15-30 seconds processing time (waits for JS execution)
1900 | - Cost: Same (1 credit) regardless of render_heavy_js setting
1901 |
1902 | Returns:
1903 | Dictionary containing:
1904 | - html_content: The raw HTML content of the page as a string
1905 | - page_title: Extracted page title if available
1906 | - status_code: HTTP response status code (200 for success)
1907 | - final_url: Final URL after any redirects
1908 | - content_length: Size of the HTML content in bytes
1909 | - processing_time: Time taken to fetch and process the page
1910 | - javascript_rendered: Whether JavaScript rendering was used
1911 | - credits_used: Number of credits consumed (always 1)
1912 |
1913 | Raises:
1914 | ValueError: If website_url is malformed or missing protocol
1915 | HTTPError: If the webpage returns an error status (404, 500, etc.)
1916 | TimeoutError: If the page takes too long to load
1917 | ConnectionError: If the website cannot be reached
1918 |
1919 | Use Cases:
1920 | - Getting raw HTML for custom parsing
1921 | - Checking page structure before using other tools
1922 | - Fetching content for offline processing
1923 | - Debugging website content issues
1924 | - Pre-processing before AI extraction
1925 |
1926 | Note:
1927 | - This tool returns raw HTML without any AI processing
1928 | - Use smartscraper for structured data extraction
1929 | - Use markdownify for clean, readable content
1930 | - Consider render_heavy_js=true if initial results seem incomplete
1931 | """
1932 | try:
1933 | api_key = get_api_key(ctx)
1934 | client = ScapeGraphClient(api_key)
1935 | return client.scrape(website_url=website_url, render_heavy_js=render_heavy_js)
1936 | except httpx.HTTPError as http_err:
1937 | return {"error": str(http_err)}
1938 | except ValueError as val_err:
1939 | return {"error": str(val_err)}
1940 |
1941 |
1942 | # Add tool for sitemap extraction
1943 | @mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
1944 | def sitemap(website_url: str, ctx: Context) -> Dict[str, Any]:
1945 | """
1946 | Extract and discover the complete sitemap structure of any website.
1947 |
1948 | This tool automatically discovers all accessible URLs and pages within a website, providing
1949 | a comprehensive map of the site's structure. Useful for understanding site architecture before
1950 | crawling or for discovering all available content. Very cost-effective at 1 credit per request.
1951 | Read-only operation with no side effects.
1952 |
1953 | Args:
1954 | website_url (str): The base URL of the website to extract sitemap from.
1955 | - Must include protocol (http:// or https://)
1956 | - Should be the root domain or main section you want to map
1957 | - The tool will discover all accessible pages from this starting point
1958 | - Examples:
1959 | * https://example.com (discover entire website structure)
1960 | * https://docs.example.com (map documentation site)
1961 | * https://blog.company.com (discover all blog pages)
1962 | * https://shop.example.com (map e-commerce structure)
1963 | - Best practices:
1964 | * Use root domain (https://example.com) for complete site mapping
1965 | * Use subdomain (https://docs.example.com) for focused mapping
1966 | * Ensure the URL is accessible and doesn't require authentication
1967 | - Discovery methods:
1968 | * Checks for robots.txt and sitemap.xml files
1969 | * Crawls navigation links and menus
1970 | * Discovers pages through internal link analysis
1971 | * Identifies common URL patterns and structures
1972 |
1973 | Returns:
1974 | Dictionary containing:
1975 | - discovered_urls: List of all URLs found on the website
1976 | - site_structure: Hierarchical organization of pages and sections
1977 | - url_categories: URLs grouped by type (pages, images, documents, etc.)
1978 | - total_pages: Total number of pages discovered
1979 | - subdomains: List of subdomains found (if any)
1980 | - sitemap_sources: Sources used for discovery (sitemap.xml, robots.txt, crawling)
1981 | - page_types: Breakdown of different content types found
1982 | - depth_analysis: URL organization by depth from root
1983 | - external_links: Links pointing to external domains (if found)
1984 | - processing_time: Time taken to complete the discovery
1985 | - credits_used: Number of credits consumed (always 1)
1986 |
1987 | Raises:
1988 | ValueError: If website_url is malformed or missing protocol
1989 | HTTPError: If the website cannot be accessed or returns errors
1990 | TimeoutError: If the discovery process takes too long
1991 | ConnectionError: If the website cannot be reached
1992 |
1993 | Use Cases:
1994 | - Planning comprehensive crawling operations
1995 | - Understanding website architecture and organization
1996 | - Discovering all available content before targeted scraping
1997 | - SEO analysis and site structure optimization
1998 | - Content inventory and audit preparation
1999 | - Identifying pages for bulk processing operations
2000 |
2001 | Best Practices:
2002 | - Run sitemap before using smartcrawler_initiate for better planning
2003 | - Use results to set appropriate max_pages and depth parameters
2004 | - Check discovered URLs to understand site organization
2005 | - Identify high-value pages for targeted extraction
2006 | - Use for cost estimation before large crawling operations
2007 |
2008 | Note:
2009 | - Very cost-effective at only 1 credit per request
2010 | - Results may vary based on site structure and accessibility
2011 | - Some pages may require authentication and won't be discovered
2012 | - Large sites may have thousands of URLs - consider filtering results
2013 | - Use discovered URLs as input for other scraping tools
2014 | """
2015 | try:
2016 | api_key = get_api_key(ctx)
2017 | client = ScapeGraphClient(api_key)
2018 | return client.sitemap(website_url=website_url)
2019 | except httpx.HTTPError as http_err:
2020 | return {"error": str(http_err)}
2021 | except ValueError as val_err:
2022 | return {"error": str(val_err)}
2023 |
2024 |
2025 | # Add tool for Agentic Scraper (no live session/browser interaction)
2026 | @mcp.tool(annotations={"readOnlyHint": False, "destructiveHint": False, "idempotentHint": False})
2027 | def agentic_scrapper(
2028 | url: str,
2029 | ctx: Context,
2030 | user_prompt: Optional[str] = None,
2031 | output_schema: Optional[Annotated[Union[str, Dict[str, Any]], Field(
2032 | default=None,
2033 | description="Desired output structure as a JSON schema dict or JSON string",
2034 | json_schema_extra={
2035 | "oneOf": [
2036 | {"type": "string"},
2037 | {"type": "object"}
2038 | ]
2039 | }
2040 | )]] = None,
2041 | steps: Optional[Annotated[Union[str, List[str]], Field(
2042 | default=None,
2043 | description="Step-by-step instructions for the agent as a list of strings or JSON array string",
2044 | json_schema_extra={
2045 | "oneOf": [
2046 | {"type": "string"},
2047 | {"type": "array", "items": {"type": "string"}}
2048 | ]
2049 | }
2050 | )]] = None,
2051 | ai_extraction: Optional[bool] = None,
2052 | persistent_session: Optional[bool] = None,
2053 | timeout_seconds: Optional[float] = None
2054 | ) -> Dict[str, Any]:
2055 | """
2056 | Execute complex multi-step web scraping workflows with AI-powered automation.
2057 |
2058 | This tool runs an intelligent agent that can navigate websites, interact with forms and buttons,
2059 | follow multi-step workflows, and extract structured data. Ideal for complex scraping scenarios
2060 | requiring user interaction simulation, form submissions, or multi-page navigation flows.
2061 | Supports custom output schemas and step-by-step instructions. Variable credit cost based on
2062 | complexity. Can perform actions on the website (non-read-only, non-idempotent).
2063 |
2064 | The agent accepts flexible input formats for steps (list or JSON string) and output_schema
2065 | (dict or JSON string) to accommodate different client implementations.
2066 |
2067 | Args:
2068 | url (str): The target website URL where the agentic scraping workflow should start.
2069 | - Must include protocol (http:// or https://)
2070 | - Should be the starting page for your automation workflow
2071 | - The agent will begin its actions from this URL
2072 | - Examples:
2073 | * https://example.com/search (start at search page)
2074 | * https://shop.example.com/login (begin with login flow)
2075 | * https://app.example.com/dashboard (start at main interface)
2076 | * https://forms.example.com/contact (begin at form page)
2077 | - Considerations:
2078 | * Choose a starting point that makes sense for your workflow
2079 | * Ensure the page is publicly accessible or handle authentication
2080 | * Consider the logical flow of actions from this starting point
2081 |
2082 | user_prompt (Optional[str]): High-level instructions for what the agent should accomplish.
2083 | - Describes the overall goal and desired outcome of the automation
2084 | - Should be clear and specific about what you want to achieve
2085 | - Works in conjunction with the steps parameter for detailed guidance
2086 | - Examples:
2087 | * "Navigate to the search page, search for laptops, and extract the top 5 results with prices"
2088 | * "Fill out the contact form with sample data and submit it"
2089 | * "Login to the dashboard and extract all recent notifications"
2090 | * "Browse the product catalog and collect information about all items"
2091 | * "Navigate through the multi-step checkout process and capture each step"
2092 | - Tips for better results:
2093 | * Be specific about the end goal
2094 | * Mention what data you want extracted
2095 | * Include context about the expected workflow
2096 | * Specify any particular elements or sections to focus on
2097 |
2098 | output_schema (Optional[Union[str, Dict]]): Desired output structure for extracted data.
2099 | - Can be provided as a dictionary or JSON string
2100 | - Defines the format and structure of the final extracted data
2101 | - Helps ensure consistent, predictable output format
2102 | - Examples:
2103 | * Simple object: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}}
2104 | * Array of objects: {'type': 'array', 'items': {'type': 'object', 'properties': {'name': {'type': 'string'}, 'value': {'type': 'string'}}}}
2105 | * Complex nested: {'type': 'object', 'properties': {'products': {'type': 'array', 'items': {...}}, 'total_count': {'type': 'number'}}}
2106 | * As JSON string: '{"type": "object", "properties": {"results": {"type": "array"}}}'
2107 | - Default: None (agent will infer structure from prompt and steps)
2108 |
2109 | steps (Optional[Union[str, List[str]]]): Step-by-step instructions for the agent.
2110 | - Can be provided as a list of strings or JSON array string
2111 | - Provides detailed, sequential instructions for the automation workflow
2112 | - Each step should be a clear, actionable instruction
2113 | - Examples as list:
2114 | * ['Click the search button', 'Enter "laptops" in the search box', 'Press Enter', 'Wait for results to load', 'Extract product information']
2115 | * ['Fill in email field with [email protected]', 'Fill in password field', 'Click login button', 'Navigate to profile page']
2116 | - Examples as JSON string:
2117 | * '["Open navigation menu", "Click on Products", "Select category filters", "Extract all product data"]'
2118 | - Best practices:
2119 | * Break complex actions into simple steps
2120 | * Be specific about UI elements (button text, field names, etc.)
2121 | * Include waiting/loading steps when necessary
2122 | * Specify extraction points clearly
2123 | * Order steps logically for the workflow
2124 |
2125 | ai_extraction (Optional[bool]): Enable AI-powered extraction mode for intelligent data parsing.
2126 | - Default: true (recommended for most use cases)
2127 | - Options:
2128 | * true: Uses advanced AI to intelligently extract and structure data
2129 | - Better at handling complex page layouts
2130 | - Can adapt to different content structures
2131 | - Provides more accurate data extraction
2132 | - Recommended for most scenarios
2133 | * false: Uses simpler extraction methods
2134 | - Faster processing but less intelligent
2135 | - May miss complex or nested data
2136 | - Use when speed is more important than accuracy
2137 | - Performance impact:
2138 | * true: Higher processing time but better results
2139 | * false: Faster execution but potentially less accurate extraction
2140 |
2141 | persistent_session (Optional[bool]): Maintain session state between steps.
2142 | - Default: false (each step starts fresh)
2143 | - Options:
2144 | * true: Keeps cookies, login state, and session data between steps
2145 | - Essential for authenticated workflows
2146 | - Maintains shopping cart contents, user preferences, etc.
2147 | - Required for multi-step processes that depend on previous actions
2148 | - Use for: Login flows, shopping processes, form wizards
2149 | * false: Each step starts with a clean session
2150 | - Faster and simpler for independent actions
2151 | - No state carried between steps
2152 | - Use for: Simple data extraction, public content scraping
2153 | - Examples when to use true:
2154 | * Login → Navigate to protected area → Extract data
2155 | * Add items to cart → Proceed to checkout → Extract order details
2156 | * Multi-step form completion with session dependencies
2157 |
2158 | timeout_seconds (Optional[float]): Maximum time to wait for the entire workflow.
2159 | - Default: 120 seconds (2 minutes)
2160 | - Recommended ranges:
2161 | * 60-120: Simple workflows (2-5 steps)
2162 | * 180-300: Medium complexity (5-10 steps)
2163 | * 300-600: Complex workflows (10+ steps or slow sites)
2164 | * 600+: Very complex or slow-loading workflows
2165 | - Considerations:
2166 | * Include time for page loads, form submissions, and processing
2167 | * Factor in network latency and site response times
2168 | * Allow extra time for AI processing and extraction
2169 | * Balance between thoroughness and efficiency
2170 | - Examples:
2171 | * 60.0: Quick single-page data extraction
2172 | * 180.0: Multi-step form filling and submission
2173 | * 300.0: Complex navigation and comprehensive data extraction
2174 | * 600.0: Extensive workflows with multiple page interactions
2175 |
2176 | Returns:
2177 | Dictionary containing:
2178 | - extracted_data: The structured data matching your prompt and optional schema
2179 | - workflow_log: Detailed log of all actions performed by the agent
2180 | - pages_visited: List of URLs visited during the workflow
2181 | - actions_performed: Summary of interactions (clicks, form fills, navigations)
2182 | - execution_time: Total time taken for the workflow
2183 | - steps_completed: Number of steps successfully executed
2184 | - final_page_url: The URL where the workflow ended
2185 | - session_data: Session information if persistent_session was enabled
2186 | - credits_used: Number of credits consumed (varies by complexity)
2187 | - status: Success/failure status with any error details
2188 |
2189 | Raises:
2190 | ValueError: If URL is malformed or required parameters are missing
2191 | TimeoutError: If the workflow exceeds the specified timeout
2192 | NavigationError: If the agent cannot navigate to required pages
2193 | InteractionError: If the agent cannot interact with specified elements
2194 | ExtractionError: If data extraction fails or returns invalid results
2195 |
2196 | Use Cases:
2197 | - Automated form filling and submission
2198 | - Multi-step checkout processes
2199 | - Login-protected content extraction
2200 | - Interactive search and filtering workflows
2201 | - Complex navigation scenarios requiring user simulation
2202 | - Data collection from dynamic, JavaScript-heavy applications
2203 |
2204 | Best Practices:
2205 | - Start with simple workflows and gradually increase complexity
2206 | - Use specific element identifiers in steps (button text, field labels)
2207 | - Include appropriate wait times for page loads and dynamic content
2208 | - Test with persistent_session=true for authentication-dependent workflows
2209 | - Set realistic timeouts based on workflow complexity
2210 | - Provide clear, sequential steps that build on each other
2211 | - Use output_schema to ensure consistent data structure
2212 |
2213 | Note:
2214 | - This tool can perform actions on websites (non-read-only)
2215 | - Results may vary between runs due to dynamic content (non-idempotent)
2216 | - Credit cost varies based on workflow complexity and execution time
2217 | - Some websites may have anti-automation measures that could affect success
2218 | - Consider using simpler tools (smartscraper, markdownify) for basic extraction needs
2219 | """
2220 | # Normalize inputs to handle flexible formats from different MCP clients
2221 | normalized_steps: Optional[List[str]] = None
2222 | if isinstance(steps, list):
2223 | normalized_steps = steps
2224 | elif isinstance(steps, str):
2225 | parsed_steps: Optional[Any] = None
2226 | try:
2227 | parsed_steps = json.loads(steps)
2228 | except json.JSONDecodeError:
2229 | parsed_steps = None
2230 | if isinstance(parsed_steps, list):
2231 | normalized_steps = parsed_steps
2232 | else:
2233 | normalized_steps = [steps]
2234 |
2235 | normalized_schema: Optional[Dict[str, Any]] = None
2236 | if isinstance(output_schema, dict):
2237 | normalized_schema = output_schema
2238 | elif isinstance(output_schema, str):
2239 | try:
2240 | parsed_schema = json.loads(output_schema)
2241 | if isinstance(parsed_schema, dict):
2242 | normalized_schema = parsed_schema
2243 | else:
2244 | return {"error": "output_schema must be a JSON object"}
2245 | except json.JSONDecodeError as e:
2246 | return {"error": f"Invalid JSON for output_schema: {str(e)}"}
2247 |
2248 | try:
2249 | api_key = get_api_key(ctx)
2250 | client = ScapeGraphClient(api_key)
2251 | return client.agentic_scrapper(
2252 | url=url,
2253 | user_prompt=user_prompt,
2254 | output_schema=normalized_schema,
2255 | steps=normalized_steps,
2256 | ai_extraction=ai_extraction,
2257 | persistent_session=persistent_session,
2258 | timeout_seconds=timeout_seconds,
2259 | )
2260 | except httpx.TimeoutException as timeout_err:
2261 | return {"error": f"Request timed out: {str(timeout_err)}"}
2262 | except httpx.HTTPError as http_err:
2263 | return {"error": str(http_err)}
2264 | except ValueError as val_err:
2265 | return {"error": str(val_err)}
2266 |
2267 |
2268 | # Smithery server creation function
2269 | @smithery.server(config_schema=ConfigSchema)
2270 | def create_server() -> FastMCP:
2271 | """
2272 | Create and return the FastMCP server instance for Smithery deployment.
2273 |
2274 | Returns:
2275 | Configured FastMCP server instance
2276 | """
2277 | return mcp
2278 |
2279 |
2280 | def main() -> None:
2281 | """Run the ScapeGraph MCP server."""
2282 | try:
2283 | logger.info("Starting ScapeGraph MCP server!")
2284 | print("Starting ScapeGraph MCP server!")
2285 | mcp.run(transport="stdio")
2286 | except Exception as e:
2287 | logger.error(f"Failed to start MCP server: {e}")
2288 | print(f"Error starting server: {e}")
2289 | raise
2290 |
2291 |
2292 | if __name__ == "__main__":
2293 | main()
```