# Directory Structure ``` ├── .env.example ├── .eslintignore ├── .eslintrc.json ├── .github │ └── workflows │ └── ci.yml ├── .gitignore ├── .prettierrc ├── Dockerfile ├── jest.config.js ├── jest.setup.js ├── openapi.yml ├── package-lock.json ├── package.json ├── README.md ├── smithery.yaml └── src ├── index.js └── index.test.js ``` # Files -------------------------------------------------------------------------------- /.eslintignore: -------------------------------------------------------------------------------- ``` 1 | **/*.test.ts 2 | **/*.test.js 3 | node_modules 4 | dist 5 | jest.setup.ts 6 | jest.config.js ``` -------------------------------------------------------------------------------- /.prettierrc: -------------------------------------------------------------------------------- ``` 1 | { 2 | "semi": true, 3 | "singleQuote": true, 4 | "tabWidth": 2, 5 | "trailingComma": "es5", 6 | "printWidth": 80, 7 | "endOfLine": "auto" 8 | } ``` -------------------------------------------------------------------------------- /.env.example: -------------------------------------------------------------------------------- ``` 1 | # Required: Your WebScraping.AI API key 2 | WEBSCRAPING_AI_API_KEY=your_api_key_here 3 | 4 | # Optional: Maximum number of concurrent requests (default: 5) 5 | WEBSCRAPING_AI_CONCURRENCY_LIMIT=5 6 | ``` -------------------------------------------------------------------------------- /.eslintrc.json: -------------------------------------------------------------------------------- ```json 1 | { 2 | "env": { 3 | "es2021": true, 4 | "node": true, 5 | "jest": true 6 | }, 7 | "extends": [ 8 | "eslint:recommended", 9 | "prettier" 10 | ], 11 | "parserOptions": { 12 | "ecmaVersion": "latest", 13 | "sourceType": "module" 14 | }, 15 | "rules": { 16 | "no-unused-vars": "warn", 17 | "no-console": "off" 18 | } 19 | } ``` -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- ``` 1 | # Dependency directories 2 | node_modules/ 3 | 4 | # Environment variables 5 | .env 6 | .env.local 7 | .env.development.local 8 | .env.test.local 9 | .env.production.local 10 | 11 | # Logs 12 | logs 13 | *.log 14 | npm-debug.log* 15 | yarn-debug.log* 16 | yarn-error.log* 17 | 18 | # Coverage directory used by tools like istanbul 19 | coverage/ 20 | 21 | # Editor directories and files 22 | .idea/ 23 | .vscode/ 24 | *.swp 25 | *.swo 26 | 27 | # OS specific 28 | .DS_Store 29 | Thumbs.db 30 | ``` -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- ```markdown 1 | # WebScraping.AI MCP Server 2 | 3 | A Model Context Protocol (MCP) server implementation that integrates with [WebScraping.AI](https://webscraping.ai) for web data extraction capabilities. 4 | 5 | ## Features 6 | 7 | - Question answering about web page content 8 | - Structured data extraction from web pages 9 | - HTML content retrieval with JavaScript rendering 10 | - Plain text extraction from web pages 11 | - CSS selector-based content extraction 12 | - Multiple proxy types (datacenter, residential) with country selection 13 | - JavaScript rendering using headless Chrome/Chromium 14 | - Concurrent request management with rate limiting 15 | - Custom JavaScript execution on target pages 16 | - Device emulation (desktop, mobile, tablet) 17 | - Account usage monitoring 18 | 19 | ## Installation 20 | 21 | ### Running with npx 22 | 23 | ```bash 24 | env WEBSCRAPING_AI_API_KEY=your_api_key npx -y webscraping-ai-mcp 25 | ``` 26 | 27 | ### Manual Installation 28 | 29 | ```bash 30 | # Clone the repository 31 | git clone https://github.com/webscraping-ai/webscraping-ai-mcp-server.git 32 | cd webscraping-ai-mcp-server 33 | 34 | # Install dependencies 35 | npm install 36 | 37 | # Run 38 | npm start 39 | ``` 40 | 41 | ### Configuring in Cursor 42 | Note: Requires Cursor version 0.45.6+ 43 | 44 | The WebScraping.AI MCP server can be configured in two ways in Cursor: 45 | 46 | 1. **Project-specific Configuration** (recommended for team projects): 47 | Create a `.cursor/mcp.json` file in your project directory: 48 | ```json 49 | { 50 | "servers": { 51 | "webscraping-ai": { 52 | "type": "command", 53 | "command": "npx -y webscraping-ai-mcp", 54 | "env": { 55 | "WEBSCRAPING_AI_API_KEY": "your-api-key", 56 | "WEBSCRAPING_AI_CONCURRENCY_LIMIT": "5" 57 | } 58 | } 59 | } 60 | } 61 | ``` 62 | 63 | 2. **Global Configuration** (for personal use across all projects): 64 | Create a `~/.cursor/mcp.json` file in your home directory with the same configuration format as above. 65 | 66 | > If you are using Windows and are running into issues, try using `cmd /c "set WEBSCRAPING_AI_API_KEY=your-api-key && npx -y webscraping-ai-mcp"` as the command. 67 | 68 | This configuration will make the WebScraping.AI tools available to Cursor's AI agent automatically when relevant for web scraping tasks. 69 | 70 | ### Running on Claude Desktop 71 | 72 | Add this to your `claude_desktop_config.json`: 73 | 74 | ```json 75 | { 76 | "mcpServers": { 77 | "mcp-server-webscraping-ai": { 78 | "command": "npx", 79 | "args": ["-y", "webscraping-ai-mcp"], 80 | "env": { 81 | "WEBSCRAPING_AI_API_KEY": "YOUR_API_KEY_HERE", 82 | "WEBSCRAPING_AI_CONCURRENCY_LIMIT": "5" 83 | } 84 | } 85 | } 86 | } 87 | ``` 88 | 89 | ## Configuration 90 | 91 | ### Environment Variables 92 | 93 | #### Required 94 | 95 | - `WEBSCRAPING_AI_API_KEY`: Your WebScraping.AI API key 96 | - Required for all operations 97 | - Get your API key from [WebScraping.AI](https://webscraping.ai) 98 | 99 | #### Optional Configuration 100 | - `WEBSCRAPING_AI_CONCURRENCY_LIMIT`: Maximum number of concurrent requests (default: `5`) 101 | - `WEBSCRAPING_AI_DEFAULT_PROXY_TYPE`: Type of proxy to use (default: `residential`) 102 | - `WEBSCRAPING_AI_DEFAULT_JS_RENDERING`: Enable/disable JavaScript rendering (default: `true`) 103 | - `WEBSCRAPING_AI_DEFAULT_TIMEOUT`: Maximum web page retrieval time in ms (default: `15000`, max: `30000`) 104 | - `WEBSCRAPING_AI_DEFAULT_JS_TIMEOUT`: Maximum JavaScript rendering time in ms (default: `2000`) 105 | 106 | ### Configuration Examples 107 | 108 | For standard usage: 109 | ```bash 110 | # Required 111 | export WEBSCRAPING_AI_API_KEY=your-api-key 112 | 113 | # Optional - customize behavior (default values) 114 | export WEBSCRAPING_AI_CONCURRENCY_LIMIT=5 115 | export WEBSCRAPING_AI_DEFAULT_PROXY_TYPE=residential # datacenter or residential 116 | export WEBSCRAPING_AI_DEFAULT_JS_RENDERING=true 117 | export WEBSCRAPING_AI_DEFAULT_TIMEOUT=15000 118 | export WEBSCRAPING_AI_DEFAULT_JS_TIMEOUT=2000 119 | ``` 120 | 121 | ## Available Tools 122 | 123 | ### 1. Question Tool (`webscraping_ai_question`) 124 | 125 | Ask questions about web page content. 126 | 127 | ```json 128 | { 129 | "name": "webscraping_ai_question", 130 | "arguments": { 131 | "url": "https://example.com", 132 | "question": "What is the main topic of this page?", 133 | "timeout": 30000, 134 | "js": true, 135 | "js_timeout": 2000, 136 | "wait_for": ".content-loaded", 137 | "proxy": "datacenter", 138 | "country": "us" 139 | } 140 | } 141 | ``` 142 | 143 | Example response: 144 | 145 | ```json 146 | { 147 | "content": [ 148 | { 149 | "type": "text", 150 | "text": "The main topic of this page is examples and documentation for HTML and web standards." 151 | } 152 | ], 153 | "isError": false 154 | } 155 | ``` 156 | 157 | ### 2. Fields Tool (`webscraping_ai_fields`) 158 | 159 | Extract structured data from web pages based on instructions. 160 | 161 | ```json 162 | { 163 | "name": "webscraping_ai_fields", 164 | "arguments": { 165 | "url": "https://example.com/product", 166 | "fields": { 167 | "title": "Extract the product title", 168 | "price": "Extract the product price", 169 | "description": "Extract the product description" 170 | }, 171 | "js": true, 172 | "timeout": 30000 173 | } 174 | } 175 | ``` 176 | 177 | Example response: 178 | 179 | ```json 180 | { 181 | "content": [ 182 | { 183 | "type": "text", 184 | "text": { 185 | "title": "Example Product", 186 | "price": "$99.99", 187 | "description": "This is an example product description." 188 | } 189 | } 190 | ], 191 | "isError": false 192 | } 193 | ``` 194 | 195 | ### 3. HTML Tool (`webscraping_ai_html`) 196 | 197 | Get the full HTML of a web page with JavaScript rendering. 198 | 199 | ```json 200 | { 201 | "name": "webscraping_ai_html", 202 | "arguments": { 203 | "url": "https://example.com", 204 | "js": true, 205 | "timeout": 30000, 206 | "wait_for": "#content-loaded" 207 | } 208 | } 209 | ``` 210 | 211 | Example response: 212 | 213 | ```json 214 | { 215 | "content": [ 216 | { 217 | "type": "text", 218 | "text": "<html>...[full HTML content]...</html>" 219 | } 220 | ], 221 | "isError": false 222 | } 223 | ``` 224 | 225 | ### 4. Text Tool (`webscraping_ai_text`) 226 | 227 | Extract the visible text content from a web page. 228 | 229 | ```json 230 | { 231 | "name": "webscraping_ai_text", 232 | "arguments": { 233 | "url": "https://example.com", 234 | "js": true, 235 | "timeout": 30000 236 | } 237 | } 238 | ``` 239 | 240 | Example response: 241 | 242 | ```json 243 | { 244 | "content": [ 245 | { 246 | "type": "text", 247 | "text": "Example Domain\nThis domain is for use in illustrative examples in documents..." 248 | } 249 | ], 250 | "isError": false 251 | } 252 | ``` 253 | 254 | ### 5. Selected Tool (`webscraping_ai_selected`) 255 | 256 | Extract content from a specific element using a CSS selector. 257 | 258 | ```json 259 | { 260 | "name": "webscraping_ai_selected", 261 | "arguments": { 262 | "url": "https://example.com", 263 | "selector": "div.main-content", 264 | "js": true, 265 | "timeout": 30000 266 | } 267 | } 268 | ``` 269 | 270 | Example response: 271 | 272 | ```json 273 | { 274 | "content": [ 275 | { 276 | "type": "text", 277 | "text": "<div class=\"main-content\">This is the main content of the page.</div>" 278 | } 279 | ], 280 | "isError": false 281 | } 282 | ``` 283 | 284 | ### 6. Selected Multiple Tool (`webscraping_ai_selected_multiple`) 285 | 286 | Extract content from multiple elements using CSS selectors. 287 | 288 | ```json 289 | { 290 | "name": "webscraping_ai_selected_multiple", 291 | "arguments": { 292 | "url": "https://example.com", 293 | "selectors": ["div.header", "div.product-list", "div.footer"], 294 | "js": true, 295 | "timeout": 30000 296 | } 297 | } 298 | ``` 299 | 300 | Example response: 301 | 302 | ```json 303 | { 304 | "content": [ 305 | { 306 | "type": "text", 307 | "text": [ 308 | "<div class=\"header\">Header content</div>", 309 | "<div class=\"product-list\">Product list content</div>", 310 | "<div class=\"footer\">Footer content</div>" 311 | ] 312 | } 313 | ], 314 | "isError": false 315 | } 316 | ``` 317 | 318 | ### 7. Account Tool (`webscraping_ai_account`) 319 | 320 | Get information about your WebScraping.AI account. 321 | 322 | ```json 323 | { 324 | "name": "webscraping_ai_account", 325 | "arguments": {} 326 | } 327 | ``` 328 | 329 | Example response: 330 | 331 | ```json 332 | { 333 | "content": [ 334 | { 335 | "type": "text", 336 | "text": { 337 | "requests": 5000, 338 | "remaining": 4500, 339 | "limit": 10000, 340 | "resets_at": "2023-12-31T23:59:59Z" 341 | } 342 | } 343 | ], 344 | "isError": false 345 | } 346 | ``` 347 | 348 | ## Common Options for All Tools 349 | 350 | The following options can be used with all scraping tools: 351 | 352 | - `timeout`: Maximum web page retrieval time in ms (15000 by default, maximum is 30000) 353 | - `js`: Execute on-page JavaScript using a headless browser (true by default) 354 | - `js_timeout`: Maximum JavaScript rendering time in ms (2000 by default) 355 | - `wait_for`: CSS selector to wait for before returning the page content 356 | - `proxy`: Type of proxy, datacenter or residential (residential by default) 357 | - `country`: Country of the proxy to use (US by default). Supported countries: us, gb, de, it, fr, ca, es, ru, jp, kr, in 358 | - `custom_proxy`: Your own proxy URL in "http://user:password@host:port" format 359 | - `device`: Type of device emulation. Supported values: desktop, mobile, tablet 360 | - `error_on_404`: Return error on 404 HTTP status on the target page (false by default) 361 | - `error_on_redirect`: Return error on redirect on the target page (false by default) 362 | - `js_script`: Custom JavaScript code to execute on the target page 363 | 364 | ## Error Handling 365 | 366 | The server provides robust error handling: 367 | 368 | - Automatic retries for transient errors 369 | - Rate limit handling with backoff 370 | - Detailed error messages 371 | - Network resilience 372 | 373 | Example error response: 374 | 375 | ```json 376 | { 377 | "content": [ 378 | { 379 | "type": "text", 380 | "text": "API Error: 429 Too Many Requests" 381 | } 382 | ], 383 | "isError": true 384 | } 385 | ``` 386 | 387 | ## Integration with LLMs 388 | 389 | This server implements the [Model Context Protocol](https://github.com/facebookresearch/modelcontextprotocol), making it compatible with any MCP-enabled LLM platforms. You can configure your LLM to use these tools for web scraping tasks. 390 | 391 | ### Example: Configuring Claude with MCP 392 | 393 | ```javascript 394 | const { Claude } = require('@anthropic-ai/sdk'); 395 | const { Client } = require('@modelcontextprotocol/sdk/client/index.js'); 396 | const { StdioClientTransport } = require('@modelcontextprotocol/sdk/client/stdio.js'); 397 | 398 | const claude = new Claude({ 399 | apiKey: process.env.ANTHROPIC_API_KEY 400 | }); 401 | 402 | const transport = new StdioClientTransport({ 403 | command: 'npx', 404 | args: ['-y', 'webscraping-ai-mcp'], 405 | env: { 406 | WEBSCRAPING_AI_API_KEY: 'your-api-key' 407 | } 408 | }); 409 | 410 | const client = new Client({ 411 | name: 'claude-client', 412 | version: '1.0.0' 413 | }); 414 | 415 | await client.connect(transport); 416 | 417 | // Now you can use Claude with WebScraping.AI tools 418 | const tools = await client.listTools(); 419 | const response = await claude.complete({ 420 | prompt: 'What is the main topic of example.com?', 421 | tools: tools 422 | }); 423 | ``` 424 | 425 | ## Development 426 | 427 | ```bash 428 | # Clone the repository 429 | git clone https://github.com/webscraping-ai/webscraping-ai-mcp-server.git 430 | cd webscraping-ai-mcp-server 431 | 432 | # Install dependencies 433 | npm install 434 | 435 | # Run tests 436 | npm test 437 | 438 | # Add your .env file 439 | cp .env.example .env 440 | 441 | # Start the inspector 442 | npx @modelcontextprotocol/inspector node src/index.js 443 | ``` 444 | 445 | ### Contributing 446 | 447 | 1. Fork the repository 448 | 2. Create your feature branch 449 | 3. Run tests: `npm test` 450 | 4. Submit a pull request 451 | 452 | ## License 453 | 454 | MIT License - see LICENSE file for details 455 | ``` -------------------------------------------------------------------------------- /jest.config.js: -------------------------------------------------------------------------------- ```javascript 1 | /** @type {import('ts-jest').JestConfigWithTsJest} */ 2 | export default { 3 | testEnvironment: 'node', 4 | transform: {}, 5 | moduleNameMapper: { 6 | '^(\\.{1,2}/.*)\\.js$': '$1', 7 | }, 8 | setupFilesAfterEnv: ['./jest.setup.js'], 9 | testMatch: ['**/*.test.js'], 10 | }; ``` -------------------------------------------------------------------------------- /jest.setup.js: -------------------------------------------------------------------------------- ```javascript 1 | import { jest } from '@jest/globals'; 2 | 3 | // Mock console methods to suppress output during tests 4 | global.console = { 5 | ...console, 6 | log: jest.fn(), 7 | debug: jest.fn(), 8 | info: jest.fn(), 9 | warn: jest.fn(), 10 | error: jest.fn(), 11 | }; 12 | 13 | // Add any additional global test setup here ``` -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- ```dockerfile 1 | FROM node:18-alpine 2 | 3 | WORKDIR /app 4 | 5 | # Copy package.json and package-lock.json 6 | COPY package*.json ./ 7 | 8 | # Install only production dependencies 9 | RUN npm ci --only=production 10 | 11 | # Copy source files 12 | COPY . . 13 | 14 | # Set environment variables 15 | ENV NODE_ENV=production 16 | 17 | # Command to run the application 18 | ENTRYPOINT ["node", "src/index.js"] 19 | 20 | # Set default arguments 21 | CMD [] 22 | 23 | # Document that the service uses stdin/stdout for communication 24 | LABEL org.opencontainers.image.description="WebScraping.AI MCP Server - Model Context Protocol server for WebScraping.AI API" 25 | LABEL org.opencontainers.image.source="https://github.com/webscraping-ai/webscraping-ai-mcp-server" 26 | LABEL org.opencontainers.image.licenses="MIT" 27 | ``` -------------------------------------------------------------------------------- /.github/workflows/ci.yml: -------------------------------------------------------------------------------- ```yaml 1 | name: CI 2 | 3 | on: 4 | push: 5 | branches: [master] 6 | pull_request: 7 | branches: [master] 8 | 9 | jobs: 10 | test: 11 | runs-on: ubuntu-latest 12 | 13 | strategy: 14 | matrix: 15 | node-version: [18.x, 20.x] 16 | 17 | steps: 18 | - uses: actions/checkout@v3 19 | 20 | - name: Use Node.js ${{ matrix.node-version }} 21 | uses: actions/setup-node@v3 22 | with: 23 | node-version: ${{ matrix.node-version }} 24 | cache: 'npm' 25 | 26 | - name: Install dependencies 27 | run: npm ci 28 | 29 | - name: Lint 30 | run: npm run lint 31 | 32 | - name: Test 33 | run: npm test 34 | env: 35 | WEBSCRAPING_AI_API_KEY: ${{ secrets.WEBSCRAPING_AI_API_KEY || 'test-api-key' }} 36 | ``` -------------------------------------------------------------------------------- /smithery.yaml: -------------------------------------------------------------------------------- ```yaml 1 | # Smithery configuration file: https://smithery.ai/docs/config#smitheryyaml 2 | 3 | startCommand: 4 | type: stdio 5 | configSchema: 6 | # JSON Schema defining the configuration options for the MCP. 7 | type: object 8 | required: 9 | - webscrapingAiApiKey 10 | properties: 11 | webscrapingAiApiKey: 12 | type: string 13 | description: Your WebScraping.AI API key. Required for API usage. 14 | webscrapingAiApiUrl: 15 | type: string 16 | description: Custom API endpoint. Default is https://api.webscraping.ai. 17 | webscrapingAiConcurrencyLimit: 18 | type: integer 19 | description: Maximum concurrent requests allowed (default 5). 20 | commandFunction: 21 | # A function that produces the CLI command to start the MCP on stdio. 22 | |- 23 | (config) => ({ 24 | command: 'node', 25 | args: ['src/index.js'], 26 | env: { 27 | WEBSCRAPING_AI_API_KEY: config.webscrapingAiApiKey, 28 | WEBSCRAPING_AI_CONCURRENCY_LIMIT: String(config.webscrapingAiConcurrencyLimit || 5) 29 | } 30 | }) 31 | ``` -------------------------------------------------------------------------------- /package.json: -------------------------------------------------------------------------------- ```json 1 | { 2 | "name": "webscraping-ai-mcp", 3 | "version": "1.0.2", 4 | "description": "Model Context Protocol server for WebScraping.AI API. Provides LLM-powered web scraping tools with Chromium JavaScript rendering, rotating proxies, and HTML parsing.", 5 | "type": "module", 6 | "bin": { 7 | "webscraping-ai-mcp": "src/index.js" 8 | }, 9 | "files": [ 10 | "src" 11 | ], 12 | "scripts": { 13 | "test": "node --experimental-vm-modules node_modules/jest/bin/jest.js", 14 | "start": "node src/index.js", 15 | "lint": "eslint src/**/*.js", 16 | "lint:fix": "eslint src/**/*.js --fix", 17 | "format": "prettier --write ." 18 | }, 19 | "license": "MIT", 20 | "dependencies": { 21 | "@modelcontextprotocol/sdk": "^1.4.1", 22 | "axios": "^1.6.7", 23 | "dotenv": "^16.4.7", 24 | "p-queue": "^8.0.1" 25 | }, 26 | "devDependencies": { 27 | "@jest/globals": "^29.7.0", 28 | "eslint": "^8.56.0", 29 | "eslint-config-prettier": "^9.1.0", 30 | "jest": "^29.7.0", 31 | "jest-mock-extended": "^4.0.0-beta1", 32 | "prettier": "^3.1.1" 33 | }, 34 | "engines": { 35 | "node": ">=18.0.0" 36 | }, 37 | "keywords": [ 38 | "mcp", 39 | "webscraping", 40 | "web-scraping", 41 | "crawler", 42 | "content-extraction", 43 | "llm" 44 | ], 45 | "main": "src/index.js", 46 | "repository": { 47 | "type": "git", 48 | "url": "git+https://github.com/webscraping-ai/webscraping-ai-mcp-server.git" 49 | }, 50 | "author": "WebScraping.AI", 51 | "bugs": { 52 | "url": "https://github.com/webscraping-ai/webscraping-ai-mcp-server/issues" 53 | }, 54 | "homepage": "https://github.com/webscraping-ai/webscraping-ai-mcp-server#readme" 55 | } 56 | ``` -------------------------------------------------------------------------------- /src/index.js: -------------------------------------------------------------------------------- ```javascript 1 | #!/usr/bin/env node 2 | 3 | import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js'; 4 | import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js'; 5 | import { z } from 'zod'; 6 | import axios from 'axios'; 7 | import dotenv from 'dotenv'; 8 | import PQueue from 'p-queue'; 9 | 10 | dotenv.config(); 11 | 12 | // Environment variables 13 | const WEBSCRAPING_AI_API_KEY = process.env.WEBSCRAPING_AI_API_KEY || ''; 14 | const WEBSCRAPING_AI_API_URL = 'https://api.webscraping.ai'; 15 | const CONCURRENCY_LIMIT = Number(process.env.WEBSCRAPING_AI_CONCURRENCY_LIMIT || 5); 16 | const DEFAULT_PROXY_TYPE = process.env.WEBSCRAPING_AI_DEFAULT_PROXY_TYPE || 'residential'; 17 | const DEFAULT_JS_RENDERING = process.env.WEBSCRAPING_AI_DEFAULT_JS_RENDERING !== 'false'; 18 | const DEFAULT_TIMEOUT = Number(process.env.WEBSCRAPING_AI_DEFAULT_TIMEOUT || 15000); 19 | const DEFAULT_JS_TIMEOUT = Number(process.env.WEBSCRAPING_AI_DEFAULT_JS_TIMEOUT || 2000); 20 | 21 | // Validate required environment variables 22 | if (!WEBSCRAPING_AI_API_KEY) { 23 | console.error('WEBSCRAPING_AI_API_KEY environment variable is required'); 24 | process.exit(1); 25 | } 26 | 27 | class WebScrapingAIClient { 28 | constructor(options = {}) { 29 | const apiKey = options.apiKey || WEBSCRAPING_AI_API_KEY; 30 | const baseUrl = options.baseUrl || WEBSCRAPING_AI_API_URL; 31 | const timeout = options.timeout || 60000; 32 | const concurrency = options.concurrency || CONCURRENCY_LIMIT; 33 | 34 | if (!apiKey) { 35 | throw new Error('WebScraping.AI API key is required'); 36 | } 37 | 38 | this.client = axios.create({ 39 | baseURL: baseUrl, 40 | timeout: timeout, 41 | headers: { 42 | 'Content-Type': 'application/json', 43 | 'Accept': 'application/json', 44 | } 45 | }); 46 | 47 | this.queue = new PQueue({ concurrency }); 48 | this.apiKey = apiKey; 49 | } 50 | 51 | async request(endpoint, params) { 52 | try { 53 | return await this.queue.add(async () => { 54 | const response = await this.client.get(endpoint, { 55 | params: { 56 | ...params, 57 | api_key: this.apiKey, 58 | from_mcp_server: true 59 | } 60 | }); 61 | return response.data; 62 | }); 63 | } catch (error) { 64 | const errorResponse = { 65 | message: 'API Error', 66 | status_code: error.response?.status, 67 | status_message: error.response?.statusText, 68 | body: error.response?.data 69 | }; 70 | throw new Error(JSON.stringify(errorResponse)); 71 | } 72 | } 73 | 74 | async question(url, question, options = {}) { 75 | return this.request('/ai/question', { 76 | url, 77 | question, 78 | ...options 79 | }); 80 | } 81 | 82 | async fields(url, fields, options = {}) { 83 | return this.request('/ai/fields', { 84 | url, 85 | fields: JSON.stringify(fields), 86 | ...options 87 | }); 88 | } 89 | 90 | async html(url, options = {}) { 91 | return this.request('/html', { 92 | url, 93 | ...options 94 | }); 95 | } 96 | 97 | async text(url, options = {}) { 98 | return this.request('/text', { 99 | url, 100 | ...options 101 | }); 102 | } 103 | 104 | async selected(url, selector, options = {}) { 105 | return this.request('/selected', { 106 | url, 107 | selector, 108 | ...options 109 | }); 110 | } 111 | 112 | async selectedMultiple(url, selectors, options = {}) { 113 | return this.request('/selected-multiple', { 114 | url, 115 | selectors, 116 | ...options 117 | }); 118 | } 119 | 120 | async account() { 121 | return this.request('/account', {}); 122 | } 123 | } 124 | 125 | // Create WebScrapingAI client 126 | const client = new WebScrapingAIClient(); 127 | 128 | // Create MCP server 129 | const server = new McpServer({ 130 | name: 'WebScraping.AI MCP Server', 131 | version: '1.0.2' 132 | }); 133 | 134 | // Common options schema for all tools 135 | const commonOptionsSchema = { 136 | timeout: z.number().optional().default(DEFAULT_TIMEOUT).describe(`Maximum web page retrieval time in ms (${DEFAULT_TIMEOUT} by default, maximum is 30000).`), 137 | js: z.boolean().optional().default(DEFAULT_JS_RENDERING).describe(`Execute on-page JavaScript using a headless browser (${DEFAULT_JS_RENDERING} by default).`), 138 | js_timeout: z.number().optional().default(DEFAULT_JS_TIMEOUT).describe(`Maximum JavaScript rendering time in ms (${DEFAULT_JS_TIMEOUT} by default).`), 139 | wait_for: z.string().optional().describe('CSS selector to wait for before returning the page content.'), 140 | proxy: z.enum(['datacenter', 'residential']).optional().default(DEFAULT_PROXY_TYPE).describe(`Type of proxy, datacenter or residential (${DEFAULT_PROXY_TYPE} by default).`), 141 | country: z.enum(['us', 'gb', 'de', 'it', 'fr', 'ca', 'es', 'ru', 'jp', 'kr', 'in']).optional().describe('Country of the proxy to use (US by default).'), 142 | custom_proxy: z.string().optional().describe('Your own proxy URL in "http://user:password@host:port" format.'), 143 | device: z.enum(['desktop', 'mobile', 'tablet']).optional().describe('Type of device emulation.'), 144 | error_on_404: z.boolean().optional().describe('Return error on 404 HTTP status on the target page (false by default).'), 145 | error_on_redirect: z.boolean().optional().describe('Return error on redirect on the target page (false by default).'), 146 | js_script: z.string().optional().describe('Custom JavaScript code to execute on the target page.') 147 | }; 148 | 149 | // Define and register tools 150 | server.tool( 151 | 'webscraping_ai_question', 152 | { 153 | url: z.string().describe('URL of the target page.'), 154 | question: z.string().describe('Question or instructions to ask the LLM model about the target page.'), 155 | ...commonOptionsSchema 156 | }, 157 | async ({ url, question, ...options }) => { 158 | try { 159 | const result = await client.question(url, question, options); 160 | return { 161 | content: [{ type: 'text', text: result }] 162 | }; 163 | } catch (error) { 164 | return { 165 | content: [{ type: 'text', text: error.message }], 166 | isError: true 167 | }; 168 | } 169 | } 170 | ); 171 | 172 | server.tool( 173 | 'webscraping_ai_fields', 174 | { 175 | url: z.string().describe('URL of the target page.'), 176 | fields: z.record(z.string()).describe('Dictionary of field names with instructions for extraction.'), 177 | ...commonOptionsSchema 178 | }, 179 | async ({ url, fields, ...options }) => { 180 | try { 181 | const result = await client.fields(url, fields, options); 182 | return { 183 | content: [{ type: 'text', text: JSON.stringify(result, null, 2) }] 184 | }; 185 | } catch (error) { 186 | return { 187 | content: [{ type: 'text', text: error.message }], 188 | isError: true 189 | }; 190 | } 191 | } 192 | ); 193 | 194 | server.tool( 195 | 'webscraping_ai_html', 196 | { 197 | url: z.string().describe('URL of the target page.'), 198 | return_script_result: z.boolean().optional().describe('Return result of the custom JavaScript code execution.'), 199 | format: z.enum(['json', 'text']).optional().describe('Response format (json or text).'), 200 | ...commonOptionsSchema 201 | }, 202 | async ({ url, return_script_result, format, ...options }) => { 203 | try { 204 | const result = await client.html(url, { ...options, return_script_result }); 205 | if (format === 'json') { 206 | return { 207 | content: [{ type: 'text', text: JSON.stringify({ html: result }) }] 208 | }; 209 | } 210 | return { 211 | content: [{ type: 'text', text: result }] 212 | }; 213 | } catch (error) { 214 | const errorObj = JSON.parse(error.message); 215 | return { 216 | content: [{ type: 'text', text: JSON.stringify(errorObj) }], 217 | isError: true 218 | }; 219 | } 220 | } 221 | ); 222 | 223 | server.tool( 224 | 'webscraping_ai_text', 225 | { 226 | url: z.string().describe('URL of the target page.'), 227 | text_format: z.enum(['plain', 'xml', 'json']).optional().default('json').describe('Format of the text response.'), 228 | return_links: z.boolean().optional().describe('Return links from the page body text.'), 229 | ...commonOptionsSchema 230 | }, 231 | async ({ url, text_format, return_links, ...options }) => { 232 | try { 233 | const result = await client.text(url, { 234 | ...options, 235 | text_format, 236 | return_links 237 | }); 238 | return { 239 | content: [{ type: 'text', text: typeof result === 'object' ? JSON.stringify(result) : result }] 240 | }; 241 | } catch (error) { 242 | const errorObj = JSON.parse(error.message); 243 | return { 244 | content: [{ type: 'text', text: JSON.stringify(errorObj) }], 245 | isError: true 246 | }; 247 | } 248 | } 249 | ); 250 | 251 | server.tool( 252 | 'webscraping_ai_selected', 253 | { 254 | url: z.string().describe('URL of the target page.'), 255 | selector: z.string().describe('CSS selector to extract content for.'), 256 | format: z.enum(['json', 'text']).optional().default('json').describe('Response format (json or text).'), 257 | ...commonOptionsSchema 258 | }, 259 | async ({ url, selector, format, ...options }) => { 260 | try { 261 | const result = await client.selected(url, selector, options); 262 | if (format === 'json') { 263 | return { 264 | content: [{ type: 'text', text: JSON.stringify({ html: result }) }] 265 | }; 266 | } 267 | return { 268 | content: [{ type: 'text', text: result }] 269 | }; 270 | } catch (error) { 271 | const errorObj = JSON.parse(error.message); 272 | return { 273 | content: [{ type: 'text', text: JSON.stringify(errorObj) }], 274 | isError: true 275 | }; 276 | } 277 | } 278 | ); 279 | 280 | server.tool( 281 | 'webscraping_ai_selected_multiple', 282 | { 283 | url: z.string().describe('URL of the target page.'), 284 | selectors: z.array(z.string()).describe('Array of CSS selectors to extract content for.'), 285 | ...commonOptionsSchema 286 | }, 287 | async ({ url, selectors, ...options }) => { 288 | try { 289 | const result = await client.selectedMultiple(url, selectors, options); 290 | return { 291 | content: [{ type: 'text', text: JSON.stringify(result, null, 2) }] 292 | }; 293 | } catch (error) { 294 | return { 295 | content: [{ type: 'text', text: error.message }], 296 | isError: true 297 | }; 298 | } 299 | } 300 | ); 301 | 302 | server.tool( 303 | 'webscraping_ai_account', 304 | {}, 305 | async () => { 306 | try { 307 | const result = await client.account(); 308 | return { 309 | content: [{ type: 'text', text: JSON.stringify(result, null, 2) }] 310 | }; 311 | } catch (error) { 312 | return { 313 | content: [{ type: 'text', text: error.message }], 314 | isError: true 315 | }; 316 | } 317 | } 318 | ); 319 | 320 | const transport = new StdioServerTransport(); 321 | server.connect(transport).then(() => { 322 | }).catch(err => { 323 | console.error('Failed to connect to transport:', err); 324 | process.exit(1); 325 | }); 326 | ``` -------------------------------------------------------------------------------- /src/index.test.js: -------------------------------------------------------------------------------- ```javascript 1 | import { 2 | describe, 3 | expect, 4 | jest, 5 | test, 6 | beforeEach, 7 | afterEach, 8 | } from '@jest/globals'; 9 | import { Client } from "@modelcontextprotocol/sdk/client/index.js"; 10 | import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js"; 11 | 12 | // Create mock WebScrapingAIClient 13 | class MockWebScrapingAIClient { 14 | constructor() { 15 | this.question = jest.fn().mockResolvedValue('This is the answer to your question.'); 16 | this.fields = jest.fn().mockResolvedValue({ field1: 'value1', field2: 'value2' }); 17 | this.html = jest.fn().mockResolvedValue('<html><body>Test HTML Content</body></html>'); 18 | this.text = jest.fn().mockResolvedValue('Test text content'); 19 | this.selected = jest.fn().mockResolvedValue('<div>Selected Element</div>'); 20 | this.selectedMultiple = jest.fn().mockResolvedValue(['<div>Element 1</div>', '<div>Element 2</div>']); 21 | this.account = jest.fn().mockResolvedValue({ requests: 100, remaining: 900, limit: 1000 }); 22 | } 23 | } 24 | 25 | // Test interfaces 26 | class RequestContext { 27 | constructor(toolName, args) { 28 | this.params = { 29 | name: toolName, 30 | arguments: args 31 | }; 32 | } 33 | } 34 | 35 | describe('WebScraping.AI MCP Server Tests', () => { 36 | let mockClient; 37 | let requestHandler; 38 | 39 | beforeEach(() => { 40 | jest.clearAllMocks(); 41 | mockClient = new MockWebScrapingAIClient(); 42 | 43 | // Create request handler function 44 | requestHandler = async (request) => { 45 | const { name: toolName, arguments: args } = request.params; 46 | if (!args && toolName !== 'webscraping_ai_account') { 47 | throw new Error('No arguments provided'); 48 | } 49 | return handleRequest(toolName, args || {}, mockClient); 50 | }; 51 | }); 52 | 53 | afterEach(() => { 54 | jest.clearAllMocks(); 55 | }); 56 | 57 | // Test question functionality 58 | test('should handle question request', async () => { 59 | const url = 'https://example.com'; 60 | const question = 'What is on this page?'; 61 | 62 | const response = await requestHandler( 63 | new RequestContext('webscraping_ai_question', { url, question }) 64 | ); 65 | 66 | expect(response).toEqual({ 67 | content: [{ type: 'text', text: 'This is the answer to your question.' }], 68 | isError: false 69 | }); 70 | expect(mockClient.question).toHaveBeenCalledWith(url, question, {}); 71 | }); 72 | 73 | // Test fields functionality 74 | test('should handle fields request', async () => { 75 | const url = 'https://example.com'; 76 | const fields = { 77 | title: 'Extract the title', 78 | price: 'Extract the price' 79 | }; 80 | 81 | const response = await requestHandler( 82 | new RequestContext('webscraping_ai_fields', { url, fields }) 83 | ); 84 | 85 | expect(response).toEqual({ 86 | content: [{ type: 'text', text: JSON.stringify({ field1: 'value1', field2: 'value2' }, null, 2) }], 87 | isError: false 88 | }); 89 | expect(mockClient.fields).toHaveBeenCalledWith(url, fields, {}); 90 | }); 91 | 92 | // Test html functionality 93 | test('should handle html request', async () => { 94 | const url = 'https://example.com'; 95 | 96 | const response = await requestHandler( 97 | new RequestContext('webscraping_ai_html', { url }) 98 | ); 99 | 100 | expect(response).toEqual({ 101 | content: [{ type: 'text', text: '<html><body>Test HTML Content</body></html>' }], 102 | isError: false 103 | }); 104 | expect(mockClient.html).toHaveBeenCalledWith(url, {}); 105 | }); 106 | 107 | // Test text functionality 108 | test('should handle text request', async () => { 109 | const url = 'https://example.com'; 110 | 111 | const response = await requestHandler( 112 | new RequestContext('webscraping_ai_text', { url }) 113 | ); 114 | 115 | expect(response).toEqual({ 116 | content: [{ type: 'text', text: 'Test text content' }], 117 | isError: false 118 | }); 119 | expect(mockClient.text).toHaveBeenCalledWith(url, {}); 120 | }); 121 | 122 | // Test selected functionality 123 | test('should handle selected request', async () => { 124 | const url = 'https://example.com'; 125 | const selector = '.main-content'; 126 | 127 | const response = await requestHandler( 128 | new RequestContext('webscraping_ai_selected', { url, selector }) 129 | ); 130 | 131 | expect(response).toEqual({ 132 | content: [{ type: 'text', text: '<div>Selected Element</div>' }], 133 | isError: false 134 | }); 135 | expect(mockClient.selected).toHaveBeenCalledWith(url, selector, {}); 136 | }); 137 | 138 | // Test selected_multiple functionality 139 | test('should handle selected_multiple request', async () => { 140 | const url = 'https://example.com'; 141 | const selectors = ['.item1', '.item2']; 142 | 143 | const response = await requestHandler( 144 | new RequestContext('webscraping_ai_selected_multiple', { url, selectors }) 145 | ); 146 | 147 | expect(response).toEqual({ 148 | content: [{ type: 'text', text: JSON.stringify(['<div>Element 1</div>', '<div>Element 2</div>'], null, 2) }], 149 | isError: false 150 | }); 151 | expect(mockClient.selectedMultiple).toHaveBeenCalledWith(url, selectors, {}); 152 | }); 153 | 154 | // Test account functionality 155 | test('should handle account request', async () => { 156 | const response = await requestHandler( 157 | new RequestContext('webscraping_ai_account', {}) 158 | ); 159 | 160 | expect(response).toEqual({ 161 | content: [{ type: 'text', text: JSON.stringify({ requests: 100, remaining: 900, limit: 1000 }, null, 2) }], 162 | isError: false 163 | }); 164 | expect(mockClient.account).toHaveBeenCalled(); 165 | }); 166 | 167 | // Test error handling 168 | test('should handle API errors', async () => { 169 | const url = 'https://example.com'; 170 | mockClient.question.mockRejectedValueOnce(new Error('API Error')); 171 | 172 | const response = await requestHandler( 173 | new RequestContext('webscraping_ai_question', { url, question: 'What is on this page?' }) 174 | ); 175 | 176 | expect(response.isError).toBe(true); 177 | expect(response.content[0].text).toContain('API Error'); 178 | }); 179 | 180 | // Test unknown tool 181 | test('should handle unknown tool request', async () => { 182 | const response = await requestHandler( 183 | new RequestContext('unknown_tool', { some: 'args' }) 184 | ); 185 | 186 | expect(response.isError).toBe(true); 187 | expect(response.content[0].text).toContain('Unknown tool'); 188 | }); 189 | 190 | // Test MCP Client Connection 191 | xtest('should connect to MCP server and list tools', async () => { 192 | const transport = new StdioClientTransport({ 193 | command: "node", 194 | args: ["src/index.js"] 195 | }); 196 | 197 | const client = new Client({ 198 | name: "webscraping-ai-test-client", 199 | version: "1.0.0" 200 | }); 201 | 202 | await client.connect(transport); 203 | const response = await client.listTools(); 204 | 205 | expect(response.tools).toEqual(expect.arrayContaining([ 206 | expect.objectContaining({ 207 | name: 'webscraping_ai_question', 208 | inputSchema: expect.any(Object) 209 | }), 210 | expect.objectContaining({ 211 | name: 'webscraping_ai_fields', 212 | inputSchema: expect.any(Object) 213 | }), 214 | expect.objectContaining({ 215 | name: 'webscraping_ai_html', 216 | inputSchema: expect.any(Object) 217 | }), 218 | expect.objectContaining({ 219 | name: 'webscraping_ai_text', 220 | inputSchema: expect.any(Object) 221 | }), 222 | expect.objectContaining({ 223 | name: 'webscraping_ai_selected', 224 | inputSchema: expect.any(Object) 225 | }), 226 | expect.objectContaining({ 227 | name: 'webscraping_ai_selected_multiple', 228 | inputSchema: expect.any(Object) 229 | }), 230 | expect.objectContaining({ 231 | name: 'webscraping_ai_account', 232 | inputSchema: expect.any(Object) 233 | }) 234 | ])); 235 | 236 | await client.close(); 237 | }); 238 | }); 239 | 240 | // Helper function to simulate request handling 241 | async function handleRequest(name, args, client) { 242 | try { 243 | const options = { ...args }; 244 | 245 | // Remove required parameters from options for each tool type 246 | switch (name) { 247 | case 'webscraping_ai_question': { 248 | const { url, question, ...rest } = options; 249 | if (!url || !question) { 250 | throw new Error('URL and question are required'); 251 | } 252 | 253 | const result = await client.question(url, question, rest); 254 | return { 255 | content: [{ type: 'text', text: result }], 256 | isError: false 257 | }; 258 | } 259 | 260 | case 'webscraping_ai_fields': { 261 | const { url, fields, ...rest } = options; 262 | if (!url || !fields) { 263 | throw new Error('URL and fields are required'); 264 | } 265 | 266 | const result = await client.fields(url, fields, rest); 267 | return { 268 | content: [{ type: 'text', text: JSON.stringify(result, null, 2) }], 269 | isError: false 270 | }; 271 | } 272 | 273 | case 'webscraping_ai_html': { 274 | const { url, ...rest } = options; 275 | if (!url) { 276 | throw new Error('URL is required'); 277 | } 278 | 279 | const result = await client.html(url, rest); 280 | return { 281 | content: [{ type: 'text', text: result }], 282 | isError: false 283 | }; 284 | } 285 | 286 | case 'webscraping_ai_text': { 287 | const { url, ...rest } = options; 288 | if (!url) { 289 | throw new Error('URL is required'); 290 | } 291 | 292 | const result = await client.text(url, rest); 293 | return { 294 | content: [{ type: 'text', text: result }], 295 | isError: false 296 | }; 297 | } 298 | 299 | case 'webscraping_ai_selected': { 300 | const { url, selector, ...rest } = options; 301 | if (!url || !selector) { 302 | throw new Error('URL and selector are required'); 303 | } 304 | 305 | const result = await client.selected(url, selector, rest); 306 | return { 307 | content: [{ type: 'text', text: result }], 308 | isError: false 309 | }; 310 | } 311 | 312 | case 'webscraping_ai_selected_multiple': { 313 | const { url, selectors, ...rest } = options; 314 | if (!url || !selectors) { 315 | throw new Error('URL and selectors are required'); 316 | } 317 | 318 | const result = await client.selectedMultiple(url, selectors, rest); 319 | return { 320 | content: [{ type: 'text', text: JSON.stringify(result, null, 2) }], 321 | isError: false 322 | }; 323 | } 324 | 325 | case 'webscraping_ai_account': { 326 | const result = await client.account(); 327 | return { 328 | content: [{ type: 'text', text: JSON.stringify(result, null, 2) }], 329 | isError: false 330 | }; 331 | } 332 | 333 | default: 334 | throw new Error(`Unknown tool: ${name}`); 335 | } 336 | } catch (error) { 337 | return { 338 | content: [{ type: 'text', text: error.message }], 339 | isError: true 340 | }; 341 | } 342 | } 343 | ``` -------------------------------------------------------------------------------- /openapi.yml: -------------------------------------------------------------------------------- ```yaml 1 | openapi: 3.1.0 2 | info: 3 | title: WebScraping.AI 4 | contact: 5 | name: WebScraping.AI Support 6 | url: https://webscraping.ai 7 | email: [email protected] 8 | version: 3.2.0 9 | description: WebScraping.AI scraping API provides LLM-powered tools with Chromium JavaScript rendering, rotating proxies, and built-in HTML parsing. 10 | tags: 11 | - name: AI 12 | description: Analyze web pages using LLMs 13 | - name: HTML 14 | description: Get full HTML content of pages using proxies and Chromium JS rendering 15 | - name: Text 16 | description: Get visible text of pages using proxies and Chromium JS rendering 17 | - name: Selected HTML 18 | description: Get HTML content of selected page areas (like price, search results, page title, etc.) 19 | - name: Account 20 | description: Information about your account API credits quota 21 | paths: 22 | /ai/question: 23 | get: 24 | summary: Get an answer to a question about a given web page 25 | description: Returns the answer in plain text. Proxies and Chromium JavaScript rendering are used for page retrieval and processing, then the answer is extracted using an LLM model. 26 | operationId: getQuestion 27 | tags: [ "AI" ] 28 | parameters: 29 | - $ref: '#/components/parameters/url' 30 | - $ref: '#/components/parameters/question' 31 | - $ref: '#/components/parameters/headers' 32 | - $ref: '#/components/parameters/timeout' 33 | - $ref: '#/components/parameters/js' 34 | - $ref: '#/components/parameters/js_timeout' 35 | - $ref: '#/components/parameters/wait_for' 36 | - $ref: '#/components/parameters/proxy' 37 | - $ref: '#/components/parameters/country' 38 | - $ref: '#/components/parameters/custom_proxy' 39 | - $ref: '#/components/parameters/device' 40 | - $ref: '#/components/parameters/error_on_404' 41 | - $ref: '#/components/parameters/error_on_redirect' 42 | - $ref: '#/components/parameters/js_script' 43 | - $ref: '#/components/parameters/format' 44 | responses: 45 | 400: 46 | $ref: '#/components/responses/400' 47 | 402: 48 | $ref: '#/components/responses/402' 49 | 403: 50 | $ref: '#/components/responses/403' 51 | 429: 52 | $ref: '#/components/responses/429' 53 | 500: 54 | $ref: '#/components/responses/500' 55 | 504: 56 | $ref: '#/components/responses/504' 57 | 200: 58 | description: Success 59 | content: 60 | text/html: 61 | schema: 62 | type: string 63 | example: "Some answer" 64 | /ai/fields: 65 | get: 66 | summary: Extract structured data fields from a web page 67 | description: Returns structured data fields extracted from the webpage using an LLM model. Proxies and Chromium JavaScript rendering are used for page retrieval and processing. 68 | operationId: getFields 69 | tags: [ "AI" ] 70 | parameters: 71 | - $ref: '#/components/parameters/url' 72 | - in: query 73 | name: fields 74 | description: Object describing fields to extract from the page and their descriptions 75 | required: true 76 | example: {"title":"Main product title","price":"Current product price","description":"Full product description"} 77 | schema: 78 | type: object 79 | additionalProperties: 80 | type: string 81 | style: deepObject 82 | explode: true 83 | - $ref: '#/components/parameters/headers' 84 | - $ref: '#/components/parameters/timeout' 85 | - $ref: '#/components/parameters/js' 86 | - $ref: '#/components/parameters/js_timeout' 87 | - $ref: '#/components/parameters/wait_for' 88 | - $ref: '#/components/parameters/proxy' 89 | - $ref: '#/components/parameters/country' 90 | - $ref: '#/components/parameters/custom_proxy' 91 | - $ref: '#/components/parameters/device' 92 | - $ref: '#/components/parameters/error_on_404' 93 | - $ref: '#/components/parameters/error_on_redirect' 94 | - $ref: '#/components/parameters/js_script' 95 | responses: 96 | 400: 97 | $ref: '#/components/responses/400' 98 | 402: 99 | $ref: '#/components/responses/402' 100 | 403: 101 | $ref: '#/components/responses/403' 102 | 429: 103 | $ref: '#/components/responses/429' 104 | 500: 105 | $ref: '#/components/responses/500' 106 | 504: 107 | $ref: '#/components/responses/504' 108 | 200: 109 | description: Success 110 | content: 111 | application/json: 112 | schema: 113 | type: object 114 | additionalProperties: 115 | type: string 116 | example: 117 | title: "Example Product" 118 | price: "$99.99" 119 | description: "This is a sample product description" 120 | /html: 121 | get: 122 | summary: Page HTML by URL 123 | description: Returns the full HTML content of a webpage specified by the URL. The response is in plain text. Proxies and Chromium JavaScript rendering are used for page retrieval and processing. 124 | operationId: getHTML 125 | tags: ["HTML"] 126 | parameters: 127 | - $ref: '#/components/parameters/url' 128 | - $ref: '#/components/parameters/headers' 129 | - $ref: '#/components/parameters/timeout' 130 | - $ref: '#/components/parameters/js' 131 | - $ref: '#/components/parameters/js_timeout' 132 | - $ref: '#/components/parameters/wait_for' 133 | - $ref: '#/components/parameters/proxy' 134 | - $ref: '#/components/parameters/country' 135 | - $ref: '#/components/parameters/custom_proxy' 136 | - $ref: '#/components/parameters/device' 137 | - $ref: '#/components/parameters/error_on_404' 138 | - $ref: '#/components/parameters/error_on_redirect' 139 | - $ref: '#/components/parameters/js_script' 140 | - $ref: '#/components/parameters/return_script_result' 141 | - $ref: '#/components/parameters/format' 142 | responses: 143 | 400: 144 | $ref: '#/components/responses/400' 145 | 402: 146 | $ref: '#/components/responses/402' 147 | 403: 148 | $ref: '#/components/responses/403' 149 | 429: 150 | $ref: '#/components/responses/429' 151 | 500: 152 | $ref: '#/components/responses/500' 153 | 504: 154 | $ref: '#/components/responses/504' 155 | 200: 156 | description: Success 157 | content: 158 | text/html: 159 | schema: 160 | type: string 161 | example: "<html><head>\n <title>Example Domain</title>\n</head>\n\n<body>\n<div>\n <h1>Example Domain</h1>\n</body></html>" 162 | /text: 163 | get: 164 | summary: Page text by URL 165 | description: Returns the visible text content of a webpage specified by the URL. Can be used to feed data to LLM models. The response can be in plain text, JSON, or XML format based on the text_format parameter. Proxies and Chromium JavaScript rendering are used for page retrieval and processing. Returns JSON on error. 166 | operationId: getText 167 | tags: [ "Text" ] 168 | parameters: 169 | - $ref: '#/components/parameters/text_format' 170 | - $ref: '#/components/parameters/return_links' 171 | - $ref: '#/components/parameters/url' 172 | - $ref: '#/components/parameters/headers' 173 | - $ref: '#/components/parameters/timeout' 174 | - $ref: '#/components/parameters/js' 175 | - $ref: '#/components/parameters/js_timeout' 176 | - $ref: '#/components/parameters/wait_for' 177 | - $ref: '#/components/parameters/proxy' 178 | - $ref: '#/components/parameters/country' 179 | - $ref: '#/components/parameters/custom_proxy' 180 | - $ref: '#/components/parameters/device' 181 | - $ref: '#/components/parameters/error_on_404' 182 | - $ref: '#/components/parameters/error_on_redirect' 183 | - $ref: '#/components/parameters/js_script' 184 | responses: 185 | 400: 186 | $ref: '#/components/responses/400' 187 | 402: 188 | $ref: '#/components/responses/402' 189 | 403: 190 | $ref: '#/components/responses/403' 191 | 429: 192 | $ref: '#/components/responses/429' 193 | 500: 194 | $ref: '#/components/responses/500' 195 | 504: 196 | $ref: '#/components/responses/504' 197 | 200: 198 | description: Success 199 | content: 200 | text/html: 201 | schema: 202 | type: string 203 | example: "Some content" 204 | text/xml: 205 | schema: 206 | type: string 207 | example: "<title>Some title</title>\n<description>Some description</description>\n<content>Some content</content>" 208 | application/json: 209 | schema: 210 | type: string 211 | example: '{"title":"Some title","description":"Some description","content":"Some content"}' 212 | /selected: 213 | get: 214 | summary: HTML of a selected page area by URL and CSS selector 215 | description: Returns HTML of a selected page area by URL and CSS selector. Useful if you don't want to do the HTML parsing on your side. 216 | operationId: getSelected 217 | tags: ["Selected HTML"] 218 | parameters: 219 | - in: query 220 | name: selector 221 | description: CSS selector (null by default, returns whole page HTML) 222 | example: "h1" 223 | schema: 224 | type: string 225 | - $ref: '#/components/parameters/url' 226 | - $ref: '#/components/parameters/headers' 227 | - $ref: '#/components/parameters/timeout' 228 | - $ref: '#/components/parameters/js' 229 | - $ref: '#/components/parameters/js_timeout' 230 | - $ref: '#/components/parameters/wait_for' 231 | - $ref: '#/components/parameters/proxy' 232 | - $ref: '#/components/parameters/country' 233 | - $ref: '#/components/parameters/custom_proxy' 234 | - $ref: '#/components/parameters/device' 235 | - $ref: '#/components/parameters/error_on_404' 236 | - $ref: '#/components/parameters/error_on_redirect' 237 | - $ref: '#/components/parameters/js_script' 238 | - $ref: '#/components/parameters/format' 239 | responses: 240 | 400: 241 | $ref: '#/components/responses/400' 242 | 402: 243 | $ref: '#/components/responses/402' 244 | 403: 245 | $ref: '#/components/responses/403' 246 | 429: 247 | $ref: '#/components/responses/429' 248 | 500: 249 | $ref: '#/components/responses/500' 250 | 504: 251 | $ref: '#/components/responses/504' 252 | 200: 253 | description: Success 254 | content: 255 | text/html: 256 | schema: 257 | type: string 258 | example: "<a href=\"https://www.iana.org/domains/example\">More information...</a>" 259 | /selected-multiple: 260 | get: 261 | summary: HTML of multiple page areas by URL and CSS selectors 262 | description: Returns HTML of multiple page areas by URL and CSS selectors. Useful if you don't want to do the HTML parsing on your side. 263 | operationId: getSelectedMultiple 264 | tags: ["Selected HTML"] 265 | parameters: 266 | - in: query 267 | name: selectors 268 | description: Multiple CSS selectors (null by default, returns whole page HTML) 269 | example: ["h1"] 270 | schema: 271 | type: array 272 | items: 273 | type: string 274 | style: form 275 | explode: true 276 | - $ref: '#/components/parameters/url' 277 | - $ref: '#/components/parameters/headers' 278 | - $ref: '#/components/parameters/timeout' 279 | - $ref: '#/components/parameters/js' 280 | - $ref: '#/components/parameters/js_timeout' 281 | - $ref: '#/components/parameters/wait_for' 282 | - $ref: '#/components/parameters/proxy' 283 | - $ref: '#/components/parameters/country' 284 | - $ref: '#/components/parameters/custom_proxy' 285 | - $ref: '#/components/parameters/device' 286 | - $ref: '#/components/parameters/error_on_404' 287 | - $ref: '#/components/parameters/error_on_redirect' 288 | - $ref: '#/components/parameters/js_script' 289 | responses: 290 | 400: 291 | $ref: '#/components/responses/400' 292 | 402: 293 | $ref: '#/components/responses/402' 294 | 403: 295 | $ref: '#/components/responses/403' 296 | 429: 297 | $ref: '#/components/responses/429' 298 | 500: 299 | $ref: '#/components/responses/500' 300 | 504: 301 | $ref: '#/components/responses/504' 302 | 200: 303 | description: Success 304 | content: 305 | application/json: 306 | schema: 307 | $ref: "#/components/schemas/SelectedAreas" 308 | example: "[\"<a href='/test'>some link</a>\", \"Hello\"]" 309 | /account: 310 | get: 311 | summary: Information about your account calls quota 312 | description: Returns information about your account, including the remaining API credits quota, the next billing cycle start time, and the remaining concurrent requests. The response is in JSON format. 313 | operationId: account 314 | tags: [ "Account" ] 315 | responses: 316 | 403: 317 | $ref: '#/components/responses/403' 318 | 200: 319 | description: Success 320 | content: 321 | application/json: 322 | schema: 323 | $ref: "#/components/schemas/Account" 324 | example: 325 | remaining_api_calls: 200000 326 | resets_at: 1617073667 327 | remaining_concurrency: 100 328 | security: 329 | - api_key: [] 330 | servers: 331 | - url: https://api.webscraping.ai 332 | components: 333 | securitySchemes: 334 | api_key: 335 | type: apiKey 336 | name: api_key 337 | in: query 338 | responses: 339 | 400: 340 | description: Parameters validation error 341 | content: 342 | application/json: 343 | schema: 344 | $ref: "#/components/schemas/Error" 345 | example: 346 | { 347 | "message": "Invalid CSS selector" 348 | } 349 | 350 | 402: 351 | description: Billing issue, probably you've ran out of credits 352 | content: 353 | application/json: 354 | schema: 355 | $ref: "#/components/schemas/Error" 356 | example: 357 | { 358 | message: "Some error" 359 | } 360 | 403: 361 | description: Wrong API key 362 | content: 363 | application/json: 364 | schema: 365 | $ref: "#/components/schemas/Error" 366 | example: 367 | { 368 | message: "Some error" 369 | } 370 | 429: 371 | description: Too many concurrent requests 372 | content: 373 | application/json: 374 | schema: 375 | $ref: "#/components/schemas/Error" 376 | example: 377 | { 378 | message: "Some error" 379 | } 380 | 500: 381 | description: Non-2xx and non-404 HTTP status code on the target page or unexpected error, try again or contact [email protected] 382 | content: 383 | application/json: 384 | schema: 385 | $ref: "#/components/schemas/Error" 386 | example: 387 | { 388 | "message": "Unexpected HTTP code on the target page", 389 | "status_code": 500, 390 | "status_message": "Some website error", 391 | } 392 | 504: 393 | description: Timeout error, try increasing timeout parameter value 394 | content: 395 | application/json: 396 | schema: 397 | $ref: "#/components/schemas/Error" 398 | example: 399 | { 400 | message: "Some error" 401 | } 402 | parameters: 403 | ## Shared everywhere 404 | url: 405 | in: query 406 | name: url 407 | description: URL of the target page. 408 | required: true 409 | example: "https://example.com" 410 | schema: 411 | type: string 412 | postUrl: 413 | in: query 414 | name: url 415 | description: URL of the target page. 416 | required: true 417 | example: "https://httpbin.org/post" 418 | schema: 419 | type: string 420 | headers: 421 | in: query 422 | name: headers 423 | description: "HTTP headers to pass to the target page. Can be specified either via a nested query parameter (...&headers[One]=value1&headers=[Another]=value2) or as a JSON encoded object (...&headers={\"One\": \"value1\", \"Another\": \"value2\"})." 424 | example: '{"Cookie":"session=some_id"}' 425 | schema: 426 | type: object 427 | additionalProperties: 428 | type: string 429 | style: deepObject 430 | explode: true 431 | timeout: 432 | in: query 433 | name: timeout 434 | description: Maximum web page retrieval time in ms. Increase it in case of timeout errors (10000 by default, maximum is 30000). 435 | example: 10000 436 | schema: 437 | type: integer 438 | default: 10000 439 | minimum: 1 440 | maximum: 30000 441 | js: 442 | in: query 443 | name: js 444 | description: Execute on-page JavaScript using a headless browser (true by default). 445 | example: true 446 | schema: 447 | type: boolean 448 | default: true 449 | js_timeout: 450 | in: query 451 | name: js_timeout 452 | description: Maximum JavaScript rendering time in ms. Increase it in case if you see a loading indicator instead of data on the target page. 453 | example: 2000 454 | schema: 455 | type: integer 456 | default: 2000 457 | minimum: 1 458 | maximum: 20000 459 | wait_for: 460 | in: query 461 | name: wait_for 462 | description: CSS selector to wait for before returning the page content. Useful for pages with dynamic content loading. Overrides js_timeout. 463 | schema: 464 | type: string 465 | proxy: 466 | in: query 467 | name: proxy 468 | description: Type of proxy, use residential proxies if your site restricts traffic from datacenters (datacenter by default). Note that residential proxy requests are more expensive than datacenter, see the pricing page for details. 469 | example: "datacenter" 470 | schema: 471 | type: string 472 | default: "datacenter" 473 | enum: [ "datacenter", "residential" ] 474 | country: 475 | in: query 476 | name: country 477 | description: Country of the proxy to use (US by default). 478 | example: "us" 479 | schema: 480 | type: string 481 | default: "us" 482 | enum: [ "us", "gb", "de", "it", "fr", "ca", "es", "ru", "jp", "kr", "in" ] 483 | custom_proxy: 484 | in: query 485 | name: custom_proxy 486 | description: Your own proxy URL to use instead of our built-in proxy pool in "http://user:password@host:port" format (<a target="_blank" href="https://webscraping.ai/proxies/smartproxy">Smartproxy</a> for example). 487 | example: 488 | schema: 489 | type: string 490 | device: 491 | in: query 492 | name: device 493 | description: Type of device emulation. 494 | example: "desktop" 495 | schema: 496 | type: string 497 | default: "desktop" 498 | enum: [ "desktop", "mobile", "tablet" ] 499 | error_on_404: 500 | in: query 501 | name: error_on_404 502 | description: Return error on 404 HTTP status on the target page (false by default). 503 | example: false 504 | schema: 505 | type: boolean 506 | default: false 507 | error_on_redirect: 508 | in: query 509 | name: error_on_redirect 510 | description: Return error on redirect on the target page (false by default). 511 | example: false 512 | schema: 513 | type: boolean 514 | default: false 515 | js_script: 516 | in: query 517 | name: js_script 518 | description: Custom JavaScript code to execute on the target page. 519 | example: "document.querySelector('button').click();" 520 | schema: 521 | type: string 522 | return_script_result: 523 | in: query 524 | name: return_script_result 525 | description: Return result of the custom JavaScript code (js_script parameter) execution on the target page (false by default, page HTML will be returned). 526 | example: false 527 | schema: 528 | type: boolean 529 | default: false 530 | text_format: 531 | in: query 532 | name: text_format 533 | description: Format of the text response (plain by default). "plain" will return only the page body text. "json" and "xml" will return a json/xml with "title", "description" and "content" keys. 534 | example: "plain" 535 | schema: 536 | type: string 537 | default: "plain" 538 | enum: [ "plain", "xml", "json" ] 539 | return_links: 540 | in: query 541 | name: return_links 542 | description: "[Works only with text_format=json] Return links from the page body text (false by default). Useful for building web crawlers." 543 | example: false 544 | schema: 545 | type: boolean 546 | default: false 547 | question: 548 | in: query 549 | name: question 550 | description: Question or instructions to ask the LLM model about the target page. 551 | example: "What is the summary of this page content?" 552 | schema: 553 | type: string 554 | format: 555 | in: query 556 | name: format 557 | description: Format of the response (text by default). "json" will return a JSON object with the response, "text" will return a plain text/HTML response. 558 | example: "json" 559 | schema: 560 | type: string 561 | default: "json" 562 | enum: [ "json", "text" ] 563 | 564 | requestBodies: 565 | Body: 566 | description: Request body to pass to the target page 567 | content: 568 | application/json: 569 | schema: 570 | type: object 571 | additionalProperties: true 572 | application/x-www-form-urlencoded: 573 | schema: 574 | type: object 575 | additionalProperties: true 576 | application/xml: 577 | schema: 578 | type: object 579 | additionalProperties: true 580 | text/plain: 581 | schema: 582 | type: string 583 | schemas: 584 | Error: 585 | title: Generic error 586 | type: object 587 | properties: 588 | message: 589 | type: string 590 | description: Error description 591 | status_code: 592 | type: integer 593 | description: Target page response HTTP status code (403, 500, etc) 594 | status_message: 595 | type: string 596 | description: Target page response HTTP status message 597 | body: 598 | type: string 599 | description: Target page response body 600 | SelectedAreas: 601 | title: HTML for selected page areas 602 | type: array 603 | description: Array of elements matched by selectors 604 | items: 605 | type: string 606 | Account: 607 | title: Account limits info 608 | type: object 609 | properties: 610 | email: 611 | type: string 612 | description: Your account email 613 | remaining_api_calls: 614 | type: integer 615 | description: Remaining API credits quota 616 | resets_at: 617 | type: integer 618 | description: Next billing cycle start time (UNIX timestamp) 619 | remaining_concurrency: 620 | type: integer 621 | description: Remaining concurrent requests 622 | ```