webscraping-ai/webscraping-ai-mcp-server # codebase.md

# Directory Structure

```
├── .env.example
├── .eslintignore
├── .eslintrc.json
├── .github
│   └── workflows
│       └── ci.yml
├── .gitignore
├── .prettierrc
├── Dockerfile
├── jest.config.js
├── jest.setup.js
├── openapi.yml
├── package-lock.json
├── package.json
├── README.md
├── smithery.yaml
└── src
    ├── index.js
    └── index.test.js
```

# Files

--------------------------------------------------------------------------------
/.eslintignore:
--------------------------------------------------------------------------------

```
1 | **/*.test.ts
2 | **/*.test.js
3 | node_modules
4 | dist
5 | jest.setup.ts
6 | jest.config.js 
```

--------------------------------------------------------------------------------
/.prettierrc:
--------------------------------------------------------------------------------

```
1 | {
2 |   "semi": true,
3 |   "singleQuote": true,
4 |   "tabWidth": 2,
5 |   "trailingComma": "es5",
6 |   "printWidth": 80,
7 |   "endOfLine": "auto"
8 | } 
```

--------------------------------------------------------------------------------
/.env.example:
--------------------------------------------------------------------------------

```
1 | # Required: Your WebScraping.AI API key
2 | WEBSCRAPING_AI_API_KEY=your_api_key_here
3 | 
4 | # Optional: Maximum number of concurrent requests (default: 5)
5 | WEBSCRAPING_AI_CONCURRENCY_LIMIT=5
6 | 
```

--------------------------------------------------------------------------------
/.eslintrc.json:
--------------------------------------------------------------------------------

```json
 1 | {
 2 |   "env": {
 3 |     "es2021": true,
 4 |     "node": true,
 5 |     "jest": true
 6 |   },
 7 |   "extends": [
 8 |     "eslint:recommended",
 9 |     "prettier"
10 |   ],
11 |   "parserOptions": {
12 |     "ecmaVersion": "latest",
13 |     "sourceType": "module"
14 |   },
15 |   "rules": {
16 |     "no-unused-vars": "warn",
17 |     "no-console": "off"
18 |   }
19 | } 
```

--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------

```
 1 | # Dependency directories
 2 | node_modules/
 3 | 
 4 | # Environment variables
 5 | .env
 6 | .env.local
 7 | .env.development.local
 8 | .env.test.local
 9 | .env.production.local
10 | 
11 | # Logs
12 | logs
13 | *.log
14 | npm-debug.log*
15 | yarn-debug.log*
16 | yarn-error.log*
17 | 
18 | # Coverage directory used by tools like istanbul
19 | coverage/
20 | 
21 | # Editor directories and files
22 | .idea/
23 | .vscode/
24 | *.swp
25 | *.swo
26 | 
27 | # OS specific
28 | .DS_Store
29 | Thumbs.db
30 | 
```

--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------

```markdown
  1 | # WebScraping.AI MCP Server
  2 | 
  3 | A Model Context Protocol (MCP) server implementation that integrates with [WebScraping.AI](https://webscraping.ai) for web data extraction capabilities.
  4 | 
  5 | ## Features
  6 | 
  7 | - Question answering about web page content
  8 | - Structured data extraction from web pages
  9 | - HTML content retrieval with JavaScript rendering
 10 | - Plain text extraction from web pages
 11 | - CSS selector-based content extraction
 12 | - Multiple proxy types (datacenter, residential) with country selection
 13 | - JavaScript rendering using headless Chrome/Chromium
 14 | - Concurrent request management with rate limiting
 15 | - Custom JavaScript execution on target pages
 16 | - Device emulation (desktop, mobile, tablet)
 17 | - Account usage monitoring
 18 | 
 19 | ## Installation
 20 | 
 21 | ### Running with npx
 22 | 
 23 | ```bash
 24 | env WEBSCRAPING_AI_API_KEY=your_api_key npx -y webscraping-ai-mcp
 25 | ```
 26 | 
 27 | ### Manual Installation
 28 | 
 29 | ```bash
 30 | # Clone the repository
 31 | git clone https://github.com/webscraping-ai/webscraping-ai-mcp-server.git
 32 | cd webscraping-ai-mcp-server
 33 | 
 34 | # Install dependencies
 35 | npm install
 36 | 
 37 | # Run
 38 | npm start
 39 | ```
 40 | 
 41 | ### Configuring in Cursor
 42 | Note: Requires Cursor version 0.45.6+
 43 | 
 44 | The WebScraping.AI MCP server can be configured in two ways in Cursor:
 45 | 
 46 | 1. **Project-specific Configuration** (recommended for team projects):
 47 |    Create a `.cursor/mcp.json` file in your project directory:
 48 |    ```json
 49 |    {
 50 |      "servers": {
 51 |        "webscraping-ai": {
 52 |          "type": "command",
 53 |          "command": "npx -y webscraping-ai-mcp",
 54 |          "env": {
 55 |            "WEBSCRAPING_AI_API_KEY": "your-api-key",
 56 |            "WEBSCRAPING_AI_CONCURRENCY_LIMIT": "5"
 57 |          }
 58 |        }
 59 |      }
 60 |    }
 61 |    ```
 62 | 
 63 | 2. **Global Configuration** (for personal use across all projects):
 64 |    Create a `~/.cursor/mcp.json` file in your home directory with the same configuration format as above.
 65 | 
 66 | > If you are using Windows and are running into issues, try using `cmd /c "set WEBSCRAPING_AI_API_KEY=your-api-key && npx -y webscraping-ai-mcp"` as the command.
 67 | 
 68 | This configuration will make the WebScraping.AI tools available to Cursor's AI agent automatically when relevant for web scraping tasks.
 69 | 
 70 | ### Running on Claude Desktop
 71 | 
 72 | Add this to your `claude_desktop_config.json`:
 73 | 
 74 | ```json
 75 | {
 76 |   "mcpServers": {
 77 |     "mcp-server-webscraping-ai": {
 78 |       "command": "npx",
 79 |       "args": ["-y", "webscraping-ai-mcp"],
 80 |       "env": {
 81 |         "WEBSCRAPING_AI_API_KEY": "YOUR_API_KEY_HERE",
 82 |         "WEBSCRAPING_AI_CONCURRENCY_LIMIT": "5"
 83 |       }
 84 |     }
 85 |   }
 86 | }
 87 | ```
 88 | 
 89 | ## Configuration
 90 | 
 91 | ### Environment Variables
 92 | 
 93 | #### Required
 94 | 
 95 | - `WEBSCRAPING_AI_API_KEY`: Your WebScraping.AI API key
 96 |   - Required for all operations
 97 |   - Get your API key from [WebScraping.AI](https://webscraping.ai)
 98 | 
 99 | #### Optional Configuration
100 | - `WEBSCRAPING_AI_CONCURRENCY_LIMIT`: Maximum number of concurrent requests (default: `5`)
101 | - `WEBSCRAPING_AI_DEFAULT_PROXY_TYPE`: Type of proxy to use (default: `residential`)
102 | - `WEBSCRAPING_AI_DEFAULT_JS_RENDERING`: Enable/disable JavaScript rendering (default: `true`)
103 | - `WEBSCRAPING_AI_DEFAULT_TIMEOUT`: Maximum web page retrieval time in ms (default: `15000`, max: `30000`)
104 | - `WEBSCRAPING_AI_DEFAULT_JS_TIMEOUT`: Maximum JavaScript rendering time in ms (default: `2000`)
105 | 
106 | ### Configuration Examples
107 | 
108 | For standard usage:
109 | ```bash
110 | # Required
111 | export WEBSCRAPING_AI_API_KEY=your-api-key
112 | 
113 | # Optional - customize behavior (default values)
114 | export WEBSCRAPING_AI_CONCURRENCY_LIMIT=5
115 | export WEBSCRAPING_AI_DEFAULT_PROXY_TYPE=residential # datacenter or residential
116 | export WEBSCRAPING_AI_DEFAULT_JS_RENDERING=true
117 | export WEBSCRAPING_AI_DEFAULT_TIMEOUT=15000
118 | export WEBSCRAPING_AI_DEFAULT_JS_TIMEOUT=2000
119 | ```
120 | 
121 | ## Available Tools
122 | 
123 | ### 1. Question Tool (`webscraping_ai_question`)
124 | 
125 | Ask questions about web page content.
126 | 
127 | ```json
128 | {
129 |   "name": "webscraping_ai_question",
130 |   "arguments": {
131 |     "url": "https://example.com",
132 |     "question": "What is the main topic of this page?",
133 |     "timeout": 30000,
134 |     "js": true,
135 |     "js_timeout": 2000,
136 |     "wait_for": ".content-loaded",
137 |     "proxy": "datacenter",
138 |     "country": "us"
139 |   }
140 | }
141 | ```
142 | 
143 | Example response:
144 | 
145 | ```json
146 | {
147 |   "content": [
148 |     {
149 |       "type": "text",
150 |       "text": "The main topic of this page is examples and documentation for HTML and web standards."
151 |     }
152 |   ],
153 |   "isError": false
154 | }
155 | ```
156 | 
157 | ### 2. Fields Tool (`webscraping_ai_fields`)
158 | 
159 | Extract structured data from web pages based on instructions.
160 | 
161 | ```json
162 | {
163 |   "name": "webscraping_ai_fields",
164 |   "arguments": {
165 |     "url": "https://example.com/product",
166 |     "fields": {
167 |       "title": "Extract the product title",
168 |       "price": "Extract the product price",
169 |       "description": "Extract the product description"
170 |     },
171 |     "js": true,
172 |     "timeout": 30000
173 |   }
174 | }
175 | ```
176 | 
177 | Example response:
178 | 
179 | ```json
180 | {
181 |   "content": [
182 |     {
183 |       "type": "text",
184 |       "text": {
185 |         "title": "Example Product",
186 |         "price": "$99.99",
187 |         "description": "This is an example product description."
188 |       }
189 |     }
190 |   ],
191 |   "isError": false
192 | }
193 | ```
194 | 
195 | ### 3. HTML Tool (`webscraping_ai_html`)
196 | 
197 | Get the full HTML of a web page with JavaScript rendering.
198 | 
199 | ```json
200 | {
201 |   "name": "webscraping_ai_html",
202 |   "arguments": {
203 |     "url": "https://example.com",
204 |     "js": true,
205 |     "timeout": 30000,
206 |     "wait_for": "#content-loaded"
207 |   }
208 | }
209 | ```
210 | 
211 | Example response:
212 | 
213 | ```json
214 | {
215 |   "content": [
216 |     {
217 |       "type": "text",
218 |       "text": "<html>...[full HTML content]...</html>"
219 |     }
220 |   ],
221 |   "isError": false
222 | }
223 | ```
224 | 
225 | ### 4. Text Tool (`webscraping_ai_text`)
226 | 
227 | Extract the visible text content from a web page.
228 | 
229 | ```json
230 | {
231 |   "name": "webscraping_ai_text",
232 |   "arguments": {
233 |     "url": "https://example.com",
234 |     "js": true,
235 |     "timeout": 30000
236 |   }
237 | }
238 | ```
239 | 
240 | Example response:
241 | 
242 | ```json
243 | {
244 |   "content": [
245 |     {
246 |       "type": "text",
247 |       "text": "Example Domain\nThis domain is for use in illustrative examples in documents..."
248 |     }
249 |   ],
250 |   "isError": false
251 | }
252 | ```
253 | 
254 | ### 5. Selected Tool (`webscraping_ai_selected`)
255 | 
256 | Extract content from a specific element using a CSS selector.
257 | 
258 | ```json
259 | {
260 |   "name": "webscraping_ai_selected",
261 |   "arguments": {
262 |     "url": "https://example.com",
263 |     "selector": "div.main-content",
264 |     "js": true,
265 |     "timeout": 30000
266 |   }
267 | }
268 | ```
269 | 
270 | Example response:
271 | 
272 | ```json
273 | {
274 |   "content": [
275 |     {
276 |       "type": "text",
277 |       "text": "<div class=\"main-content\">This is the main content of the page.</div>"
278 |     }
279 |   ],
280 |   "isError": false
281 | }
282 | ```
283 | 
284 | ### 6. Selected Multiple Tool (`webscraping_ai_selected_multiple`)
285 | 
286 | Extract content from multiple elements using CSS selectors.
287 | 
288 | ```json
289 | {
290 |   "name": "webscraping_ai_selected_multiple",
291 |   "arguments": {
292 |     "url": "https://example.com",
293 |     "selectors": ["div.header", "div.product-list", "div.footer"],
294 |     "js": true,
295 |     "timeout": 30000
296 |   }
297 | }
298 | ```
299 | 
300 | Example response:
301 | 
302 | ```json
303 | {
304 |   "content": [
305 |     {
306 |       "type": "text",
307 |       "text": [
308 |         "<div class=\"header\">Header content</div>",
309 |         "<div class=\"product-list\">Product list content</div>",
310 |         "<div class=\"footer\">Footer content</div>"
311 |       ]
312 |     }
313 |   ],
314 |   "isError": false
315 | }
316 | ```
317 | 
318 | ### 7. Account Tool (`webscraping_ai_account`)
319 | 
320 | Get information about your WebScraping.AI account.
321 | 
322 | ```json
323 | {
324 |   "name": "webscraping_ai_account",
325 |   "arguments": {}
326 | }
327 | ```
328 | 
329 | Example response:
330 | 
331 | ```json
332 | {
333 |   "content": [
334 |     {
335 |       "type": "text",
336 |       "text": {
337 |         "requests": 5000,
338 |         "remaining": 4500,
339 |         "limit": 10000,
340 |         "resets_at": "2023-12-31T23:59:59Z"
341 |       }
342 |     }
343 |   ],
344 |   "isError": false
345 | }
346 | ```
347 | 
348 | ## Common Options for All Tools
349 | 
350 | The following options can be used with all scraping tools:
351 | 
352 | - `timeout`: Maximum web page retrieval time in ms (15000 by default, maximum is 30000)
353 | - `js`: Execute on-page JavaScript using a headless browser (true by default)
354 | - `js_timeout`: Maximum JavaScript rendering time in ms (2000 by default)
355 | - `wait_for`: CSS selector to wait for before returning the page content
356 | - `proxy`: Type of proxy, datacenter or residential (residential by default)
357 | - `country`: Country of the proxy to use (US by default). Supported countries: us, gb, de, it, fr, ca, es, ru, jp, kr, in
358 | - `custom_proxy`: Your own proxy URL in "http://user:password@host:port" format
359 | - `device`: Type of device emulation. Supported values: desktop, mobile, tablet
360 | - `error_on_404`: Return error on 404 HTTP status on the target page (false by default)
361 | - `error_on_redirect`: Return error on redirect on the target page (false by default)
362 | - `js_script`: Custom JavaScript code to execute on the target page
363 | 
364 | ## Error Handling
365 | 
366 | The server provides robust error handling:
367 | 
368 | - Automatic retries for transient errors
369 | - Rate limit handling with backoff
370 | - Detailed error messages
371 | - Network resilience
372 | 
373 | Example error response:
374 | 
375 | ```json
376 | {
377 |   "content": [
378 |     {
379 |       "type": "text",
380 |       "text": "API Error: 429 Too Many Requests"
381 |     }
382 |   ],
383 |   "isError": true
384 | }
385 | ```
386 | 
387 | ## Integration with LLMs
388 | 
389 | This server implements the [Model Context Protocol](https://github.com/facebookresearch/modelcontextprotocol), making it compatible with any MCP-enabled LLM platforms. You can configure your LLM to use these tools for web scraping tasks.
390 | 
391 | ### Example: Configuring Claude with MCP
392 | 
393 | ```javascript
394 | const { Claude } = require('@anthropic-ai/sdk');
395 | const { Client } = require('@modelcontextprotocol/sdk/client/index.js');
396 | const { StdioClientTransport } = require('@modelcontextprotocol/sdk/client/stdio.js');
397 | 
398 | const claude = new Claude({
399 |   apiKey: process.env.ANTHROPIC_API_KEY
400 | });
401 | 
402 | const transport = new StdioClientTransport({
403 |   command: 'npx',
404 |   args: ['-y', 'webscraping-ai-mcp'],
405 |   env: {
406 |     WEBSCRAPING_AI_API_KEY: 'your-api-key'
407 |   }
408 | });
409 | 
410 | const client = new Client({
411 |   name: 'claude-client',
412 |   version: '1.0.0'
413 | });
414 | 
415 | await client.connect(transport);
416 | 
417 | // Now you can use Claude with WebScraping.AI tools
418 | const tools = await client.listTools();
419 | const response = await claude.complete({
420 |   prompt: 'What is the main topic of example.com?',
421 |   tools: tools
422 | });
423 | ```
424 | 
425 | ## Development
426 | 
427 | ```bash
428 | # Clone the repository
429 | git clone https://github.com/webscraping-ai/webscraping-ai-mcp-server.git
430 | cd webscraping-ai-mcp-server
431 | 
432 | # Install dependencies
433 | npm install
434 | 
435 | # Run tests
436 | npm test
437 | 
438 | # Add your .env file
439 | cp .env.example .env
440 | 
441 | # Start the inspector
442 | npx @modelcontextprotocol/inspector node src/index.js
443 | ```
444 | 
445 | ### Contributing
446 | 
447 | 1. Fork the repository
448 | 2. Create your feature branch
449 | 3. Run tests: `npm test`
450 | 4. Submit a pull request
451 | 
452 | ## License
453 | 
454 | MIT License - see LICENSE file for details 
455 | 
```

--------------------------------------------------------------------------------
/jest.config.js:
--------------------------------------------------------------------------------

```javascript
 1 | /** @type {import('ts-jest').JestConfigWithTsJest} */
 2 | export default {
 3 |   testEnvironment: 'node',
 4 |   transform: {},
 5 |   moduleNameMapper: {
 6 |     '^(\\.{1,2}/.*)\\.js$': '$1',
 7 |   },
 8 |   setupFilesAfterEnv: ['./jest.setup.js'],
 9 |   testMatch: ['**/*.test.js'],
10 | };
```

--------------------------------------------------------------------------------
/jest.setup.js:
--------------------------------------------------------------------------------

```javascript
 1 | import { jest } from '@jest/globals';
 2 | 
 3 | // Mock console methods to suppress output during tests
 4 | global.console = {
 5 |   ...console,
 6 |   log: jest.fn(),
 7 |   debug: jest.fn(),
 8 |   info: jest.fn(),
 9 |   warn: jest.fn(),
10 |   error: jest.fn(),
11 | };
12 | 
13 | // Add any additional global test setup here 
```

--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------

```dockerfile
 1 | FROM node:18-alpine
 2 | 
 3 | WORKDIR /app
 4 | 
 5 | # Copy package.json and package-lock.json
 6 | COPY package*.json ./
 7 | 
 8 | # Install only production dependencies
 9 | RUN npm ci --only=production
10 | 
11 | # Copy source files
12 | COPY . .
13 | 
14 | # Set environment variables
15 | ENV NODE_ENV=production
16 | 
17 | # Command to run the application
18 | ENTRYPOINT ["node", "src/index.js"]
19 | 
20 | # Set default arguments
21 | CMD []
22 | 
23 | # Document that the service uses stdin/stdout for communication
24 | LABEL org.opencontainers.image.description="WebScraping.AI MCP Server - Model Context Protocol server for WebScraping.AI API"
25 | LABEL org.opencontainers.image.source="https://github.com/webscraping-ai/webscraping-ai-mcp-server"
26 | LABEL org.opencontainers.image.licenses="MIT"
27 | 
```

--------------------------------------------------------------------------------
/.github/workflows/ci.yml:
--------------------------------------------------------------------------------

```yaml
 1 | name: CI
 2 | 
 3 | on:
 4 |   push:
 5 |     branches: [master]
 6 |   pull_request:
 7 |     branches: [master]
 8 | 
 9 | jobs:
10 |   test:
11 |     runs-on: ubuntu-latest
12 | 
13 |     strategy:
14 |       matrix:
15 |         node-version: [18.x, 20.x]
16 | 
17 |     steps:
18 |       - uses: actions/checkout@v3
19 |       
20 |       - name: Use Node.js ${{ matrix.node-version }}
21 |         uses: actions/setup-node@v3
22 |         with:
23 |           node-version: ${{ matrix.node-version }}
24 |           cache: 'npm'
25 |           
26 |       - name: Install dependencies
27 |         run: npm ci
28 |         
29 |       - name: Lint
30 |         run: npm run lint
31 |         
32 |       - name: Test
33 |         run: npm test
34 |         env:
35 |           WEBSCRAPING_AI_API_KEY: ${{ secrets.WEBSCRAPING_AI_API_KEY || 'test-api-key' }} 
36 | 
```

--------------------------------------------------------------------------------
/smithery.yaml:
--------------------------------------------------------------------------------

```yaml
 1 | # Smithery configuration file: https://smithery.ai/docs/config#smitheryyaml
 2 | 
 3 | startCommand:
 4 |   type: stdio
 5 |   configSchema:
 6 |     # JSON Schema defining the configuration options for the MCP.
 7 |     type: object
 8 |     required:
 9 |       - webscrapingAiApiKey
10 |     properties:
11 |       webscrapingAiApiKey:
12 |         type: string
13 |         description: Your WebScraping.AI API key. Required for API usage.
14 |       webscrapingAiApiUrl:
15 |         type: string
16 |         description: Custom API endpoint. Default is https://api.webscraping.ai.
17 |       webscrapingAiConcurrencyLimit:
18 |         type: integer
19 |         description: Maximum concurrent requests allowed (default 5).
20 |   commandFunction:
21 |     # A function that produces the CLI command to start the MCP on stdio.
22 |     |-
23 |     (config) => ({ 
24 |       command: 'node', 
25 |       args: ['src/index.js'], 
26 |       env: { 
27 |         WEBSCRAPING_AI_API_KEY: config.webscrapingAiApiKey,
28 |         WEBSCRAPING_AI_CONCURRENCY_LIMIT: String(config.webscrapingAiConcurrencyLimit || 5)
29 |       } 
30 |     }) 
31 | 
```

--------------------------------------------------------------------------------
/package.json:
--------------------------------------------------------------------------------

```json
 1 | {
 2 |   "name": "webscraping-ai-mcp",
 3 |   "version": "1.0.2",
 4 |   "description": "Model Context Protocol server for WebScraping.AI API. Provides LLM-powered web scraping tools with Chromium JavaScript rendering, rotating proxies, and HTML parsing.",
 5 |   "type": "module",
 6 |   "bin": {
 7 |     "webscraping-ai-mcp": "src/index.js"
 8 |   },
 9 |   "files": [
10 |     "src"
11 |   ],
12 |   "scripts": {
13 |     "test": "node --experimental-vm-modules node_modules/jest/bin/jest.js",
14 |     "start": "node src/index.js",
15 |     "lint": "eslint src/**/*.js",
16 |     "lint:fix": "eslint src/**/*.js --fix",
17 |     "format": "prettier --write ."
18 |   },
19 |   "license": "MIT",
20 |   "dependencies": {
21 |     "@modelcontextprotocol/sdk": "^1.4.1",
22 |     "axios": "^1.6.7",
23 |     "dotenv": "^16.4.7",
24 |     "p-queue": "^8.0.1"
25 |   },
26 |   "devDependencies": {
27 |     "@jest/globals": "^29.7.0",
28 |     "eslint": "^8.56.0",
29 |     "eslint-config-prettier": "^9.1.0",
30 |     "jest": "^29.7.0",
31 |     "jest-mock-extended": "^4.0.0-beta1",
32 |     "prettier": "^3.1.1"
33 |   },
34 |   "engines": {
35 |     "node": ">=18.0.0"
36 |   },
37 |   "keywords": [
38 |     "mcp",
39 |     "webscraping",
40 |     "web-scraping",
41 |     "crawler",
42 |     "content-extraction",
43 |     "llm"
44 |   ],
45 |   "main": "src/index.js",
46 |   "repository": {
47 |     "type": "git",
48 |     "url": "git+https://github.com/webscraping-ai/webscraping-ai-mcp-server.git"
49 |   },
50 |   "author": "WebScraping.AI",
51 |   "bugs": {
52 |     "url": "https://github.com/webscraping-ai/webscraping-ai-mcp-server/issues"
53 |   },
54 |   "homepage": "https://github.com/webscraping-ai/webscraping-ai-mcp-server#readme"
55 | }
56 | 
```

--------------------------------------------------------------------------------
/src/index.js:
--------------------------------------------------------------------------------

```javascript
  1 | #!/usr/bin/env node
  2 | 
  3 | import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
  4 | import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
  5 | import { z } from 'zod';
  6 | import axios from 'axios';
  7 | import dotenv from 'dotenv';
  8 | import PQueue from 'p-queue';
  9 | 
 10 | dotenv.config();
 11 | 
 12 | // Environment variables
 13 | const WEBSCRAPING_AI_API_KEY = process.env.WEBSCRAPING_AI_API_KEY || '';
 14 | const WEBSCRAPING_AI_API_URL = 'https://api.webscraping.ai';
 15 | const CONCURRENCY_LIMIT = Number(process.env.WEBSCRAPING_AI_CONCURRENCY_LIMIT || 5);
 16 | const DEFAULT_PROXY_TYPE = process.env.WEBSCRAPING_AI_DEFAULT_PROXY_TYPE || 'residential';
 17 | const DEFAULT_JS_RENDERING = process.env.WEBSCRAPING_AI_DEFAULT_JS_RENDERING !== 'false';
 18 | const DEFAULT_TIMEOUT = Number(process.env.WEBSCRAPING_AI_DEFAULT_TIMEOUT || 15000);
 19 | const DEFAULT_JS_TIMEOUT = Number(process.env.WEBSCRAPING_AI_DEFAULT_JS_TIMEOUT || 2000);
 20 | 
 21 | // Validate required environment variables
 22 | if (!WEBSCRAPING_AI_API_KEY) {
 23 |   console.error('WEBSCRAPING_AI_API_KEY environment variable is required');
 24 |   process.exit(1);
 25 | }
 26 | 
 27 | class WebScrapingAIClient {
 28 |   constructor(options = {}) {
 29 |     const apiKey = options.apiKey || WEBSCRAPING_AI_API_KEY;
 30 |     const baseUrl = options.baseUrl || WEBSCRAPING_AI_API_URL;
 31 |     const timeout = options.timeout || 60000;
 32 |     const concurrency = options.concurrency || CONCURRENCY_LIMIT;
 33 | 
 34 |     if (!apiKey) {
 35 |       throw new Error('WebScraping.AI API key is required');
 36 |     }
 37 | 
 38 |     this.client = axios.create({
 39 |       baseURL: baseUrl,
 40 |       timeout: timeout,
 41 |       headers: {
 42 |         'Content-Type': 'application/json',
 43 |         'Accept': 'application/json',
 44 |       }
 45 |     });
 46 | 
 47 |     this.queue = new PQueue({ concurrency });
 48 |     this.apiKey = apiKey;
 49 |   }
 50 | 
 51 |   async request(endpoint, params) {
 52 |     try {
 53 |       return await this.queue.add(async () => {
 54 |         const response = await this.client.get(endpoint, { 
 55 |           params: {
 56 |             ...params,
 57 |             api_key: this.apiKey,
 58 |             from_mcp_server: true
 59 |           }
 60 |         });
 61 |         return response.data;
 62 |       });
 63 |     } catch (error) {
 64 |       const errorResponse = {
 65 |         message: 'API Error',
 66 |         status_code: error.response?.status,
 67 |         status_message: error.response?.statusText,
 68 |         body: error.response?.data
 69 |       };
 70 |       throw new Error(JSON.stringify(errorResponse));
 71 |     }
 72 |   }
 73 | 
 74 |   async question(url, question, options = {}) {
 75 |     return this.request('/ai/question', {
 76 |       url,
 77 |       question,
 78 |       ...options
 79 |     });
 80 |   }
 81 | 
 82 |   async fields(url, fields, options = {}) {
 83 |     return this.request('/ai/fields', {
 84 |       url,
 85 |       fields: JSON.stringify(fields),
 86 |       ...options
 87 |     });
 88 |   }
 89 | 
 90 |   async html(url, options = {}) {
 91 |     return this.request('/html', {
 92 |       url,
 93 |       ...options
 94 |     });
 95 |   }
 96 | 
 97 |   async text(url, options = {}) {
 98 |     return this.request('/text', {
 99 |       url,
100 |       ...options
101 |     });
102 |   }
103 | 
104 |   async selected(url, selector, options = {}) {
105 |     return this.request('/selected', {
106 |       url,
107 |       selector,
108 |       ...options
109 |     });
110 |   }
111 | 
112 |   async selectedMultiple(url, selectors, options = {}) {
113 |     return this.request('/selected-multiple', {
114 |       url,
115 |       selectors,
116 |       ...options
117 |     });
118 |   }
119 | 
120 |   async account() {
121 |     return this.request('/account', {});
122 |   }
123 | }
124 | 
125 | // Create WebScrapingAI client
126 | const client = new WebScrapingAIClient();
127 | 
128 | // Create MCP server
129 | const server = new McpServer({
130 |   name: 'WebScraping.AI MCP Server',
131 |   version: '1.0.2'
132 | });
133 | 
134 | // Common options schema for all tools
135 | const commonOptionsSchema = {
136 |   timeout: z.number().optional().default(DEFAULT_TIMEOUT).describe(`Maximum web page retrieval time in ms (${DEFAULT_TIMEOUT} by default, maximum is 30000).`),
137 |   js: z.boolean().optional().default(DEFAULT_JS_RENDERING).describe(`Execute on-page JavaScript using a headless browser (${DEFAULT_JS_RENDERING} by default).`),
138 |   js_timeout: z.number().optional().default(DEFAULT_JS_TIMEOUT).describe(`Maximum JavaScript rendering time in ms (${DEFAULT_JS_TIMEOUT} by default).`),
139 |   wait_for: z.string().optional().describe('CSS selector to wait for before returning the page content.'),
140 |   proxy: z.enum(['datacenter', 'residential']).optional().default(DEFAULT_PROXY_TYPE).describe(`Type of proxy, datacenter or residential (${DEFAULT_PROXY_TYPE} by default).`),
141 |   country: z.enum(['us', 'gb', 'de', 'it', 'fr', 'ca', 'es', 'ru', 'jp', 'kr', 'in']).optional().describe('Country of the proxy to use (US by default).'),
142 |   custom_proxy: z.string().optional().describe('Your own proxy URL in "http://user:password@host:port" format.'),
143 |   device: z.enum(['desktop', 'mobile', 'tablet']).optional().describe('Type of device emulation.'),
144 |   error_on_404: z.boolean().optional().describe('Return error on 404 HTTP status on the target page (false by default).'),
145 |   error_on_redirect: z.boolean().optional().describe('Return error on redirect on the target page (false by default).'),
146 |   js_script: z.string().optional().describe('Custom JavaScript code to execute on the target page.')
147 | };
148 | 
149 | // Define and register tools
150 | server.tool(
151 |   'webscraping_ai_question',
152 |   {
153 |     url: z.string().describe('URL of the target page.'),
154 |     question: z.string().describe('Question or instructions to ask the LLM model about the target page.'),
155 |     ...commonOptionsSchema
156 |   },
157 |   async ({ url, question, ...options }) => {
158 |     try {
159 |       const result = await client.question(url, question, options);
160 |       return {
161 |         content: [{ type: 'text', text: result }]
162 |       };
163 |     } catch (error) {
164 |       return {
165 |         content: [{ type: 'text', text: error.message }],
166 |         isError: true
167 |       };
168 |     }
169 |   }
170 | );
171 | 
172 | server.tool(
173 |   'webscraping_ai_fields',
174 |   {
175 |     url: z.string().describe('URL of the target page.'),
176 |     fields: z.record(z.string()).describe('Dictionary of field names with instructions for extraction.'),
177 |     ...commonOptionsSchema
178 |   },
179 |   async ({ url, fields, ...options }) => {
180 |     try {
181 |       const result = await client.fields(url, fields, options);
182 |       return {
183 |         content: [{ type: 'text', text: JSON.stringify(result, null, 2) }]
184 |       };
185 |     } catch (error) {
186 |       return {
187 |         content: [{ type: 'text', text: error.message }],
188 |         isError: true
189 |       };
190 |     }
191 |   }
192 | );
193 | 
194 | server.tool(
195 |   'webscraping_ai_html',
196 |   {
197 |     url: z.string().describe('URL of the target page.'),
198 |     return_script_result: z.boolean().optional().describe('Return result of the custom JavaScript code execution.'),
199 |     format: z.enum(['json', 'text']).optional().describe('Response format (json or text).'),
200 |     ...commonOptionsSchema
201 |   },
202 |   async ({ url, return_script_result, format, ...options }) => {
203 |     try {
204 |       const result = await client.html(url, { ...options, return_script_result });
205 |       if (format === 'json') {
206 |         return {
207 |           content: [{ type: 'text', text: JSON.stringify({ html: result }) }]
208 |         };
209 |       }
210 |       return {
211 |         content: [{ type: 'text', text: result }]
212 |       };
213 |     } catch (error) {
214 |       const errorObj = JSON.parse(error.message);
215 |       return {
216 |         content: [{ type: 'text', text: JSON.stringify(errorObj) }],
217 |         isError: true
218 |       };
219 |     }
220 |   }
221 | );
222 | 
223 | server.tool(
224 |   'webscraping_ai_text',
225 |   {
226 |     url: z.string().describe('URL of the target page.'),
227 |     text_format: z.enum(['plain', 'xml', 'json']).optional().default('json').describe('Format of the text response.'),
228 |     return_links: z.boolean().optional().describe('Return links from the page body text.'),
229 |     ...commonOptionsSchema
230 |   },
231 |   async ({ url, text_format, return_links, ...options }) => {
232 |     try {
233 |       const result = await client.text(url, { 
234 |         ...options, 
235 |         text_format, 
236 |         return_links 
237 |       });
238 |       return {
239 |         content: [{ type: 'text', text: typeof result === 'object' ? JSON.stringify(result) : result }]
240 |       };
241 |     } catch (error) {
242 |       const errorObj = JSON.parse(error.message);
243 |       return {
244 |         content: [{ type: 'text', text: JSON.stringify(errorObj) }],
245 |         isError: true
246 |       };
247 |     }
248 |   }
249 | );
250 | 
251 | server.tool(
252 |   'webscraping_ai_selected',
253 |   {
254 |     url: z.string().describe('URL of the target page.'),
255 |     selector: z.string().describe('CSS selector to extract content for.'),
256 |     format: z.enum(['json', 'text']).optional().default('json').describe('Response format (json or text).'),
257 |     ...commonOptionsSchema
258 |   },
259 |   async ({ url, selector, format, ...options }) => {
260 |     try {
261 |       const result = await client.selected(url, selector, options);
262 |       if (format === 'json') {
263 |         return {
264 |           content: [{ type: 'text', text: JSON.stringify({ html: result }) }]
265 |         };
266 |       }
267 |       return {
268 |         content: [{ type: 'text', text: result }]
269 |       };
270 |     } catch (error) {
271 |       const errorObj = JSON.parse(error.message);
272 |       return {
273 |         content: [{ type: 'text', text: JSON.stringify(errorObj) }],
274 |         isError: true
275 |       };
276 |     }
277 |   }
278 | );
279 | 
280 | server.tool(
281 |   'webscraping_ai_selected_multiple',
282 |   {
283 |     url: z.string().describe('URL of the target page.'),
284 |     selectors: z.array(z.string()).describe('Array of CSS selectors to extract content for.'),
285 |     ...commonOptionsSchema
286 |   },
287 |   async ({ url, selectors, ...options }) => {
288 |     try {
289 |       const result = await client.selectedMultiple(url, selectors, options);
290 |       return {
291 |         content: [{ type: 'text', text: JSON.stringify(result, null, 2) }]
292 |       };
293 |     } catch (error) {
294 |       return {
295 |         content: [{ type: 'text', text: error.message }],
296 |         isError: true
297 |       };
298 |     }
299 |   }
300 | );
301 | 
302 | server.tool(
303 |   'webscraping_ai_account',
304 |   {},
305 |   async () => {
306 |     try {
307 |       const result = await client.account();
308 |       return {
309 |         content: [{ type: 'text', text: JSON.stringify(result, null, 2) }]
310 |       };
311 |     } catch (error) {
312 |       return {
313 |         content: [{ type: 'text', text: error.message }],
314 |         isError: true
315 |       };
316 |     }
317 |   }
318 | );
319 | 
320 | const transport = new StdioServerTransport();
321 | server.connect(transport).then(() => {
322 | }).catch(err => {
323 |   console.error('Failed to connect to transport:', err);
324 |   process.exit(1);
325 | });
326 | 
```

--------------------------------------------------------------------------------
/src/index.test.js:
--------------------------------------------------------------------------------

```javascript
  1 | import {
  2 |   describe,
  3 |   expect,
  4 |   jest,
  5 |   test,
  6 |   beforeEach,
  7 |   afterEach,
  8 | } from '@jest/globals';
  9 | import { Client } from "@modelcontextprotocol/sdk/client/index.js";
 10 | import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";
 11 | 
 12 | // Create mock WebScrapingAIClient
 13 | class MockWebScrapingAIClient {
 14 |   constructor() {
 15 |     this.question = jest.fn().mockResolvedValue('This is the answer to your question.');
 16 |     this.fields = jest.fn().mockResolvedValue({ field1: 'value1', field2: 'value2' });
 17 |     this.html = jest.fn().mockResolvedValue('<html><body>Test HTML Content</body></html>');
 18 |     this.text = jest.fn().mockResolvedValue('Test text content');
 19 |     this.selected = jest.fn().mockResolvedValue('<div>Selected Element</div>');
 20 |     this.selectedMultiple = jest.fn().mockResolvedValue(['<div>Element 1</div>', '<div>Element 2</div>']);
 21 |     this.account = jest.fn().mockResolvedValue({ requests: 100, remaining: 900, limit: 1000 });
 22 |   }
 23 | }
 24 | 
 25 | // Test interfaces
 26 | class RequestContext {
 27 |   constructor(toolName, args) {
 28 |     this.params = {
 29 |       name: toolName,
 30 |       arguments: args
 31 |     };
 32 |   }
 33 | }
 34 | 
 35 | describe('WebScraping.AI MCP Server Tests', () => {
 36 |   let mockClient;
 37 |   let requestHandler;
 38 | 
 39 |   beforeEach(() => {
 40 |     jest.clearAllMocks();
 41 |     mockClient = new MockWebScrapingAIClient();
 42 | 
 43 |     // Create request handler function
 44 |     requestHandler = async (request) => {
 45 |       const { name: toolName, arguments: args } = request.params;
 46 |       if (!args && toolName !== 'webscraping_ai_account') {
 47 |         throw new Error('No arguments provided');
 48 |       }
 49 |       return handleRequest(toolName, args || {}, mockClient);
 50 |     };
 51 |   });
 52 | 
 53 |   afterEach(() => {
 54 |     jest.clearAllMocks();
 55 |   });
 56 | 
 57 |   // Test question functionality
 58 |   test('should handle question request', async () => {
 59 |     const url = 'https://example.com';
 60 |     const question = 'What is on this page?';
 61 | 
 62 |     const response = await requestHandler(
 63 |       new RequestContext('webscraping_ai_question', { url, question })
 64 |     );
 65 | 
 66 |     expect(response).toEqual({
 67 |       content: [{ type: 'text', text: 'This is the answer to your question.' }],
 68 |       isError: false
 69 |     });
 70 |     expect(mockClient.question).toHaveBeenCalledWith(url, question, {});
 71 |   });
 72 | 
 73 |   // Test fields functionality
 74 |   test('should handle fields request', async () => {
 75 |     const url = 'https://example.com';
 76 |     const fields = { 
 77 |       title: 'Extract the title', 
 78 |       price: 'Extract the price' 
 79 |     };
 80 | 
 81 |     const response = await requestHandler(
 82 |       new RequestContext('webscraping_ai_fields', { url, fields })
 83 |     );
 84 | 
 85 |     expect(response).toEqual({
 86 |       content: [{ type: 'text', text: JSON.stringify({ field1: 'value1', field2: 'value2' }, null, 2) }],
 87 |       isError: false
 88 |     });
 89 |     expect(mockClient.fields).toHaveBeenCalledWith(url, fields, {});
 90 |   });
 91 | 
 92 |   // Test html functionality
 93 |   test('should handle html request', async () => {
 94 |     const url = 'https://example.com';
 95 | 
 96 |     const response = await requestHandler(
 97 |       new RequestContext('webscraping_ai_html', { url })
 98 |     );
 99 | 
100 |     expect(response).toEqual({
101 |       content: [{ type: 'text', text: '<html><body>Test HTML Content</body></html>' }],
102 |       isError: false
103 |     });
104 |     expect(mockClient.html).toHaveBeenCalledWith(url, {});
105 |   });
106 | 
107 |   // Test text functionality
108 |   test('should handle text request', async () => {
109 |     const url = 'https://example.com';
110 | 
111 |     const response = await requestHandler(
112 |       new RequestContext('webscraping_ai_text', { url })
113 |     );
114 | 
115 |     expect(response).toEqual({
116 |       content: [{ type: 'text', text: 'Test text content' }],
117 |       isError: false
118 |     });
119 |     expect(mockClient.text).toHaveBeenCalledWith(url, {});
120 |   });
121 | 
122 |   // Test selected functionality
123 |   test('should handle selected request', async () => {
124 |     const url = 'https://example.com';
125 |     const selector = '.main-content';
126 | 
127 |     const response = await requestHandler(
128 |       new RequestContext('webscraping_ai_selected', { url, selector })
129 |     );
130 | 
131 |     expect(response).toEqual({
132 |       content: [{ type: 'text', text: '<div>Selected Element</div>' }],
133 |       isError: false
134 |     });
135 |     expect(mockClient.selected).toHaveBeenCalledWith(url, selector, {});
136 |   });
137 | 
138 |   // Test selected_multiple functionality
139 |   test('should handle selected_multiple request', async () => {
140 |     const url = 'https://example.com';
141 |     const selectors = ['.item1', '.item2'];
142 | 
143 |     const response = await requestHandler(
144 |       new RequestContext('webscraping_ai_selected_multiple', { url, selectors })
145 |     );
146 | 
147 |     expect(response).toEqual({
148 |       content: [{ type: 'text', text: JSON.stringify(['<div>Element 1</div>', '<div>Element 2</div>'], null, 2) }],
149 |       isError: false
150 |     });
151 |     expect(mockClient.selectedMultiple).toHaveBeenCalledWith(url, selectors, {});
152 |   });
153 | 
154 |   // Test account functionality
155 |   test('should handle account request', async () => {
156 |     const response = await requestHandler(
157 |       new RequestContext('webscraping_ai_account', {})
158 |     );
159 | 
160 |     expect(response).toEqual({
161 |       content: [{ type: 'text', text: JSON.stringify({ requests: 100, remaining: 900, limit: 1000 }, null, 2) }],
162 |       isError: false
163 |     });
164 |     expect(mockClient.account).toHaveBeenCalled();
165 |   });
166 | 
167 |   // Test error handling
168 |   test('should handle API errors', async () => {
169 |     const url = 'https://example.com';
170 |     mockClient.question.mockRejectedValueOnce(new Error('API Error'));
171 | 
172 |     const response = await requestHandler(
173 |       new RequestContext('webscraping_ai_question', { url, question: 'What is on this page?' })
174 |     );
175 | 
176 |     expect(response.isError).toBe(true);
177 |     expect(response.content[0].text).toContain('API Error');
178 |   });
179 | 
180 |   // Test unknown tool
181 |   test('should handle unknown tool request', async () => {
182 |     const response = await requestHandler(
183 |       new RequestContext('unknown_tool', { some: 'args' })
184 |     );
185 | 
186 |     expect(response.isError).toBe(true);
187 |     expect(response.content[0].text).toContain('Unknown tool');
188 |   });
189 | 
190 |   // Test MCP Client Connection
191 |   xtest('should connect to MCP server and list tools', async () => {
192 |     const transport = new StdioClientTransport({
193 |       command: "node",
194 |       args: ["src/index.js"]
195 |     });
196 | 
197 |     const client = new Client({
198 |       name: "webscraping-ai-test-client",
199 |       version: "1.0.0"
200 |     });
201 | 
202 |     await client.connect(transport);
203 |     const response = await client.listTools();
204 |     
205 |     expect(response.tools).toEqual(expect.arrayContaining([
206 |       expect.objectContaining({
207 |         name: 'webscraping_ai_question',
208 |         inputSchema: expect.any(Object)
209 |       }),
210 |       expect.objectContaining({
211 |         name: 'webscraping_ai_fields',
212 |         inputSchema: expect.any(Object)
213 |       }),
214 |       expect.objectContaining({
215 |         name: 'webscraping_ai_html',
216 |         inputSchema: expect.any(Object)
217 |       }),
218 |       expect.objectContaining({
219 |         name: 'webscraping_ai_text',
220 |         inputSchema: expect.any(Object)
221 |       }),
222 |       expect.objectContaining({
223 |         name: 'webscraping_ai_selected',
224 |         inputSchema: expect.any(Object)
225 |       }),
226 |       expect.objectContaining({
227 |         name: 'webscraping_ai_selected_multiple',
228 |         inputSchema: expect.any(Object)
229 |       }),
230 |       expect.objectContaining({
231 |         name: 'webscraping_ai_account',
232 |         inputSchema: expect.any(Object)
233 |       })
234 |     ]));
235 | 
236 |     await client.close();
237 |   });
238 | });
239 | 
240 | // Helper function to simulate request handling
241 | async function handleRequest(name, args, client) {
242 |   try {
243 |     const options = { ...args };
244 |     
245 |     // Remove required parameters from options for each tool type
246 |     switch (name) {
247 |       case 'webscraping_ai_question': {
248 |         const { url, question, ...rest } = options;
249 |         if (!url || !question) {
250 |           throw new Error('URL and question are required');
251 |         }
252 |         
253 |         const result = await client.question(url, question, rest);
254 |         return {
255 |           content: [{ type: 'text', text: result }],
256 |           isError: false
257 |         };
258 |       }
259 | 
260 |       case 'webscraping_ai_fields': {
261 |         const { url, fields, ...rest } = options;
262 |         if (!url || !fields) {
263 |           throw new Error('URL and fields are required');
264 |         }
265 |         
266 |         const result = await client.fields(url, fields, rest);
267 |         return {
268 |           content: [{ type: 'text', text: JSON.stringify(result, null, 2) }],
269 |           isError: false
270 |         };
271 |       }
272 | 
273 |       case 'webscraping_ai_html': {
274 |         const { url, ...rest } = options;
275 |         if (!url) {
276 |           throw new Error('URL is required');
277 |         }
278 |         
279 |         const result = await client.html(url, rest);
280 |         return {
281 |           content: [{ type: 'text', text: result }],
282 |           isError: false
283 |         };
284 |       }
285 | 
286 |       case 'webscraping_ai_text': {
287 |         const { url, ...rest } = options;
288 |         if (!url) {
289 |           throw new Error('URL is required');
290 |         }
291 |         
292 |         const result = await client.text(url, rest);
293 |         return {
294 |           content: [{ type: 'text', text: result }],
295 |           isError: false
296 |         };
297 |       }
298 | 
299 |       case 'webscraping_ai_selected': {
300 |         const { url, selector, ...rest } = options;
301 |         if (!url || !selector) {
302 |           throw new Error('URL and selector are required');
303 |         }
304 |         
305 |         const result = await client.selected(url, selector, rest);
306 |         return {
307 |           content: [{ type: 'text', text: result }],
308 |           isError: false
309 |         };
310 |       }
311 | 
312 |       case 'webscraping_ai_selected_multiple': {
313 |         const { url, selectors, ...rest } = options;
314 |         if (!url || !selectors) {
315 |           throw new Error('URL and selectors are required');
316 |         }
317 |         
318 |         const result = await client.selectedMultiple(url, selectors, rest);
319 |         return {
320 |           content: [{ type: 'text', text: JSON.stringify(result, null, 2) }],
321 |           isError: false
322 |         };
323 |       }
324 | 
325 |       case 'webscraping_ai_account': {
326 |         const result = await client.account();
327 |         return {
328 |           content: [{ type: 'text', text: JSON.stringify(result, null, 2) }],
329 |           isError: false
330 |         };
331 |       }
332 | 
333 |       default:
334 |         throw new Error(`Unknown tool: ${name}`);
335 |     }
336 |   } catch (error) {
337 |     return {
338 |       content: [{ type: 'text', text: error.message }],
339 |       isError: true
340 |     };
341 |   }
342 | } 
343 | 
```

--------------------------------------------------------------------------------
/openapi.yml:
--------------------------------------------------------------------------------

```yaml
  1 | openapi: 3.1.0
  2 | info:
  3 |   title: WebScraping.AI
  4 |   contact:
  5 |     name: WebScraping.AI Support
  6 |     url: https://webscraping.ai
  7 |     email: [email protected]
  8 |   version: 3.2.0
  9 |   description: WebScraping.AI scraping API provides LLM-powered tools with Chromium JavaScript rendering, rotating proxies, and built-in HTML parsing.
 10 | tags:
 11 |   - name: AI
 12 |     description: Analyze web pages using LLMs
 13 |   - name: HTML
 14 |     description: Get full HTML content of pages using proxies and Chromium JS rendering
 15 |   - name: Text
 16 |     description: Get visible text of pages using proxies and Chromium JS rendering
 17 |   - name: Selected HTML
 18 |     description: Get HTML content of selected page areas (like price, search results, page title, etc.)
 19 |   - name: Account
 20 |     description: Information about your account API credits quota
 21 | paths:
 22 |   /ai/question:
 23 |     get:
 24 |       summary: Get an answer to a question about a given web page
 25 |       description: Returns the answer in plain text. Proxies and Chromium JavaScript rendering are used for page retrieval and processing, then the answer is extracted using an LLM model.
 26 |       operationId: getQuestion
 27 |       tags: [ "AI" ]
 28 |       parameters:
 29 |         - $ref: '#/components/parameters/url'
 30 |         - $ref: '#/components/parameters/question'
 31 |         - $ref: '#/components/parameters/headers'
 32 |         - $ref: '#/components/parameters/timeout'
 33 |         - $ref: '#/components/parameters/js'
 34 |         - $ref: '#/components/parameters/js_timeout'
 35 |         - $ref: '#/components/parameters/wait_for'
 36 |         - $ref: '#/components/parameters/proxy'
 37 |         - $ref: '#/components/parameters/country'
 38 |         - $ref: '#/components/parameters/custom_proxy'
 39 |         - $ref: '#/components/parameters/device'
 40 |         - $ref: '#/components/parameters/error_on_404'
 41 |         - $ref: '#/components/parameters/error_on_redirect'
 42 |         - $ref: '#/components/parameters/js_script'
 43 |         - $ref: '#/components/parameters/format'
 44 |       responses:
 45 |         400:
 46 |           $ref: '#/components/responses/400'
 47 |         402:
 48 |           $ref: '#/components/responses/402'
 49 |         403:
 50 |           $ref: '#/components/responses/403'
 51 |         429:
 52 |           $ref: '#/components/responses/429'
 53 |         500:
 54 |           $ref: '#/components/responses/500'
 55 |         504:
 56 |           $ref: '#/components/responses/504'
 57 |         200:
 58 |           description: Success
 59 |           content:
 60 |             text/html:
 61 |               schema:
 62 |                 type: string
 63 |               example: "Some answer"
 64 |   /ai/fields:
 65 |     get:
 66 |       summary: Extract structured data fields from a web page
 67 |       description: Returns structured data fields extracted from the webpage using an LLM model. Proxies and Chromium JavaScript rendering are used for page retrieval and processing.
 68 |       operationId: getFields
 69 |       tags: [ "AI" ]
 70 |       parameters:
 71 |         - $ref: '#/components/parameters/url'
 72 |         - in: query
 73 |           name: fields
 74 |           description: Object describing fields to extract from the page and their descriptions
 75 |           required: true
 76 |           example: {"title":"Main product title","price":"Current product price","description":"Full product description"}
 77 |           schema:
 78 |             type: object
 79 |             additionalProperties:
 80 |               type: string
 81 |           style: deepObject
 82 |           explode: true
 83 |         - $ref: '#/components/parameters/headers'
 84 |         - $ref: '#/components/parameters/timeout'
 85 |         - $ref: '#/components/parameters/js'
 86 |         - $ref: '#/components/parameters/js_timeout'
 87 |         - $ref: '#/components/parameters/wait_for'
 88 |         - $ref: '#/components/parameters/proxy'
 89 |         - $ref: '#/components/parameters/country'
 90 |         - $ref: '#/components/parameters/custom_proxy'
 91 |         - $ref: '#/components/parameters/device'
 92 |         - $ref: '#/components/parameters/error_on_404'
 93 |         - $ref: '#/components/parameters/error_on_redirect'
 94 |         - $ref: '#/components/parameters/js_script'
 95 |       responses:
 96 |         400:
 97 |           $ref: '#/components/responses/400'
 98 |         402:
 99 |           $ref: '#/components/responses/402'
100 |         403:
101 |           $ref: '#/components/responses/403'
102 |         429:
103 |           $ref: '#/components/responses/429'
104 |         500:
105 |           $ref: '#/components/responses/500'
106 |         504:
107 |           $ref: '#/components/responses/504'
108 |         200:
109 |           description: Success
110 |           content:
111 |             application/json:
112 |               schema:
113 |                 type: object
114 |                 additionalProperties:
115 |                   type: string
116 |               example:
117 |                 title: "Example Product"
118 |                 price: "$99.99"
119 |                 description: "This is a sample product description"
120 |   /html:
121 |     get:
122 |       summary: Page HTML by URL
123 |       description: Returns the full HTML content of a webpage specified by the URL. The response is in plain text. Proxies and Chromium JavaScript rendering are used for page retrieval and processing.
124 |       operationId: getHTML
125 |       tags: ["HTML"]
126 |       parameters:
127 |         - $ref: '#/components/parameters/url'
128 |         - $ref: '#/components/parameters/headers'
129 |         - $ref: '#/components/parameters/timeout'
130 |         - $ref: '#/components/parameters/js'
131 |         - $ref: '#/components/parameters/js_timeout'
132 |         - $ref: '#/components/parameters/wait_for'
133 |         - $ref: '#/components/parameters/proxy'
134 |         - $ref: '#/components/parameters/country'
135 |         - $ref: '#/components/parameters/custom_proxy'
136 |         - $ref: '#/components/parameters/device'
137 |         - $ref: '#/components/parameters/error_on_404'
138 |         - $ref: '#/components/parameters/error_on_redirect'
139 |         - $ref: '#/components/parameters/js_script'
140 |         - $ref: '#/components/parameters/return_script_result'
141 |         - $ref: '#/components/parameters/format'
142 |       responses:
143 |         400:
144 |           $ref: '#/components/responses/400'
145 |         402:
146 |           $ref: '#/components/responses/402'
147 |         403:
148 |           $ref: '#/components/responses/403'
149 |         429:
150 |           $ref: '#/components/responses/429'
151 |         500:
152 |           $ref: '#/components/responses/500'
153 |         504:
154 |           $ref: '#/components/responses/504'
155 |         200:
156 |           description: Success
157 |           content:
158 |             text/html:
159 |               schema:
160 |                 type: string
161 |               example: "<html><head>\n    <title>Example Domain</title>\n</head>\n\n<body>\n<div>\n    <h1>Example Domain</h1>\n</body></html>"
162 |   /text:
163 |     get:
164 |       summary: Page text by URL
165 |       description: Returns the visible text content of a webpage specified by the URL. Can be used to feed data to LLM models. The response can be in plain text, JSON, or XML format based on the text_format parameter. Proxies and Chromium JavaScript rendering are used for page retrieval and processing. Returns JSON on error.
166 |       operationId: getText
167 |       tags: [ "Text" ]
168 |       parameters:
169 |         - $ref: '#/components/parameters/text_format'
170 |         - $ref: '#/components/parameters/return_links'
171 |         - $ref: '#/components/parameters/url'
172 |         - $ref: '#/components/parameters/headers'
173 |         - $ref: '#/components/parameters/timeout'
174 |         - $ref: '#/components/parameters/js'
175 |         - $ref: '#/components/parameters/js_timeout'
176 |         - $ref: '#/components/parameters/wait_for'
177 |         - $ref: '#/components/parameters/proxy'
178 |         - $ref: '#/components/parameters/country'
179 |         - $ref: '#/components/parameters/custom_proxy'
180 |         - $ref: '#/components/parameters/device'
181 |         - $ref: '#/components/parameters/error_on_404'
182 |         - $ref: '#/components/parameters/error_on_redirect'
183 |         - $ref: '#/components/parameters/js_script'
184 |       responses:
185 |         400:
186 |           $ref: '#/components/responses/400'
187 |         402:
188 |           $ref: '#/components/responses/402'
189 |         403:
190 |           $ref: '#/components/responses/403'
191 |         429:
192 |           $ref: '#/components/responses/429'
193 |         500:
194 |           $ref: '#/components/responses/500'
195 |         504:
196 |           $ref: '#/components/responses/504'
197 |         200:
198 |           description: Success
199 |           content:
200 |             text/html:
201 |               schema:
202 |                 type: string
203 |               example: "Some content"
204 |             text/xml:
205 |               schema:
206 |                 type: string
207 |               example: "<title>Some title</title>\n<description>Some description</description>\n<content>Some content</content>"
208 |             application/json:
209 |               schema:
210 |                 type: string
211 |               example: '{"title":"Some title","description":"Some description","content":"Some content"}'
212 |   /selected:
213 |     get:
214 |       summary: HTML of a selected page area by URL and CSS selector
215 |       description: Returns HTML of a selected page area by URL and CSS selector. Useful if you don't want to do the HTML parsing on your side.
216 |       operationId: getSelected
217 |       tags: ["Selected HTML"]
218 |       parameters:
219 |         - in: query
220 |           name: selector
221 |           description: CSS selector (null by default, returns whole page HTML)
222 |           example: "h1"
223 |           schema:
224 |             type: string
225 |         - $ref: '#/components/parameters/url'
226 |         - $ref: '#/components/parameters/headers'
227 |         - $ref: '#/components/parameters/timeout'
228 |         - $ref: '#/components/parameters/js'
229 |         - $ref: '#/components/parameters/js_timeout'
230 |         - $ref: '#/components/parameters/wait_for'
231 |         - $ref: '#/components/parameters/proxy'
232 |         - $ref: '#/components/parameters/country'
233 |         - $ref: '#/components/parameters/custom_proxy'
234 |         - $ref: '#/components/parameters/device'
235 |         - $ref: '#/components/parameters/error_on_404'
236 |         - $ref: '#/components/parameters/error_on_redirect'
237 |         - $ref: '#/components/parameters/js_script'
238 |         - $ref: '#/components/parameters/format'
239 |       responses:
240 |         400:
241 |           $ref: '#/components/responses/400'
242 |         402:
243 |           $ref: '#/components/responses/402'
244 |         403:
245 |           $ref: '#/components/responses/403'
246 |         429:
247 |           $ref: '#/components/responses/429'
248 |         500:
249 |           $ref: '#/components/responses/500'
250 |         504:
251 |           $ref: '#/components/responses/504'
252 |         200:
253 |           description: Success
254 |           content:
255 |             text/html:
256 |               schema:
257 |                 type: string
258 |               example: "<a href=\"https://www.iana.org/domains/example\">More information...</a>"
259 |   /selected-multiple:
260 |     get:
261 |       summary: HTML of multiple page areas by URL and CSS selectors
262 |       description: Returns HTML of multiple page areas by URL and CSS selectors. Useful if you don't want to do the HTML parsing on your side.
263 |       operationId: getSelectedMultiple
264 |       tags: ["Selected HTML"]
265 |       parameters:
266 |         - in: query
267 |           name: selectors
268 |           description: Multiple CSS selectors (null by default, returns whole page HTML)
269 |           example: ["h1"]
270 |           schema:
271 |             type: array
272 |             items:
273 |               type: string
274 |           style: form
275 |           explode: true
276 |         - $ref: '#/components/parameters/url'
277 |         - $ref: '#/components/parameters/headers'
278 |         - $ref: '#/components/parameters/timeout'
279 |         - $ref: '#/components/parameters/js'
280 |         - $ref: '#/components/parameters/js_timeout'
281 |         - $ref: '#/components/parameters/wait_for'
282 |         - $ref: '#/components/parameters/proxy'
283 |         - $ref: '#/components/parameters/country'
284 |         - $ref: '#/components/parameters/custom_proxy'
285 |         - $ref: '#/components/parameters/device'
286 |         - $ref: '#/components/parameters/error_on_404'
287 |         - $ref: '#/components/parameters/error_on_redirect'
288 |         - $ref: '#/components/parameters/js_script'
289 |       responses:
290 |         400:
291 |           $ref: '#/components/responses/400'
292 |         402:
293 |           $ref: '#/components/responses/402'
294 |         403:
295 |           $ref: '#/components/responses/403'
296 |         429:
297 |           $ref: '#/components/responses/429'
298 |         500:
299 |           $ref: '#/components/responses/500'
300 |         504:
301 |           $ref: '#/components/responses/504'
302 |         200:
303 |           description: Success
304 |           content:
305 |             application/json:
306 |               schema:
307 |                 $ref: "#/components/schemas/SelectedAreas"
308 |               example: "[\"<a href='/test'>some link</a>\", \"Hello\"]"
309 |   /account:
310 |     get:
311 |       summary: Information about your account calls quota
312 |       description: Returns information about your account, including the remaining API credits quota, the next billing cycle start time, and the remaining concurrent requests. The response is in JSON format.
313 |       operationId: account
314 |       tags: [ "Account" ]
315 |       responses:
316 |         403:
317 |           $ref: '#/components/responses/403'
318 |         200:
319 |           description: Success
320 |           content:
321 |             application/json:
322 |               schema:
323 |                 $ref: "#/components/schemas/Account"
324 |               example:
325 |                 remaining_api_calls: 200000
326 |                 resets_at: 1617073667
327 |                 remaining_concurrency: 100
328 | security:
329 |   - api_key: []
330 | servers:
331 |   - url: https://api.webscraping.ai
332 | components:
333 |   securitySchemes:
334 |     api_key:
335 |       type: apiKey
336 |       name: api_key
337 |       in: query
338 |   responses:
339 |     400:
340 |       description: Parameters validation error
341 |       content:
342 |         application/json:
343 |           schema:
344 |             $ref: "#/components/schemas/Error"
345 |           example:
346 |             {
347 |               "message": "Invalid CSS selector"
348 |             }
349 | 
350 |     402:
351 |       description: Billing issue, probably you've ran out of credits
352 |       content:
353 |         application/json:
354 |           schema:
355 |             $ref: "#/components/schemas/Error"
356 |           example:
357 |             {
358 |               message: "Some error"
359 |             }
360 |     403:
361 |       description: Wrong API key
362 |       content:
363 |         application/json:
364 |           schema:
365 |             $ref: "#/components/schemas/Error"
366 |           example:
367 |             {
368 |               message: "Some error"
369 |             }
370 |     429:
371 |       description: Too many concurrent requests
372 |       content:
373 |         application/json:
374 |           schema:
375 |             $ref: "#/components/schemas/Error"
376 |           example:
377 |             {
378 |               message: "Some error"
379 |             }
380 |     500:
381 |       description: Non-2xx and non-404 HTTP status code on the target page or unexpected error, try again or contact [email protected]
382 |       content:
383 |         application/json:
384 |           schema:
385 |             $ref: "#/components/schemas/Error"
386 |           example:
387 |             {
388 |               "message": "Unexpected HTTP code on the target page",
389 |               "status_code": 500,
390 |               "status_message": "Some website error",
391 |             }
392 |     504:
393 |       description: Timeout error, try increasing timeout parameter value
394 |       content:
395 |         application/json:
396 |           schema:
397 |             $ref: "#/components/schemas/Error"
398 |           example:
399 |             {
400 |               message: "Some error"
401 |             }
402 |   parameters:
403 |     ## Shared everywhere
404 |     url:
405 |       in: query
406 |       name: url
407 |       description: URL of the target page.
408 |       required: true
409 |       example: "https://example.com"
410 |       schema:
411 |         type: string
412 |     postUrl:
413 |       in: query
414 |       name: url
415 |       description: URL of the target page.
416 |       required: true
417 |       example: "https://httpbin.org/post"
418 |       schema:
419 |         type: string
420 |     headers:
421 |       in: query
422 |       name: headers
423 |       description: "HTTP headers to pass to the target page. Can be specified either via a nested query parameter (...&headers[One]=value1&headers=[Another]=value2) or as a JSON encoded object (...&headers={\"One\": \"value1\", \"Another\": \"value2\"})."
424 |       example: '{"Cookie":"session=some_id"}'
425 |       schema:
426 |         type: object
427 |         additionalProperties:
428 |           type: string
429 |       style: deepObject
430 |       explode: true
431 |     timeout:
432 |       in: query
433 |       name: timeout
434 |       description: Maximum web page retrieval time in ms. Increase it in case of timeout errors (10000 by default, maximum is 30000).
435 |       example: 10000
436 |       schema:
437 |         type: integer
438 |         default: 10000
439 |         minimum: 1
440 |         maximum: 30000
441 |     js:
442 |       in: query
443 |       name: js
444 |       description: Execute on-page JavaScript using a headless browser (true by default).
445 |       example: true
446 |       schema:
447 |         type: boolean
448 |         default: true
449 |     js_timeout:
450 |       in: query
451 |       name: js_timeout
452 |       description: Maximum JavaScript rendering time in ms. Increase it in case if you see a loading indicator instead of data on the target page.
453 |       example: 2000
454 |       schema:
455 |         type: integer
456 |         default: 2000
457 |         minimum: 1
458 |         maximum: 20000
459 |     wait_for:
460 |       in: query
461 |       name: wait_for
462 |       description: CSS selector to wait for before returning the page content. Useful for pages with dynamic content loading. Overrides js_timeout.
463 |       schema:
464 |         type: string
465 |     proxy:
466 |       in: query
467 |       name: proxy
468 |       description: Type of proxy, use residential proxies if your site restricts traffic from datacenters (datacenter by default). Note that residential proxy requests are more expensive than datacenter, see the pricing page for details.
469 |       example: "datacenter"
470 |       schema:
471 |         type: string
472 |         default: "datacenter"
473 |         enum: [ "datacenter", "residential" ]
474 |     country:
475 |       in: query
476 |       name: country
477 |       description: Country of the proxy to use (US by default).
478 |       example: "us"
479 |       schema:
480 |         type: string
481 |         default: "us"
482 |         enum: [ "us", "gb", "de", "it", "fr", "ca", "es", "ru", "jp", "kr", "in" ]
483 |     custom_proxy:
484 |       in: query
485 |       name: custom_proxy
486 |       description: Your own proxy URL to use instead of our built-in proxy pool in "http://user:password@host:port" format (<a target="_blank" href="https://webscraping.ai/proxies/smartproxy">Smartproxy</a> for example).
487 |       example:
488 |       schema:
489 |         type: string
490 |     device:
491 |       in: query
492 |       name: device
493 |       description: Type of device emulation.
494 |       example: "desktop"
495 |       schema:
496 |         type: string
497 |         default: "desktop"
498 |         enum: [ "desktop", "mobile", "tablet" ]
499 |     error_on_404:
500 |       in: query
501 |       name: error_on_404
502 |       description: Return error on 404 HTTP status on the target page (false by default).
503 |       example: false
504 |       schema:
505 |         type: boolean
506 |         default: false
507 |     error_on_redirect:
508 |       in: query
509 |       name: error_on_redirect
510 |       description: Return error on redirect on the target page (false by default).
511 |       example: false
512 |       schema:
513 |         type: boolean
514 |         default: false
515 |     js_script:
516 |       in: query
517 |       name: js_script
518 |       description: Custom JavaScript code to execute on the target page.
519 |       example: "document.querySelector('button').click();"
520 |       schema:
521 |         type: string
522 |     return_script_result:
523 |       in: query
524 |       name: return_script_result
525 |       description: Return result of the custom JavaScript code (js_script parameter) execution on the target page (false by default, page HTML will be returned).
526 |       example: false
527 |       schema:
528 |         type: boolean
529 |         default: false
530 |     text_format:
531 |       in: query
532 |       name: text_format
533 |       description: Format of the text response (plain by default). "plain" will return only the page body text. "json" and "xml" will return a json/xml with "title", "description" and "content" keys.
534 |       example: "plain"
535 |       schema:
536 |           type: string
537 |           default: "plain"
538 |           enum: [ "plain", "xml", "json" ]
539 |     return_links:
540 |       in: query
541 |       name: return_links
542 |       description: "[Works only with text_format=json] Return links from the page body text (false by default). Useful for building web crawlers."
543 |       example: false
544 |       schema:
545 |         type: boolean
546 |         default: false
547 |     question:
548 |       in: query
549 |       name: question
550 |       description: Question or instructions to ask the LLM model about the target page.
551 |       example: "What is the summary of this page content?"
552 |       schema:
553 |           type: string
554 |     format:
555 |       in: query
556 |       name: format
557 |       description: Format of the response (text by default). "json" will return a JSON object with the response, "text" will return a plain text/HTML response.
558 |       example: "json"
559 |       schema:
560 |         type: string
561 |         default: "json"
562 |         enum: [ "json", "text" ]
563 | 
564 |   requestBodies:
565 |     Body:
566 |       description: Request body to pass to the target page
567 |       content:
568 |         application/json:
569 |           schema:
570 |             type: object
571 |             additionalProperties: true
572 |         application/x-www-form-urlencoded:
573 |           schema:
574 |             type: object
575 |             additionalProperties: true
576 |         application/xml:
577 |           schema:
578 |             type: object
579 |             additionalProperties: true
580 |         text/plain:
581 |           schema:
582 |             type: string
583 |   schemas:
584 |     Error:
585 |       title: Generic error
586 |       type: object
587 |       properties:
588 |         message:
589 |           type: string
590 |           description: Error description
591 |         status_code:
592 |           type: integer
593 |           description: Target page response HTTP status code (403, 500, etc)
594 |         status_message:
595 |           type: string
596 |           description: Target page response HTTP status message
597 |         body:
598 |           type: string
599 |           description: Target page response body
600 |     SelectedAreas:
601 |       title: HTML for selected page areas
602 |       type: array
603 |       description: Array of elements matched by selectors
604 |       items:
605 |         type: string
606 |     Account:
607 |       title: Account limits info
608 |       type: object
609 |       properties:
610 |         email:
611 |           type: string
612 |           description: Your account email
613 |         remaining_api_calls:
614 |           type: integer
615 |           description: Remaining API credits quota
616 |         resets_at:
617 |           type: integer
618 |           description: Next billing cycle start time (UNIX timestamp)
619 |         remaining_concurrency:
620 |           type: integer
621 |           description: Remaining concurrent requests
622 | 
```