# Directory Structure ``` ├── .env.example ├── .eslintignore ├── .eslintrc.json ├── .github │ └── workflows │ └── ci.yml ├── .gitignore ├── .prettierrc ├── Dockerfile ├── jest.config.js ├── jest.setup.js ├── openapi.yml ├── package-lock.json ├── package.json ├── README.md ├── smithery.yaml └── src ├── index.js └── index.test.js ``` # Files -------------------------------------------------------------------------------- /.eslintignore: -------------------------------------------------------------------------------- ``` **/*.test.ts **/*.test.js node_modules dist jest.setup.ts jest.config.js ``` -------------------------------------------------------------------------------- /.prettierrc: -------------------------------------------------------------------------------- ``` { "semi": true, "singleQuote": true, "tabWidth": 2, "trailingComma": "es5", "printWidth": 80, "endOfLine": "auto" } ``` -------------------------------------------------------------------------------- /.env.example: -------------------------------------------------------------------------------- ``` # Required: Your WebScraping.AI API key WEBSCRAPING_AI_API_KEY=your_api_key_here # Optional: Maximum number of concurrent requests (default: 5) WEBSCRAPING_AI_CONCURRENCY_LIMIT=5 ``` -------------------------------------------------------------------------------- /.eslintrc.json: -------------------------------------------------------------------------------- ```json { "env": { "es2021": true, "node": true, "jest": true }, "extends": [ "eslint:recommended", "prettier" ], "parserOptions": { "ecmaVersion": "latest", "sourceType": "module" }, "rules": { "no-unused-vars": "warn", "no-console": "off" } } ``` -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- ``` # Dependency directories node_modules/ # Environment variables .env .env.local .env.development.local .env.test.local .env.production.local # Logs logs *.log npm-debug.log* yarn-debug.log* yarn-error.log* # Coverage directory used by tools like istanbul coverage/ # Editor directories and files .idea/ .vscode/ *.swp *.swo # OS specific .DS_Store Thumbs.db ``` -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- ```markdown # WebScraping.AI MCP Server A Model Context Protocol (MCP) server implementation that integrates with [WebScraping.AI](https://webscraping.ai) for web data extraction capabilities. ## Features - Question answering about web page content - Structured data extraction from web pages - HTML content retrieval with JavaScript rendering - Plain text extraction from web pages - CSS selector-based content extraction - Multiple proxy types (datacenter, residential) with country selection - JavaScript rendering using headless Chrome/Chromium - Concurrent request management with rate limiting - Custom JavaScript execution on target pages - Device emulation (desktop, mobile, tablet) - Account usage monitoring ## Installation ### Running with npx ```bash env WEBSCRAPING_AI_API_KEY=your_api_key npx -y webscraping-ai-mcp ``` ### Manual Installation ```bash # Clone the repository git clone https://github.com/webscraping-ai/webscraping-ai-mcp-server.git cd webscraping-ai-mcp-server # Install dependencies npm install # Run npm start ``` ### Configuring in Cursor Note: Requires Cursor version 0.45.6+ The WebScraping.AI MCP server can be configured in two ways in Cursor: 1. **Project-specific Configuration** (recommended for team projects): Create a `.cursor/mcp.json` file in your project directory: ```json { "servers": { "webscraping-ai": { "type": "command", "command": "npx -y webscraping-ai-mcp", "env": { "WEBSCRAPING_AI_API_KEY": "your-api-key", "WEBSCRAPING_AI_CONCURRENCY_LIMIT": "5" } } } } ``` 2. **Global Configuration** (for personal use across all projects): Create a `~/.cursor/mcp.json` file in your home directory with the same configuration format as above. > If you are using Windows and are running into issues, try using `cmd /c "set WEBSCRAPING_AI_API_KEY=your-api-key && npx -y webscraping-ai-mcp"` as the command. This configuration will make the WebScraping.AI tools available to Cursor's AI agent automatically when relevant for web scraping tasks. ### Running on Claude Desktop Add this to your `claude_desktop_config.json`: ```json { "mcpServers": { "mcp-server-webscraping-ai": { "command": "npx", "args": ["-y", "webscraping-ai-mcp"], "env": { "WEBSCRAPING_AI_API_KEY": "YOUR_API_KEY_HERE", "WEBSCRAPING_AI_CONCURRENCY_LIMIT": "5" } } } } ``` ## Configuration ### Environment Variables #### Required - `WEBSCRAPING_AI_API_KEY`: Your WebScraping.AI API key - Required for all operations - Get your API key from [WebScraping.AI](https://webscraping.ai) #### Optional Configuration - `WEBSCRAPING_AI_CONCURRENCY_LIMIT`: Maximum number of concurrent requests (default: `5`) - `WEBSCRAPING_AI_DEFAULT_PROXY_TYPE`: Type of proxy to use (default: `residential`) - `WEBSCRAPING_AI_DEFAULT_JS_RENDERING`: Enable/disable JavaScript rendering (default: `true`) - `WEBSCRAPING_AI_DEFAULT_TIMEOUT`: Maximum web page retrieval time in ms (default: `15000`, max: `30000`) - `WEBSCRAPING_AI_DEFAULT_JS_TIMEOUT`: Maximum JavaScript rendering time in ms (default: `2000`) ### Configuration Examples For standard usage: ```bash # Required export WEBSCRAPING_AI_API_KEY=your-api-key # Optional - customize behavior (default values) export WEBSCRAPING_AI_CONCURRENCY_LIMIT=5 export WEBSCRAPING_AI_DEFAULT_PROXY_TYPE=residential # datacenter or residential export WEBSCRAPING_AI_DEFAULT_JS_RENDERING=true export WEBSCRAPING_AI_DEFAULT_TIMEOUT=15000 export WEBSCRAPING_AI_DEFAULT_JS_TIMEOUT=2000 ``` ## Available Tools ### 1. Question Tool (`webscraping_ai_question`) Ask questions about web page content. ```json { "name": "webscraping_ai_question", "arguments": { "url": "https://example.com", "question": "What is the main topic of this page?", "timeout": 30000, "js": true, "js_timeout": 2000, "wait_for": ".content-loaded", "proxy": "datacenter", "country": "us" } } ``` Example response: ```json { "content": [ { "type": "text", "text": "The main topic of this page is examples and documentation for HTML and web standards." } ], "isError": false } ``` ### 2. Fields Tool (`webscraping_ai_fields`) Extract structured data from web pages based on instructions. ```json { "name": "webscraping_ai_fields", "arguments": { "url": "https://example.com/product", "fields": { "title": "Extract the product title", "price": "Extract the product price", "description": "Extract the product description" }, "js": true, "timeout": 30000 } } ``` Example response: ```json { "content": [ { "type": "text", "text": { "title": "Example Product", "price": "$99.99", "description": "This is an example product description." } } ], "isError": false } ``` ### 3. HTML Tool (`webscraping_ai_html`) Get the full HTML of a web page with JavaScript rendering. ```json { "name": "webscraping_ai_html", "arguments": { "url": "https://example.com", "js": true, "timeout": 30000, "wait_for": "#content-loaded" } } ``` Example response: ```json { "content": [ { "type": "text", "text": "<html>...[full HTML content]...</html>" } ], "isError": false } ``` ### 4. Text Tool (`webscraping_ai_text`) Extract the visible text content from a web page. ```json { "name": "webscraping_ai_text", "arguments": { "url": "https://example.com", "js": true, "timeout": 30000 } } ``` Example response: ```json { "content": [ { "type": "text", "text": "Example Domain\nThis domain is for use in illustrative examples in documents..." } ], "isError": false } ``` ### 5. Selected Tool (`webscraping_ai_selected`) Extract content from a specific element using a CSS selector. ```json { "name": "webscraping_ai_selected", "arguments": { "url": "https://example.com", "selector": "div.main-content", "js": true, "timeout": 30000 } } ``` Example response: ```json { "content": [ { "type": "text", "text": "<div class=\"main-content\">This is the main content of the page.</div>" } ], "isError": false } ``` ### 6. Selected Multiple Tool (`webscraping_ai_selected_multiple`) Extract content from multiple elements using CSS selectors. ```json { "name": "webscraping_ai_selected_multiple", "arguments": { "url": "https://example.com", "selectors": ["div.header", "div.product-list", "div.footer"], "js": true, "timeout": 30000 } } ``` Example response: ```json { "content": [ { "type": "text", "text": [ "<div class=\"header\">Header content</div>", "<div class=\"product-list\">Product list content</div>", "<div class=\"footer\">Footer content</div>" ] } ], "isError": false } ``` ### 7. Account Tool (`webscraping_ai_account`) Get information about your WebScraping.AI account. ```json { "name": "webscraping_ai_account", "arguments": {} } ``` Example response: ```json { "content": [ { "type": "text", "text": { "requests": 5000, "remaining": 4500, "limit": 10000, "resets_at": "2023-12-31T23:59:59Z" } } ], "isError": false } ``` ## Common Options for All Tools The following options can be used with all scraping tools: - `timeout`: Maximum web page retrieval time in ms (15000 by default, maximum is 30000) - `js`: Execute on-page JavaScript using a headless browser (true by default) - `js_timeout`: Maximum JavaScript rendering time in ms (2000 by default) - `wait_for`: CSS selector to wait for before returning the page content - `proxy`: Type of proxy, datacenter or residential (residential by default) - `country`: Country of the proxy to use (US by default). Supported countries: us, gb, de, it, fr, ca, es, ru, jp, kr, in - `custom_proxy`: Your own proxy URL in "http://user:password@host:port" format - `device`: Type of device emulation. Supported values: desktop, mobile, tablet - `error_on_404`: Return error on 404 HTTP status on the target page (false by default) - `error_on_redirect`: Return error on redirect on the target page (false by default) - `js_script`: Custom JavaScript code to execute on the target page ## Error Handling The server provides robust error handling: - Automatic retries for transient errors - Rate limit handling with backoff - Detailed error messages - Network resilience Example error response: ```json { "content": [ { "type": "text", "text": "API Error: 429 Too Many Requests" } ], "isError": true } ``` ## Integration with LLMs This server implements the [Model Context Protocol](https://github.com/facebookresearch/modelcontextprotocol), making it compatible with any MCP-enabled LLM platforms. You can configure your LLM to use these tools for web scraping tasks. ### Example: Configuring Claude with MCP ```javascript const { Claude } = require('@anthropic-ai/sdk'); const { Client } = require('@modelcontextprotocol/sdk/client/index.js'); const { StdioClientTransport } = require('@modelcontextprotocol/sdk/client/stdio.js'); const claude = new Claude({ apiKey: process.env.ANTHROPIC_API_KEY }); const transport = new StdioClientTransport({ command: 'npx', args: ['-y', 'webscraping-ai-mcp'], env: { WEBSCRAPING_AI_API_KEY: 'your-api-key' } }); const client = new Client({ name: 'claude-client', version: '1.0.0' }); await client.connect(transport); // Now you can use Claude with WebScraping.AI tools const tools = await client.listTools(); const response = await claude.complete({ prompt: 'What is the main topic of example.com?', tools: tools }); ``` ## Development ```bash # Clone the repository git clone https://github.com/webscraping-ai/webscraping-ai-mcp-server.git cd webscraping-ai-mcp-server # Install dependencies npm install # Run tests npm test # Add your .env file cp .env.example .env # Start the inspector npx @modelcontextprotocol/inspector node src/index.js ``` ### Contributing 1. Fork the repository 2. Create your feature branch 3. Run tests: `npm test` 4. Submit a pull request ## License MIT License - see LICENSE file for details ``` -------------------------------------------------------------------------------- /jest.config.js: -------------------------------------------------------------------------------- ```javascript /** @type {import('ts-jest').JestConfigWithTsJest} */ export default { testEnvironment: 'node', transform: {}, moduleNameMapper: { '^(\\.{1,2}/.*)\\.js$': '$1', }, setupFilesAfterEnv: ['./jest.setup.js'], testMatch: ['**/*.test.js'], }; ``` -------------------------------------------------------------------------------- /jest.setup.js: -------------------------------------------------------------------------------- ```javascript import { jest } from '@jest/globals'; // Mock console methods to suppress output during tests global.console = { ...console, log: jest.fn(), debug: jest.fn(), info: jest.fn(), warn: jest.fn(), error: jest.fn(), }; // Add any additional global test setup here ``` -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- ```dockerfile FROM node:18-alpine WORKDIR /app # Copy package.json and package-lock.json COPY package*.json ./ # Install only production dependencies RUN npm ci --only=production # Copy source files COPY . . # Set environment variables ENV NODE_ENV=production # Command to run the application ENTRYPOINT ["node", "src/index.js"] # Set default arguments CMD [] # Document that the service uses stdin/stdout for communication LABEL org.opencontainers.image.description="WebScraping.AI MCP Server - Model Context Protocol server for WebScraping.AI API" LABEL org.opencontainers.image.source="https://github.com/webscraping-ai/webscraping-ai-mcp-server" LABEL org.opencontainers.image.licenses="MIT" ``` -------------------------------------------------------------------------------- /.github/workflows/ci.yml: -------------------------------------------------------------------------------- ```yaml name: CI on: push: branches: [master] pull_request: branches: [master] jobs: test: runs-on: ubuntu-latest strategy: matrix: node-version: [18.x, 20.x] steps: - uses: actions/checkout@v3 - name: Use Node.js ${{ matrix.node-version }} uses: actions/setup-node@v3 with: node-version: ${{ matrix.node-version }} cache: 'npm' - name: Install dependencies run: npm ci - name: Lint run: npm run lint - name: Test run: npm test env: WEBSCRAPING_AI_API_KEY: ${{ secrets.WEBSCRAPING_AI_API_KEY || 'test-api-key' }} ``` -------------------------------------------------------------------------------- /smithery.yaml: -------------------------------------------------------------------------------- ```yaml # Smithery configuration file: https://smithery.ai/docs/config#smitheryyaml startCommand: type: stdio configSchema: # JSON Schema defining the configuration options for the MCP. type: object required: - webscrapingAiApiKey properties: webscrapingAiApiKey: type: string description: Your WebScraping.AI API key. Required for API usage. webscrapingAiApiUrl: type: string description: Custom API endpoint. Default is https://api.webscraping.ai. webscrapingAiConcurrencyLimit: type: integer description: Maximum concurrent requests allowed (default 5). commandFunction: # A function that produces the CLI command to start the MCP on stdio. |- (config) => ({ command: 'node', args: ['src/index.js'], env: { WEBSCRAPING_AI_API_KEY: config.webscrapingAiApiKey, WEBSCRAPING_AI_CONCURRENCY_LIMIT: String(config.webscrapingAiConcurrencyLimit || 5) } }) ``` -------------------------------------------------------------------------------- /package.json: -------------------------------------------------------------------------------- ```json { "name": "webscraping-ai-mcp", "version": "1.0.2", "description": "Model Context Protocol server for WebScraping.AI API. Provides LLM-powered web scraping tools with Chromium JavaScript rendering, rotating proxies, and HTML parsing.", "type": "module", "bin": { "webscraping-ai-mcp": "src/index.js" }, "files": [ "src" ], "scripts": { "test": "node --experimental-vm-modules node_modules/jest/bin/jest.js", "start": "node src/index.js", "lint": "eslint src/**/*.js", "lint:fix": "eslint src/**/*.js --fix", "format": "prettier --write ." }, "license": "MIT", "dependencies": { "@modelcontextprotocol/sdk": "^1.4.1", "axios": "^1.6.7", "dotenv": "^16.4.7", "p-queue": "^8.0.1" }, "devDependencies": { "@jest/globals": "^29.7.0", "eslint": "^8.56.0", "eslint-config-prettier": "^9.1.0", "jest": "^29.7.0", "jest-mock-extended": "^4.0.0-beta1", "prettier": "^3.1.1" }, "engines": { "node": ">=18.0.0" }, "keywords": [ "mcp", "webscraping", "web-scraping", "crawler", "content-extraction", "llm" ], "main": "src/index.js", "repository": { "type": "git", "url": "git+https://github.com/webscraping-ai/webscraping-ai-mcp-server.git" }, "author": "WebScraping.AI", "bugs": { "url": "https://github.com/webscraping-ai/webscraping-ai-mcp-server/issues" }, "homepage": "https://github.com/webscraping-ai/webscraping-ai-mcp-server#readme" } ``` -------------------------------------------------------------------------------- /src/index.js: -------------------------------------------------------------------------------- ```javascript #!/usr/bin/env node import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js'; import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js'; import { z } from 'zod'; import axios from 'axios'; import dotenv from 'dotenv'; import PQueue from 'p-queue'; dotenv.config(); // Environment variables const WEBSCRAPING_AI_API_KEY = process.env.WEBSCRAPING_AI_API_KEY || ''; const WEBSCRAPING_AI_API_URL = 'https://api.webscraping.ai'; const CONCURRENCY_LIMIT = Number(process.env.WEBSCRAPING_AI_CONCURRENCY_LIMIT || 5); const DEFAULT_PROXY_TYPE = process.env.WEBSCRAPING_AI_DEFAULT_PROXY_TYPE || 'residential'; const DEFAULT_JS_RENDERING = process.env.WEBSCRAPING_AI_DEFAULT_JS_RENDERING !== 'false'; const DEFAULT_TIMEOUT = Number(process.env.WEBSCRAPING_AI_DEFAULT_TIMEOUT || 15000); const DEFAULT_JS_TIMEOUT = Number(process.env.WEBSCRAPING_AI_DEFAULT_JS_TIMEOUT || 2000); // Validate required environment variables if (!WEBSCRAPING_AI_API_KEY) { console.error('WEBSCRAPING_AI_API_KEY environment variable is required'); process.exit(1); } class WebScrapingAIClient { constructor(options = {}) { const apiKey = options.apiKey || WEBSCRAPING_AI_API_KEY; const baseUrl = options.baseUrl || WEBSCRAPING_AI_API_URL; const timeout = options.timeout || 60000; const concurrency = options.concurrency || CONCURRENCY_LIMIT; if (!apiKey) { throw new Error('WebScraping.AI API key is required'); } this.client = axios.create({ baseURL: baseUrl, timeout: timeout, headers: { 'Content-Type': 'application/json', 'Accept': 'application/json', } }); this.queue = new PQueue({ concurrency }); this.apiKey = apiKey; } async request(endpoint, params) { try { return await this.queue.add(async () => { const response = await this.client.get(endpoint, { params: { ...params, api_key: this.apiKey, from_mcp_server: true } }); return response.data; }); } catch (error) { const errorResponse = { message: 'API Error', status_code: error.response?.status, status_message: error.response?.statusText, body: error.response?.data }; throw new Error(JSON.stringify(errorResponse)); } } async question(url, question, options = {}) { return this.request('/ai/question', { url, question, ...options }); } async fields(url, fields, options = {}) { return this.request('/ai/fields', { url, fields: JSON.stringify(fields), ...options }); } async html(url, options = {}) { return this.request('/html', { url, ...options }); } async text(url, options = {}) { return this.request('/text', { url, ...options }); } async selected(url, selector, options = {}) { return this.request('/selected', { url, selector, ...options }); } async selectedMultiple(url, selectors, options = {}) { return this.request('/selected-multiple', { url, selectors, ...options }); } async account() { return this.request('/account', {}); } } // Create WebScrapingAI client const client = new WebScrapingAIClient(); // Create MCP server const server = new McpServer({ name: 'WebScraping.AI MCP Server', version: '1.0.2' }); // Common options schema for all tools const commonOptionsSchema = { timeout: z.number().optional().default(DEFAULT_TIMEOUT).describe(`Maximum web page retrieval time in ms (${DEFAULT_TIMEOUT} by default, maximum is 30000).`), js: z.boolean().optional().default(DEFAULT_JS_RENDERING).describe(`Execute on-page JavaScript using a headless browser (${DEFAULT_JS_RENDERING} by default).`), js_timeout: z.number().optional().default(DEFAULT_JS_TIMEOUT).describe(`Maximum JavaScript rendering time in ms (${DEFAULT_JS_TIMEOUT} by default).`), wait_for: z.string().optional().describe('CSS selector to wait for before returning the page content.'), proxy: z.enum(['datacenter', 'residential']).optional().default(DEFAULT_PROXY_TYPE).describe(`Type of proxy, datacenter or residential (${DEFAULT_PROXY_TYPE} by default).`), country: z.enum(['us', 'gb', 'de', 'it', 'fr', 'ca', 'es', 'ru', 'jp', 'kr', 'in']).optional().describe('Country of the proxy to use (US by default).'), custom_proxy: z.string().optional().describe('Your own proxy URL in "http://user:password@host:port" format.'), device: z.enum(['desktop', 'mobile', 'tablet']).optional().describe('Type of device emulation.'), error_on_404: z.boolean().optional().describe('Return error on 404 HTTP status on the target page (false by default).'), error_on_redirect: z.boolean().optional().describe('Return error on redirect on the target page (false by default).'), js_script: z.string().optional().describe('Custom JavaScript code to execute on the target page.') }; // Define and register tools server.tool( 'webscraping_ai_question', { url: z.string().describe('URL of the target page.'), question: z.string().describe('Question or instructions to ask the LLM model about the target page.'), ...commonOptionsSchema }, async ({ url, question, ...options }) => { try { const result = await client.question(url, question, options); return { content: [{ type: 'text', text: result }] }; } catch (error) { return { content: [{ type: 'text', text: error.message }], isError: true }; } } ); server.tool( 'webscraping_ai_fields', { url: z.string().describe('URL of the target page.'), fields: z.record(z.string()).describe('Dictionary of field names with instructions for extraction.'), ...commonOptionsSchema }, async ({ url, fields, ...options }) => { try { const result = await client.fields(url, fields, options); return { content: [{ type: 'text', text: JSON.stringify(result, null, 2) }] }; } catch (error) { return { content: [{ type: 'text', text: error.message }], isError: true }; } } ); server.tool( 'webscraping_ai_html', { url: z.string().describe('URL of the target page.'), return_script_result: z.boolean().optional().describe('Return result of the custom JavaScript code execution.'), format: z.enum(['json', 'text']).optional().describe('Response format (json or text).'), ...commonOptionsSchema }, async ({ url, return_script_result, format, ...options }) => { try { const result = await client.html(url, { ...options, return_script_result }); if (format === 'json') { return { content: [{ type: 'text', text: JSON.stringify({ html: result }) }] }; } return { content: [{ type: 'text', text: result }] }; } catch (error) { const errorObj = JSON.parse(error.message); return { content: [{ type: 'text', text: JSON.stringify(errorObj) }], isError: true }; } } ); server.tool( 'webscraping_ai_text', { url: z.string().describe('URL of the target page.'), text_format: z.enum(['plain', 'xml', 'json']).optional().default('json').describe('Format of the text response.'), return_links: z.boolean().optional().describe('Return links from the page body text.'), ...commonOptionsSchema }, async ({ url, text_format, return_links, ...options }) => { try { const result = await client.text(url, { ...options, text_format, return_links }); return { content: [{ type: 'text', text: typeof result === 'object' ? JSON.stringify(result) : result }] }; } catch (error) { const errorObj = JSON.parse(error.message); return { content: [{ type: 'text', text: JSON.stringify(errorObj) }], isError: true }; } } ); server.tool( 'webscraping_ai_selected', { url: z.string().describe('URL of the target page.'), selector: z.string().describe('CSS selector to extract content for.'), format: z.enum(['json', 'text']).optional().default('json').describe('Response format (json or text).'), ...commonOptionsSchema }, async ({ url, selector, format, ...options }) => { try { const result = await client.selected(url, selector, options); if (format === 'json') { return { content: [{ type: 'text', text: JSON.stringify({ html: result }) }] }; } return { content: [{ type: 'text', text: result }] }; } catch (error) { const errorObj = JSON.parse(error.message); return { content: [{ type: 'text', text: JSON.stringify(errorObj) }], isError: true }; } } ); server.tool( 'webscraping_ai_selected_multiple', { url: z.string().describe('URL of the target page.'), selectors: z.array(z.string()).describe('Array of CSS selectors to extract content for.'), ...commonOptionsSchema }, async ({ url, selectors, ...options }) => { try { const result = await client.selectedMultiple(url, selectors, options); return { content: [{ type: 'text', text: JSON.stringify(result, null, 2) }] }; } catch (error) { return { content: [{ type: 'text', text: error.message }], isError: true }; } } ); server.tool( 'webscraping_ai_account', {}, async () => { try { const result = await client.account(); return { content: [{ type: 'text', text: JSON.stringify(result, null, 2) }] }; } catch (error) { return { content: [{ type: 'text', text: error.message }], isError: true }; } } ); const transport = new StdioServerTransport(); server.connect(transport).then(() => { }).catch(err => { console.error('Failed to connect to transport:', err); process.exit(1); }); ``` -------------------------------------------------------------------------------- /src/index.test.js: -------------------------------------------------------------------------------- ```javascript import { describe, expect, jest, test, beforeEach, afterEach, } from '@jest/globals'; import { Client } from "@modelcontextprotocol/sdk/client/index.js"; import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js"; // Create mock WebScrapingAIClient class MockWebScrapingAIClient { constructor() { this.question = jest.fn().mockResolvedValue('This is the answer to your question.'); this.fields = jest.fn().mockResolvedValue({ field1: 'value1', field2: 'value2' }); this.html = jest.fn().mockResolvedValue('<html><body>Test HTML Content</body></html>'); this.text = jest.fn().mockResolvedValue('Test text content'); this.selected = jest.fn().mockResolvedValue('<div>Selected Element</div>'); this.selectedMultiple = jest.fn().mockResolvedValue(['<div>Element 1</div>', '<div>Element 2</div>']); this.account = jest.fn().mockResolvedValue({ requests: 100, remaining: 900, limit: 1000 }); } } // Test interfaces class RequestContext { constructor(toolName, args) { this.params = { name: toolName, arguments: args }; } } describe('WebScraping.AI MCP Server Tests', () => { let mockClient; let requestHandler; beforeEach(() => { jest.clearAllMocks(); mockClient = new MockWebScrapingAIClient(); // Create request handler function requestHandler = async (request) => { const { name: toolName, arguments: args } = request.params; if (!args && toolName !== 'webscraping_ai_account') { throw new Error('No arguments provided'); } return handleRequest(toolName, args || {}, mockClient); }; }); afterEach(() => { jest.clearAllMocks(); }); // Test question functionality test('should handle question request', async () => { const url = 'https://example.com'; const question = 'What is on this page?'; const response = await requestHandler( new RequestContext('webscraping_ai_question', { url, question }) ); expect(response).toEqual({ content: [{ type: 'text', text: 'This is the answer to your question.' }], isError: false }); expect(mockClient.question).toHaveBeenCalledWith(url, question, {}); }); // Test fields functionality test('should handle fields request', async () => { const url = 'https://example.com'; const fields = { title: 'Extract the title', price: 'Extract the price' }; const response = await requestHandler( new RequestContext('webscraping_ai_fields', { url, fields }) ); expect(response).toEqual({ content: [{ type: 'text', text: JSON.stringify({ field1: 'value1', field2: 'value2' }, null, 2) }], isError: false }); expect(mockClient.fields).toHaveBeenCalledWith(url, fields, {}); }); // Test html functionality test('should handle html request', async () => { const url = 'https://example.com'; const response = await requestHandler( new RequestContext('webscraping_ai_html', { url }) ); expect(response).toEqual({ content: [{ type: 'text', text: '<html><body>Test HTML Content</body></html>' }], isError: false }); expect(mockClient.html).toHaveBeenCalledWith(url, {}); }); // Test text functionality test('should handle text request', async () => { const url = 'https://example.com'; const response = await requestHandler( new RequestContext('webscraping_ai_text', { url }) ); expect(response).toEqual({ content: [{ type: 'text', text: 'Test text content' }], isError: false }); expect(mockClient.text).toHaveBeenCalledWith(url, {}); }); // Test selected functionality test('should handle selected request', async () => { const url = 'https://example.com'; const selector = '.main-content'; const response = await requestHandler( new RequestContext('webscraping_ai_selected', { url, selector }) ); expect(response).toEqual({ content: [{ type: 'text', text: '<div>Selected Element</div>' }], isError: false }); expect(mockClient.selected).toHaveBeenCalledWith(url, selector, {}); }); // Test selected_multiple functionality test('should handle selected_multiple request', async () => { const url = 'https://example.com'; const selectors = ['.item1', '.item2']; const response = await requestHandler( new RequestContext('webscraping_ai_selected_multiple', { url, selectors }) ); expect(response).toEqual({ content: [{ type: 'text', text: JSON.stringify(['<div>Element 1</div>', '<div>Element 2</div>'], null, 2) }], isError: false }); expect(mockClient.selectedMultiple).toHaveBeenCalledWith(url, selectors, {}); }); // Test account functionality test('should handle account request', async () => { const response = await requestHandler( new RequestContext('webscraping_ai_account', {}) ); expect(response).toEqual({ content: [{ type: 'text', text: JSON.stringify({ requests: 100, remaining: 900, limit: 1000 }, null, 2) }], isError: false }); expect(mockClient.account).toHaveBeenCalled(); }); // Test error handling test('should handle API errors', async () => { const url = 'https://example.com'; mockClient.question.mockRejectedValueOnce(new Error('API Error')); const response = await requestHandler( new RequestContext('webscraping_ai_question', { url, question: 'What is on this page?' }) ); expect(response.isError).toBe(true); expect(response.content[0].text).toContain('API Error'); }); // Test unknown tool test('should handle unknown tool request', async () => { const response = await requestHandler( new RequestContext('unknown_tool', { some: 'args' }) ); expect(response.isError).toBe(true); expect(response.content[0].text).toContain('Unknown tool'); }); // Test MCP Client Connection xtest('should connect to MCP server and list tools', async () => { const transport = new StdioClientTransport({ command: "node", args: ["src/index.js"] }); const client = new Client({ name: "webscraping-ai-test-client", version: "1.0.0" }); await client.connect(transport); const response = await client.listTools(); expect(response.tools).toEqual(expect.arrayContaining([ expect.objectContaining({ name: 'webscraping_ai_question', inputSchema: expect.any(Object) }), expect.objectContaining({ name: 'webscraping_ai_fields', inputSchema: expect.any(Object) }), expect.objectContaining({ name: 'webscraping_ai_html', inputSchema: expect.any(Object) }), expect.objectContaining({ name: 'webscraping_ai_text', inputSchema: expect.any(Object) }), expect.objectContaining({ name: 'webscraping_ai_selected', inputSchema: expect.any(Object) }), expect.objectContaining({ name: 'webscraping_ai_selected_multiple', inputSchema: expect.any(Object) }), expect.objectContaining({ name: 'webscraping_ai_account', inputSchema: expect.any(Object) }) ])); await client.close(); }); }); // Helper function to simulate request handling async function handleRequest(name, args, client) { try { const options = { ...args }; // Remove required parameters from options for each tool type switch (name) { case 'webscraping_ai_question': { const { url, question, ...rest } = options; if (!url || !question) { throw new Error('URL and question are required'); } const result = await client.question(url, question, rest); return { content: [{ type: 'text', text: result }], isError: false }; } case 'webscraping_ai_fields': { const { url, fields, ...rest } = options; if (!url || !fields) { throw new Error('URL and fields are required'); } const result = await client.fields(url, fields, rest); return { content: [{ type: 'text', text: JSON.stringify(result, null, 2) }], isError: false }; } case 'webscraping_ai_html': { const { url, ...rest } = options; if (!url) { throw new Error('URL is required'); } const result = await client.html(url, rest); return { content: [{ type: 'text', text: result }], isError: false }; } case 'webscraping_ai_text': { const { url, ...rest } = options; if (!url) { throw new Error('URL is required'); } const result = await client.text(url, rest); return { content: [{ type: 'text', text: result }], isError: false }; } case 'webscraping_ai_selected': { const { url, selector, ...rest } = options; if (!url || !selector) { throw new Error('URL and selector are required'); } const result = await client.selected(url, selector, rest); return { content: [{ type: 'text', text: result }], isError: false }; } case 'webscraping_ai_selected_multiple': { const { url, selectors, ...rest } = options; if (!url || !selectors) { throw new Error('URL and selectors are required'); } const result = await client.selectedMultiple(url, selectors, rest); return { content: [{ type: 'text', text: JSON.stringify(result, null, 2) }], isError: false }; } case 'webscraping_ai_account': { const result = await client.account(); return { content: [{ type: 'text', text: JSON.stringify(result, null, 2) }], isError: false }; } default: throw new Error(`Unknown tool: ${name}`); } } catch (error) { return { content: [{ type: 'text', text: error.message }], isError: true }; } } ``` -------------------------------------------------------------------------------- /openapi.yml: -------------------------------------------------------------------------------- ```yaml openapi: 3.1.0 info: title: WebScraping.AI contact: name: WebScraping.AI Support url: https://webscraping.ai email: [email protected] version: 3.2.0 description: WebScraping.AI scraping API provides LLM-powered tools with Chromium JavaScript rendering, rotating proxies, and built-in HTML parsing. tags: - name: AI description: Analyze web pages using LLMs - name: HTML description: Get full HTML content of pages using proxies and Chromium JS rendering - name: Text description: Get visible text of pages using proxies and Chromium JS rendering - name: Selected HTML description: Get HTML content of selected page areas (like price, search results, page title, etc.) - name: Account description: Information about your account API credits quota paths: /ai/question: get: summary: Get an answer to a question about a given web page description: Returns the answer in plain text. Proxies and Chromium JavaScript rendering are used for page retrieval and processing, then the answer is extracted using an LLM model. operationId: getQuestion tags: [ "AI" ] parameters: - $ref: '#/components/parameters/url' - $ref: '#/components/parameters/question' - $ref: '#/components/parameters/headers' - $ref: '#/components/parameters/timeout' - $ref: '#/components/parameters/js' - $ref: '#/components/parameters/js_timeout' - $ref: '#/components/parameters/wait_for' - $ref: '#/components/parameters/proxy' - $ref: '#/components/parameters/country' - $ref: '#/components/parameters/custom_proxy' - $ref: '#/components/parameters/device' - $ref: '#/components/parameters/error_on_404' - $ref: '#/components/parameters/error_on_redirect' - $ref: '#/components/parameters/js_script' - $ref: '#/components/parameters/format' responses: 400: $ref: '#/components/responses/400' 402: $ref: '#/components/responses/402' 403: $ref: '#/components/responses/403' 429: $ref: '#/components/responses/429' 500: $ref: '#/components/responses/500' 504: $ref: '#/components/responses/504' 200: description: Success content: text/html: schema: type: string example: "Some answer" /ai/fields: get: summary: Extract structured data fields from a web page description: Returns structured data fields extracted from the webpage using an LLM model. Proxies and Chromium JavaScript rendering are used for page retrieval and processing. operationId: getFields tags: [ "AI" ] parameters: - $ref: '#/components/parameters/url' - in: query name: fields description: Object describing fields to extract from the page and their descriptions required: true example: {"title":"Main product title","price":"Current product price","description":"Full product description"} schema: type: object additionalProperties: type: string style: deepObject explode: true - $ref: '#/components/parameters/headers' - $ref: '#/components/parameters/timeout' - $ref: '#/components/parameters/js' - $ref: '#/components/parameters/js_timeout' - $ref: '#/components/parameters/wait_for' - $ref: '#/components/parameters/proxy' - $ref: '#/components/parameters/country' - $ref: '#/components/parameters/custom_proxy' - $ref: '#/components/parameters/device' - $ref: '#/components/parameters/error_on_404' - $ref: '#/components/parameters/error_on_redirect' - $ref: '#/components/parameters/js_script' responses: 400: $ref: '#/components/responses/400' 402: $ref: '#/components/responses/402' 403: $ref: '#/components/responses/403' 429: $ref: '#/components/responses/429' 500: $ref: '#/components/responses/500' 504: $ref: '#/components/responses/504' 200: description: Success content: application/json: schema: type: object additionalProperties: type: string example: title: "Example Product" price: "$99.99" description: "This is a sample product description" /html: get: summary: Page HTML by URL description: Returns the full HTML content of a webpage specified by the URL. The response is in plain text. Proxies and Chromium JavaScript rendering are used for page retrieval and processing. operationId: getHTML tags: ["HTML"] parameters: - $ref: '#/components/parameters/url' - $ref: '#/components/parameters/headers' - $ref: '#/components/parameters/timeout' - $ref: '#/components/parameters/js' - $ref: '#/components/parameters/js_timeout' - $ref: '#/components/parameters/wait_for' - $ref: '#/components/parameters/proxy' - $ref: '#/components/parameters/country' - $ref: '#/components/parameters/custom_proxy' - $ref: '#/components/parameters/device' - $ref: '#/components/parameters/error_on_404' - $ref: '#/components/parameters/error_on_redirect' - $ref: '#/components/parameters/js_script' - $ref: '#/components/parameters/return_script_result' - $ref: '#/components/parameters/format' responses: 400: $ref: '#/components/responses/400' 402: $ref: '#/components/responses/402' 403: $ref: '#/components/responses/403' 429: $ref: '#/components/responses/429' 500: $ref: '#/components/responses/500' 504: $ref: '#/components/responses/504' 200: description: Success content: text/html: schema: type: string example: "<html><head>\n <title>Example Domain</title>\n</head>\n\n<body>\n<div>\n <h1>Example Domain</h1>\n</body></html>" /text: get: summary: Page text by URL description: Returns the visible text content of a webpage specified by the URL. Can be used to feed data to LLM models. The response can be in plain text, JSON, or XML format based on the text_format parameter. Proxies and Chromium JavaScript rendering are used for page retrieval and processing. Returns JSON on error. operationId: getText tags: [ "Text" ] parameters: - $ref: '#/components/parameters/text_format' - $ref: '#/components/parameters/return_links' - $ref: '#/components/parameters/url' - $ref: '#/components/parameters/headers' - $ref: '#/components/parameters/timeout' - $ref: '#/components/parameters/js' - $ref: '#/components/parameters/js_timeout' - $ref: '#/components/parameters/wait_for' - $ref: '#/components/parameters/proxy' - $ref: '#/components/parameters/country' - $ref: '#/components/parameters/custom_proxy' - $ref: '#/components/parameters/device' - $ref: '#/components/parameters/error_on_404' - $ref: '#/components/parameters/error_on_redirect' - $ref: '#/components/parameters/js_script' responses: 400: $ref: '#/components/responses/400' 402: $ref: '#/components/responses/402' 403: $ref: '#/components/responses/403' 429: $ref: '#/components/responses/429' 500: $ref: '#/components/responses/500' 504: $ref: '#/components/responses/504' 200: description: Success content: text/html: schema: type: string example: "Some content" text/xml: schema: type: string example: "<title>Some title</title>\n<description>Some description</description>\n<content>Some content</content>" application/json: schema: type: string example: '{"title":"Some title","description":"Some description","content":"Some content"}' /selected: get: summary: HTML of a selected page area by URL and CSS selector description: Returns HTML of a selected page area by URL and CSS selector. Useful if you don't want to do the HTML parsing on your side. operationId: getSelected tags: ["Selected HTML"] parameters: - in: query name: selector description: CSS selector (null by default, returns whole page HTML) example: "h1" schema: type: string - $ref: '#/components/parameters/url' - $ref: '#/components/parameters/headers' - $ref: '#/components/parameters/timeout' - $ref: '#/components/parameters/js' - $ref: '#/components/parameters/js_timeout' - $ref: '#/components/parameters/wait_for' - $ref: '#/components/parameters/proxy' - $ref: '#/components/parameters/country' - $ref: '#/components/parameters/custom_proxy' - $ref: '#/components/parameters/device' - $ref: '#/components/parameters/error_on_404' - $ref: '#/components/parameters/error_on_redirect' - $ref: '#/components/parameters/js_script' - $ref: '#/components/parameters/format' responses: 400: $ref: '#/components/responses/400' 402: $ref: '#/components/responses/402' 403: $ref: '#/components/responses/403' 429: $ref: '#/components/responses/429' 500: $ref: '#/components/responses/500' 504: $ref: '#/components/responses/504' 200: description: Success content: text/html: schema: type: string example: "<a href=\"https://www.iana.org/domains/example\">More information...</a>" /selected-multiple: get: summary: HTML of multiple page areas by URL and CSS selectors description: Returns HTML of multiple page areas by URL and CSS selectors. Useful if you don't want to do the HTML parsing on your side. operationId: getSelectedMultiple tags: ["Selected HTML"] parameters: - in: query name: selectors description: Multiple CSS selectors (null by default, returns whole page HTML) example: ["h1"] schema: type: array items: type: string style: form explode: true - $ref: '#/components/parameters/url' - $ref: '#/components/parameters/headers' - $ref: '#/components/parameters/timeout' - $ref: '#/components/parameters/js' - $ref: '#/components/parameters/js_timeout' - $ref: '#/components/parameters/wait_for' - $ref: '#/components/parameters/proxy' - $ref: '#/components/parameters/country' - $ref: '#/components/parameters/custom_proxy' - $ref: '#/components/parameters/device' - $ref: '#/components/parameters/error_on_404' - $ref: '#/components/parameters/error_on_redirect' - $ref: '#/components/parameters/js_script' responses: 400: $ref: '#/components/responses/400' 402: $ref: '#/components/responses/402' 403: $ref: '#/components/responses/403' 429: $ref: '#/components/responses/429' 500: $ref: '#/components/responses/500' 504: $ref: '#/components/responses/504' 200: description: Success content: application/json: schema: $ref: "#/components/schemas/SelectedAreas" example: "[\"<a href='/test'>some link</a>\", \"Hello\"]" /account: get: summary: Information about your account calls quota description: Returns information about your account, including the remaining API credits quota, the next billing cycle start time, and the remaining concurrent requests. The response is in JSON format. operationId: account tags: [ "Account" ] responses: 403: $ref: '#/components/responses/403' 200: description: Success content: application/json: schema: $ref: "#/components/schemas/Account" example: remaining_api_calls: 200000 resets_at: 1617073667 remaining_concurrency: 100 security: - api_key: [] servers: - url: https://api.webscraping.ai components: securitySchemes: api_key: type: apiKey name: api_key in: query responses: 400: description: Parameters validation error content: application/json: schema: $ref: "#/components/schemas/Error" example: { "message": "Invalid CSS selector" } 402: description: Billing issue, probably you've ran out of credits content: application/json: schema: $ref: "#/components/schemas/Error" example: { message: "Some error" } 403: description: Wrong API key content: application/json: schema: $ref: "#/components/schemas/Error" example: { message: "Some error" } 429: description: Too many concurrent requests content: application/json: schema: $ref: "#/components/schemas/Error" example: { message: "Some error" } 500: description: Non-2xx and non-404 HTTP status code on the target page or unexpected error, try again or contact [email protected] content: application/json: schema: $ref: "#/components/schemas/Error" example: { "message": "Unexpected HTTP code on the target page", "status_code": 500, "status_message": "Some website error", } 504: description: Timeout error, try increasing timeout parameter value content: application/json: schema: $ref: "#/components/schemas/Error" example: { message: "Some error" } parameters: ## Shared everywhere url: in: query name: url description: URL of the target page. required: true example: "https://example.com" schema: type: string postUrl: in: query name: url description: URL of the target page. required: true example: "https://httpbin.org/post" schema: type: string headers: in: query name: headers description: "HTTP headers to pass to the target page. Can be specified either via a nested query parameter (...&headers[One]=value1&headers=[Another]=value2) or as a JSON encoded object (...&headers={\"One\": \"value1\", \"Another\": \"value2\"})." example: '{"Cookie":"session=some_id"}' schema: type: object additionalProperties: type: string style: deepObject explode: true timeout: in: query name: timeout description: Maximum web page retrieval time in ms. Increase it in case of timeout errors (10000 by default, maximum is 30000). example: 10000 schema: type: integer default: 10000 minimum: 1 maximum: 30000 js: in: query name: js description: Execute on-page JavaScript using a headless browser (true by default). example: true schema: type: boolean default: true js_timeout: in: query name: js_timeout description: Maximum JavaScript rendering time in ms. Increase it in case if you see a loading indicator instead of data on the target page. example: 2000 schema: type: integer default: 2000 minimum: 1 maximum: 20000 wait_for: in: query name: wait_for description: CSS selector to wait for before returning the page content. Useful for pages with dynamic content loading. Overrides js_timeout. schema: type: string proxy: in: query name: proxy description: Type of proxy, use residential proxies if your site restricts traffic from datacenters (datacenter by default). Note that residential proxy requests are more expensive than datacenter, see the pricing page for details. example: "datacenter" schema: type: string default: "datacenter" enum: [ "datacenter", "residential" ] country: in: query name: country description: Country of the proxy to use (US by default). example: "us" schema: type: string default: "us" enum: [ "us", "gb", "de", "it", "fr", "ca", "es", "ru", "jp", "kr", "in" ] custom_proxy: in: query name: custom_proxy description: Your own proxy URL to use instead of our built-in proxy pool in "http://user:password@host:port" format (<a target="_blank" href="https://webscraping.ai/proxies/smartproxy">Smartproxy</a> for example). example: schema: type: string device: in: query name: device description: Type of device emulation. example: "desktop" schema: type: string default: "desktop" enum: [ "desktop", "mobile", "tablet" ] error_on_404: in: query name: error_on_404 description: Return error on 404 HTTP status on the target page (false by default). example: false schema: type: boolean default: false error_on_redirect: in: query name: error_on_redirect description: Return error on redirect on the target page (false by default). example: false schema: type: boolean default: false js_script: in: query name: js_script description: Custom JavaScript code to execute on the target page. example: "document.querySelector('button').click();" schema: type: string return_script_result: in: query name: return_script_result description: Return result of the custom JavaScript code (js_script parameter) execution on the target page (false by default, page HTML will be returned). example: false schema: type: boolean default: false text_format: in: query name: text_format description: Format of the text response (plain by default). "plain" will return only the page body text. "json" and "xml" will return a json/xml with "title", "description" and "content" keys. example: "plain" schema: type: string default: "plain" enum: [ "plain", "xml", "json" ] return_links: in: query name: return_links description: "[Works only with text_format=json] Return links from the page body text (false by default). Useful for building web crawlers." example: false schema: type: boolean default: false question: in: query name: question description: Question or instructions to ask the LLM model about the target page. example: "What is the summary of this page content?" schema: type: string format: in: query name: format description: Format of the response (text by default). "json" will return a JSON object with the response, "text" will return a plain text/HTML response. example: "json" schema: type: string default: "json" enum: [ "json", "text" ] requestBodies: Body: description: Request body to pass to the target page content: application/json: schema: type: object additionalProperties: true application/x-www-form-urlencoded: schema: type: object additionalProperties: true application/xml: schema: type: object additionalProperties: true text/plain: schema: type: string schemas: Error: title: Generic error type: object properties: message: type: string description: Error description status_code: type: integer description: Target page response HTTP status code (403, 500, etc) status_message: type: string description: Target page response HTTP status message body: type: string description: Target page response body SelectedAreas: title: HTML for selected page areas type: array description: Array of elements matched by selectors items: type: string Account: title: Account limits info type: object properties: email: type: string description: Your account email remaining_api_calls: type: integer description: Remaining API credits quota resets_at: type: integer description: Next billing cycle start time (UNIX timestamp) remaining_concurrency: type: integer description: Remaining concurrent requests ```