This is page 1 of 4. Use http://codebase.md/omgwtfwow/mcp-crawl4ai-ts?lines=true&page={x} to view the full context. # Directory Structure ``` ├── .env.example ├── .github │ ├── CI.md │ ├── copilot-instructions.md │ └── workflows │ └── ci.yml ├── .gitignore ├── .prettierignore ├── .prettierrc.json ├── CHANGELOG.md ├── eslint.config.mjs ├── jest.config.cjs ├── jest.setup.cjs ├── LICENSE ├── package-lock.json ├── package.json ├── README.md ├── src │ ├── __tests__ │ │ ├── crawl.test.ts │ │ ├── crawl4ai-service.network.test.ts │ │ ├── crawl4ai-service.test.ts │ │ ├── handlers │ │ │ ├── crawl-handlers.test.ts │ │ │ ├── parameter-combinations.test.ts │ │ │ ├── screenshot-saving.test.ts │ │ │ ├── session-handlers.test.ts │ │ │ └── utility-handlers.test.ts │ │ ├── index.cli.test.ts │ │ ├── index.npx.test.ts │ │ ├── index.server.test.ts │ │ ├── index.test.ts │ │ ├── integration │ │ │ ├── batch-crawl.integration.test.ts │ │ │ ├── capture-screenshot.integration.test.ts │ │ │ ├── crawl-advanced.integration.test.ts │ │ │ ├── crawl-handlers.integration.test.ts │ │ │ ├── crawl-recursive.integration.test.ts │ │ │ ├── crawl.integration.test.ts │ │ │ ├── execute-js.integration.test.ts │ │ │ ├── extract-links.integration.test.ts │ │ │ ├── extract-with-llm.integration.test.ts │ │ │ ├── generate-pdf.integration.test.ts │ │ │ ├── get-html.integration.test.ts │ │ │ ├── get-markdown.integration.test.ts │ │ │ ├── parse-sitemap.integration.test.ts │ │ │ ├── session-management.integration.test.ts │ │ │ ├── smart-crawl.integration.test.ts │ │ │ └── test-utils.ts │ │ ├── request-handler.test.ts │ │ ├── schemas │ │ │ └── validation-edge-cases.test.ts │ │ ├── types │ │ │ └── mocks.ts │ │ └── utils │ │ └── javascript-validation.test.ts │ ├── crawl4ai-service.ts │ ├── handlers │ │ ├── base-handler.ts │ │ ├── content-handlers.ts │ │ ├── crawl-handlers.ts │ │ ├── session-handlers.ts │ │ └── utility-handlers.ts │ ├── index.ts │ ├── schemas │ │ ├── helpers.ts │ │ └── validation-schemas.ts │ ├── server.ts │ └── types.ts ├── tsconfig.build.json └── tsconfig.json ``` # Files -------------------------------------------------------------------------------- /.prettierignore: -------------------------------------------------------------------------------- ``` 1 | dist 2 | node_modules 3 | *.md 4 | *.json 5 | .env 6 | .env.* 7 | coverage 8 | .nyc_output ``` -------------------------------------------------------------------------------- /.prettierrc.json: -------------------------------------------------------------------------------- ```json 1 | { 2 | "semi": true, 3 | "trailingComma": "all", 4 | "singleQuote": true, 5 | "printWidth": 120, 6 | "tabWidth": 2, 7 | "useTabs": false, 8 | "bracketSpacing": true, 9 | "arrowParens": "always", 10 | "endOfLine": "lf" 11 | } ``` -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- ``` 1 | # Dependencies 2 | node_modules/ 3 | npm-debug.log* 4 | yarn-debug.log* 5 | yarn-error.log* 6 | 7 | # Build output 8 | dist/ 9 | build/ 10 | *.js 11 | *.js.map 12 | *.d.ts 13 | *.d.ts.map 14 | 15 | # Environment 16 | .env 17 | .env.local 18 | .env.*.local 19 | 20 | # IDE 21 | .vscode/ 22 | .idea/ 23 | *.swp 24 | *.swo 25 | *~ 26 | 27 | # OS 28 | .DS_Store 29 | Thumbs.db 30 | 31 | # Logs 32 | logs/ 33 | *.log 34 | 35 | # Testing 36 | coverage/ 37 | .nyc_output/ 38 | src/__tests__/mock-responses.json 39 | 40 | # Temporary files 41 | tmp/ 42 | temp/ 43 | 44 | add-to-claude.sh ``` -------------------------------------------------------------------------------- /.env.example: -------------------------------------------------------------------------------- ``` 1 | # Required: URL of your Crawl4AI server 2 | CRAWL4AI_BASE_URL=http://localhost:11235 3 | 4 | # Optional: API key for authentication (if your server requires it) 5 | CRAWL4AI_API_KEY= 6 | 7 | # Optional: Custom server name and version 8 | SERVER_NAME=crawl4ai-mcp 9 | SERVER_VERSION=0.7.4 10 | 11 | # Optional: For LLM extraction tests 12 | LLM_PROVIDER=openai/gpt-4o-mini 13 | LLM_API_TOKEN=your-llm-api-key 14 | LLM_BASE_URL=https://api.openai.com/v1 # If using custom endpoint 15 | ``` -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- ```markdown 1 | # MCP Server for Crawl4AI 2 | 3 | > **Note:** Tested with Crawl4AI version 0.7.4 4 | 5 | [](https://www.npmjs.com/package/mcp-crawl4ai-ts) 6 | [](https://opensource.org/licenses/MIT) 7 | [](https://nodejs.org/) 8 | [](https://omgwtfwow.github.io/mcp-crawl4ai-ts/coverage/) 9 | 10 | TypeScript implementation of an MCP server for Crawl4AI. Provides tools for web crawling, content extraction, and browser automation. 11 | 12 | ## Table of Contents 13 | 14 | - [Prerequisites](#prerequisites) 15 | - [Quick Start](#quick-start) 16 | - [Configuration](#configuration) 17 | - [Client-Specific Instructions](#client-specific-instructions) 18 | - [Available Tools](#available-tools) 19 | - [1. get_markdown](#1-get_markdown---extract-content-as-markdown-with-filtering) 20 | - [2. capture_screenshot](#2-capture_screenshot---capture-webpage-screenshot) 21 | - [3. generate_pdf](#3-generate_pdf---convert-webpage-to-pdf) 22 | - [4. execute_js](#4-execute_js---execute-javascript-and-get-return-values) 23 | - [5. batch_crawl](#5-batch_crawl---crawl-multiple-urls-concurrently) 24 | - [6. smart_crawl](#6-smart_crawl---auto-detect-and-handle-different-content-types) 25 | - [7. get_html](#7-get_html---get-sanitized-html-for-analysis) 26 | - [8. extract_links](#8-extract_links---extract-and-categorize-page-links) 27 | - [9. crawl_recursive](#9-crawl_recursive---deep-crawl-website-following-links) 28 | - [10. parse_sitemap](#10-parse_sitemap---extract-urls-from-xml-sitemaps) 29 | - [11. crawl](#11-crawl---advanced-web-crawling-with-full-configuration) 30 | - [12. manage_session](#12-manage_session---unified-session-management) 31 | - [13. extract_with_llm](#13-extract_with_llm---extract-structured-data-using-ai) 32 | - [Advanced Configuration](#advanced-configuration) 33 | - [Development](#development) 34 | - [License](#license) 35 | 36 | ## Prerequisites 37 | 38 | - Node.js 18+ and npm 39 | - A running Crawl4AI server 40 | 41 | ## Quick Start 42 | 43 | ### 1. Start the Crawl4AI server (for example, local docker) 44 | 45 | ```bash 46 | docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.4 47 | ``` 48 | 49 | ### 2. Add to your MCP client 50 | 51 | This MCP server works with any MCP-compatible client (Claude Desktop, Claude Code, Cursor, LMStudio, etc.). 52 | 53 | #### Using npx (Recommended) 54 | 55 | ```json 56 | { 57 | "mcpServers": { 58 | "crawl4ai": { 59 | "command": "npx", 60 | "args": ["mcp-crawl4ai-ts"], 61 | "env": { 62 | "CRAWL4AI_BASE_URL": "http://localhost:11235" 63 | } 64 | } 65 | } 66 | } 67 | ``` 68 | 69 | #### Using local installation 70 | 71 | ```json 72 | { 73 | "mcpServers": { 74 | "crawl4ai": { 75 | "command": "node", 76 | "args": ["/path/to/mcp-crawl4ai-ts/dist/index.js"], 77 | "env": { 78 | "CRAWL4AI_BASE_URL": "http://localhost:11235" 79 | } 80 | } 81 | } 82 | } 83 | ``` 84 | 85 | #### With all optional variables 86 | 87 | ```json 88 | { 89 | "mcpServers": { 90 | "crawl4ai": { 91 | "command": "npx", 92 | "args": ["mcp-crawl4ai-ts"], 93 | "env": { 94 | "CRAWL4AI_BASE_URL": "http://localhost:11235", 95 | "CRAWL4AI_API_KEY": "your-api-key", 96 | "SERVER_NAME": "custom-name", 97 | "SERVER_VERSION": "1.0.0" 98 | } 99 | } 100 | } 101 | } 102 | ``` 103 | 104 | ## Configuration 105 | 106 | ### Environment Variables 107 | 108 | ```env 109 | # Required 110 | CRAWL4AI_BASE_URL=http://localhost:11235 111 | 112 | # Optional - Server Configuration 113 | CRAWL4AI_API_KEY= # If your server requires auth 114 | SERVER_NAME=crawl4ai-mcp # Custom name for the MCP server 115 | SERVER_VERSION=1.0.0 # Custom version 116 | ``` 117 | 118 | ## Client-Specific Instructions 119 | 120 | ### Claude Desktop 121 | 122 | Add to `~/Library/Application Support/Claude/claude_desktop_config.json` 123 | 124 | ### Claude Code 125 | 126 | ```bash 127 | claude mcp add crawl4ai -e CRAWL4AI_BASE_URL=http://localhost:11235 -- npx mcp-crawl4ai-ts 128 | ``` 129 | 130 | ### Other MCP Clients 131 | 132 | Consult your client's documentation for MCP server configuration. The key details: 133 | 134 | - **Command**: `npx mcp-crawl4ai-ts` or `node /path/to/dist/index.js` 135 | - **Required env**: `CRAWL4AI_BASE_URL` 136 | - **Optional env**: `CRAWL4AI_API_KEY`, `SERVER_NAME`, `SERVER_VERSION` 137 | 138 | ## Available Tools 139 | 140 | ### 1. `get_markdown` - Extract content as markdown with filtering 141 | 142 | ```typescript 143 | { 144 | url: string, // Required: URL to extract markdown from 145 | filter?: 'raw'|'fit'|'bm25'|'llm', // Filter type (default: 'fit') 146 | query?: string, // Query for bm25/llm filters 147 | cache?: string // Cache-bust parameter (default: '0') 148 | } 149 | ``` 150 | 151 | Extracts content as markdown with various filtering options. Use 'bm25' or 'llm' filters with a query for specific content extraction. 152 | 153 | ### 2. `capture_screenshot` - Capture webpage screenshot 154 | 155 | ```typescript 156 | { 157 | url: string, // Required: URL to capture 158 | screenshot_wait_for?: number // Seconds to wait before screenshot (default: 2) 159 | } 160 | ``` 161 | 162 | Returns base64-encoded PNG. Note: This is stateless - for screenshots after JS execution, use `crawl` with `screenshot: true`. 163 | 164 | ### 3. `generate_pdf` - Convert webpage to PDF 165 | 166 | ```typescript 167 | { 168 | url: string // Required: URL to convert to PDF 169 | } 170 | ``` 171 | 172 | Returns base64-encoded PDF. Stateless tool - for PDFs after JS execution, use `crawl` with `pdf: true`. 173 | 174 | ### 4. `execute_js` - Execute JavaScript and get return values 175 | 176 | ```typescript 177 | { 178 | url: string, // Required: URL to load 179 | scripts: string | string[] // Required: JavaScript to execute 180 | } 181 | ``` 182 | 183 | Executes JavaScript and returns results. Each script can use 'return' to get values back. Stateless - for persistent JS execution use `crawl` with `js_code`. 184 | 185 | ### 5. `batch_crawl` - Crawl multiple URLs concurrently 186 | 187 | ```typescript 188 | { 189 | urls: string[], // Required: List of URLs to crawl 190 | max_concurrent?: number, // Parallel request limit (default: 5) 191 | remove_images?: boolean, // Remove images from output (default: false) 192 | bypass_cache?: boolean, // Bypass cache for all URLs (default: false) 193 | configs?: Array<{ // Optional: Per-URL configurations (v3.0.0+) 194 | url: string, 195 | [key: string]: any // Any crawl parameters for this specific URL 196 | }> 197 | } 198 | ``` 199 | 200 | Efficiently crawls multiple URLs in parallel. Each URL gets a fresh browser instance. With `configs` array, you can specify different parameters for each URL. 201 | 202 | ### 6. `smart_crawl` - Auto-detect and handle different content types 203 | 204 | ```typescript 205 | { 206 | url: string, // Required: URL to crawl 207 | max_depth?: number, // Maximum depth for recursive crawling (default: 2) 208 | follow_links?: boolean, // Follow links in content (default: true) 209 | bypass_cache?: boolean // Bypass cache (default: false) 210 | } 211 | ``` 212 | 213 | Intelligently detects content type (HTML/sitemap/RSS) and processes accordingly. 214 | 215 | ### 7. `get_html` - Get sanitized HTML for analysis 216 | 217 | ```typescript 218 | { 219 | url: string // Required: URL to extract HTML from 220 | } 221 | ``` 222 | 223 | Returns preprocessed HTML optimized for structure analysis. Use for building schemas or analyzing patterns. 224 | 225 | ### 8. `extract_links` - Extract and categorize page links 226 | 227 | ```typescript 228 | { 229 | url: string, // Required: URL to extract links from 230 | categorize?: boolean // Group by type (default: true) 231 | } 232 | ``` 233 | 234 | Extracts all links and groups them by type: internal, external, social media, documents, images. 235 | 236 | ### 9. `crawl_recursive` - Deep crawl website following links 237 | 238 | ```typescript 239 | { 240 | url: string, // Required: Starting URL 241 | max_depth?: number, // Maximum depth to crawl (default: 3) 242 | max_pages?: number, // Maximum pages to crawl (default: 50) 243 | include_pattern?: string, // Regex pattern for URLs to include 244 | exclude_pattern?: string // Regex pattern for URLs to exclude 245 | } 246 | ``` 247 | 248 | Crawls a website following internal links up to specified depth. Returns content from all discovered pages. 249 | 250 | ### 10. `parse_sitemap` - Extract URLs from XML sitemaps 251 | 252 | ```typescript 253 | { 254 | url: string, // Required: Sitemap URL (e.g., /sitemap.xml) 255 | filter_pattern?: string // Optional: Regex pattern to filter URLs 256 | } 257 | ``` 258 | 259 | Extracts all URLs from XML sitemaps. Supports regex filtering for specific URL patterns. 260 | 261 | ### 11. `crawl` - Advanced web crawling with full configuration 262 | 263 | ```typescript 264 | { 265 | url: string, // URL to crawl 266 | // Browser Configuration 267 | browser_type?: 'chromium'|'firefox'|'webkit'|'undetected', // Browser engine (undetected = stealth mode) 268 | viewport_width?: number, // Browser width (default: 1080) 269 | viewport_height?: number, // Browser height (default: 600) 270 | user_agent?: string, // Custom user agent 271 | proxy_server?: string | { // Proxy URL (string or object format) 272 | server: string, 273 | username?: string, 274 | password?: string 275 | }, 276 | proxy_username?: string, // Proxy auth (if using string format) 277 | proxy_password?: string, // Proxy password (if using string format) 278 | cookies?: Array<{name, value, domain}>, // Pre-set cookies 279 | headers?: Record<string,string>, // Custom headers 280 | 281 | // Crawler Configuration 282 | word_count_threshold?: number, // Min words per block (default: 200) 283 | excluded_tags?: string[], // HTML tags to exclude 284 | remove_overlay_elements?: boolean, // Remove popups/modals 285 | js_code?: string | string[], // JavaScript to execute 286 | wait_for?: string, // Wait condition (selector or JS) 287 | wait_for_timeout?: number, // Wait timeout (default: 30000) 288 | delay_before_scroll?: number, // Pre-scroll delay 289 | scroll_delay?: number, // Between-scroll delay 290 | process_iframes?: boolean, // Include iframe content 291 | exclude_external_links?: boolean, // Remove external links 292 | screenshot?: boolean, // Capture screenshot 293 | pdf?: boolean, // Generate PDF 294 | session_id?: string, // Reuse browser session (only works with crawl tool) 295 | cache_mode?: 'ENABLED'|'BYPASS'|'DISABLED', // Cache control 296 | 297 | // New in v3.0.0 (Crawl4AI 0.7.3/0.7.4) 298 | css_selector?: string, // CSS selector to filter content 299 | delay_before_return_html?: number, // Delay in seconds before returning HTML 300 | include_links?: boolean, // Include extracted links in response 301 | resolve_absolute_urls?: boolean, // Convert relative URLs to absolute 302 | 303 | // LLM Extraction (REST API only supports 'llm' type) 304 | extraction_type?: 'llm', // Only 'llm' extraction is supported via REST API 305 | extraction_schema?: object, // Schema for structured extraction 306 | extraction_instruction?: string, // Natural language extraction prompt 307 | extraction_strategy?: { // Advanced extraction configuration 308 | provider?: string, 309 | api_key?: string, 310 | model?: string, 311 | [key: string]: any 312 | }, 313 | table_extraction_strategy?: { // Table extraction configuration 314 | enable_chunking?: boolean, 315 | thresholds?: object, 316 | [key: string]: any 317 | }, 318 | markdown_generator_options?: { // Markdown generation options 319 | include_links?: boolean, 320 | preserve_formatting?: boolean, 321 | [key: string]: any 322 | }, 323 | 324 | timeout?: number, // Overall timeout (default: 60000) 325 | verbose?: boolean // Detailed logging 326 | } 327 | ``` 328 | 329 | ### 12. `manage_session` - Unified session management 330 | 331 | ```typescript 332 | { 333 | action: 'create' | 'clear' | 'list', // Required: Action to perform 334 | session_id?: string, // For 'create' and 'clear' actions 335 | initial_url?: string, // For 'create' action: URL to load 336 | browser_type?: 'chromium' | 'firefox' | 'webkit' | 'undetected' // For 'create' action 337 | } 338 | ``` 339 | 340 | Unified tool for managing browser sessions. Supports three actions: 341 | - **create**: Start a persistent browser session 342 | - **clear**: Remove a session from local tracking 343 | - **list**: Show all active sessions 344 | 345 | Examples: 346 | ```typescript 347 | // Create a new session 348 | { action: 'create', session_id: 'my-session', initial_url: 'https://example.com' } 349 | 350 | // Clear a session 351 | { action: 'clear', session_id: 'my-session' } 352 | 353 | // List all sessions 354 | { action: 'list' } 355 | ``` 356 | 357 | ### 13. `extract_with_llm` - Extract structured data using AI 358 | 359 | ```typescript 360 | { 361 | url: string, // URL to extract data from 362 | query: string // Natural language extraction instructions 363 | } 364 | ``` 365 | 366 | Uses AI to extract structured data from webpages. Returns results immediately without any polling or job management. This is the recommended way to extract specific information since CSS/XPath extraction is not supported via the REST API. 367 | 368 | ## Advanced Configuration 369 | 370 | For detailed information about all available configuration options, extraction strategies, and advanced features, please refer to the official Crawl4AI documentation: 371 | 372 | - [Crawl4AI Documentation](https://docs.crawl4ai.com/) 373 | - [Crawl4AI GitHub Repository](https://github.com/unclecode/crawl4ai) 374 | 375 | ## Changelog 376 | 377 | See [CHANGELOG.md](CHANGELOG.md) for detailed version history and recent updates. 378 | 379 | ## Development 380 | 381 | ### Setup 382 | 383 | ```bash 384 | # 1. Start the Crawl4AI server 385 | docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest 386 | 387 | # 2. Install MCP server 388 | git clone https://github.com/omgwtfwow/mcp-crawl4ai-ts.git 389 | cd mcp-crawl4ai-ts 390 | npm install 391 | cp .env.example .env 392 | 393 | # 3. Development commands 394 | npm run dev # Development mode 395 | npm test # Run tests 396 | npm run lint # Check code quality 397 | npm run build # Production build 398 | 399 | # 4. Add to your MCP client (See "Using local installation") 400 | ``` 401 | 402 | ### Running Integration Tests 403 | 404 | Integration tests require a running Crawl4AI server. Configure your environment: 405 | 406 | ```bash 407 | # Required for integration tests 408 | export CRAWL4AI_BASE_URL=http://localhost:11235 409 | export CRAWL4AI_API_KEY=your-api-key # If authentication is required 410 | 411 | # Optional: For LLM extraction tests 412 | export LLM_PROVIDER=openai/gpt-4o-mini 413 | export LLM_API_TOKEN=your-llm-api-key 414 | export LLM_BASE_URL=https://api.openai.com/v1 # If using custom endpoint 415 | 416 | # Run integration tests (ALWAYS use the npm script; don't call `jest` directly) 417 | npm run test:integration 418 | 419 | # Run a single integration test file 420 | npm run test:integration -- src/__tests__/integration/extract-links.integration.test.ts 421 | 422 | > IMPORTANT: Do NOT run `npx jest` directly for integration tests. The npm script injects `NODE_OPTIONS=--experimental-vm-modules` which is required for ESM + ts-jest. Running Jest directly will produce `SyntaxError: Cannot use import statement outside a module` and hang. 423 | ``` 424 | 425 | Integration tests cover: 426 | 427 | - Dynamic content and JavaScript execution 428 | - Session management and cookies 429 | - Content extraction (LLM-based only) 430 | - Media handling (screenshots, PDFs) 431 | - Performance and caching 432 | - Content filtering 433 | - Bot detection avoidance 434 | - Error handling 435 | 436 | ### Integration Test Checklist 437 | 1. Docker container healthy: 438 | ```bash 439 | docker ps --filter name=crawl4ai --format '{{.Names}} {{.Status}}' 440 | curl -sf http://localhost:11235/health || echo "Health check failed" 441 | ``` 442 | 2. Env vars loaded (either exported or in `.env`): `CRAWL4AI_BASE_URL` (required), optional: `CRAWL4AI_API_KEY`, `LLM_PROVIDER`, `LLM_API_TOKEN`, `LLM_BASE_URL`. 443 | 3. Use `npm run test:integration` (never raw `jest`). 444 | 4. To target one file add it after `--` (see example above). 445 | 5. Expect total runtime ~2–3 minutes; longer or immediate hang usually means missing `NODE_OPTIONS` or wrong Jest version. 446 | 447 | ### Troubleshooting 448 | | Symptom | Likely Cause | Fix | 449 | |---------|--------------|-----| 450 | | `SyntaxError: Cannot use import statement outside a module` | Ran `jest` directly without script flags | Re-run with `npm run test:integration` | 451 | | Hangs on first test (RUNS ...) | Missing experimental VM modules flag | Use npm script / ensure `NODE_OPTIONS=--experimental-vm-modules` | 452 | | Network timeouts | Crawl4AI container not healthy / DNS blocked | Restart container: `docker restart <name>` | 453 | | LLM tests skipped | Missing `LLM_PROVIDER` or `LLM_API_TOKEN` | Export required LLM vars | 454 | | New Jest major upgrade breaks tests | Version mismatch with `ts-jest` | Keep Jest 29.x unless `ts-jest` upgraded accordingly | 455 | 456 | ### Version Compatibility Note 457 | Current stack: `[email protected]` + `[email protected]` + ESM (`"type": "module"`). Updating Jest to 30+ requires upgrading `ts-jest` and revisiting `jest.config.cjs`. Keep versions aligned to avoid parse errors. 458 | 459 | ## License 460 | 461 | MIT ``` -------------------------------------------------------------------------------- /tsconfig.build.json: -------------------------------------------------------------------------------- ```json 1 | { 2 | "extends": "./tsconfig.json", 3 | "exclude": [ 4 | "node_modules", 5 | "dist", 6 | "src/**/*.test.ts", 7 | "src/__tests__/**/*" 8 | ] 9 | } ``` -------------------------------------------------------------------------------- /tsconfig.json: -------------------------------------------------------------------------------- ```json 1 | { 2 | "compilerOptions": { 3 | "target": "ES2022", 4 | "module": "NodeNext", 5 | "moduleResolution": "NodeNext", 6 | "lib": ["ES2022"], 7 | "outDir": "./dist", 8 | "rootDir": "./src", 9 | "strict": true, 10 | "esModuleInterop": true, 11 | "skipLibCheck": true, 12 | "forceConsistentCasingInFileNames": true, 13 | "resolveJsonModule": true, 14 | "declaration": true, 15 | "declarationMap": true, 16 | "sourceMap": true, 17 | "isolatedModules": true 18 | }, 19 | "include": ["src/**/*"], 20 | "exclude": ["node_modules", "dist"] 21 | } ``` -------------------------------------------------------------------------------- /jest.setup.cjs: -------------------------------------------------------------------------------- ``` 1 | // Load dotenv for integration tests 2 | const dotenv = require('dotenv'); 3 | const path = require('path'); 4 | 5 | // The npm script sets an env var to identify integration tests 6 | const isIntegrationTest = process.env.JEST_TEST_TYPE === 'integration'; 7 | 8 | if (isIntegrationTest) { 9 | // For integration tests, load from .env file 10 | dotenv.config({ path: path.resolve(__dirname, '.env') }); 11 | 12 | // For integration tests, we MUST have proper environment variables 13 | // No fallback to localhost - tests should fail if not configured 14 | } else { 15 | // For unit tests, always use localhost 16 | process.env.CRAWL4AI_BASE_URL = 'http://localhost:11235'; 17 | process.env.CRAWL4AI_API_KEY = 'test-api-key'; 18 | } ``` -------------------------------------------------------------------------------- /jest.config.cjs: -------------------------------------------------------------------------------- ``` 1 | /** @type {import('jest').Config} */ 2 | module.exports = { 3 | preset: 'ts-jest/presets/default-esm', 4 | testEnvironment: 'node', 5 | roots: ['<rootDir>/src'], 6 | testMatch: ['**/__tests__/**/*.test.ts'], 7 | setupFiles: ['<rootDir>/jest.setup.cjs'], 8 | collectCoverageFrom: [ 9 | 'src/**/*.ts', 10 | '!src/**/__tests__/**', 11 | '!src/**/*.test.ts', 12 | '!src/**/types/**', 13 | ], 14 | coverageDirectory: 'coverage', 15 | coverageReporters: ['text', 'lcov', 'html', 'json'], 16 | moduleNameMapper: { 17 | '^(\\.{1,2}/.*)\\.js$': '$1', 18 | }, 19 | transform: { 20 | '^.+\\.tsx?$': [ 21 | 'ts-jest', 22 | { 23 | useESM: true, 24 | }, 25 | ], 26 | }, 27 | extensionsToTreatAsEsm: ['.ts'], 28 | clearMocks: true, 29 | // Limit parallelization for integration tests to avoid overwhelming the server 30 | ...(process.env.NODE_ENV === 'test' && process.argv.some(arg => arg.includes('integration')) ? { maxWorkers: 1 } : {}), 31 | }; ``` -------------------------------------------------------------------------------- /src/__tests__/types/mocks.ts: -------------------------------------------------------------------------------- ```typescript 1 | /* eslint-env jest */ 2 | import type { AxiosResponse } from 'axios'; 3 | 4 | /** 5 | * Mock axios instance for testing HTTP client behavior 6 | */ 7 | export interface MockAxiosInstance { 8 | post: jest.Mock<Promise<AxiosResponse>>; 9 | get: jest.Mock<Promise<AxiosResponse>>; 10 | head: jest.Mock<Promise<AxiosResponse>>; 11 | put?: jest.Mock<Promise<AxiosResponse>>; 12 | delete?: jest.Mock<Promise<AxiosResponse>>; 13 | patch?: jest.Mock<Promise<AxiosResponse>>; 14 | } 15 | 16 | /** 17 | * Mock function type that returns a promise with content array 18 | */ 19 | type MockFunction = jest.Mock<Promise<{ content: TestContent }>>; 20 | 21 | /** 22 | * Mock server interface for MCP server testing 23 | */ 24 | export interface MockMCPServer { 25 | listTools: MockFunction; 26 | callTool: MockFunction; 27 | listResources?: MockFunction; 28 | readResource?: MockFunction; 29 | listPrompts?: MockFunction; 30 | getPrompt?: MockFunction; 31 | } 32 | 33 | /** 34 | * Type for test content arrays used in MCP responses 35 | */ 36 | export type TestContent = Array<{ 37 | type: string; 38 | text?: string; 39 | resource?: { 40 | uri: string; 41 | mimeType: string; 42 | blob?: string; 43 | }; 44 | }>; 45 | 46 | /** 47 | * Generic test response type 48 | */ 49 | export interface TestResponse<T = unknown> { 50 | content: TestContent; 51 | data?: T; 52 | error?: string; 53 | } 54 | ``` -------------------------------------------------------------------------------- /src/index.ts: -------------------------------------------------------------------------------- ```typescript 1 | #!/usr/bin/env node 2 | 3 | import { Crawl4AIServer } from './server.js'; 4 | 5 | // Try to load dotenv only in development 6 | // In production (via npx), env vars come from the MCP client 7 | try { 8 | // Only try to load dotenv if CRAWL4AI_BASE_URL is not set 9 | if (!process.env.CRAWL4AI_BASE_URL) { 10 | const dotenv = await import('dotenv'); 11 | dotenv.config(); 12 | } 13 | } catch { 14 | // dotenv is not available in production, which is expected 15 | } 16 | 17 | const CRAWL4AI_BASE_URL = process.env.CRAWL4AI_BASE_URL; 18 | const CRAWL4AI_API_KEY = process.env.CRAWL4AI_API_KEY || ''; 19 | const SERVER_NAME = process.env.SERVER_NAME || 'crawl4ai-mcp'; 20 | const SERVER_VERSION = process.env.SERVER_VERSION || '1.0.0'; 21 | 22 | if (!CRAWL4AI_BASE_URL) { 23 | console.error('Error: CRAWL4AI_BASE_URL environment variable is required'); 24 | console.error('Please set it to your Crawl4AI server URL (e.g., http://localhost:8080)'); 25 | process.exit(1); 26 | } 27 | 28 | // Always start the server when this script is executed 29 | // This script is meant to be run as an MCP server 30 | const server = new Crawl4AIServer(CRAWL4AI_BASE_URL, CRAWL4AI_API_KEY, SERVER_NAME, SERVER_VERSION); 31 | server.start().catch((err) => { 32 | console.error('Server failed to start:', err); 33 | process.exit(1); 34 | }); 35 | ``` -------------------------------------------------------------------------------- /.github/CI.md: -------------------------------------------------------------------------------- ```markdown 1 | # GitHub Actions CI/CD 2 | 3 | This project uses GitHub Actions for continuous integration. 4 | 5 | ## Workflows 6 | 7 | ### CI (`ci.yml`) 8 | Runs on every push to main and on pull requests: 9 | - Linting (ESLint) 10 | - Code formatting check (Prettier) 11 | - Build (TypeScript compilation) 12 | - Unit tests (with nock mocks) 13 | - Test coverage report 14 | 15 | Tests run on Node.js 18.x and 20.x. 16 | 17 | ## Mock Maintenance 18 | 19 | The unit tests use [nock](https://github.com/nock/nock) for HTTP mocking. This provides: 20 | - Fast test execution (~1 second) 21 | - Predictable test results 22 | - No external dependencies during CI 23 | 24 | **How to update mocks:** 25 | 26 | Option 1 - Generate mock code from real API: 27 | ```bash 28 | # This will call the real API and generate nock mock code 29 | CRAWL4AI_API_KEY=your-key npm run generate-mocks 30 | ``` 31 | 32 | Option 2 - View API responses as JSON: 33 | ```bash 34 | # This will save responses to mock-responses.json for inspection 35 | CRAWL4AI_API_KEY=your-key npm run view-mocks 36 | ``` 37 | 38 | Option 3 - Manual update: 39 | 1. Run integration tests to see current API behavior: `npm run test:integration` 40 | 2. Update the mock responses in `src/__tests__/crawl4ai-service.test.ts` 41 | 3. Ensure unit tests pass: `npm run test:unit` 42 | 43 | The mocks are intentionally simple and focus on testing our code's behavior, not the API's exact responses. 44 | 45 | ## Running Tests Locally 46 | 47 | ```bash 48 | # Run all tests 49 | npm test 50 | 51 | # Run only unit tests (fast, with mocks) 52 | npm run test:unit 53 | 54 | # Run only integration tests (slow, real API) 55 | npm run test:integration 56 | 57 | # Run with coverage 58 | npm run test:coverage 59 | ``` ``` -------------------------------------------------------------------------------- /src/handlers/base-handler.ts: -------------------------------------------------------------------------------- ```typescript 1 | import { Crawl4AIService } from '../crawl4ai-service.js'; 2 | import { AxiosInstance } from 'axios'; 3 | 4 | // Error handling types 5 | export interface ErrorWithResponse { 6 | response?: { 7 | data?: 8 | | { 9 | detail?: string; 10 | } 11 | | string 12 | | unknown; 13 | }; 14 | message?: string; 15 | } 16 | 17 | export interface SessionInfo { 18 | id: string; 19 | created_at: Date; 20 | last_used: Date; 21 | initial_url?: string; 22 | metadata?: Record<string, unknown>; 23 | } 24 | 25 | export abstract class BaseHandler { 26 | protected service: Crawl4AIService; 27 | protected axiosClient: AxiosInstance; 28 | protected sessions: Map<string, SessionInfo>; 29 | 30 | constructor(service: Crawl4AIService, axiosClient: AxiosInstance, sessions: Map<string, SessionInfo>) { 31 | this.service = service; 32 | this.axiosClient = axiosClient; 33 | this.sessions = sessions; 34 | } 35 | 36 | protected formatError(error: unknown, operation: string): Error { 37 | const errorWithResponse = error as ErrorWithResponse; 38 | let errorMessage = ''; 39 | 40 | const data = errorWithResponse.response?.data; 41 | if (typeof data === 'object' && data && 'detail' in data) { 42 | errorMessage = (data as { detail: string }).detail; 43 | } else if (data) { 44 | // If data is an object, stringify it 45 | errorMessage = typeof data === 'object' ? JSON.stringify(data) : String(data); 46 | } else if (error instanceof Error) { 47 | errorMessage = error.message; 48 | } else { 49 | errorMessage = String(error); 50 | } 51 | 52 | return new Error(`Failed to ${operation}: ${errorMessage}`); 53 | } 54 | } 55 | ``` -------------------------------------------------------------------------------- /eslint.config.mjs: -------------------------------------------------------------------------------- ``` 1 | import eslint from '@eslint/js'; 2 | import tseslint from '@typescript-eslint/eslint-plugin'; 3 | import tsparser from '@typescript-eslint/parser'; 4 | import prettier from 'eslint-config-prettier'; 5 | import prettierPlugin from 'eslint-plugin-prettier'; 6 | 7 | export default [ 8 | eslint.configs.recommended, 9 | prettier, 10 | { 11 | files: ['src/**/*.ts'], 12 | languageOptions: { 13 | parser: tsparser, 14 | parserOptions: { 15 | project: './tsconfig.json', 16 | ecmaVersion: 'latest', 17 | sourceType: 'module', 18 | }, 19 | globals: { 20 | console: 'readonly', 21 | process: 'readonly', 22 | Buffer: 'readonly', 23 | __dirname: 'readonly', 24 | __filename: 'readonly', 25 | setTimeout: 'readonly', 26 | clearTimeout: 'readonly', 27 | setInterval: 'readonly', 28 | clearInterval: 'readonly', 29 | URL: 'readonly', 30 | }, 31 | }, 32 | plugins: { 33 | '@typescript-eslint': tseslint, 34 | prettier: prettierPlugin, 35 | }, 36 | rules: { 37 | ...tseslint.configs.recommended.rules, 38 | '@typescript-eslint/explicit-function-return-type': 'off', 39 | '@typescript-eslint/explicit-module-boundary-types': 'off', 40 | '@typescript-eslint/no-explicit-any': 'warn', 41 | '@typescript-eslint/no-unused-vars': [ 42 | 'error', 43 | { 44 | argsIgnorePattern: '^_', 45 | varsIgnorePattern: '^_', 46 | }, 47 | ], 48 | '@typescript-eslint/no-misused-promises': [ 49 | 'error', 50 | { 51 | checksVoidReturn: false, 52 | }, 53 | ], 54 | 'prettier/prettier': 'error', 55 | }, 56 | }, 57 | { 58 | files: ['src/**/*.test.ts', 'src/**/*.integration.test.ts', 'src/**/test-utils.ts', 'src/__tests__/types/*.ts'], 59 | languageOptions: { 60 | globals: { 61 | describe: 'readonly', 62 | it: 'readonly', 63 | expect: 'readonly', 64 | beforeEach: 'readonly', 65 | afterEach: 'readonly', 66 | beforeAll: 'readonly', 67 | afterAll: 'readonly', 68 | jest: 'readonly', 69 | }, 70 | }, 71 | }, 72 | { 73 | ignores: ['dist/**', 'node_modules/**', '*.js', '*.mjs', '*.cjs', 'coverage/**'], 74 | }, 75 | ]; ``` -------------------------------------------------------------------------------- /src/schemas/helpers.ts: -------------------------------------------------------------------------------- ```typescript 1 | import { z } from 'zod'; 2 | 3 | // Helper to validate JavaScript code 4 | export const validateJavaScriptCode = (code: string): boolean => { 5 | // Check for common HTML entities that shouldn't be in JS 6 | if (/"|&|<|>|&#\d+;|&\w+;/.test(code)) { 7 | return false; 8 | } 9 | 10 | // Basic check to ensure it's not HTML 11 | if (/<(!DOCTYPE|html|body|head|script|style)\b/i.test(code)) { 12 | return false; 13 | } 14 | 15 | // Check for literal \n, \t, \r outside of strings (common LLM mistake) 16 | // This is tricky - we'll check if the code has these patterns in a way that suggests 17 | // they're meant to be actual newlines/tabs rather than escape sequences in strings 18 | // Look for patterns like: ;\n or }\n or )\n which suggest literal newlines 19 | if (/[;})]\s*\\n|\\n\s*[{(/]/.test(code)) { 20 | return false; 21 | } 22 | 23 | // Check for obvious cases of literal \n between statements 24 | if (/[;})]\s*\\n\s*\w/.test(code)) { 25 | return false; 26 | } 27 | 28 | return true; 29 | }; 30 | 31 | // Helper to create schema that rejects session_id 32 | export const createStatelessSchema = <T extends z.ZodObject<z.ZodRawShape>>(schema: T, toolName: string) => { 33 | // Tool-specific guidance for common scenarios 34 | const toolGuidance: Record<string, string> = { 35 | capture_screenshot: 'To capture screenshots with sessions, use crawl(session_id, screenshot: true)', 36 | generate_pdf: 'To generate PDFs with sessions, use crawl(session_id, pdf: true)', 37 | execute_js: 'To run JavaScript with sessions, use crawl(session_id, js_code: [...])', 38 | get_html: 'To get HTML with sessions, use crawl(session_id)', 39 | extract_with_llm: 'To extract data with sessions, first use crawl(session_id) then extract from the response', 40 | }; 41 | 42 | const message = `${toolName} does not support session_id. This tool is stateless - each call creates a new browser. ${ 43 | toolGuidance[toolName] || 'For persistent operations, use crawl with session_id.' 44 | }`; 45 | 46 | return z 47 | .object({ 48 | session_id: z.never({ message }).optional(), 49 | }) 50 | .passthrough() 51 | .and(schema) 52 | .transform((data) => { 53 | const { session_id, ...rest } = data; 54 | if (session_id !== undefined) { 55 | throw new Error(message); 56 | } 57 | return rest as z.infer<T>; 58 | }); 59 | }; 60 | ``` -------------------------------------------------------------------------------- /.github/workflows/ci.yml: -------------------------------------------------------------------------------- ```yaml 1 | name: CI 2 | 3 | on: 4 | push: 5 | branches: [ main ] 6 | pull_request: 7 | branches: [ main ] 8 | 9 | permissions: 10 | contents: write 11 | pages: write 12 | id-token: write 13 | 14 | jobs: 15 | test: 16 | runs-on: ubuntu-latest 17 | 18 | strategy: 19 | matrix: 20 | node-version: [18.x, 20.x, 22.x] 21 | 22 | steps: 23 | - uses: actions/checkout@v4 24 | 25 | - name: Use Node.js ${{ matrix.node-version }} 26 | uses: actions/setup-node@v4 27 | with: 28 | node-version: ${{ matrix.node-version }} 29 | cache: 'npm' 30 | 31 | - name: Install dependencies 32 | run: npm ci 33 | 34 | - name: Run linter 35 | run: npm run lint 36 | 37 | - name: Check formatting 38 | run: npm run format:check 39 | 40 | - name: Build 41 | run: npm run build 42 | 43 | - name: Run unit tests 44 | run: npm run test:unit 45 | 46 | - name: Generate coverage report 47 | if: matrix.node-version == '18.x' 48 | run: npm run test:coverage -- --testPathIgnorePatterns=integration --testPathIgnorePatterns=examples 49 | 50 | - name: Upload coverage reports 51 | if: matrix.node-version == '18.x' 52 | uses: actions/upload-artifact@v4 53 | with: 54 | name: coverage-report 55 | path: coverage/ 56 | 57 | - name: Update coverage gist 58 | if: matrix.node-version == '18.x' 59 | env: 60 | GIST_SECRET: ${{ secrets.GIST_SECRET }} 61 | run: | 62 | # Extract coverage percentage from lcov.info 63 | COVERAGE=$(awk -F: '/^SF:/{files++} /^LF:/{lines+=$2} /^LH:/{hits+=$2} END {printf "%.0f", (hits/lines)*100}' coverage/lcov.info) 64 | 65 | # Determine color based on coverage 66 | if [ $COVERAGE -ge 90 ]; then COLOR="brightgreen" 67 | elif [ $COVERAGE -ge 70 ]; then COLOR="green" 68 | elif [ $COVERAGE -ge 50 ]; then COLOR="yellow" 69 | elif [ $COVERAGE -ge 30 ]; then COLOR="orange" 70 | else COLOR="red"; fi 71 | 72 | # Update gist 73 | echo "{\"schemaVersion\":1,\"label\":\"coverage\",\"message\":\"${COVERAGE}%\",\"color\":\"${COLOR}\"}" > coverage.json 74 | gh auth login --with-token <<< "$GIST_SECRET" 75 | gh gist edit e2abffb0deb25afa2bf9185f440dae81 coverage.json 76 | 77 | - name: Deploy coverage to GitHub Pages 78 | if: matrix.node-version == '18.x' && github.ref == 'refs/heads/main' 79 | uses: peaceiris/actions-gh-pages@v4 80 | with: 81 | github_token: ${{ secrets.GITHUB_TOKEN }} 82 | publish_dir: ./coverage/lcov-report 83 | destination_dir: coverage ``` -------------------------------------------------------------------------------- /package.json: -------------------------------------------------------------------------------- ```json 1 | { 2 | "name": "mcp-crawl4ai-ts", 3 | "version": "3.0.2", 4 | "description": "TypeScript MCP server for Crawl4AI - web crawling and content extraction", 5 | "main": "dist/index.js", 6 | "bin": { 7 | "mcp-crawl4ai-ts": "dist/index.js" 8 | }, 9 | "type": "module", 10 | "engines": { 11 | "node": ">=18.0.0" 12 | }, 13 | "scripts": { 14 | "build": "tsc -p tsconfig.build.json", 15 | "start": "node dist/index.js", 16 | "dev": "tsx src/index.ts", 17 | "test": "NODE_OPTIONS=--experimental-vm-modules jest", 18 | "test:watch": "NODE_OPTIONS=--experimental-vm-modules jest --watch", 19 | "test:coverage": "NODE_OPTIONS=--experimental-vm-modules jest --coverage", 20 | "test:unit": "NODE_OPTIONS=--experimental-vm-modules jest --testPathIgnorePatterns=integration --testPathIgnorePatterns=examples", 21 | "test:integration": "JEST_TEST_TYPE=integration NODE_OPTIONS=--experimental-vm-modules jest src/__tests__/integration", 22 | "test:ci": "NODE_OPTIONS=--experimental-vm-modules jest --coverage --maxWorkers=2", 23 | "lint": "eslint src --ext .ts", 24 | "lint:fix": "eslint src --ext .ts --fix", 25 | "format": "prettier --write \"src/**/*.ts\"", 26 | "format:check": "prettier --check \"src/**/*.ts\"", 27 | "check": "npm run lint && npm run format:check && npm run build" 28 | }, 29 | "keywords": [ 30 | "mcp", 31 | "crawl4ai", 32 | "web-scraping", 33 | "markdown", 34 | "pdf", 35 | "screenshot" 36 | ], 37 | "author": "Juan González Cano", 38 | "license": "MIT", 39 | "repository": { 40 | "type": "git", 41 | "url": "git+https://github.com/omgwtfwow/mcp-crawl4ai-ts.git" 42 | }, 43 | "bugs": { 44 | "url": "https://github.com/omgwtfwow/mcp-crawl4ai-ts/issues" 45 | }, 46 | "homepage": "https://github.com/omgwtfwow/mcp-crawl4ai-ts#readme", 47 | "files": [ 48 | "dist/**/*", 49 | "README.md", 50 | "LICENSE" 51 | ], 52 | "dependencies": { 53 | "@modelcontextprotocol/sdk": "^1.0.4", 54 | "axios": "^1.7.9", 55 | "dotenv": "^16.4.7", 56 | "zod": "^3.25.76" 57 | }, 58 | "devDependencies": { 59 | "@eslint/js": "^9.32.0", 60 | "@jest/globals": "^29.7.0", 61 | "@types/jest": "^29.5.12", 62 | "@types/nock": "^10.0.3", 63 | "@types/node": "^22.10.6", 64 | "@typescript-eslint/eslint-plugin": "^8.38.0", 65 | "@typescript-eslint/parser": "^8.38.0", 66 | "diff": "^8.0.2", 67 | "eslint": "^9.32.0", 68 | "eslint-config-prettier": "^10.1.8", 69 | "eslint-plugin-prettier": "^5.5.3", 70 | "jest": "^29.7.0", 71 | "nock": "^14.0.8", 72 | "prettier": "^3.6.2", 73 | "ts-jest": "^29.4.0", 74 | "tsx": "^4.19.2", 75 | "typescript": "^5.7.3" 76 | } 77 | } 78 | ``` -------------------------------------------------------------------------------- /src/__tests__/handlers/session-handlers.test.ts: -------------------------------------------------------------------------------- ```typescript 1 | /* eslint-env jest */ 2 | import { jest } from '@jest/globals'; 3 | import { AxiosError } from 'axios'; 4 | import type { SessionHandlers as SessionHandlersType } from '../../handlers/session-handlers.js'; 5 | 6 | // Mock axios before importing SessionHandlers 7 | const mockPost = jest.fn(); 8 | const mockAxiosClient = { 9 | post: mockPost, 10 | }; 11 | 12 | // Mock the service 13 | const mockService = {} as unknown; 14 | 15 | // Import after setting up mocks 16 | const { SessionHandlers } = await import('../../handlers/session-handlers.js'); 17 | 18 | describe('SessionHandlers', () => { 19 | let handler: SessionHandlersType; 20 | let sessions: Map<string, unknown>; 21 | 22 | beforeEach(() => { 23 | jest.clearAllMocks(); 24 | sessions = new Map(); 25 | handler = new SessionHandlers(mockService, mockAxiosClient as unknown, sessions); 26 | }); 27 | 28 | describe('createSession', () => { 29 | it('should handle initial crawl failure gracefully', async () => { 30 | // Mock failed crawl 31 | mockPost.mockRejectedValue( 32 | new AxiosError('Request failed with status code 500', 'ERR_BAD_RESPONSE', undefined, undefined, { 33 | status: 500, 34 | statusText: 'Internal Server Error', 35 | data: 'Internal Server Error', 36 | headers: {}, 37 | config: {} as unknown, 38 | } as unknown), 39 | ); 40 | 41 | const options = { 42 | initial_url: 'https://this-domain-definitely-does-not-exist-12345.com', 43 | browser_type: 'chromium' as const, 44 | }; 45 | 46 | // Create session with initial_url that will fail 47 | const result = await handler.createSession(options); 48 | 49 | // Session should still be created 50 | expect(result.content[0].type).toBe('text'); 51 | expect(result.content[0].text).toContain('Session created successfully'); 52 | expect(result.content[0].text).toContain( 53 | 'Pre-warmed with: https://this-domain-definitely-does-not-exist-12345.com', 54 | ); 55 | expect(result.session_id).toBeDefined(); 56 | expect(result.browser_type).toBe('chromium'); 57 | 58 | // Verify crawl was attempted 59 | expect(mockPost).toHaveBeenCalledWith( 60 | '/crawl', 61 | { 62 | urls: ['https://this-domain-definitely-does-not-exist-12345.com'], 63 | browser_config: { 64 | headless: true, 65 | browser_type: 'chromium', 66 | }, 67 | crawler_config: { 68 | session_id: expect.stringMatching(/^session-/), 69 | cache_mode: 'BYPASS', 70 | }, 71 | }, 72 | { 73 | timeout: 30000, 74 | }, 75 | ); 76 | 77 | // Verify session was stored locally 78 | expect(sessions.size).toBe(1); 79 | const session = sessions.get(result.session_id); 80 | expect(session).toBeDefined(); 81 | expect(session.initial_url).toBe('https://this-domain-definitely-does-not-exist-12345.com'); 82 | }); 83 | 84 | it('should not attempt crawl when no initial_url provided', async () => { 85 | const result = await handler.createSession({}); 86 | 87 | // Session should be created without crawl 88 | expect(result.content[0].text).toContain('Session created successfully'); 89 | expect(result.content[0].text).toContain('Ready for use'); 90 | expect(result.content[0].text).not.toContain('Pre-warmed'); 91 | 92 | // Verify no crawl was attempted 93 | expect(mockPost).not.toHaveBeenCalled(); 94 | }); 95 | }); 96 | }); 97 | ``` -------------------------------------------------------------------------------- /src/__tests__/integration/generate-pdf.integration.test.ts: -------------------------------------------------------------------------------- ```typescript 1 | /* eslint-env jest */ 2 | import { Client } from '@modelcontextprotocol/sdk/client/index.js'; 3 | import { createTestClient, cleanupTestClient, TEST_TIMEOUTS } from './test-utils.js'; 4 | 5 | interface ToolResult { 6 | content: Array<{ 7 | type: string; 8 | text?: string; 9 | resource?: { 10 | uri: string; 11 | mimeType?: string; 12 | blob?: string; 13 | }; 14 | }>; 15 | } 16 | 17 | describe('generate_pdf Integration Tests', () => { 18 | let client: Client; 19 | 20 | beforeAll(async () => { 21 | client = await createTestClient(); 22 | }, TEST_TIMEOUTS.medium); 23 | 24 | afterAll(async () => { 25 | if (client) { 26 | await cleanupTestClient(client); 27 | } 28 | }); 29 | 30 | describe('PDF generation', () => { 31 | it( 32 | 'should generate PDF from URL', 33 | async () => { 34 | const result = await client.callTool({ 35 | name: 'generate_pdf', 36 | arguments: { 37 | url: 'https://httpbin.org/html', 38 | }, 39 | }); 40 | 41 | expect(result).toBeDefined(); 42 | const content = (result as ToolResult).content; 43 | expect(content).toHaveLength(2); 44 | 45 | // First item should be the PDF as embedded resource 46 | expect(content[0].type).toBe('resource'); 47 | expect(content[0].resource).toBeDefined(); 48 | expect(content[0].resource?.mimeType).toBe('application/pdf'); 49 | expect(content[0].resource?.blob).toBeTruthy(); 50 | expect(content[0].resource?.blob?.length).toBeGreaterThan(1000); // Should be a substantial base64 string 51 | expect(content[0].resource?.uri).toContain('data:application/pdf'); 52 | 53 | // Second item should be text description 54 | expect(content[1].type).toBe('text'); 55 | expect(content[1].text).toContain('PDF generated for: https://httpbin.org/html'); 56 | }, 57 | TEST_TIMEOUTS.long, 58 | ); 59 | 60 | it( 61 | 'should reject session_id parameter', 62 | async () => { 63 | const result = await client.callTool({ 64 | name: 'generate_pdf', 65 | arguments: { 66 | url: 'https://httpbin.org/html', 67 | session_id: 'test-session', 68 | }, 69 | }); 70 | 71 | const content = (result as ToolResult).content; 72 | expect(content).toHaveLength(1); 73 | expect(content[0].type).toBe('text'); 74 | expect(content[0].text).toContain('session_id'); 75 | expect(content[0].text).toContain('does not support'); 76 | expect(content[0].text).toContain('stateless'); 77 | }, 78 | TEST_TIMEOUTS.short, 79 | ); 80 | 81 | it( 82 | 'should handle invalid URLs gracefully', 83 | async () => { 84 | const result = await client.callTool({ 85 | name: 'generate_pdf', 86 | arguments: { 87 | url: 'not-a-valid-url', 88 | }, 89 | }); 90 | 91 | const content = (result as ToolResult).content; 92 | expect(content).toHaveLength(1); 93 | expect(content[0].type).toBe('text'); 94 | expect(content[0].text).toContain('Error'); 95 | expect(content[0].text?.toLowerCase()).toContain('invalid'); 96 | }, 97 | TEST_TIMEOUTS.short, 98 | ); 99 | 100 | it( 101 | 'should handle non-existent domains', 102 | async () => { 103 | const result = await client.callTool({ 104 | name: 'generate_pdf', 105 | arguments: { 106 | url: 'https://this-domain-definitely-does-not-exist-123456789.com', 107 | }, 108 | }); 109 | 110 | const content = (result as ToolResult).content; 111 | expect(content).toHaveLength(1); 112 | expect(content[0].type).toBe('text'); 113 | expect(content[0].text).toContain('Error'); 114 | }, 115 | TEST_TIMEOUTS.short, 116 | ); 117 | }); 118 | }); 119 | ``` -------------------------------------------------------------------------------- /src/__tests__/integration/session-management.integration.test.ts: -------------------------------------------------------------------------------- ```typescript 1 | import { createTestClient, cleanupTestClient, TEST_TIMEOUTS } from './test-utils.js'; 2 | import { Client } from '@modelcontextprotocol/sdk/client/index.js'; 3 | 4 | interface ToolResult { 5 | content: Array<{ 6 | type: string; 7 | text?: string; 8 | }>; 9 | session_id?: string; 10 | browser_type?: string; 11 | initial_url?: string; 12 | created_at?: string; 13 | } 14 | 15 | describe('Session Management Integration Tests', () => { 16 | let client: Client; 17 | const createdSessions: string[] = []; 18 | 19 | beforeAll(async () => { 20 | client = await createTestClient(); 21 | }, TEST_TIMEOUTS.medium); 22 | 23 | afterEach(async () => { 24 | // Clean up any sessions created during tests 25 | for (const sessionId of createdSessions) { 26 | try { 27 | await client.callTool({ 28 | name: 'manage_session', 29 | arguments: { action: 'clear', session_id: sessionId }, 30 | }); 31 | } catch (e) { 32 | // Ignore errors during cleanup 33 | console.debug('Cleanup error:', e); 34 | } 35 | } 36 | createdSessions.length = 0; 37 | }); 38 | 39 | afterAll(async () => { 40 | if (client) { 41 | await cleanupTestClient(client); 42 | } 43 | }); 44 | 45 | describe('manage_session', () => { 46 | it( 47 | 'should create session with auto-generated ID using manage_session', 48 | async () => { 49 | const result = await client.callTool({ 50 | name: 'manage_session', 51 | arguments: { action: 'create' }, 52 | }); 53 | 54 | expect(result).toBeDefined(); 55 | const typedResult = result as ToolResult; 56 | expect(typedResult.content).toBeDefined(); 57 | expect(Array.isArray(typedResult.content)).toBe(true); 58 | 59 | const textContent = typedResult.content.find((c) => c.type === 'text'); 60 | expect(textContent).toBeDefined(); 61 | expect(textContent?.text).toContain('Session created successfully'); 62 | 63 | // Check returned parameters 64 | expect(typedResult.session_id).toBeDefined(); 65 | expect(typedResult.session_id).toMatch(/^session-/); 66 | expect(typedResult.browser_type).toBe('chromium'); 67 | 68 | // Track for cleanup 69 | createdSessions.push(typedResult.session_id!); 70 | }, 71 | TEST_TIMEOUTS.short, 72 | ); 73 | 74 | it( 75 | 'should clear session using manage_session', 76 | async () => { 77 | // First create a session 78 | const createResult = await client.callTool({ 79 | name: 'manage_session', 80 | arguments: { 81 | action: 'create', 82 | session_id: 'test-to-clear', 83 | }, 84 | }); 85 | 86 | const typedCreateResult = createResult as ToolResult; 87 | createdSessions.push(typedCreateResult.session_id!); 88 | 89 | // Then clear it 90 | const clearResult = await client.callTool({ 91 | name: 'manage_session', 92 | arguments: { 93 | action: 'clear', 94 | session_id: 'test-to-clear', 95 | }, 96 | }); 97 | 98 | const typedClearResult = clearResult as ToolResult; 99 | expect(typedClearResult.content[0].text).toContain('Session cleared successfully'); 100 | }, 101 | TEST_TIMEOUTS.short, 102 | ); 103 | 104 | it( 105 | 'should list sessions using manage_session', 106 | async () => { 107 | // Create a session first 108 | const createResult = await client.callTool({ 109 | name: 'manage_session', 110 | arguments: { 111 | action: 'create', 112 | session_id: 'test-list-session', 113 | }, 114 | }); 115 | 116 | const typedCreateResult = createResult as ToolResult; 117 | createdSessions.push(typedCreateResult.session_id!); 118 | 119 | // List sessions 120 | const listResult = await client.callTool({ 121 | name: 'manage_session', 122 | arguments: { action: 'list' }, 123 | }); 124 | 125 | const typedListResult = listResult as ToolResult; 126 | expect(typedListResult.content[0].text).toContain('Active sessions'); 127 | expect(typedListResult.content[0].text).toContain('test-list-session'); 128 | }, 129 | TEST_TIMEOUTS.short, 130 | ); 131 | }); 132 | }); 133 | ``` -------------------------------------------------------------------------------- /src/__tests__/integration/get-html.integration.test.ts: -------------------------------------------------------------------------------- ```typescript 1 | /* eslint-env jest */ 2 | import { Client } from '@modelcontextprotocol/sdk/client/index.js'; 3 | import { createTestClient, cleanupTestClient, TEST_TIMEOUTS } from './test-utils.js'; 4 | 5 | interface ToolResult { 6 | content: Array<{ 7 | type: string; 8 | text?: string; 9 | }>; 10 | } 11 | 12 | describe('get_html Integration Tests', () => { 13 | let client: Client; 14 | 15 | beforeAll(async () => { 16 | client = await createTestClient(); 17 | }, TEST_TIMEOUTS.medium); 18 | 19 | afterAll(async () => { 20 | if (client) { 21 | await cleanupTestClient(client); 22 | } 23 | }); 24 | 25 | describe('HTML extraction', () => { 26 | it( 27 | 'should extract HTML from URL', 28 | async () => { 29 | const result = await client.callTool({ 30 | name: 'get_html', 31 | arguments: { 32 | url: 'https://httpbin.org/html', 33 | }, 34 | }); 35 | 36 | expect(result).toBeDefined(); 37 | const content = (result as ToolResult).content; 38 | expect(content).toHaveLength(1); 39 | expect(content[0].type).toBe('text'); 40 | 41 | // Should contain processed HTML 42 | const html = content[0].text || ''; 43 | expect(html).toBeTruthy(); 44 | // The HTML endpoint returns sanitized/processed HTML 45 | // It might be truncated with "..." 46 | expect(html.length).toBeGreaterThan(0); 47 | }, 48 | TEST_TIMEOUTS.medium, 49 | ); 50 | 51 | it( 52 | 'should reject session_id parameter', 53 | async () => { 54 | const result = await client.callTool({ 55 | name: 'get_html', 56 | arguments: { 57 | url: 'https://example.com', 58 | session_id: 'test-session', 59 | }, 60 | }); 61 | 62 | const content = (result as ToolResult).content; 63 | expect(content).toHaveLength(1); 64 | expect(content[0].type).toBe('text'); 65 | expect(content[0].text).toContain('session_id'); 66 | expect(content[0].text).toContain('does not support'); 67 | expect(content[0].text).toContain('stateless'); 68 | }, 69 | TEST_TIMEOUTS.short, 70 | ); 71 | 72 | it( 73 | 'should handle invalid URLs gracefully', 74 | async () => { 75 | const result = await client.callTool({ 76 | name: 'get_html', 77 | arguments: { 78 | url: 'not-a-valid-url', 79 | }, 80 | }); 81 | 82 | const content = (result as ToolResult).content; 83 | expect(content).toHaveLength(1); 84 | expect(content[0].type).toBe('text'); 85 | expect(content[0].text).toContain('Error'); 86 | expect(content[0].text?.toLowerCase()).toContain('invalid'); 87 | }, 88 | TEST_TIMEOUTS.short, 89 | ); 90 | 91 | it( 92 | 'should handle non-existent domains', 93 | async () => { 94 | const result = await client.callTool({ 95 | name: 'get_html', 96 | arguments: { 97 | url: 'https://this-domain-definitely-does-not-exist-123456789.com', 98 | }, 99 | }); 100 | 101 | const content = (result as ToolResult).content; 102 | expect(content).toHaveLength(1); 103 | expect(content[0].type).toBe('text'); 104 | 105 | // According to spec, returns success: true with empty HTML for invalid URLs 106 | const html = content[0].text || ''; 107 | // Could be empty or contain an error message 108 | expect(typeof html).toBe('string'); 109 | }, 110 | TEST_TIMEOUTS.short, 111 | ); 112 | 113 | it( 114 | 'should ignore extra parameters', 115 | async () => { 116 | const result = await client.callTool({ 117 | name: 'get_html', 118 | arguments: { 119 | url: 'https://example.com', 120 | wait_for: '.some-selector', // Should be ignored 121 | bypass_cache: true, // Should be ignored 122 | }, 123 | }); 124 | 125 | const content = (result as ToolResult).content; 126 | expect(content).toHaveLength(1); 127 | expect(content[0].type).toBe('text'); 128 | 129 | // Should still work, ignoring extra params 130 | const html = content[0].text || ''; 131 | expect(html.length).toBeGreaterThan(0); 132 | }, 133 | TEST_TIMEOUTS.long, 134 | ); 135 | }); 136 | }); 137 | ``` -------------------------------------------------------------------------------- /src/__tests__/integration/capture-screenshot.integration.test.ts: -------------------------------------------------------------------------------- ```typescript 1 | /* eslint-env jest */ 2 | import { Client } from '@modelcontextprotocol/sdk/client/index.js'; 3 | import { createTestClient, cleanupTestClient, TEST_TIMEOUTS } from './test-utils.js'; 4 | 5 | interface ToolResult { 6 | content: Array<{ 7 | type: string; 8 | text?: string; 9 | data?: string; 10 | mimeType?: string; 11 | }>; 12 | } 13 | 14 | describe('capture_screenshot Integration Tests', () => { 15 | let client: Client; 16 | 17 | beforeAll(async () => { 18 | client = await createTestClient(); 19 | }, TEST_TIMEOUTS.medium); 20 | 21 | afterAll(async () => { 22 | if (client) { 23 | await cleanupTestClient(client); 24 | } 25 | }); 26 | 27 | describe('Screenshot capture', () => { 28 | it( 29 | 'should capture screenshot with default wait time', 30 | async () => { 31 | const result = await client.callTool({ 32 | name: 'capture_screenshot', 33 | arguments: { 34 | url: 'https://httpbin.org/html', 35 | }, 36 | }); 37 | 38 | expect(result).toBeDefined(); 39 | const content = (result as ToolResult).content; 40 | expect(content).toHaveLength(2); 41 | 42 | // First item should be the image 43 | expect(content[0].type).toBe('image'); 44 | expect(content[0].mimeType).toBe('image/png'); 45 | expect(content[0].data).toBeTruthy(); 46 | expect(content[0].data?.length).toBeGreaterThan(1000); // Should be a substantial base64 string 47 | 48 | // Second item should be text description 49 | expect(content[1].type).toBe('text'); 50 | expect(content[1].text).toContain('Screenshot captured for: https://httpbin.org/html'); 51 | }, 52 | TEST_TIMEOUTS.short, 53 | ); 54 | 55 | it( 56 | 'should capture screenshot with custom wait time', 57 | async () => { 58 | const result = await client.callTool({ 59 | name: 'capture_screenshot', 60 | arguments: { 61 | url: 'https://httpbin.org/html', 62 | screenshot_wait_for: 0.5, // Reduced from 3 seconds 63 | }, 64 | }); 65 | 66 | expect(result).toBeDefined(); 67 | const content = (result as ToolResult).content; 68 | expect(content).toHaveLength(2); 69 | 70 | // First item should be the image 71 | expect(content[0].type).toBe('image'); 72 | expect(content[0].mimeType).toBe('image/png'); 73 | expect(content[0].data).toBeTruthy(); 74 | 75 | // Second item should be text description 76 | expect(content[1].type).toBe('text'); 77 | expect(content[1].text).toContain('Screenshot captured for: https://httpbin.org/html'); 78 | }, 79 | TEST_TIMEOUTS.medium, 80 | ); 81 | 82 | it( 83 | 'should reject session_id parameter', 84 | async () => { 85 | const result = await client.callTool({ 86 | name: 'capture_screenshot', 87 | arguments: { 88 | url: 'https://example.com', 89 | session_id: 'test-session', 90 | }, 91 | }); 92 | 93 | const content = (result as ToolResult).content; 94 | expect(content).toHaveLength(1); 95 | expect(content[0].type).toBe('text'); 96 | expect(content[0].text).toContain('session_id'); 97 | expect(content[0].text).toContain('does not support'); 98 | expect(content[0].text).toContain('stateless'); 99 | }, 100 | TEST_TIMEOUTS.short, 101 | ); 102 | 103 | it( 104 | 'should handle invalid URLs gracefully', 105 | async () => { 106 | const result = await client.callTool({ 107 | name: 'capture_screenshot', 108 | arguments: { 109 | url: 'not-a-valid-url', 110 | }, 111 | }); 112 | 113 | const content = (result as ToolResult).content; 114 | expect(content).toHaveLength(1); 115 | expect(content[0].type).toBe('text'); 116 | expect(content[0].text).toContain('Error'); 117 | expect(content[0].text?.toLowerCase()).toContain('invalid'); 118 | }, 119 | TEST_TIMEOUTS.short, 120 | ); 121 | 122 | it( 123 | 'should handle non-existent domains', 124 | async () => { 125 | const result = await client.callTool({ 126 | name: 'capture_screenshot', 127 | arguments: { 128 | url: 'https://this-domain-definitely-does-not-exist-123456789.com', 129 | }, 130 | }); 131 | 132 | const content = (result as ToolResult).content; 133 | expect(content).toHaveLength(1); 134 | expect(content[0].type).toBe('text'); 135 | expect(content[0].text).toContain('Error'); 136 | }, 137 | TEST_TIMEOUTS.short, 138 | ); 139 | }); 140 | }); 141 | ``` -------------------------------------------------------------------------------- /src/__tests__/integration/test-utils.ts: -------------------------------------------------------------------------------- ```typescript 1 | /* eslint-env jest */ 2 | import { Client } from '@modelcontextprotocol/sdk/client/index.js'; 3 | import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js'; 4 | import dotenv from 'dotenv'; 5 | 6 | // Load environment variables 7 | dotenv.config(); 8 | 9 | export interface IntegrationTestConfig { 10 | baseUrl: string; 11 | apiKey: string; 12 | llmProvider?: string; 13 | llmApiToken?: string; 14 | llmBaseUrl?: string; 15 | } 16 | 17 | export function getTestConfig(): IntegrationTestConfig { 18 | const config: IntegrationTestConfig = { 19 | baseUrl: process.env.CRAWL4AI_BASE_URL || '', 20 | apiKey: process.env.CRAWL4AI_API_KEY || '', 21 | llmProvider: process.env.LLM_PROVIDER, 22 | llmApiToken: process.env.LLM_API_TOKEN, 23 | llmBaseUrl: process.env.LLM_BASE_URL, 24 | }; 25 | 26 | if (!config.baseUrl) { 27 | throw new Error( 28 | 'CRAWL4AI_BASE_URL is required for integration tests. Please set it in .env file or environment variable.', 29 | ); 30 | } 31 | 32 | return config; 33 | } 34 | 35 | export function hasLLMConfig(): boolean { 36 | const config = getTestConfig(); 37 | return !!(config.llmProvider && config.llmApiToken); 38 | } 39 | 40 | export async function createTestClient(): Promise<Client> { 41 | const transport = new StdioClientTransport({ 42 | command: 'tsx', 43 | args: ['src/index.ts'], 44 | env: { 45 | ...process.env, 46 | NODE_ENV: 'test', 47 | }, 48 | cwd: process.cwd(), // Ensure the child process runs in the correct directory 49 | }); 50 | 51 | const client = new Client( 52 | { 53 | name: 'integration-test-client', 54 | version: '1.0.0', 55 | }, 56 | { 57 | capabilities: {}, 58 | }, 59 | ); 60 | 61 | await client.connect(transport); 62 | return client; 63 | } 64 | 65 | export async function cleanupTestClient(client: Client): Promise<void> { 66 | await client.close(); 67 | } 68 | 69 | // Test data generators 70 | export function generateSessionId(): string { 71 | return `test-session-${Date.now()}-${Math.random().toString(36).substring(2, 9)}`; 72 | } 73 | 74 | export function generateTestUrl(type: 'simple' | 'dynamic' | 'infinite-scroll' | 'auth' = 'simple'): string { 75 | const urls = { 76 | simple: 'https://example.com', 77 | dynamic: 'https://github.com', 78 | 'infinite-scroll': 'https://twitter.com', 79 | auth: 'https://github.com/login', 80 | }; 81 | return urls[type]; 82 | } 83 | 84 | // Test result types 85 | export interface TestContentItem { 86 | type: string; 87 | text?: string; 88 | data?: string; 89 | mimeType?: string; 90 | } 91 | 92 | export interface TestResult { 93 | content: TestContentItem[]; 94 | } 95 | 96 | export interface ToolResult { 97 | content: TestContentItem[]; 98 | isError?: boolean; 99 | } 100 | 101 | // Assertion helpers 102 | export async function expectSuccessfulCrawl(result: unknown): Promise<void> { 103 | expect(result).toBeDefined(); 104 | 105 | // Type guard to check if result has content property 106 | const typedResult = result as { content?: unknown }; 107 | expect(typedResult.content).toBeDefined(); 108 | expect(typedResult.content).toBeInstanceOf(Array); 109 | 110 | const contentArray = typedResult.content as TestContentItem[]; 111 | expect(contentArray.length).toBeGreaterThan(0); 112 | 113 | const textContent = contentArray.find((c) => c.type === 'text'); 114 | expect(textContent).toBeDefined(); 115 | expect(textContent?.text).toBeTruthy(); 116 | } 117 | 118 | export async function expectScreenshot(result: unknown): Promise<void> { 119 | const typedResult = result as { content?: TestContentItem[] }; 120 | expect(typedResult.content).toBeDefined(); 121 | 122 | const imageContent = typedResult.content?.find((c) => c.type === 'image'); 123 | expect(imageContent).toBeDefined(); 124 | expect(imageContent?.data).toBeTruthy(); 125 | expect(imageContent?.mimeType).toBe('image/png'); 126 | } 127 | 128 | export async function expectExtractedData(result: unknown, expectedKeys: string[]): Promise<void> { 129 | const typedResult = result as { content?: TestContentItem[] }; 130 | expect(typedResult.content).toBeDefined(); 131 | 132 | const textContent = typedResult.content?.find((c) => c.type === 'text'); 133 | expect(textContent).toBeDefined(); 134 | 135 | // Check if extracted data contains expected keys 136 | for (const key of expectedKeys) { 137 | expect(textContent?.text).toContain(key); 138 | } 139 | } 140 | 141 | // Delay helper for tests 142 | export function delay(ms: number): Promise<void> { 143 | return new Promise((resolve) => setTimeout(resolve, ms)); 144 | } 145 | 146 | // Rate limiter for integration tests 147 | let lastRequestTime = 0; 148 | export async function rateLimit(minDelayMs: number = 500): Promise<void> { 149 | const now = Date.now(); 150 | const timeSinceLastRequest = now - lastRequestTime; 151 | 152 | if (timeSinceLastRequest < minDelayMs) { 153 | await delay(minDelayMs - timeSinceLastRequest); 154 | } 155 | 156 | lastRequestTime = Date.now(); 157 | } 158 | 159 | // Skip test if condition is not met 160 | export function skipIf(condition: boolean, message: string) { 161 | if (condition) { 162 | console.log(`⚠️ Skipping test: ${message}`); 163 | return true; 164 | } 165 | return false; 166 | } 167 | 168 | // Test timeout helper 169 | export const TEST_TIMEOUTS = { 170 | short: 30000, // 30 seconds 171 | medium: 60000, // 1 minute 172 | long: 120000, // 2 minutes 173 | extraLong: 180000, // 3 minutes 174 | }; 175 | ``` -------------------------------------------------------------------------------- /.github/copilot-instructions.md: -------------------------------------------------------------------------------- ```markdown 1 | # Copilot Instructions: `mcp-crawl4ai-ts` 2 | 3 | Concise, project-specific guidance for AI coding agents. Optimize for correctness, safety, and existing test expectations. 4 | 5 | ## Architecture & Flow 6 | - Entrypoint `src/index.ts`: loads dotenv only if `CRAWL4AI_BASE_URL` unset; fails fast if missing. Passes env + version into `Crawl4AIServer`. 7 | - `src/server.ts`: registers MCP tools, keeps a `Map<string, SessionInfo>` for persistent browser sessions, and uses `validateAndExecute` (Zod parse + invariant error message format). Do NOT alter error text pattern: `Invalid parameters for <tool>: ...` (tests & LLM reliability depend on it). 8 | - Service layer `src/crawl4ai-service.ts`: pure HTTP wrapper around Crawl4AI endpoints; centralizes axios timeout & error translation (preserve wording like `Request timed out`, `Request failed with status <code>:` — tests rely on these substrings). 9 | - Handlers (`src/handlers/*.ts`): orchestration & response shaping (text content arrays). No direct business logic inside server class beyond wiring. 10 | - Validation schemas (`src/schemas/validation-schemas.ts` + helpers): all tool inputs defined here. Use `createStatelessSchema` for stateless tools; session/persistent tools have discriminated unions. 11 | 12 | ## Tool Model 13 | - Stateless tools (e.g. `get_markdown`, `capture_screenshot`, `execute_js`) spin up a fresh browser each call. 14 | - Session-based operations use `manage_session` (create/list/clear) + `crawl` for persistent state, allowing chained JS + screenshot/pdf in ONE call. Never try to chain separate stateless calls to reflect JS mutations. 15 | - Output always returned as base64/text blocks; do not add file system side-effects unless explicitly using a save path param already supported (screenshots: optional local save dir). 16 | 17 | ## JS & Input Validation Nuances 18 | - JS code schema rejects: HTML entities ("), literal `\n` tokens outside strings, embedded HTML tags. Reuse `JsCodeSchema`—do not duplicate logic. 19 | - For `get_markdown`: if filter is `bm25` or `llm`, `query` becomes required (enforced via `.refine`). Keep this logic centralized. 20 | 21 | ## Sessions 22 | - `SessionInfo` tracks `created_at` & `last_used`. Update `last_used` whenever a session-based action runs. Don't leak sessions: `clear` must delete map entry. 23 | 24 | ## Error Handling Pattern 25 | - Handlers wrap service calls; on failure use `this.formatError(error, '<operation>')` (see `BaseHandler`). Preserve format: `Failed to <operation>: <detail>`. 26 | - Zod validation errors: keep exact join pattern of `path: message` segments. 27 | 28 | ## Adding / Modifying a Tool (Checklist) 29 | 1. Define or extend schema in `validation-schemas.ts` (prefer composing existing small schemas; wrap with `createStatelessSchema` if ephemeral). 30 | 2. Add service method if it maps to a new Crawl4AI endpoint (pure HTTP + validation of URL / JS content; reuse existing validators). 31 | 3. Implement handler method (assemble request body, post-process response to `content: [{ type: 'text', text }]`). 32 | 4. Register in `setupHandlers()` list (tool description should mirror README style & clarify stateless vs session). 33 | 5. Write tests: unit (schema + handler success/failure), integration (happy path with mocked or real endpoint). Place under matching folder in `src/__tests__/`. 34 | 6. Update README tool table if user-facing, and CHANGELOG + version bump. 35 | 36 | ## Commands & Workflows 37 | - Install: `npm install` 38 | - Build: `npm run build` (tsconfig.build.json) 39 | - Dev (watch): `npm run dev` 40 | - Tests: `npm run test` | unit only: `npm run test:unit` | integration: `npm run test:integration` | coverage: `npm run test:coverage` 41 | - Lint/Format: `npm run lint`, `npm run lint:fix`, `npm run format:check` 42 | - Pre-flight composite: `npm run check` 43 | 44 | ### Testing Invariants 45 | - NEVER invoke `jest` directly for integration tests; rely on `npm run test:integration` (injects `NODE_OPTIONS=--experimental-vm-modules` + `JEST_TEST_TYPE=integration`). 46 | - Unit tests auto-set `CRAWL4AI_BASE_URL` in `jest.setup.cjs`; integration tests require real env vars (`CRAWL4AI_BASE_URL`, optional `CRAWL4AI_API_KEY`, LLM vars) via `.env` or exported. 47 | - To run a single integration file: `npm run test:integration -- path/to/file.test.ts`. 48 | - Jest pinned at 29.x with `ts-jest@29`; do not upgrade one without the other. 49 | - Symptom mapping: import syntax error or hang at first test => you bypassed the npm script. 50 | 51 | ## Conventions & Invariants 52 | - No `any`; prefer `unknown` + narrowing. 53 | - Keep responses minimal & textual; do not introduce new top-level fields in tool results without updating all tests. 54 | - Timeout remains 120s in axios clients—changing requires test updates. 55 | - Commit style: conventional commits; no emojis, AI signoffs, or verbose bodies. 56 | 57 | ## References 58 | - README (tools & examples), CLAUDE.md (contrib rules), CHANGELOG (release notes), coverage report for quality gates. 59 | 60 | If something is ambiguous, inspect existing handlers first and mirror the closest established pattern before inventing a new one. 61 | ``` -------------------------------------------------------------------------------- /src/__tests__/integration/extract-with-llm.integration.test.ts: -------------------------------------------------------------------------------- ```typescript 1 | /* eslint-env jest */ 2 | import { Client } from '@modelcontextprotocol/sdk/client/index.js'; 3 | import { createTestClient, cleanupTestClient, TEST_TIMEOUTS } from './test-utils.js'; 4 | 5 | interface ToolResult { 6 | content: Array<{ 7 | type: string; 8 | text?: string; 9 | }>; 10 | } 11 | 12 | describe('extract_with_llm Integration Tests', () => { 13 | let client: Client; 14 | 15 | beforeAll(async () => { 16 | client = await createTestClient(); 17 | }, TEST_TIMEOUTS.medium); 18 | 19 | afterAll(async () => { 20 | if (client) { 21 | await cleanupTestClient(client); 22 | } 23 | }); 24 | 25 | describe('LLM extraction', () => { 26 | it( 27 | 'should extract information about a webpage', 28 | async () => { 29 | const result = await client.callTool({ 30 | name: 'extract_with_llm', 31 | arguments: { 32 | url: 'https://httpbin.org/html', 33 | query: 'What is the main topic of this page?', 34 | }, 35 | }); 36 | 37 | expect(result).toBeTruthy(); 38 | const typedResult = result as ToolResult; 39 | expect(typedResult.content).toBeDefined(); 40 | expect(typedResult.content.length).toBeGreaterThan(0); 41 | 42 | const textContent = (result as ToolResult).content.find((c) => c.type === 'text'); 43 | expect(textContent?.text).toBeTruthy(); 44 | // Should return a meaningful response (LLM responses are non-deterministic) 45 | expect(textContent?.text?.length || 0).toBeGreaterThan(10); 46 | }, 47 | TEST_TIMEOUTS.long, 48 | ); 49 | 50 | it( 51 | 'should answer specific questions about content', 52 | async () => { 53 | const result = await client.callTool({ 54 | name: 'extract_with_llm', 55 | arguments: { 56 | url: 'https://httpbin.org/json', 57 | query: 'What is the slideshow title?', 58 | }, 59 | }); 60 | 61 | expect(result).toBeTruthy(); 62 | expect(result.content).toBeDefined(); 63 | 64 | const textContent = (result as ToolResult).content.find((c) => c.type === 'text'); 65 | expect(textContent?.text).toBeTruthy(); 66 | // Should provide an answer about the content 67 | expect(textContent?.text?.length || 0).toBeGreaterThan(5); 68 | }, 69 | TEST_TIMEOUTS.long, 70 | ); 71 | 72 | it( 73 | 'should handle complex queries', 74 | async () => { 75 | const result = await client.callTool({ 76 | name: 'extract_with_llm', 77 | arguments: { 78 | url: 'https://httpbin.org/html', 79 | query: 'List all the links found on this page', 80 | }, 81 | }); 82 | 83 | expect(result).toBeTruthy(); 84 | const textContent = (result as ToolResult).content.find((c) => c.type === 'text'); 85 | expect(textContent?.text).toBeTruthy(); 86 | // Should provide a response about links (content may vary) 87 | expect(textContent?.text?.length || 0).toBeGreaterThan(10); 88 | }, 89 | TEST_TIMEOUTS.long, 90 | ); 91 | }); 92 | 93 | describe('Error handling', () => { 94 | it( 95 | 'should handle server without API key configured', 96 | async () => { 97 | // Note: This test may pass if the server has OPENAI_API_KEY configured 98 | // It's here to document the expected behavior 99 | const result = await client.callTool({ 100 | name: 'extract_with_llm', 101 | arguments: { 102 | url: 'https://httpbin.org/status/200', 103 | query: 'What is on this page?', 104 | }, 105 | }); 106 | 107 | const typedResult = result as ToolResult; 108 | // If it succeeds, we have API key configured 109 | if (typedResult.content && typedResult.content.length > 0) { 110 | expect(result).toBeTruthy(); 111 | } 112 | // If it fails, we should get a proper error message 113 | else if (typedResult.content[0]?.text?.includes('LLM provider')) { 114 | expect(typedResult.content[0].text).toContain('LLM provider'); 115 | } 116 | }, 117 | TEST_TIMEOUTS.medium, 118 | ); 119 | 120 | it( 121 | 'should handle invalid URLs', 122 | async () => { 123 | const result = await client.callTool({ 124 | name: 'extract_with_llm', 125 | arguments: { 126 | url: 'not-a-url', 127 | query: 'What is this?', 128 | }, 129 | }); 130 | 131 | expect(result).toBeDefined(); 132 | const content = (result as ToolResult).content; 133 | const textContent = content.find((c) => c.type === 'text'); 134 | expect(textContent).toBeDefined(); 135 | expect(textContent?.text).toContain('Error'); 136 | expect(textContent?.text?.toLowerCase()).toContain('invalid'); 137 | }, 138 | TEST_TIMEOUTS.short, 139 | ); 140 | 141 | it( 142 | 'should handle empty query gracefully', 143 | async () => { 144 | const result = await client.callTool({ 145 | name: 'extract_with_llm', 146 | arguments: { 147 | url: 'https://example.com', 148 | query: '', 149 | }, 150 | }); 151 | 152 | expect(result).toBeDefined(); 153 | const content = (result as ToolResult).content; 154 | const textContent = content.find((c) => c.type === 'text'); 155 | expect(textContent).toBeDefined(); 156 | expect(textContent?.text).toContain('Error'); 157 | }, 158 | TEST_TIMEOUTS.short, 159 | ); 160 | }); 161 | }); 162 | ``` -------------------------------------------------------------------------------- /src/__tests__/integration/extract-links.integration.test.ts: -------------------------------------------------------------------------------- ```typescript 1 | /* eslint-env jest */ 2 | import { Client } from '@modelcontextprotocol/sdk/client/index.js'; 3 | import { createTestClient, cleanupTestClient, TEST_TIMEOUTS } from './test-utils.js'; 4 | 5 | interface ToolResult { 6 | content: Array<{ 7 | type: string; 8 | text?: string; 9 | }>; 10 | } 11 | 12 | describe('extract_links Integration Tests', () => { 13 | let client: Client; 14 | 15 | beforeAll(async () => { 16 | client = await createTestClient(); 17 | }, TEST_TIMEOUTS.medium); 18 | 19 | afterAll(async () => { 20 | if (client) { 21 | await cleanupTestClient(client); 22 | } 23 | }); 24 | 25 | describe('Basic functionality', () => { 26 | it( 27 | 'should extract links with categorization (default)', 28 | async () => { 29 | const result = await client.callTool({ 30 | name: 'extract_links', 31 | arguments: { 32 | url: 'https://webscraper.io/test-sites', 33 | }, 34 | }); 35 | 36 | expect(result).toBeDefined(); 37 | const content = (result as ToolResult).content; 38 | expect(content).toBeDefined(); 39 | expect(Array.isArray(content)).toBe(true); 40 | expect(content.length).toBeGreaterThan(0); 41 | 42 | const textContent = content.find((c) => c.type === 'text'); 43 | expect(textContent).toBeDefined(); 44 | expect(textContent?.text).toContain('Link analysis for https://webscraper.io/test-sites'); 45 | // Should show categorized output 46 | expect(textContent?.text).toMatch(/internal \(\d+\)/); 47 | expect(textContent?.text).toMatch(/external \(\d+\)/); 48 | }, 49 | TEST_TIMEOUTS.medium, 50 | ); 51 | 52 | it( 53 | 'should extract links without categorization', 54 | async () => { 55 | const result = await client.callTool({ 56 | name: 'extract_links', 57 | arguments: { 58 | url: 'https://webscraper.io/test-sites', 59 | categorize: false, 60 | }, 61 | }); 62 | 63 | expect(result).toBeDefined(); 64 | const content = (result as ToolResult).content; 65 | expect(content).toBeDefined(); 66 | expect(Array.isArray(content)).toBe(true); 67 | expect(content.length).toBeGreaterThan(0); 68 | 69 | const textContent = content.find((c) => c.type === 'text'); 70 | expect(textContent).toBeDefined(); 71 | expect(textContent?.text).toContain('All links from https://webscraper.io/test-sites'); 72 | // Should NOT show categorized output 73 | expect(textContent?.text).not.toMatch(/internal \(\d+\)/); 74 | expect(textContent?.text).not.toMatch(/external \(\d+\)/); 75 | }, 76 | TEST_TIMEOUTS.medium, 77 | ); 78 | 79 | it( 80 | 'should handle sites with no links', 81 | async () => { 82 | // Test with a simple status page 83 | const result = await client.callTool({ 84 | name: 'extract_links', 85 | arguments: { 86 | url: 'https://httpstat.us/200', 87 | }, 88 | }); 89 | 90 | expect(result).toBeDefined(); 91 | const content = (result as ToolResult).content; 92 | expect(content).toBeDefined(); 93 | const textContent = content.find((c) => c.type === 'text'); 94 | expect(textContent).toBeDefined(); 95 | }, 96 | TEST_TIMEOUTS.medium, 97 | ); 98 | 99 | it( 100 | 'should detect JSON endpoints', 101 | async () => { 102 | const result = await client.callTool({ 103 | name: 'extract_links', 104 | arguments: { 105 | url: 'https://httpbin.org/json', 106 | }, 107 | }); 108 | 109 | expect(result).toBeDefined(); 110 | const content = (result as ToolResult).content; 111 | expect(content).toBeDefined(); 112 | const textContent = content.find((c) => c.type === 'text'); 113 | expect(textContent).toBeDefined(); 114 | // Should show link analysis (even if empty) 115 | expect(textContent?.text).toContain('Link analysis for https://httpbin.org/json'); 116 | }, 117 | TEST_TIMEOUTS.medium, 118 | ); 119 | }); 120 | 121 | describe('Error handling', () => { 122 | it( 123 | 'should handle invalid URLs', 124 | async () => { 125 | const result = await client.callTool({ 126 | name: 'extract_links', 127 | arguments: { 128 | url: 'not-a-url', 129 | }, 130 | }); 131 | 132 | expect(result).toBeDefined(); 133 | const content = (result as ToolResult).content; 134 | expect(content).toBeDefined(); 135 | const textContent = content.find((c) => c.type === 'text'); 136 | expect(textContent).toBeDefined(); 137 | expect(textContent?.text).toContain('Error'); 138 | expect(textContent?.text?.toLowerCase()).toContain('invalid'); 139 | }, 140 | TEST_TIMEOUTS.short, 141 | ); 142 | 143 | it( 144 | 'should handle non-existent domains', 145 | async () => { 146 | const result = await client.callTool({ 147 | name: 'extract_links', 148 | arguments: { 149 | url: 'https://this-domain-definitely-does-not-exist-12345.com', 150 | }, 151 | }); 152 | 153 | expect(result).toBeDefined(); 154 | const content = (result as ToolResult).content; 155 | expect(content).toBeDefined(); 156 | const textContent = content.find((c) => c.type === 'text'); 157 | expect(textContent).toBeDefined(); 158 | expect(textContent?.text).toContain('Error'); 159 | // Could be various error messages: connection error, DNS error, etc. 160 | expect(textContent?.text?.toLowerCase()).toMatch(/error|failed/); 161 | }, 162 | TEST_TIMEOUTS.medium, 163 | ); 164 | }); 165 | }); 166 | ``` -------------------------------------------------------------------------------- /src/__tests__/integration/smart-crawl.integration.test.ts: -------------------------------------------------------------------------------- ```typescript 1 | /* eslint-env jest */ 2 | import { Client } from '@modelcontextprotocol/sdk/client/index.js'; 3 | import { createTestClient, cleanupTestClient, TEST_TIMEOUTS } from './test-utils.js'; 4 | 5 | interface ToolResult { 6 | content: Array<{ 7 | type: string; 8 | text?: string; 9 | }>; 10 | } 11 | 12 | describe('smart_crawl Integration Tests', () => { 13 | let client: Client; 14 | 15 | beforeAll(async () => { 16 | client = await createTestClient(); 17 | }, TEST_TIMEOUTS.medium); 18 | 19 | afterAll(async () => { 20 | if (client) { 21 | await cleanupTestClient(client); 22 | } 23 | }); 24 | 25 | describe('Smart crawling', () => { 26 | it( 27 | 'should auto-detect HTML content', 28 | async () => { 29 | const result = await client.callTool({ 30 | name: 'smart_crawl', 31 | arguments: { 32 | url: 'https://httpbin.org/html', 33 | }, 34 | }); 35 | 36 | expect(result).toBeDefined(); 37 | const content = (result as ToolResult).content; 38 | expect(content.length).toBeGreaterThanOrEqual(1); 39 | expect(content[0].type).toBe('text'); 40 | 41 | const text = content[0].text || ''; 42 | expect(text).toContain('Smart crawl detected content type:'); 43 | expect(text).toContain('html'); 44 | }, 45 | TEST_TIMEOUTS.medium, 46 | ); 47 | 48 | it( 49 | 'should handle sitemap URLs', 50 | async () => { 51 | const result = await client.callTool({ 52 | name: 'smart_crawl', 53 | arguments: { 54 | url: 'https://httpbingo.org/xml', 55 | max_depth: 1, 56 | }, 57 | }); 58 | 59 | const content = (result as ToolResult).content; 60 | expect(content.length).toBeGreaterThanOrEqual(1); 61 | expect(content[0].type).toBe('text'); 62 | 63 | const text = content[0].text || ''; 64 | expect(text).toContain('Smart crawl detected content type:'); 65 | expect(text.toLowerCase()).toMatch(/xml|sitemap/); 66 | }, 67 | TEST_TIMEOUTS.medium, 68 | ); 69 | 70 | it( 71 | 'should handle follow_links parameter', 72 | async () => { 73 | const result = await client.callTool({ 74 | name: 'smart_crawl', 75 | arguments: { 76 | url: 'https://httpbingo.org/xml', 77 | follow_links: true, 78 | max_depth: 1, 79 | }, 80 | }); 81 | 82 | const content = (result as ToolResult).content; 83 | expect(content.length).toBeGreaterThanOrEqual(1); 84 | expect(content[0].type).toBe('text'); 85 | 86 | const text = content[0].text || ''; 87 | expect(text).toContain('Smart crawl detected content type:'); 88 | }, 89 | TEST_TIMEOUTS.long, 90 | ); 91 | 92 | it( 93 | 'should detect JSON content', 94 | async () => { 95 | const result = await client.callTool({ 96 | name: 'smart_crawl', 97 | arguments: { 98 | url: 'https://httpbin.org/json', 99 | }, 100 | }); 101 | 102 | const content = (result as ToolResult).content; 103 | expect(content.length).toBeGreaterThanOrEqual(1); 104 | expect(content[0].type).toBe('text'); 105 | 106 | const text = content[0].text || ''; 107 | expect(text).toContain('Smart crawl detected content type:'); 108 | }, 109 | TEST_TIMEOUTS.medium, 110 | ); 111 | 112 | it( 113 | 'should bypass cache when requested', 114 | async () => { 115 | const result = await client.callTool({ 116 | name: 'smart_crawl', 117 | arguments: { 118 | url: 'https://httpbin.org/html', 119 | bypass_cache: true, 120 | }, 121 | }); 122 | 123 | const content = (result as ToolResult).content; 124 | expect(content.length).toBeGreaterThanOrEqual(1); 125 | expect(content[0].type).toBe('text'); 126 | 127 | const text = content[0].text || ''; 128 | expect(text).toContain('Smart crawl detected content type:'); 129 | }, 130 | TEST_TIMEOUTS.medium, 131 | ); 132 | 133 | it( 134 | 'should handle invalid URLs gracefully', 135 | async () => { 136 | const result = await client.callTool({ 137 | name: 'smart_crawl', 138 | arguments: { 139 | url: 'not-a-valid-url', 140 | }, 141 | }); 142 | 143 | const content = (result as ToolResult).content; 144 | expect(content.length).toBeGreaterThanOrEqual(1); 145 | expect(content[0].text).toContain('Error'); 146 | }, 147 | TEST_TIMEOUTS.short, 148 | ); 149 | 150 | it( 151 | 'should handle non-existent domains', 152 | async () => { 153 | const result = await client.callTool({ 154 | name: 'smart_crawl', 155 | arguments: { 156 | url: 'https://this-domain-definitely-does-not-exist-123456789.com', 157 | }, 158 | }); 159 | 160 | const content = (result as ToolResult).content; 161 | expect(content.length).toBeGreaterThanOrEqual(1); 162 | expect(content[0].type).toBe('text'); 163 | 164 | const text = content[0].text || ''; 165 | // Non-existent domains cause 500 errors 166 | expect(text).toContain('Error'); 167 | }, 168 | TEST_TIMEOUTS.short, 169 | ); 170 | 171 | it( 172 | 'should reject session_id parameter', 173 | async () => { 174 | const result = await client.callTool({ 175 | name: 'smart_crawl', 176 | arguments: { 177 | url: 'https://httpbin.org/html', 178 | session_id: 'test-session', 179 | }, 180 | }); 181 | 182 | const content = (result as ToolResult).content; 183 | expect(content.length).toBeGreaterThanOrEqual(1); 184 | expect(content[0].type).toBe('text'); 185 | expect(content[0].text).toContain('session_id'); 186 | expect(content[0].text).toContain('does not support'); 187 | expect(content[0].text).toContain('stateless'); 188 | }, 189 | TEST_TIMEOUTS.short, 190 | ); 191 | }); 192 | }); 193 | ``` -------------------------------------------------------------------------------- /src/handlers/session-handlers.ts: -------------------------------------------------------------------------------- ```typescript 1 | import { BaseHandler } from './base-handler.js'; 2 | 3 | export class SessionHandlers extends BaseHandler { 4 | async manageSession(options: { 5 | action: 'create' | 'clear' | 'list'; 6 | session_id?: string; 7 | initial_url?: string; 8 | browser_type?: string; 9 | }) { 10 | switch (options.action) { 11 | case 'create': 12 | return this.createSession({ 13 | session_id: options.session_id, 14 | initial_url: options.initial_url, 15 | browser_type: options.browser_type, 16 | }); 17 | case 'clear': 18 | if (!options.session_id) { 19 | throw new Error('session_id is required for clear action'); 20 | } 21 | return this.clearSession({ session_id: options.session_id }); 22 | case 'list': 23 | return this.listSessions(); 24 | default: 25 | // This should never happen due to TypeScript types, but handle it for runtime safety 26 | throw new Error(`Invalid action: ${(options as { action: string }).action}`); 27 | } 28 | } 29 | 30 | private async createSession(options: { session_id?: string; initial_url?: string; browser_type?: string }) { 31 | try { 32 | // Generate session ID if not provided 33 | const sessionId = options.session_id || `session-${Date.now()}-${Math.random().toString(36).substring(2, 11)}`; 34 | 35 | // Store session info locally 36 | this.sessions.set(sessionId, { 37 | id: sessionId, 38 | created_at: new Date(), 39 | last_used: new Date(), 40 | initial_url: options.initial_url, 41 | metadata: { 42 | browser_type: options.browser_type || 'chromium', 43 | }, 44 | }); 45 | 46 | // If initial_url provided, make first crawl to establish session 47 | if (options.initial_url) { 48 | try { 49 | await this.axiosClient.post( 50 | '/crawl', 51 | { 52 | urls: [options.initial_url], 53 | browser_config: { 54 | headless: true, 55 | browser_type: options.browser_type || 'chromium', 56 | }, 57 | crawler_config: { 58 | session_id: sessionId, 59 | cache_mode: 'BYPASS', 60 | }, 61 | }, 62 | { 63 | timeout: 30000, // 30 second timeout for initial crawl 64 | }, 65 | ); 66 | 67 | // Update last_used 68 | const session = this.sessions.get(sessionId); 69 | if (session) { 70 | session.last_used = new Date(); 71 | } 72 | } catch (error) { 73 | // Session created but initial crawl failed - still return success 74 | console.error(`Initial crawl failed for session ${sessionId}:`, error); 75 | } 76 | } 77 | 78 | return { 79 | content: [ 80 | { 81 | type: 'text', 82 | text: `Session created successfully:\nSession ID: ${sessionId}\nBrowser: ${options.browser_type || 'chromium'}\n${options.initial_url ? `Pre-warmed with: ${options.initial_url}` : 'Ready for use'}\n\nUse this session_id with the crawl tool to maintain state across requests.`, 83 | }, 84 | ], 85 | // Include all session parameters for easier programmatic access 86 | session_id: sessionId, 87 | browser_type: options.browser_type || 'chromium', 88 | initial_url: options.initial_url, 89 | created_at: this.sessions.get(sessionId)?.created_at.toISOString(), 90 | }; 91 | } catch (error) { 92 | throw this.formatError(error, 'create session'); 93 | } 94 | } 95 | 96 | private async clearSession(options: { session_id: string }) { 97 | try { 98 | // Remove from local store 99 | const deleted = this.sessions.delete(options.session_id); 100 | 101 | // Note: The actual browser session in Crawl4AI will be cleaned up 102 | // automatically after inactivity or when the server restarts 103 | 104 | return { 105 | content: [ 106 | { 107 | type: 'text', 108 | text: deleted 109 | ? `Session cleared successfully: ${options.session_id}` 110 | : `Session not found: ${options.session_id}`, 111 | }, 112 | ], 113 | }; 114 | } catch (error) { 115 | throw this.formatError(error, 'clear session'); 116 | } 117 | } 118 | 119 | private async listSessions() { 120 | try { 121 | // Return locally stored sessions 122 | const sessions = Array.from(this.sessions.entries()).map(([id, info]) => { 123 | const ageMinutes = Math.floor((Date.now() - info.created_at.getTime()) / 60000); 124 | const lastUsedMinutes = Math.floor((Date.now() - info.last_used.getTime()) / 60000); 125 | 126 | return { 127 | session_id: id, 128 | created_at: info.created_at.toISOString(), 129 | last_used: info.last_used.toISOString(), 130 | age_minutes: ageMinutes, 131 | last_used_minutes_ago: lastUsedMinutes, 132 | initial_url: info.initial_url, 133 | browser_type: info.metadata?.browser_type || 'chromium', 134 | }; 135 | }); 136 | 137 | if (sessions.length === 0) { 138 | return { 139 | content: [ 140 | { 141 | type: 'text', 142 | text: 'No active sessions found.', 143 | }, 144 | ], 145 | }; 146 | } 147 | 148 | const sessionList = sessions 149 | .map( 150 | (session) => 151 | `- ${session.session_id} (${session.browser_type}, created ${session.age_minutes}m ago, last used ${session.last_used_minutes_ago}m ago)`, 152 | ) 153 | .join('\n'); 154 | 155 | return { 156 | content: [ 157 | { 158 | type: 'text', 159 | text: `Active sessions (${sessions.length}):\n${sessionList}`, 160 | }, 161 | ], 162 | }; 163 | } catch (error) { 164 | throw this.formatError(error, 'list sessions'); 165 | } 166 | } 167 | } 168 | ``` -------------------------------------------------------------------------------- /src/__tests__/utils/javascript-validation.test.ts: -------------------------------------------------------------------------------- ```typescript 1 | /* eslint-env jest */ 2 | import { describe, it, expect } from '@jest/globals'; 3 | import { validateJavaScriptCode } from '../../schemas/helpers.js'; 4 | 5 | describe('JavaScript Code Validation', () => { 6 | describe('Valid JavaScript', () => { 7 | it('should accept simple JavaScript code', () => { 8 | expect(validateJavaScriptCode('console.log("Hello world")')).toBe(true); 9 | expect(validateJavaScriptCode('return document.title')).toBe(true); 10 | expect(validateJavaScriptCode('const x = 5; return x * 2;')).toBe(true); 11 | }); 12 | 13 | it('should accept JavaScript with real newlines', () => { 14 | expect(validateJavaScriptCode('console.log("Hello");\nconsole.log("World");')).toBe(true); 15 | expect(validateJavaScriptCode('function test() {\n return true;\n}')).toBe(true); 16 | }); 17 | 18 | it('should accept JavaScript with escape sequences in strings', () => { 19 | expect(validateJavaScriptCode('console.log("Line 1\\nLine 2")')).toBe(true); 20 | expect(validateJavaScriptCode('const msg = "Tab\\there\\tand\\tthere"')).toBe(true); 21 | expect(validateJavaScriptCode('return "Quote: \\"Hello\\""')).toBe(true); 22 | }); 23 | 24 | it('should accept complex JavaScript patterns', () => { 25 | const complexCode = ` 26 | const elements = document.querySelectorAll('.item'); 27 | elements.forEach((el, i) => { 28 | el.textContent = \`Item \${i + 1}\`; 29 | }); 30 | return elements.length; 31 | `; 32 | expect(validateJavaScriptCode(complexCode)).toBe(true); 33 | }); 34 | 35 | it('should accept JavaScript with regex patterns', () => { 36 | expect(validateJavaScriptCode('return /test\\d+/.test(str)')).toBe(true); 37 | expect(validateJavaScriptCode('const pattern = /\\w+@\\w+\\.\\w+/')).toBe(true); 38 | }); 39 | }); 40 | 41 | describe('Invalid JavaScript - HTML Entities', () => { 42 | it('should reject code with HTML entities', () => { 43 | expect(validateJavaScriptCode('console.log("Hello")')).toBe(false); 44 | expect(validateJavaScriptCode('const x = && true')).toBe(false); 45 | expect(validateJavaScriptCode('if (x < 5) return')).toBe(false); 46 | expect(validateJavaScriptCode('if (x > 5) return')).toBe(false); 47 | }); 48 | 49 | it('should reject code with numeric HTML entities', () => { 50 | expect(validateJavaScriptCode('const char = A')).toBe(false); 51 | // Note: hex entities like A are not caught by the current regex 52 | }); 53 | 54 | it('should reject code with named HTML entities', () => { 55 | expect(validateJavaScriptCode('const copy = ©')).toBe(false); 56 | expect(validateJavaScriptCode('const nbsp = ')).toBe(false); 57 | }); 58 | }); 59 | 60 | describe('Invalid JavaScript - HTML Tags', () => { 61 | it('should reject HTML markup', () => { 62 | expect(validateJavaScriptCode('<!DOCTYPE html>')).toBe(false); 63 | expect(validateJavaScriptCode('<html><body>test</body></html>')).toBe(false); 64 | expect(validateJavaScriptCode('<script>alert("test")</script>')).toBe(false); 65 | expect(validateJavaScriptCode('<style>body { color: red; }</style>')).toBe(false); 66 | }); 67 | 68 | it('should reject mixed HTML and JavaScript', () => { 69 | expect(validateJavaScriptCode('<head>\nconst x = 5;\n</head>')).toBe(false); 70 | expect(validateJavaScriptCode('console.log("test");\n<body>')).toBe(false); 71 | }); 72 | }); 73 | 74 | describe('Invalid JavaScript - Literal Escape Sequences', () => { 75 | it('should reject literal \\n outside of strings', () => { 76 | expect(validateJavaScriptCode('console.log("Hello");\\nconsole.log("World");')).toBe(false); 77 | expect(validateJavaScriptCode('const x = 5;\\nreturn x;')).toBe(false); 78 | expect(validateJavaScriptCode('if (true) {\\n return;\\n}')).toBe(false); 79 | }); 80 | 81 | it('should reject literal \\n in various positions', () => { 82 | expect(validateJavaScriptCode('}\\nfunction')).toBe(false); 83 | expect(validateJavaScriptCode(');\\nconst')).toBe(false); 84 | expect(validateJavaScriptCode('\\n{')).toBe(false); 85 | expect(validateJavaScriptCode('\\n(')).toBe(false); 86 | }); 87 | 88 | it('should reject literal \\n between statements', () => { 89 | expect(validateJavaScriptCode('const x = 5;\\nconst y = 10;')).toBe(false); 90 | expect(validateJavaScriptCode('doSomething();\\ndoAnother();')).toBe(false); 91 | }); 92 | }); 93 | 94 | describe('Edge Cases', () => { 95 | it('should handle empty strings', () => { 96 | expect(validateJavaScriptCode('')).toBe(true); 97 | }); 98 | 99 | it('should handle whitespace-only strings', () => { 100 | expect(validateJavaScriptCode(' ')).toBe(true); 101 | expect(validateJavaScriptCode('\n\n\n')).toBe(true); 102 | expect(validateJavaScriptCode('\t\t')).toBe(true); 103 | }); 104 | 105 | it('should handle single-line comments', () => { 106 | expect(validateJavaScriptCode('// This is a comment')).toBe(true); 107 | expect(validateJavaScriptCode('return 5; // Comment here')).toBe(true); 108 | }); 109 | 110 | it('should handle multi-line comments', () => { 111 | expect(validateJavaScriptCode('/* Multi\nline\ncomment */')).toBe(true); 112 | expect(validateJavaScriptCode('/* Comment */ return 5;')).toBe(true); 113 | }); 114 | 115 | it('should reject HTML tags even in what looks like strings', () => { 116 | // The current validation is quite strict and rejects HTML tags even if they appear to be in strings 117 | // This is by design to prevent malformed JavaScript that contains actual HTML 118 | expect(validateJavaScriptCode('const html = "<div>Hello</div>"')).toBe(true); // <div> is ok 119 | expect(validateJavaScriptCode("return '<style>body{}</style>'")).toBe(false); // <style> is rejected 120 | }); 121 | }); 122 | }); 123 | ``` -------------------------------------------------------------------------------- /src/__tests__/handlers/utility-handlers.test.ts: -------------------------------------------------------------------------------- ```typescript 1 | /* eslint-env jest */ 2 | import { jest } from '@jest/globals'; 3 | import type { UtilityHandlers } from '../../handlers/utility-handlers.js'; 4 | import type { Crawl4AIService } from '../../crawl4ai-service.js'; 5 | 6 | // Mock the service 7 | const mockCrawl = jest.fn(); 8 | const mockService = { 9 | crawl: mockCrawl, 10 | } as unknown as Crawl4AIService; 11 | 12 | // Mock axios client 13 | const mockPost = jest.fn(); 14 | const mockAxiosClient = { 15 | post: mockPost, 16 | } as unknown; 17 | 18 | // Import after setting up mocks 19 | const { UtilityHandlers: UtilityHandlersClass } = await import('../../handlers/utility-handlers.js'); 20 | 21 | describe('UtilityHandlers', () => { 22 | let handler: UtilityHandlers; 23 | let sessions: Map<string, unknown>; 24 | 25 | beforeEach(() => { 26 | jest.clearAllMocks(); 27 | sessions = new Map(); 28 | handler = new UtilityHandlersClass(mockService, mockAxiosClient, sessions); 29 | }); 30 | 31 | describe('extractLinks', () => { 32 | it('should manually extract links from markdown when API returns empty links', async () => { 33 | // Mock crawl response with empty links but markdown containing href attributes 34 | mockPost.mockResolvedValue({ 35 | data: { 36 | results: [ 37 | { 38 | success: true, 39 | links: { 40 | internal: [], 41 | external: [], 42 | }, 43 | markdown: { 44 | raw_markdown: ` 45 | # Test Page 46 | 47 | Here are some links: 48 | <a href="https://example.com/page1">Internal Link</a> 49 | <a href="https://external.com/page">External Link</a> 50 | <a href="/relative/path">Relative Link</a> 51 | <a href='https://example.com/page2'>Another Internal</a> 52 | `, 53 | }, 54 | }, 55 | ], 56 | }, 57 | }); 58 | 59 | const result = await handler.extractLinks({ 60 | url: 'https://example.com', 61 | categorize: true, 62 | }); 63 | 64 | // Should have manually extracted and categorized links 65 | expect(result.content[0].type).toBe('text'); 66 | expect(result.content[0].text).toContain('Link analysis for https://example.com'); 67 | expect(result.content[0].text).toContain('internal (3)'); 68 | expect(result.content[0].text).toContain('https://example.com/page1'); 69 | expect(result.content[0].text).toContain('https://example.com/page2'); 70 | expect(result.content[0].text).toContain('https://example.com/relative/path'); 71 | expect(result.content[0].text).toContain('external (1)'); 72 | expect(result.content[0].text).toContain('https://external.com/page'); 73 | }); 74 | 75 | it('should handle manual extraction without categorization', async () => { 76 | // Mock crawl response with empty links 77 | mockPost.mockResolvedValue({ 78 | data: { 79 | results: [ 80 | { 81 | success: true, 82 | links: { 83 | internal: [], 84 | external: [], 85 | }, 86 | markdown: { 87 | raw_markdown: `<a href="https://example.com/page1">Link 1</a> 88 | <a href="https://external.com/page">Link 2</a>`, 89 | }, 90 | }, 91 | ], 92 | }, 93 | }); 94 | 95 | const result = await handler.extractLinks({ 96 | url: 'https://example.com', 97 | categorize: false, 98 | }); 99 | 100 | // Should show all links without categorization 101 | expect(result.content[0].text).toContain('All links from https://example.com'); 102 | expect(result.content[0].text).toContain('https://example.com/page1'); 103 | expect(result.content[0].text).toContain('https://external.com/page'); 104 | expect(result.content[0].text).not.toContain('Internal links:'); 105 | }); 106 | 107 | it('should handle malformed URLs during manual extraction', async () => { 108 | // Mock crawl response with a malformed URL in href 109 | mockPost.mockResolvedValue({ 110 | data: { 111 | results: [ 112 | { 113 | success: true, 114 | links: { 115 | internal: [], 116 | external: [], 117 | }, 118 | markdown: { 119 | raw_markdown: `<a href="javascript:void(0)">JS Link</a> 120 | <a href="https://example.com/valid">Valid Link</a> 121 | <a href="not-a-url">Invalid URL</a>`, 122 | }, 123 | }, 124 | ], 125 | }, 126 | }); 127 | 128 | const result = await handler.extractLinks({ 129 | url: 'https://example.com', 130 | categorize: true, 131 | }); 132 | 133 | // Should handle invalid URLs gracefully 134 | expect(result.content[0].text).toContain('https://example.com/valid'); 135 | // Invalid URLs should be treated as relative links 136 | expect(result.content[0].text).toContain('not-a-url'); 137 | expect(result.content[0].text).toContain('javascript:void(0)'); 138 | }); 139 | 140 | it('should return empty results when no links found', async () => { 141 | // Mock crawl response with no links 142 | mockPost.mockResolvedValue({ 143 | data: { 144 | results: [ 145 | { 146 | success: true, 147 | links: { 148 | internal: [], 149 | external: [], 150 | }, 151 | markdown: { 152 | raw_markdown: 'Just plain text without any links', 153 | }, 154 | }, 155 | ], 156 | }, 157 | }); 158 | 159 | const result = await handler.extractLinks({ 160 | url: 'https://example.com', 161 | categorize: true, 162 | }); 163 | 164 | // Should show empty categories 165 | expect(result.content[0].text).toContain('Link analysis for https://example.com'); 166 | expect(result.content[0].text).toContain('internal (0)'); 167 | expect(result.content[0].text).toContain('external (0)'); 168 | }); 169 | }); 170 | }); 171 | ``` -------------------------------------------------------------------------------- /src/__tests__/index.cli.test.ts: -------------------------------------------------------------------------------- ```typescript 1 | // import { jest } from '@jest/globals'; 2 | import { spawn } from 'child_process'; 3 | import * as path from 'path'; 4 | import * as url from 'url'; 5 | 6 | const __dirname = url.fileURLToPath(new URL('.', import.meta.url)); 7 | 8 | describe('CLI Entry Point', () => { 9 | const cliPath = path.join(__dirname, '..', '..', 'src', 'index.ts'); 10 | 11 | // Helper to run CLI with given env vars 12 | const runCLI = ( 13 | env: Record<string, string> = {}, 14 | ): Promise<{ code: number | null; stdout: string; stderr: string }> => { 15 | return new Promise((resolve) => { 16 | const child = spawn('tsx', [cliPath], { 17 | env: { ...process.env, ...env }, 18 | stdio: 'pipe', 19 | }); 20 | 21 | let stdout = ''; 22 | let stderr = ''; 23 | 24 | child.stdout.on('data', (data) => { 25 | stdout += data.toString(); 26 | }); 27 | 28 | child.stderr.on('data', (data) => { 29 | stderr += data.toString(); 30 | }); 31 | 32 | child.on('close', (code) => { 33 | resolve({ code, stdout, stderr }); 34 | }); 35 | 36 | // Kill after 2 seconds to prevent hanging 37 | setTimeout(() => { 38 | child.kill(); 39 | }, 2000); 40 | }); 41 | }; 42 | 43 | describe('Environment Variable Validation', () => { 44 | it('should exit with code 1 when CRAWL4AI_BASE_URL is missing', async () => { 45 | const { code, stderr } = await runCLI({ 46 | CRAWL4AI_BASE_URL: '', 47 | }); 48 | 49 | expect(code).toBe(1); 50 | expect(stderr).toContain('Error: CRAWL4AI_BASE_URL environment variable is required'); 51 | expect(stderr).toContain('Please set it to your Crawl4AI server URL'); 52 | }); 53 | 54 | it('should start successfully with valid CRAWL4AI_BASE_URL', async () => { 55 | const { code, stderr } = await runCLI({ 56 | CRAWL4AI_BASE_URL: 'http://localhost:11235', 57 | CRAWL4AI_API_KEY: 'test-key', 58 | }); 59 | 60 | // Process should be killed by timeout, not exit with error 61 | expect(code).not.toBe(1); 62 | // MCP servers output to stderr 63 | expect(stderr).toContain('crawl4ai-mcp'); 64 | }); 65 | 66 | it('should use default values for optional env vars', async () => { 67 | const { stderr } = await runCLI({ 68 | CRAWL4AI_BASE_URL: 'http://localhost:11235', 69 | // No API_KEY, SERVER_NAME, or SERVER_VERSION 70 | }); 71 | 72 | expect(stderr).toContain('crawl4ai-mcp'); // default server name 73 | expect(stderr).toContain('1.0.0'); // default version 74 | }); 75 | 76 | it('should use custom SERVER_NAME and SERVER_VERSION when provided', async () => { 77 | const { stderr } = await runCLI({ 78 | CRAWL4AI_BASE_URL: 'http://localhost:11235', 79 | SERVER_NAME: 'custom-server', 80 | SERVER_VERSION: '2.0.0', 81 | }); 82 | 83 | expect(stderr).toContain('custom-server'); 84 | expect(stderr).toContain('2.0.0'); 85 | }); 86 | }); 87 | 88 | describe('Signal Handling', () => { 89 | it('should handle SIGTERM gracefully', async () => { 90 | const child = spawn('tsx', [cliPath], { 91 | env: { 92 | ...process.env, 93 | CRAWL4AI_BASE_URL: 'http://localhost:11235', 94 | }, 95 | stdio: 'pipe', 96 | }); 97 | 98 | // Wait for startup 99 | await new Promise((resolve) => setTimeout(resolve, 500)); 100 | 101 | // Send SIGTERM 102 | child.kill('SIGTERM'); 103 | 104 | const code = await new Promise<number | null>((resolve, reject) => { 105 | const timeout = setTimeout(() => { 106 | child.kill('SIGKILL'); 107 | reject(new Error('Process did not exit in time')); 108 | }, 5000); 109 | 110 | child.on('close', (exitCode) => { 111 | clearTimeout(timeout); 112 | resolve(exitCode); 113 | }); 114 | }); 115 | 116 | // Should exit with signal code 117 | expect(code).toBe(143); // 128 + 15 (SIGTERM) 118 | 119 | // Ensure cleanup 120 | child.kill(); 121 | }, 10000); 122 | 123 | it('should handle SIGINT gracefully', async () => { 124 | const child = spawn('tsx', [cliPath], { 125 | env: { 126 | ...process.env, 127 | CRAWL4AI_BASE_URL: 'http://localhost:11235', 128 | }, 129 | stdio: 'pipe', 130 | }); 131 | 132 | // Wait for startup 133 | await new Promise((resolve) => setTimeout(resolve, 500)); 134 | 135 | // Send SIGINT (Ctrl+C) 136 | child.kill('SIGINT'); 137 | 138 | const code = await new Promise<number | null>((resolve, reject) => { 139 | const timeout = setTimeout(() => { 140 | child.kill('SIGKILL'); 141 | reject(new Error('Process did not exit in time')); 142 | }, 5000); 143 | 144 | child.on('close', (exitCode) => { 145 | clearTimeout(timeout); 146 | resolve(exitCode); 147 | }); 148 | }); 149 | 150 | // Should exit with signal code 151 | expect(code).toBe(130); // 128 + 2 (SIGINT) 152 | 153 | // Ensure cleanup 154 | child.kill(); 155 | }, 10000); 156 | }); 157 | 158 | describe('Error Handling', () => { 159 | it('should handle server startup errors', async () => { 160 | // This will be tricky to test without mocking, but we can at least 161 | // verify the process starts and attempts to connect 162 | const { code, stdout, stderr } = await runCLI({ 163 | CRAWL4AI_BASE_URL: 'http://invalid-host-that-does-not-exist:99999', 164 | }); 165 | 166 | // Should not exit with code 1 (that's for missing env vars) 167 | expect(code).not.toBe(1); 168 | // But might log connection errors 169 | const output = stdout + stderr; 170 | expect(output).toBeTruthy(); 171 | }); 172 | }); 173 | 174 | describe('dotenv Loading', () => { 175 | it('should load .env file if present', async () => { 176 | // Create a temporary .env file 177 | const fs = await import('fs/promises'); 178 | const envPath = path.join(__dirname, '..', '..', '.env.test'); 179 | 180 | await fs.writeFile(envPath, 'TEST_ENV_VAR=loaded_from_file\n'); 181 | 182 | try { 183 | const { stderr } = await runCLI({ 184 | CRAWL4AI_BASE_URL: 'http://localhost:11235', 185 | NODE_ENV: 'test', 186 | DOTENV_CONFIG_PATH: envPath, 187 | }); 188 | 189 | // Verify the server starts (dotenv loaded successfully) 190 | expect(stderr).toContain('crawl4ai-mcp'); 191 | } finally { 192 | // Clean up 193 | await fs.unlink(envPath).catch(() => {}); 194 | } 195 | }); 196 | }); 197 | }); 198 | ``` -------------------------------------------------------------------------------- /src/__tests__/integration/crawl-recursive.integration.test.ts: -------------------------------------------------------------------------------- ```typescript 1 | /* eslint-env jest */ 2 | import { Client } from '@modelcontextprotocol/sdk/client/index.js'; 3 | import { createTestClient, cleanupTestClient, TEST_TIMEOUTS } from './test-utils.js'; 4 | 5 | interface ToolResult { 6 | content: Array<{ 7 | type: string; 8 | text?: string; 9 | }>; 10 | } 11 | 12 | describe('crawl_recursive Integration Tests', () => { 13 | let client: Client; 14 | 15 | beforeAll(async () => { 16 | client = await createTestClient(); 17 | }, TEST_TIMEOUTS.medium); 18 | 19 | afterAll(async () => { 20 | if (client) { 21 | await cleanupTestClient(client); 22 | } 23 | }); 24 | 25 | describe('Basic functionality', () => { 26 | it( 27 | 'should crawl a site recursively with default settings', 28 | async () => { 29 | const result = await client.callTool({ 30 | name: 'crawl_recursive', 31 | arguments: { 32 | url: 'https://httpbin.org/links/5/0', 33 | }, 34 | }); 35 | 36 | expect(result).toBeDefined(); 37 | const content = (result as ToolResult).content; 38 | expect(content).toBeDefined(); 39 | expect(Array.isArray(content)).toBe(true); 40 | expect(content.length).toBeGreaterThan(0); 41 | 42 | const textContent = content.find((c) => c.type === 'text'); 43 | expect(textContent).toBeDefined(); 44 | expect(textContent?.text).toContain('Recursive crawl completed'); 45 | expect(textContent?.text).toContain('Pages crawled:'); 46 | expect(textContent?.text).toContain('Max depth reached:'); 47 | expect(textContent?.text).toContain('Only internal links'); 48 | // Should have found multiple pages since httpbin.org/links/5/0 has internal links 49 | expect(textContent?.text).toMatch(/Pages crawled: [2-9]|[1-9][0-9]/); 50 | }, 51 | TEST_TIMEOUTS.long, 52 | ); 53 | 54 | it( 55 | 'should respect max_depth parameter', 56 | async () => { 57 | const result = await client.callTool({ 58 | name: 'crawl_recursive', 59 | arguments: { 60 | url: 'https://httpbin.org/links/10/0', 61 | max_depth: 1, 62 | max_pages: 5, 63 | }, 64 | }); 65 | 66 | expect(result).toBeDefined(); 67 | const content = (result as ToolResult).content; 68 | const textContent = content.find((c) => c.type === 'text'); 69 | expect(textContent).toBeDefined(); 70 | expect(textContent?.text).toContain('Max depth reached: '); 71 | expect(textContent?.text).toMatch(/Max depth reached: [0-1] \(limit: 1\)/); 72 | // With max_depth=1, should find some pages but not go too deep 73 | expect(textContent?.text).toMatch(/Pages crawled: [1-5]/); 74 | }, 75 | TEST_TIMEOUTS.long, 76 | ); 77 | 78 | it( 79 | 'should apply include pattern filter', 80 | async () => { 81 | const result = await client.callTool({ 82 | name: 'crawl_recursive', 83 | arguments: { 84 | url: 'https://httpbin.org/links/10/0', 85 | max_depth: 1, 86 | max_pages: 5, 87 | include_pattern: '.*/links/[0-9]+/[0-4]$', // Only include links ending with 0-4 88 | }, 89 | }); 90 | 91 | expect(result).toBeDefined(); 92 | const content = (result as ToolResult).content; 93 | const textContent = content.find((c) => c.type === 'text'); 94 | expect(textContent).toBeDefined(); 95 | 96 | // Check that we have some results 97 | expect(textContent?.text).toContain('Pages crawled:'); 98 | 99 | // If we crawled pages, they should match our pattern 100 | if (textContent?.text && textContent.text.includes('Pages found:')) { 101 | const pagesSection = textContent.text.split('Pages found:')[1]; 102 | if (pagesSection && pagesSection.trim()) { 103 | // All URLs should end with /0, /1, /2, /3, or /4 104 | expect(pagesSection).toMatch(/\/[0-4]\b/); 105 | // Should NOT have URLs ending with /5, /6, /7, /8, /9 106 | expect(pagesSection).not.toMatch(/\/[5-9]\b/); 107 | } 108 | } 109 | }, 110 | TEST_TIMEOUTS.long, 111 | ); 112 | 113 | it( 114 | 'should apply exclude pattern filter', 115 | async () => { 116 | const result = await client.callTool({ 117 | name: 'crawl_recursive', 118 | arguments: { 119 | url: 'https://example.com', 120 | max_depth: 2, 121 | max_pages: 10, 122 | exclude_pattern: '.*\\.(pdf|zip|exe)$', 123 | }, 124 | }); 125 | 126 | expect(result).toBeDefined(); 127 | const content = (result as ToolResult).content; 128 | const textContent = content.find((c) => c.type === 'text'); 129 | expect(textContent).toBeDefined(); 130 | 131 | // Should not have crawled any PDF, ZIP, or EXE files 132 | expect(textContent?.text).not.toMatch(/\.(pdf|zip|exe)/i); 133 | }, 134 | TEST_TIMEOUTS.long, 135 | ); 136 | }); 137 | 138 | describe('Error handling', () => { 139 | it( 140 | 'should handle invalid URLs', 141 | async () => { 142 | const result = await client.callTool({ 143 | name: 'crawl_recursive', 144 | arguments: { 145 | url: 'not-a-url', 146 | }, 147 | }); 148 | 149 | expect(result).toBeDefined(); 150 | const content = (result as ToolResult).content; 151 | expect(content).toBeDefined(); 152 | const textContent = content.find((c) => c.type === 'text'); 153 | expect(textContent).toBeDefined(); 154 | expect(textContent?.text).toContain('Error'); 155 | expect(textContent?.text?.toLowerCase()).toContain('invalid'); 156 | }, 157 | TEST_TIMEOUTS.short, 158 | ); 159 | 160 | it( 161 | 'should handle sites with internal links', 162 | async () => { 163 | const result = await client.callTool({ 164 | name: 'crawl_recursive', 165 | arguments: { 166 | url: 'https://httpbin.org/links/5/0', 167 | max_depth: 2, 168 | max_pages: 10, 169 | }, 170 | }); 171 | 172 | expect(result).toBeDefined(); 173 | const content = (result as ToolResult).content; 174 | const textContent = content.find((c) => c.type === 'text'); 175 | expect(textContent).toBeDefined(); 176 | expect(textContent?.text).toContain('Pages crawled:'); 177 | // Should crawl multiple pages since httpbin.org/links/5/0 has 5 internal links 178 | expect(textContent?.text).toMatch(/Pages crawled: [2-9]|1[0-9]/); 179 | expect(textContent?.text).toContain('Internal links found:'); 180 | }, 181 | TEST_TIMEOUTS.medium, 182 | ); 183 | }); 184 | }); 185 | ``` -------------------------------------------------------------------------------- /src/handlers/content-handlers.ts: -------------------------------------------------------------------------------- ```typescript 1 | import { BaseHandler } from './base-handler.js'; 2 | import { 3 | MarkdownEndpointOptions, 4 | MarkdownEndpointResponse, 5 | ScreenshotEndpointOptions, 6 | ScreenshotEndpointResponse, 7 | PDFEndpointOptions, 8 | PDFEndpointResponse, 9 | HTMLEndpointOptions, 10 | HTMLEndpointResponse, 11 | FilterType, 12 | } from '../types.js'; 13 | import * as fs from 'fs/promises'; 14 | import * as path from 'path'; 15 | import * as os from 'os'; 16 | 17 | export class ContentHandlers extends BaseHandler { 18 | async getMarkdown( 19 | options: Omit<MarkdownEndpointOptions, 'f' | 'q' | 'c'> & { filter?: string; query?: string; cache?: string }, 20 | ) { 21 | try { 22 | // Map from schema property names to API parameter names 23 | const result: MarkdownEndpointResponse = await this.service.getMarkdown({ 24 | url: options.url, 25 | f: options.filter as FilterType | undefined, // Schema provides 'filter', API expects 'f' 26 | q: options.query, // Schema provides 'query', API expects 'q' 27 | c: options.cache, // Schema provides 'cache', API expects 'c' 28 | }); 29 | 30 | // Format the response 31 | let formattedText = `URL: ${result.url}\nFilter: ${result.filter}`; 32 | 33 | if (result.query) { 34 | formattedText += `\nQuery: ${result.query}`; 35 | } 36 | 37 | formattedText += `\nCache: ${result.cache}\n\nMarkdown:\n${result.markdown || 'No content found.'}`; 38 | 39 | return { 40 | content: [ 41 | { 42 | type: 'text', 43 | text: formattedText, 44 | }, 45 | ], 46 | }; 47 | } catch (error) { 48 | throw this.formatError(error, 'get markdown'); 49 | } 50 | } 51 | 52 | async captureScreenshot(options: ScreenshotEndpointOptions) { 53 | try { 54 | const result: ScreenshotEndpointResponse = await this.service.captureScreenshot(options); 55 | 56 | // Response has { success: true, screenshot: "base64string" } 57 | if (!result.success || !result.screenshot) { 58 | throw new Error('Screenshot capture failed - no screenshot data in response'); 59 | } 60 | 61 | let savedFilePath: string | undefined; 62 | 63 | // Save to local directory if requested 64 | if (options.save_to_directory) { 65 | try { 66 | // Resolve home directory path 67 | let resolvedPath = options.save_to_directory; 68 | if (resolvedPath.startsWith('~')) { 69 | const homedir = os.homedir(); 70 | resolvedPath = path.join(homedir, resolvedPath.slice(1)); 71 | } 72 | 73 | // Check if user provided a file path instead of directory 74 | if (resolvedPath.endsWith('.png') || resolvedPath.endsWith('.jpg')) { 75 | console.warn( 76 | `Warning: save_to_directory should be a directory path, not a file path. Using parent directory.`, 77 | ); 78 | resolvedPath = path.dirname(resolvedPath); 79 | } 80 | 81 | // Ensure directory exists 82 | await fs.mkdir(resolvedPath, { recursive: true }); 83 | 84 | // Generate filename from URL and timestamp 85 | const url = new URL(options.url); 86 | const hostname = url.hostname.replace(/[^a-z0-9]/gi, '-'); 87 | const timestamp = new Date().toISOString().replace(/[:.]/g, '-').slice(0, -5); 88 | const filename = `${hostname}-${timestamp}.png`; 89 | 90 | savedFilePath = path.join(resolvedPath, filename); 91 | 92 | // Convert base64 to buffer and save 93 | const buffer = Buffer.from(result.screenshot, 'base64'); 94 | await fs.writeFile(savedFilePath, buffer); 95 | } catch (saveError) { 96 | // Log error but don't fail the operation 97 | console.error('Failed to save screenshot locally:', saveError); 98 | } 99 | } 100 | 101 | const textContent = savedFilePath 102 | ? `Screenshot captured for: ${options.url}\nSaved to: ${savedFilePath}` 103 | : `Screenshot captured for: ${options.url}`; 104 | 105 | // If saved locally and screenshot is large (>800KB), don't return the base64 data 106 | const screenshotSize = Buffer.from(result.screenshot, 'base64').length; 107 | const shouldReturnImage = !savedFilePath || screenshotSize < 800 * 1024; // 800KB threshold 108 | 109 | const content = []; 110 | 111 | if (shouldReturnImage) { 112 | content.push({ 113 | type: 'image', 114 | data: result.screenshot, 115 | mimeType: 'image/png', 116 | }); 117 | } 118 | 119 | content.push({ 120 | type: 'text', 121 | text: shouldReturnImage 122 | ? textContent 123 | : `${textContent}\n\nNote: Screenshot data not returned due to size (${Math.round(screenshotSize / 1024)}KB). View the saved file instead.`, 124 | }); 125 | 126 | return { content }; 127 | } catch (error) { 128 | throw this.formatError(error, 'capture screenshot'); 129 | } 130 | } 131 | 132 | async generatePDF(options: PDFEndpointOptions) { 133 | try { 134 | const result: PDFEndpointResponse = await this.service.generatePDF(options); 135 | 136 | // Response has { success: true, pdf: "base64string" } 137 | if (!result.success || !result.pdf) { 138 | throw new Error('PDF generation failed - no PDF data in response'); 139 | } 140 | 141 | return { 142 | content: [ 143 | { 144 | type: 'resource', 145 | resource: { 146 | uri: `data:application/pdf;name=${encodeURIComponent(new URL(String(options.url)).hostname)}.pdf;base64,${result.pdf}`, 147 | mimeType: 'application/pdf', 148 | blob: result.pdf, 149 | }, 150 | }, 151 | { 152 | type: 'text', 153 | text: `PDF generated for: ${options.url}`, 154 | }, 155 | ], 156 | }; 157 | } catch (error) { 158 | throw this.formatError(error, 'generate PDF'); 159 | } 160 | } 161 | 162 | async getHTML(options: HTMLEndpointOptions) { 163 | try { 164 | const result: HTMLEndpointResponse = await this.service.getHTML(options); 165 | 166 | // Response has { html: string, url: string, success: true } 167 | return { 168 | content: [ 169 | { 170 | type: 'text', 171 | text: result.html || '', 172 | }, 173 | ], 174 | }; 175 | } catch (error) { 176 | throw this.formatError(error, 'get HTML'); 177 | } 178 | } 179 | 180 | async extractWithLLM(options: { url: string; query: string }) { 181 | try { 182 | const result = await this.service.extractWithLLM(options); 183 | 184 | return { 185 | content: [ 186 | { 187 | type: 'text', 188 | text: result.answer, 189 | }, 190 | ], 191 | }; 192 | } catch (error) { 193 | throw this.formatError(error, 'extract with LLM'); 194 | } 195 | } 196 | } 197 | ``` -------------------------------------------------------------------------------- /src/__tests__/integration/get-markdown.integration.test.ts: -------------------------------------------------------------------------------- ```typescript 1 | /* eslint-env jest */ 2 | import { Client } from '@modelcontextprotocol/sdk/client/index.js'; 3 | import { createTestClient, cleanupTestClient, TEST_TIMEOUTS } from './test-utils.js'; 4 | 5 | interface ToolResult { 6 | content: Array<{ 7 | type: string; 8 | text?: string; 9 | }>; 10 | } 11 | 12 | describe('get_markdown Integration Tests', () => { 13 | let client: Client; 14 | 15 | beforeAll(async () => { 16 | client = await createTestClient(); 17 | }, TEST_TIMEOUTS.medium); 18 | 19 | afterAll(async () => { 20 | if (client) { 21 | await cleanupTestClient(client); 22 | } 23 | }); 24 | 25 | describe('Markdown extraction', () => { 26 | it( 27 | 'should extract markdown with default fit filter', 28 | async () => { 29 | const result = await client.callTool({ 30 | name: 'get_markdown', 31 | arguments: { 32 | url: 'https://httpbin.org/html', 33 | }, 34 | }); 35 | 36 | expect(result).toBeDefined(); 37 | const content = (result as ToolResult).content; 38 | expect(content).toHaveLength(1); 39 | expect(content[0].type).toBe('text'); 40 | 41 | const text = content[0].text || ''; 42 | expect(text).toContain('URL: https://httpbin.org/html'); 43 | expect(text).toContain('Filter: fit'); 44 | expect(text).toContain('Markdown:'); 45 | }, 46 | TEST_TIMEOUTS.medium, 47 | ); 48 | 49 | it( 50 | 'should extract markdown with raw filter', 51 | async () => { 52 | const result = await client.callTool({ 53 | name: 'get_markdown', 54 | arguments: { 55 | url: 'https://httpbin.org/html', 56 | filter: 'raw', 57 | }, 58 | }); 59 | 60 | const content = (result as ToolResult).content; 61 | expect(content).toHaveLength(1); 62 | expect(content[0].type).toBe('text'); 63 | 64 | const text = content[0].text || ''; 65 | expect(text).toContain('Filter: raw'); 66 | }, 67 | TEST_TIMEOUTS.medium, 68 | ); 69 | 70 | it( 71 | 'should extract markdown with bm25 filter and query', 72 | async () => { 73 | const result = await client.callTool({ 74 | name: 'get_markdown', 75 | arguments: { 76 | url: 'https://httpbin.org/html', 77 | filter: 'bm25', 78 | query: 'Herman Melville', 79 | }, 80 | }); 81 | 82 | const content = (result as ToolResult).content; 83 | expect(content).toHaveLength(1); 84 | expect(content[0].type).toBe('text'); 85 | 86 | const text = content[0].text || ''; 87 | expect(text).toContain('Filter: bm25'); 88 | expect(text).toContain('Query: Herman Melville'); 89 | }, 90 | TEST_TIMEOUTS.medium, 91 | ); 92 | 93 | it( 94 | 'should extract markdown with llm filter and query', 95 | async () => { 96 | const result = await client.callTool({ 97 | name: 'get_markdown', 98 | arguments: { 99 | url: 'https://httpbin.org/html', 100 | filter: 'llm', 101 | query: 'What is this page about?', 102 | }, 103 | }); 104 | 105 | const content = (result as ToolResult).content; 106 | expect(content).toHaveLength(1); 107 | expect(content[0].type).toBe('text'); 108 | 109 | const text = content[0].text || ''; 110 | expect(text).toContain('Filter: llm'); 111 | expect(text).toContain('Query: What is this page about?'); 112 | }, 113 | TEST_TIMEOUTS.medium, 114 | ); 115 | 116 | it( 117 | 'should use cache parameter', 118 | async () => { 119 | const result = await client.callTool({ 120 | name: 'get_markdown', 121 | arguments: { 122 | url: 'https://httpbin.org/html', 123 | cache: '1', 124 | }, 125 | }); 126 | 127 | const content = (result as ToolResult).content; 128 | expect(content).toHaveLength(1); 129 | expect(content[0].type).toBe('text'); 130 | 131 | const text = content[0].text || ''; 132 | expect(text).toContain('Cache: 1'); 133 | }, 134 | TEST_TIMEOUTS.medium, 135 | ); 136 | 137 | it( 138 | 'should reject session_id parameter', 139 | async () => { 140 | const result = await client.callTool({ 141 | name: 'get_markdown', 142 | arguments: { 143 | url: 'https://httpbin.org/html', 144 | session_id: 'test-session', 145 | }, 146 | }); 147 | 148 | const content = (result as ToolResult).content; 149 | expect(content).toHaveLength(1); 150 | expect(content[0].type).toBe('text'); 151 | expect(content[0].text).toContain('session_id'); 152 | expect(content[0].text).toContain('does not support'); 153 | expect(content[0].text).toContain('stateless'); 154 | }, 155 | TEST_TIMEOUTS.short, 156 | ); 157 | 158 | it( 159 | 'should handle invalid URLs gracefully', 160 | async () => { 161 | const result = await client.callTool({ 162 | name: 'get_markdown', 163 | arguments: { 164 | url: 'not-a-valid-url', 165 | }, 166 | }); 167 | 168 | const content = (result as ToolResult).content; 169 | expect(content).toHaveLength(1); 170 | expect(content[0].type).toBe('text'); 171 | expect(content[0].text).toContain('Error'); 172 | expect(content[0].text?.toLowerCase()).toContain('invalid'); 173 | }, 174 | TEST_TIMEOUTS.short, 175 | ); 176 | 177 | it( 178 | 'should handle non-existent domains', 179 | async () => { 180 | const result = await client.callTool({ 181 | name: 'get_markdown', 182 | arguments: { 183 | url: 'https://this-domain-definitely-does-not-exist-123456789.com', 184 | }, 185 | }); 186 | 187 | const content = (result as ToolResult).content; 188 | expect(content).toHaveLength(1); 189 | expect(content[0].type).toBe('text'); 190 | 191 | // According to the pattern from other tests, might return success with empty content 192 | const text = content[0].text || ''; 193 | expect(typeof text).toBe('string'); 194 | }, 195 | TEST_TIMEOUTS.short, 196 | ); 197 | 198 | it( 199 | 'should ignore extra parameters', 200 | async () => { 201 | const result = await client.callTool({ 202 | name: 'get_markdown', 203 | arguments: { 204 | url: 'https://httpbin.org/html', 205 | filter: 'fit', 206 | // These should be ignored 207 | remove_images: true, 208 | bypass_cache: true, 209 | screenshot: true, 210 | }, 211 | }); 212 | 213 | const content = (result as ToolResult).content; 214 | expect(content).toHaveLength(1); 215 | expect(content[0].type).toBe('text'); 216 | 217 | // Should still work, ignoring extra params 218 | const text = content[0].text || ''; 219 | expect(text).toContain('Filter: fit'); 220 | }, 221 | TEST_TIMEOUTS.medium, 222 | ); 223 | }); 224 | }); 225 | ``` -------------------------------------------------------------------------------- /src/__tests__/integration/execute-js.integration.test.ts: -------------------------------------------------------------------------------- ```typescript 1 | /* eslint-env jest */ 2 | import { Client } from '@modelcontextprotocol/sdk/client/index.js'; 3 | import { createTestClient, cleanupTestClient, TEST_TIMEOUTS } from './test-utils.js'; 4 | 5 | interface ToolResult { 6 | content: Array<{ 7 | type: string; 8 | text?: string; 9 | }>; 10 | } 11 | 12 | describe('execute_js Integration Tests', () => { 13 | let client: Client; 14 | 15 | beforeAll(async () => { 16 | client = await createTestClient(); 17 | }, TEST_TIMEOUTS.medium); 18 | 19 | afterAll(async () => { 20 | if (client) { 21 | await cleanupTestClient(client); 22 | } 23 | }); 24 | 25 | describe('JavaScript execution', () => { 26 | it( 27 | 'should execute JavaScript and return results', 28 | async () => { 29 | const result = await client.callTool({ 30 | name: 'execute_js', 31 | arguments: { 32 | url: 'https://httpbin.org/html', 33 | scripts: ['return document.title', 'return document.querySelectorAll("h1").length'], 34 | }, 35 | }); 36 | 37 | expect(result).toBeDefined(); 38 | const content = (result as ToolResult).content; 39 | expect(content).toHaveLength(1); 40 | expect(content[0].type).toBe('text'); 41 | 42 | // Should contain JavaScript execution results 43 | expect(content[0].text).toContain('JavaScript executed on: https://httpbin.org/html'); 44 | expect(content[0].text).toContain('Results:'); 45 | expect(content[0].text).toContain('Script: return document.title'); 46 | expect(content[0].text).toMatch(/Returned: .*/); // Title may be empty or no return value 47 | expect(content[0].text).toContain('Script: return document.querySelectorAll("h1").length'); 48 | expect(content[0].text).toContain('Returned: 1'); // Should have 1 h1 element 49 | }, 50 | TEST_TIMEOUTS.medium, 51 | ); 52 | 53 | it( 54 | 'should execute single script as string', 55 | async () => { 56 | console.log('Starting execute_js test...'); 57 | const result = await client.callTool({ 58 | name: 'execute_js', 59 | arguments: { 60 | url: 'https://httpbin.org/html', 61 | scripts: 'return window.location.href', 62 | }, 63 | }); 64 | console.log('Got result:', result); 65 | 66 | expect(result).toBeDefined(); 67 | const content = (result as ToolResult).content; 68 | expect(content).toHaveLength(1); 69 | 70 | expect(content[0].text).toContain('JavaScript executed on: https://httpbin.org/html'); 71 | expect(content[0].text).toContain('Script: return window.location.href'); 72 | expect(content[0].text).toContain('Returned: "https://httpbin.org/html'); 73 | }, 74 | TEST_TIMEOUTS.long, // Increase timeout to 120s 75 | ); 76 | 77 | it( 78 | 'should reject session_id parameter', 79 | async () => { 80 | const result = await client.callTool({ 81 | name: 'execute_js', 82 | arguments: { 83 | url: 'https://httpbin.org/html', 84 | scripts: 'return true', 85 | session_id: 'test-session', 86 | }, 87 | }); 88 | 89 | const content = (result as ToolResult).content; 90 | expect(content).toHaveLength(1); 91 | expect(content[0].type).toBe('text'); 92 | expect(content[0].text).toContain('session_id'); 93 | expect(content[0].text).toContain('does not support'); 94 | expect(content[0].text).toContain('stateless'); 95 | }, 96 | TEST_TIMEOUTS.short, 97 | ); 98 | 99 | it( 100 | 'should reject invalid JavaScript with HTML entities', 101 | async () => { 102 | const result = await client.callTool({ 103 | name: 'execute_js', 104 | arguments: { 105 | url: 'https://httpbin.org/html', 106 | scripts: 'return "test"', 107 | }, 108 | }); 109 | 110 | const content = (result as ToolResult).content; 111 | expect(content).toHaveLength(1); 112 | expect(content[0].text).toContain('Error'); 113 | expect(content[0].text).toContain('Invalid JavaScript'); 114 | expect(content[0].text).toContain('HTML entities'); 115 | }, 116 | TEST_TIMEOUTS.short, 117 | ); 118 | 119 | it( 120 | 'should accept JavaScript with newlines in strings', 121 | async () => { 122 | const result = await client.callTool({ 123 | name: 'execute_js', 124 | arguments: { 125 | url: 'https://httpbin.org/html', 126 | scripts: 'const text = "line1\\nline2"; return text', 127 | }, 128 | }); 129 | 130 | const content = (result as ToolResult).content; 131 | expect(content).toHaveLength(1); 132 | expect(content[0].text).toContain('JavaScript executed on: https://httpbin.org/html'); 133 | expect(content[0].text).toContain('Returned: "line1\\nline2"'); 134 | }, 135 | TEST_TIMEOUTS.medium, // Increase from short to medium 136 | ); 137 | 138 | it( 139 | 'should handle JavaScript execution errors', 140 | async () => { 141 | const result = await client.callTool({ 142 | name: 'execute_js', 143 | arguments: { 144 | url: 'https://httpbin.org/html', 145 | scripts: [ 146 | 'return "This works"', 147 | 'throw new Error("This is a test error")', 148 | 'nonExistentVariable.someMethod()', 149 | ], 150 | }, 151 | }); 152 | 153 | const content = (result as ToolResult).content; 154 | expect(content).toHaveLength(1); 155 | expect(content[0].text).toContain('JavaScript executed on: https://httpbin.org/html'); 156 | 157 | // First script should succeed 158 | expect(content[0].text).toContain('Script: return "This works"'); 159 | expect(content[0].text).toContain('Returned: "This works"'); 160 | 161 | // Second script should show error 162 | expect(content[0].text).toContain('Script: throw new Error("This is a test error")'); 163 | expect(content[0].text).toContain('Returned: Error: Error: This is a test error'); 164 | 165 | // Third script should show reference error 166 | expect(content[0].text).toContain('Script: nonExistentVariable.someMethod()'); 167 | expect(content[0].text).toContain('Returned: Error: ReferenceError: nonExistentVariable is not defined'); 168 | }, 169 | TEST_TIMEOUTS.medium, 170 | ); 171 | 172 | it( 173 | 'should handle invalid URLs gracefully', 174 | async () => { 175 | const result = await client.callTool({ 176 | name: 'execute_js', 177 | arguments: { 178 | url: 'not-a-valid-url', 179 | scripts: 'return true', 180 | }, 181 | }); 182 | 183 | const content = (result as ToolResult).content; 184 | expect(content).toHaveLength(1); 185 | expect(content[0].text).toContain('Error'); 186 | expect(content[0].text?.toLowerCase()).toContain('invalid'); 187 | }, 188 | TEST_TIMEOUTS.short, 189 | ); 190 | }); 191 | }); 192 | ``` -------------------------------------------------------------------------------- /src/__tests__/integration/batch-crawl.integration.test.ts: -------------------------------------------------------------------------------- ```typescript 1 | /* eslint-env jest */ 2 | import { Client } from '@modelcontextprotocol/sdk/client/index.js'; 3 | import { createTestClient, cleanupTestClient, TEST_TIMEOUTS } from './test-utils.js'; 4 | 5 | interface ToolResult { 6 | content: Array<{ 7 | type: string; 8 | text?: string; 9 | }>; 10 | } 11 | 12 | describe('batch_crawl Integration Tests', () => { 13 | let client: Client; 14 | 15 | beforeAll(async () => { 16 | client = await createTestClient(); 17 | }, TEST_TIMEOUTS.medium); 18 | 19 | afterAll(async () => { 20 | if (client) { 21 | await cleanupTestClient(client); 22 | } 23 | }); 24 | 25 | describe('Batch crawling', () => { 26 | it( 27 | 'should crawl multiple URLs', 28 | async () => { 29 | const result = await client.callTool({ 30 | name: 'batch_crawl', 31 | arguments: { 32 | urls: ['https://httpbingo.org/html', 'https://httpbingo.org/json'], 33 | }, 34 | }); 35 | 36 | expect(result).toBeDefined(); 37 | const content = (result as ToolResult).content; 38 | expect(content).toHaveLength(1); 39 | expect(content[0].type).toBe('text'); 40 | 41 | const text = content[0].text || ''; 42 | expect(text).toContain('Batch crawl completed'); 43 | expect(text).toContain('Processed 2 URLs'); 44 | expect(text).toContain('https://httpbingo.org/html: Success'); 45 | expect(text).toContain('https://httpbingo.org/json: Success'); 46 | }, 47 | TEST_TIMEOUTS.medium, 48 | ); 49 | 50 | it( 51 | 'should handle max_concurrent parameter', 52 | async () => { 53 | const result = await client.callTool({ 54 | name: 'batch_crawl', 55 | arguments: { 56 | urls: ['https://httpbingo.org/html', 'https://httpbingo.org/xml', 'https://httpbingo.org/json'], 57 | max_concurrent: 1, 58 | }, 59 | }); 60 | 61 | const content = (result as ToolResult).content; 62 | expect(content).toHaveLength(1); 63 | expect(content[0].type).toBe('text'); 64 | 65 | const text = content[0].text || ''; 66 | expect(text).toContain('Processed 3 URLs'); 67 | expect(text).toContain(': Success'); 68 | }, 69 | TEST_TIMEOUTS.long, 70 | ); 71 | 72 | it( 73 | 'should remove images when requested', 74 | async () => { 75 | const result = await client.callTool({ 76 | name: 'batch_crawl', 77 | arguments: { 78 | urls: ['https://httpbingo.org/html'], 79 | remove_images: true, 80 | }, 81 | }); 82 | 83 | const content = (result as ToolResult).content; 84 | expect(content).toHaveLength(1); 85 | expect(content[0].type).toBe('text'); 86 | 87 | const text = content[0].text || ''; 88 | expect(text).toContain('Batch crawl completed'); 89 | expect(text).toContain('https://httpbingo.org/html: Success'); 90 | }, 91 | TEST_TIMEOUTS.medium, 92 | ); 93 | 94 | it( 95 | 'should bypass cache when requested', 96 | async () => { 97 | const result = await client.callTool({ 98 | name: 'batch_crawl', 99 | arguments: { 100 | urls: ['https://httpbingo.org/html'], 101 | bypass_cache: true, 102 | }, 103 | }); 104 | 105 | const content = (result as ToolResult).content; 106 | expect(content).toHaveLength(1); 107 | expect(content[0].type).toBe('text'); 108 | 109 | const text = content[0].text || ''; 110 | expect(text).toContain('Batch crawl completed'); 111 | expect(text).toContain('https://httpbingo.org/html: Success'); 112 | }, 113 | TEST_TIMEOUTS.medium, 114 | ); 115 | 116 | it( 117 | 'should handle mixed content types', 118 | async () => { 119 | const result = await client.callTool({ 120 | name: 'batch_crawl', 121 | arguments: { 122 | urls: ['https://httpbin.org/html', 'https://httpbin.org/json', 'https://httpbin.org/xml'], 123 | }, 124 | }); 125 | 126 | const content = (result as ToolResult).content; 127 | expect(content).toHaveLength(1); 128 | expect(content[0].type).toBe('text'); 129 | 130 | const text = content[0].text || ''; 131 | expect(text).toContain('Processed 3 URLs'); 132 | expect(text).toContain('https://httpbin.org/html: Success'); 133 | expect(text).toContain('https://httpbin.org/json: Success'); 134 | expect(text).toContain('https://httpbin.org/xml: Success'); 135 | }, 136 | TEST_TIMEOUTS.medium, 137 | ); 138 | 139 | it( 140 | 'should handle empty URL list', 141 | async () => { 142 | const result = await client.callTool({ 143 | name: 'batch_crawl', 144 | arguments: { 145 | urls: [], 146 | }, 147 | }); 148 | 149 | const content = (result as ToolResult).content; 150 | expect(content).toHaveLength(1); 151 | expect(content[0].text).toContain('Error'); 152 | // Just check that it's an error about invalid parameters 153 | expect(content[0].text?.toLowerCase()).toMatch(/error|invalid|failed/); 154 | }, 155 | TEST_TIMEOUTS.short, 156 | ); 157 | 158 | it( 159 | 'should reject session_id parameter', 160 | async () => { 161 | const result = await client.callTool({ 162 | name: 'batch_crawl', 163 | arguments: { 164 | urls: ['https://httpbingo.org/html'], 165 | session_id: 'test-session', 166 | }, 167 | }); 168 | 169 | const content = (result as ToolResult).content; 170 | expect(content).toHaveLength(1); 171 | expect(content[0].type).toBe('text'); 172 | expect(content[0].text).toContain('session_id'); 173 | expect(content[0].text).toContain('does not support'); 174 | expect(content[0].text).toContain('stateless'); 175 | }, 176 | TEST_TIMEOUTS.short, 177 | ); 178 | 179 | it( 180 | 'should handle per-URL configs array', 181 | async () => { 182 | const result = await client.callTool({ 183 | name: 'batch_crawl', 184 | arguments: { 185 | urls: ['https://httpbingo.org/html', 'https://httpbingo.org/json'], 186 | configs: [ 187 | { 188 | url: 'https://httpbingo.org/html', 189 | browser_config: { browser_type: 'chromium' }, 190 | crawler_config: { word_count_threshold: 10 }, 191 | }, 192 | { 193 | url: 'https://httpbingo.org/json', 194 | browser_config: { browser_type: 'firefox' }, 195 | crawler_config: { word_count_threshold: 20 }, 196 | }, 197 | ], 198 | max_concurrent: 2, 199 | }, 200 | }); 201 | 202 | const content = (result as ToolResult).content; 203 | expect(content).toHaveLength(1); 204 | expect(content[0].type).toBe('text'); 205 | 206 | const text = content[0].text || ''; 207 | expect(text).toContain('Batch crawl completed'); 208 | expect(text).toContain('Processed 2 URLs'); 209 | // Both should succeed regardless of different configs 210 | expect(text).toContain('https://httpbingo.org/html: Success'); 211 | expect(text).toContain('https://httpbingo.org/json: Success'); 212 | }, 213 | TEST_TIMEOUTS.medium, 214 | ); 215 | }); 216 | }); 217 | ``` -------------------------------------------------------------------------------- /src/__tests__/integration/parse-sitemap.integration.test.ts: -------------------------------------------------------------------------------- ```typescript 1 | /* eslint-env jest */ 2 | import { Client } from '@modelcontextprotocol/sdk/client/index.js'; 3 | import { createTestClient, cleanupTestClient, TEST_TIMEOUTS } from './test-utils.js'; 4 | 5 | interface ToolResult { 6 | content: Array<{ 7 | type: string; 8 | text?: string; 9 | }>; 10 | } 11 | 12 | describe('parse_sitemap Integration Tests', () => { 13 | let client: Client; 14 | 15 | beforeAll(async () => { 16 | client = await createTestClient(); 17 | }, TEST_TIMEOUTS.medium); 18 | 19 | afterAll(async () => { 20 | if (client) { 21 | await cleanupTestClient(client); 22 | } 23 | }); 24 | 25 | describe('Basic functionality', () => { 26 | it( 27 | 'should parse nodejs.org sitemap successfully', 28 | async () => { 29 | const result = await client.callTool({ 30 | name: 'parse_sitemap', 31 | arguments: { 32 | url: 'https://nodejs.org/sitemap.xml', 33 | }, 34 | }); 35 | 36 | expect(result).toBeDefined(); 37 | const content = (result as ToolResult).content; 38 | expect(content).toBeDefined(); 39 | expect(Array.isArray(content)).toBe(true); 40 | expect(content.length).toBeGreaterThan(0); 41 | 42 | const textContent = content.find((c) => c.type === 'text'); 43 | expect(textContent).toBeDefined(); 44 | expect(textContent?.text).toContain('Sitemap parsed successfully'); 45 | expect(textContent?.text).toContain('Total URLs found:'); 46 | expect(textContent?.text).toContain('https://nodejs.org'); 47 | 48 | // Should find many URLs in the nodejs sitemap 49 | expect(textContent?.text).toMatch(/Total URLs found: [1-9][0-9]+/); 50 | }, 51 | TEST_TIMEOUTS.medium, 52 | ); 53 | 54 | it( 55 | 'should filter URLs with regex pattern', 56 | async () => { 57 | const result = await client.callTool({ 58 | name: 'parse_sitemap', 59 | arguments: { 60 | url: 'https://nodejs.org/sitemap.xml', 61 | filter_pattern: '.*/learn/.*', // Only URLs containing /learn/ 62 | }, 63 | }); 64 | 65 | expect(result).toBeDefined(); 66 | const content = (result as ToolResult).content; 67 | const textContent = content.find((c) => c.type === 'text'); 68 | expect(textContent).toBeDefined(); 69 | 70 | // Check that filtering worked 71 | expect(textContent?.text).toContain('Filtered URLs:'); 72 | 73 | // All URLs in the result should contain /learn/ 74 | const urlsSection = textContent?.text?.split('URLs:\n')[1]; 75 | if (urlsSection) { 76 | const urls = urlsSection.split('\n').filter((url) => url.trim()); 77 | urls.forEach((url) => { 78 | if (url && !url.includes('... and')) { 79 | expect(url).toContain('/learn/'); 80 | } 81 | }); 82 | } 83 | }, 84 | TEST_TIMEOUTS.medium, 85 | ); 86 | 87 | it( 88 | 'should handle empty sitemaps', 89 | async () => { 90 | // Using a URL that returns valid XML but not a sitemap 91 | const result = await client.callTool({ 92 | name: 'parse_sitemap', 93 | arguments: { 94 | url: 'https://www.w3schools.com/xml/note.xml', 95 | }, 96 | }); 97 | 98 | expect(result).toBeDefined(); 99 | const content = (result as ToolResult).content; 100 | const textContent = content.find((c) => c.type === 'text'); 101 | expect(textContent).toBeDefined(); 102 | expect(textContent?.text).toContain('Total URLs found: 0'); 103 | }, 104 | TEST_TIMEOUTS.medium, 105 | ); 106 | 107 | it( 108 | 'should handle large sitemaps with truncation', 109 | async () => { 110 | const result = await client.callTool({ 111 | name: 'parse_sitemap', 112 | arguments: { 113 | url: 'https://nodejs.org/sitemap.xml', 114 | filter_pattern: '.*', // Match all to test truncation 115 | }, 116 | }); 117 | 118 | expect(result).toBeDefined(); 119 | const content = (result as ToolResult).content; 120 | const textContent = content.find((c) => c.type === 'text'); 121 | expect(textContent).toBeDefined(); 122 | 123 | // Should show max 100 URLs and indicate there are more 124 | if (textContent?.text && textContent.text.includes('... and')) { 125 | expect(textContent.text).toMatch(/\.\.\. and \d+ more/); 126 | } 127 | }, 128 | TEST_TIMEOUTS.medium, 129 | ); 130 | }); 131 | 132 | describe('Error handling', () => { 133 | it( 134 | 'should handle invalid URLs', 135 | async () => { 136 | const result = await client.callTool({ 137 | name: 'parse_sitemap', 138 | arguments: { 139 | url: 'not-a-url', 140 | }, 141 | }); 142 | 143 | expect(result).toBeDefined(); 144 | const content = (result as ToolResult).content; 145 | expect(content).toBeDefined(); 146 | const textContent = content.find((c) => c.type === 'text'); 147 | expect(textContent).toBeDefined(); 148 | expect(textContent?.text).toContain('Error'); 149 | expect(textContent?.text?.toLowerCase()).toContain('invalid'); 150 | }, 151 | TEST_TIMEOUTS.short, 152 | ); 153 | 154 | it( 155 | 'should handle non-existent URLs', 156 | async () => { 157 | const result = await client.callTool({ 158 | name: 'parse_sitemap', 159 | arguments: { 160 | url: 'https://this-domain-definitely-does-not-exist-12345.com/sitemap.xml', 161 | }, 162 | }); 163 | 164 | expect(result).toBeDefined(); 165 | const content = (result as ToolResult).content; 166 | const textContent = content.find((c) => c.type === 'text'); 167 | expect(textContent).toBeDefined(); 168 | expect(textContent?.text).toContain('Error'); 169 | }, 170 | TEST_TIMEOUTS.medium, 171 | ); 172 | 173 | it( 174 | 'should handle non-XML content', 175 | async () => { 176 | const result = await client.callTool({ 177 | name: 'parse_sitemap', 178 | arguments: { 179 | url: 'https://example.com', // HTML page, not XML 180 | }, 181 | }); 182 | 183 | expect(result).toBeDefined(); 184 | const content = (result as ToolResult).content; 185 | const textContent = content.find((c) => c.type === 'text'); 186 | expect(textContent).toBeDefined(); 187 | // Should still parse but likely find 0 URLs since it's not a sitemap 188 | expect(textContent?.text).toContain('Total URLs found:'); 189 | }, 190 | TEST_TIMEOUTS.medium, 191 | ); 192 | 193 | it( 194 | 'should handle invalid regex patterns', 195 | async () => { 196 | const result = await client.callTool({ 197 | name: 'parse_sitemap', 198 | arguments: { 199 | url: 'https://nodejs.org/sitemap.xml', 200 | filter_pattern: '[invalid(regex', // Invalid regex 201 | }, 202 | }); 203 | 204 | expect(result).toBeDefined(); 205 | const content = (result as ToolResult).content; 206 | const textContent = content.find((c) => c.type === 'text'); 207 | expect(textContent).toBeDefined(); 208 | expect(textContent?.text).toContain('Error'); 209 | expect(textContent?.text?.toLowerCase()).toMatch(/failed|error|invalid/); 210 | }, 211 | TEST_TIMEOUTS.medium, 212 | ); 213 | }); 214 | }); 215 | ``` -------------------------------------------------------------------------------- /src/__tests__/integration/crawl-handlers.integration.test.ts: -------------------------------------------------------------------------------- ```typescript 1 | /* eslint-env jest */ 2 | import { Client } from '@modelcontextprotocol/sdk/client/index.js'; 3 | import { createTestClient, cleanupTestClient, TEST_TIMEOUTS } from './test-utils.js'; 4 | 5 | interface ToolResult { 6 | content: Array<{ 7 | type: string; 8 | text?: string; 9 | }>; 10 | } 11 | 12 | describe('Crawl Handlers Integration Tests', () => { 13 | let client: Client; 14 | 15 | beforeAll(async () => { 16 | client = await createTestClient(); 17 | }, TEST_TIMEOUTS.medium); 18 | 19 | afterAll(async () => { 20 | if (client) { 21 | await cleanupTestClient(client); 22 | } 23 | }); 24 | 25 | describe('batch_crawl error handling', () => { 26 | it( 27 | 'should handle batch crawl with invalid URLs', 28 | async () => { 29 | const result = await client.callTool({ 30 | name: 'batch_crawl', 31 | arguments: { 32 | urls: ['not-a-valid-url', 'https://this-domain-does-not-exist-12345.com'], 33 | max_concurrent: 2, 34 | }, 35 | }); 36 | 37 | expect(result).toBeDefined(); 38 | const content = (result as ToolResult).content; 39 | expect(content[0].type).toBe('text'); 40 | // Zod validation will catch the invalid URL format 41 | expect(content[0].text).toContain('Invalid parameters'); 42 | }, 43 | TEST_TIMEOUTS.medium, 44 | ); 45 | }); 46 | 47 | describe('smart_crawl edge cases', () => { 48 | it( 49 | 'should detect XML content type for XML URLs', 50 | async () => { 51 | const result = await client.callTool({ 52 | name: 'smart_crawl', 53 | arguments: { 54 | url: 'https://httpbin.org/xml', 55 | bypass_cache: true, 56 | }, 57 | }); 58 | 59 | expect(result).toBeDefined(); 60 | const content = (result as ToolResult).content; 61 | expect(content[0].text).toContain('Smart crawl detected content type:'); 62 | // Should detect as XML based on content-type header 63 | expect(content[0].text?.toLowerCase()).toMatch(/xml|json/); // httpbin.org/xml actually returns JSON 64 | }, 65 | TEST_TIMEOUTS.medium, 66 | ); 67 | 68 | it( 69 | 'should handle follow_links with sitemap URLs', 70 | async () => { 71 | // Note: Most sites don't have accessible sitemaps, so this tests the logic 72 | const result = await client.callTool({ 73 | name: 'smart_crawl', 74 | arguments: { 75 | url: 'https://example.com/sitemap.xml', 76 | follow_links: true, 77 | max_depth: 2, 78 | bypass_cache: true, 79 | }, 80 | }); 81 | 82 | expect(result).toBeDefined(); 83 | const content = (result as ToolResult).content; 84 | expect(content[0].text).toContain('Smart crawl detected content type:'); 85 | }, 86 | TEST_TIMEOUTS.long, // Increase timeout for sitemap processing 87 | ); 88 | }); 89 | 90 | describe('crawl_recursive edge cases', () => { 91 | it( 92 | 'should respect max_depth limit of 0', 93 | async () => { 94 | const result = await client.callTool({ 95 | name: 'crawl_recursive', 96 | arguments: { 97 | url: 'https://httpbin.org/links/5/0', 98 | max_depth: 0, // Should only crawl the initial page 99 | }, 100 | }); 101 | 102 | expect(result).toBeDefined(); 103 | const content = (result as ToolResult).content; 104 | // The test might show 0 pages if the URL fails, or 1 page if it succeeds 105 | expect(content[0].text).toMatch(/Pages crawled: [01]/); 106 | // If pages were crawled, check for max depth message 107 | if (content[0].text?.includes('Pages crawled: 1')) { 108 | expect(content[0].text).toContain('Max depth reached: 0'); 109 | } 110 | }, 111 | TEST_TIMEOUTS.medium, 112 | ); 113 | 114 | it( 115 | 'should handle sites with no internal links', 116 | async () => { 117 | const result = await client.callTool({ 118 | name: 'crawl_recursive', 119 | arguments: { 120 | url: 'https://httpbin.org/json', // JSON endpoint has no links 121 | max_depth: 2, 122 | }, 123 | }); 124 | 125 | expect(result).toBeDefined(); 126 | const content = (result as ToolResult).content; 127 | expect(content[0].text).toContain('Pages crawled: 1'); 128 | expect(content[0].text).toContain('Internal links found: 0'); 129 | }, 130 | TEST_TIMEOUTS.medium, 131 | ); 132 | }); 133 | 134 | describe('parse_sitemap error handling', () => { 135 | it( 136 | 'should handle non-existent sitemap URLs', 137 | async () => { 138 | const result = await client.callTool({ 139 | name: 'parse_sitemap', 140 | arguments: { 141 | url: 'https://this-domain-does-not-exist-12345.com/sitemap.xml', 142 | }, 143 | }); 144 | 145 | expect(result).toBeDefined(); 146 | const content = (result as ToolResult).content; 147 | expect(content[0].text).toContain('Error'); 148 | expect(content[0].text?.toLowerCase()).toMatch(/failed|error|not found/); 149 | }, 150 | TEST_TIMEOUTS.medium, 151 | ); 152 | }); 153 | 154 | describe('crawl method edge cases', () => { 155 | it( 156 | 'should handle crawl with all image and filtering parameters', 157 | async () => { 158 | const result = await client.callTool({ 159 | name: 'crawl', 160 | arguments: { 161 | url: 'https://example.com', 162 | word_count_threshold: 50, 163 | image_description_min_word_threshold: 10, 164 | image_score_threshold: 0.5, 165 | exclude_social_media_links: true, 166 | cache_mode: 'BYPASS', 167 | }, 168 | }); 169 | 170 | expect(result).toBeDefined(); 171 | const content = (result as ToolResult).content; 172 | expect(content[0].type).toBe('text'); 173 | // Should successfully crawl with these parameters 174 | expect(content[0].text).not.toContain('Error'); 175 | }, 176 | TEST_TIMEOUTS.medium, 177 | ); 178 | 179 | it( 180 | 'should handle js_code as null with validation error', 181 | async () => { 182 | const result = await client.callTool({ 183 | name: 'crawl', 184 | arguments: { 185 | url: 'https://example.com', 186 | js_code: null as unknown as string, // Intentionally pass null 187 | }, 188 | }); 189 | 190 | expect(result).toBeDefined(); 191 | const content = (result as ToolResult).content; 192 | expect(content[0].text).toContain('Invalid parameters for crawl'); 193 | expect(content[0].text).toContain('js_code'); 194 | }, 195 | TEST_TIMEOUTS.short, 196 | ); 197 | 198 | it( 199 | 'should work with session_id parameter using manage_session', 200 | async () => { 201 | // First create a session using manage_session 202 | const sessionResult = await client.callTool({ 203 | name: 'manage_session', 204 | arguments: { 205 | action: 'create', 206 | session_id: 'test-crawl-session-new', 207 | }, 208 | }); 209 | 210 | expect(sessionResult).toBeDefined(); 211 | 212 | // Then use it for crawling 213 | const crawlResult = await client.callTool({ 214 | name: 'crawl', 215 | arguments: { 216 | url: 'https://example.com', 217 | session_id: 'test-crawl-session-new', 218 | }, 219 | }); 220 | 221 | expect(crawlResult).toBeDefined(); 222 | const content = (crawlResult as ToolResult).content; 223 | expect(content[0].type).toBe('text'); 224 | expect(content[0].text).not.toContain('Error'); 225 | 226 | // Clean up using manage_session 227 | await client.callTool({ 228 | name: 'manage_session', 229 | arguments: { 230 | action: 'clear', 231 | session_id: 'test-crawl-session-new', 232 | }, 233 | }); 234 | }, 235 | TEST_TIMEOUTS.medium, 236 | ); 237 | }); 238 | }); 239 | ``` -------------------------------------------------------------------------------- /src/__tests__/integration/crawl-advanced.integration.test.ts: -------------------------------------------------------------------------------- ```typescript 1 | /* eslint-env jest */ 2 | import { Client } from '@modelcontextprotocol/sdk/client/index.js'; 3 | import { createTestClient, cleanupTestClient, expectSuccessfulCrawl, TEST_TIMEOUTS } from './test-utils.js'; 4 | 5 | interface ToolResult { 6 | content: Array<{ 7 | type: string; 8 | text?: string; 9 | data?: string; 10 | mimeType?: string; 11 | }>; 12 | } 13 | 14 | describe('crawl Advanced Features Integration Tests', () => { 15 | let client: Client; 16 | 17 | beforeAll(async () => { 18 | client = await createTestClient(); 19 | }, TEST_TIMEOUTS.medium); 20 | 21 | afterAll(async () => { 22 | if (client) { 23 | await cleanupTestClient(client); 24 | } 25 | }); 26 | 27 | describe('Media and Content Extraction', () => { 28 | it( 29 | 'should extract images with scoring', 30 | async () => { 31 | const result = await client.callTool({ 32 | name: 'crawl', 33 | arguments: { 34 | url: 'https://example.com', 35 | image_score_threshold: 3, 36 | exclude_external_images: false, 37 | cache_mode: 'BYPASS', 38 | }, 39 | }); 40 | 41 | await expectSuccessfulCrawl(result); 42 | const textContent = (result as ToolResult).content.find((c) => c.type === 'text'); 43 | expect(textContent?.text).toBeTruthy(); 44 | // Should have extracted content 45 | expect(textContent?.text).toContain('Example Domain'); 46 | }, 47 | TEST_TIMEOUTS.medium, 48 | ); 49 | 50 | it( 51 | 'should capture MHTML', 52 | async () => { 53 | const result = await client.callTool({ 54 | name: 'crawl', 55 | arguments: { 56 | url: 'https://example.com', 57 | capture_mhtml: true, 58 | cache_mode: 'BYPASS', 59 | }, 60 | }); 61 | 62 | await expectSuccessfulCrawl(result); 63 | const textContent = (result as ToolResult).content.find((c) => c.type === 'text'); 64 | expect(textContent?.text).toBeTruthy(); 65 | // MHTML should be captured but not in the text output 66 | expect(textContent?.text).toContain('Example Domain'); 67 | }, 68 | TEST_TIMEOUTS.long, 69 | ); 70 | 71 | it( 72 | 'should extract tables from Wikipedia', 73 | async () => { 74 | const result = await client.callTool({ 75 | name: 'crawl', 76 | arguments: { 77 | url: 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)', 78 | word_count_threshold: 10, 79 | cache_mode: 'BYPASS', 80 | }, 81 | }); 82 | 83 | await expectSuccessfulCrawl(result); 84 | const textContent = (result as ToolResult).content.find((c) => c.type === 'text'); 85 | expect(textContent?.text).toBeTruthy(); 86 | // Should contain country data 87 | expect(textContent?.text).toMatch(/China|India|United States/); 88 | }, 89 | TEST_TIMEOUTS.long, 90 | ); 91 | }); 92 | 93 | describe('Link and Content Filtering', () => { 94 | it( 95 | 'should exclude social media links', 96 | async () => { 97 | const result = await client.callTool({ 98 | name: 'crawl', 99 | arguments: { 100 | url: 'https://www.bbc.com/news', 101 | exclude_social_media_links: true, 102 | exclude_domains: ['twitter.com', 'facebook.com', 'instagram.com'], 103 | cache_mode: 'BYPASS', 104 | word_count_threshold: 50, 105 | }, 106 | }); 107 | 108 | await expectSuccessfulCrawl(result); 109 | const textContent = (result as ToolResult).content.find((c) => c.type === 'text'); 110 | expect(textContent?.text).toBeTruthy(); 111 | // Should have news content but no social media references in extracted links 112 | expect(textContent?.text).toContain('BBC'); 113 | }, 114 | TEST_TIMEOUTS.long, 115 | ); 116 | 117 | it( 118 | 'should remove excluded selectors', 119 | async () => { 120 | const result = await client.callTool({ 121 | name: 'crawl', 122 | arguments: { 123 | url: 'https://httpbin.org/html', 124 | excluded_selector: 'div:first-child', 125 | cache_mode: 'BYPASS', 126 | }, 127 | }); 128 | 129 | await expectSuccessfulCrawl(result); 130 | }, 131 | TEST_TIMEOUTS.medium, 132 | ); 133 | }); 134 | 135 | describe('Page Navigation Options', () => { 136 | it( 137 | 'should wait for images to load', 138 | async () => { 139 | const result = await client.callTool({ 140 | name: 'crawl', 141 | arguments: { 142 | url: 'https://httpbin.org/image/png', 143 | wait_for_images: true, 144 | wait_until: 'load', 145 | page_timeout: 30000, 146 | cache_mode: 'BYPASS', 147 | }, 148 | }); 149 | 150 | await expectSuccessfulCrawl(result); 151 | }, 152 | TEST_TIMEOUTS.medium, 153 | ); 154 | 155 | it( 156 | 'should scan full page', 157 | async () => { 158 | const result = await client.callTool({ 159 | name: 'crawl', 160 | arguments: { 161 | url: 'https://httpbin.org/html', 162 | scan_full_page: true, 163 | delay_before_scroll: 0.5, 164 | scroll_delay: 0.2, 165 | cache_mode: 'BYPASS', 166 | }, 167 | }); 168 | 169 | await expectSuccessfulCrawl(result); 170 | }, 171 | TEST_TIMEOUTS.medium, 172 | ); 173 | }); 174 | 175 | describe('Stealth and Bot Detection', () => { 176 | it( 177 | 'should use magic mode', 178 | async () => { 179 | const result = await client.callTool({ 180 | name: 'crawl', 181 | arguments: { 182 | url: 'https://httpbin.org/headers', 183 | magic: true, 184 | simulate_user: true, 185 | override_navigator: true, 186 | cache_mode: 'BYPASS', 187 | }, 188 | }); 189 | 190 | await expectSuccessfulCrawl(result); 191 | }, 192 | TEST_TIMEOUTS.long, 193 | ); 194 | }); 195 | 196 | describe('Extraction Strategies (0.7.3/0.7.4)', () => { 197 | it( 198 | 'should accept extraction_strategy parameter', 199 | async () => { 200 | const result = await client.callTool({ 201 | name: 'crawl', 202 | arguments: { 203 | url: 'https://httpbin.org/html', 204 | extraction_strategy: { 205 | type: 'custom', 206 | provider: 'openai', 207 | api_key: 'test-key', 208 | model: 'gpt-4', 209 | }, 210 | cache_mode: 'BYPASS', 211 | }, 212 | }); 213 | 214 | // The parameter should be accepted even if not fully processed 215 | await expectSuccessfulCrawl(result); 216 | }, 217 | TEST_TIMEOUTS.short, 218 | ); 219 | 220 | it( 221 | 'should accept table_extraction_strategy parameter', 222 | async () => { 223 | const result = await client.callTool({ 224 | name: 'crawl', 225 | arguments: { 226 | url: 'https://httpbin.org/html', 227 | table_extraction_strategy: { 228 | enable_chunking: true, 229 | thresholds: { 230 | min_rows: 5, 231 | max_columns: 20, 232 | }, 233 | }, 234 | cache_mode: 'BYPASS', 235 | }, 236 | }); 237 | 238 | await expectSuccessfulCrawl(result); 239 | }, 240 | TEST_TIMEOUTS.short, 241 | ); 242 | 243 | it( 244 | 'should accept markdown_generator_options parameter', 245 | async () => { 246 | const result = await client.callTool({ 247 | name: 'crawl', 248 | arguments: { 249 | url: 'https://httpbin.org/html', 250 | markdown_generator_options: { 251 | include_links: true, 252 | preserve_formatting: true, 253 | }, 254 | cache_mode: 'BYPASS', 255 | }, 256 | }); 257 | 258 | await expectSuccessfulCrawl(result); 259 | }, 260 | TEST_TIMEOUTS.short, 261 | ); 262 | }); 263 | 264 | describe('Virtual Scroll', () => { 265 | it( 266 | 'should handle virtual scroll configuration', 267 | async () => { 268 | const result = await client.callTool({ 269 | name: 'crawl', 270 | arguments: { 271 | url: 'https://httpbin.org/html', 272 | virtual_scroll_config: { 273 | container_selector: 'body', 274 | scroll_count: 3, 275 | scroll_by: 'container_height', 276 | wait_after_scroll: 0.5, 277 | }, 278 | cache_mode: 'BYPASS', 279 | }, 280 | }); 281 | 282 | await expectSuccessfulCrawl(result); 283 | }, 284 | TEST_TIMEOUTS.medium, 285 | ); 286 | }); 287 | }); 288 | ```