#
tokens: 37190/50000 27/28 files (page 1/2)
lines: on (toggle) GitHub
raw markdown copy reset
This is page 1 of 2. Use http://codebase.md/felores/docs_scraper_mcp?lines=true&page={x} to view the full context.

# Directory Structure

```
├── .cursor
│   └── rules
│       ├── implementation-plan.mdc
│       └── mcp-development-protocol.mdc
├── .gitignore
├── .venv
│   ├── Include
│   │   └── site
│   │       └── python3.12
│   │           └── greenlet
│   │               └── greenlet.h
│   ├── pyvenv.cfg
│   └── Scripts
│       ├── activate
│       ├── activate.bat
│       ├── Activate.ps1
│       ├── cchardetect
│       ├── crawl4ai-doctor.exe
│       ├── crawl4ai-download-models.exe
│       ├── crawl4ai-migrate.exe
│       ├── crawl4ai-setup.exe
│       ├── crwl.exe
│       ├── deactivate.bat
│       ├── distro.exe
│       ├── docs-scraper.exe
│       ├── dotenv.exe
│       ├── f2py.exe
│       ├── httpx.exe
│       ├── huggingface-cli.exe
│       ├── jsonschema.exe
│       ├── litellm.exe
│       ├── markdown-it.exe
│       ├── mcp.exe
│       ├── nltk.exe
│       ├── normalizer.exe
│       ├── numpy-config.exe
│       ├── openai.exe
│       ├── pip.exe
│       ├── pip3.12.exe
│       ├── pip3.exe
│       ├── playwright.exe
│       ├── py.test.exe
│       ├── pygmentize.exe
│       ├── pytest.exe
│       ├── python.exe
│       ├── pythonw.exe
│       ├── tqdm.exe
│       ├── typer.exe
│       └── uvicorn.exe
├── input_files
│   └── .gitkeep
├── LICENSE
├── pyproject.toml
├── README.md
├── requirements.txt
├── scraped_docs
│   └── .gitkeep
├── src
│   └── docs_scraper
│       ├── __init__.py
│       ├── cli.py
│       ├── crawlers
│       │   ├── __init__.py
│       │   ├── menu_crawler.py
│       │   ├── multi_url_crawler.py
│       │   ├── single_url_crawler.py
│       │   └── sitemap_crawler.py
│       ├── server.py
│       └── utils
│           ├── __init__.py
│           ├── html_parser.py
│           └── request_handler.py
└── tests
    ├── conftest.py
    ├── test_crawlers
    │   ├── test_menu_crawler.py
    │   ├── test_multi_url_crawler.py
    │   ├── test_single_url_crawler.py
    │   └── test_sitemap_crawler.py
    └── test_utils
        ├── test_html_parser.py
        └── test_request_handler.py
```

# Files

--------------------------------------------------------------------------------
/input_files/.gitkeep:
--------------------------------------------------------------------------------

```
1 | 
```

--------------------------------------------------------------------------------
/scraped_docs/.gitkeep:
--------------------------------------------------------------------------------

```
1 |  
```

--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------

```
 1 | # Python
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | *.so
 6 | .Python
 7 | env/
 8 | build/
 9 | develop-eggs/
10 | dist/
11 | downloads/
12 | eggs/
13 | .eggs/
14 | lib/
15 | lib64/
16 | parts/
17 | sdist/
18 | var/
19 | wheels/
20 | *.egg-info/
21 | .installed.cfg
22 | *.egg
23 | 
24 | # Virtual Environment
25 | venv/
26 | ENV/
27 | .env
28 | 
29 | # IDE
30 | .idea/
31 | .vscode/
32 | *.swp
33 | *.swo
34 | .DS_Store
35 | 
36 | # Scraped Docs - ignore contents but keep directory
37 | scraped_docs/*
38 | !scraped_docs/.gitkeep
39 | 
40 | # Input Files - ignore contents but keep directory
41 | input_files/*
42 | !input_files/.gitkeep 
```

--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------

```markdown
  1 | # Crawl4AI Documentation Scraper
  2 | 
  3 | Keep your dependency documentation lean, current, and AI-ready. This toolkit helps you extract clean, focused documentation from any framework or library website, perfect for both human readers and LLM consumption.
  4 | 
  5 | ## Why This Tool?
  6 | 
  7 | In today's fast-paced development environment, you need:
  8 | - 📚 Quick access to dependency documentation without the bloat
  9 | - 🤖 Documentation in a format that's ready for RAG systems and LLMs
 10 | - 🎯 Focused content without navigation elements, ads, or irrelevant sections
 11 | - ⚡ Fast, efficient way to keep documentation up-to-date
 12 | - 🧹 Clean Markdown output for easy integration with documentation tools
 13 | 
 14 | Traditional web scraping often gives you everything - including navigation menus, footers, ads, and other noise. This toolkit is specifically designed to extract only what matters: the actual documentation content.
 15 | 
 16 | ### Key Benefits
 17 | 
 18 | 1. **Clean Documentation Output**
 19 |    - Markdown format for content-focused documentation
 20 |    - JSON format for structured menu data
 21 |    - Perfect for documentation sites, wikis, and knowledge bases
 22 |    - Ideal format for LLM training and RAG systems
 23 | 
 24 | 2. **Smart Content Extraction**
 25 |    - Automatically identifies main content areas
 26 |    - Strips away navigation, ads, and irrelevant sections
 27 |    - Preserves code blocks and technical formatting
 28 |    - Maintains proper Markdown structure
 29 | 
 30 | 3. **Flexible Crawling Strategies**
 31 |    - Single page for quick reference docs
 32 |    - Multi-page for comprehensive library documentation
 33 |    - Sitemap-based for complete framework coverage
 34 |    - Menu-based for structured documentation hierarchies
 35 | 
 36 | 4. **LLM and RAG Ready**
 37 |    - Clean Markdown text suitable for embeddings
 38 |    - Preserved code blocks for technical accuracy
 39 |    - Structured menu data in JSON format
 40 |    - Consistent formatting for reliable processing
 41 | 
 42 | A comprehensive Python toolkit for scraping documentation websites using different crawling strategies. Built using the Crawl4AI library for efficient web crawling.
 43 | 
 44 | [![Powered by Crawl4AI](https://img.shields.io/badge/Powered%20by-Crawl4AI-blue?style=flat-square)](https://github.com/unclecode/crawl4ai)
 45 | 
 46 | ## Features
 47 | 
 48 | ### Core Features
 49 | - 🚀 Multiple crawling strategies
 50 | - 📑 Automatic nested menu expansion
 51 | - 🔄 Handles dynamic content and lazy-loaded elements
 52 | - 🎯 Configurable selectors
 53 | - 📝 Clean Markdown output for documentation
 54 | - 📊 JSON output for menu structure
 55 | - 🎨 Colorful terminal feedback
 56 | - 🔍 Smart URL processing
 57 | - ⚡ Asynchronous execution
 58 | 
 59 | ### Available Crawlers
 60 | 1. **Single URL Crawler** (`single_url_crawler.py`)
 61 |    - Extracts content from a single documentation page
 62 |    - Outputs clean Markdown format
 63 |    - Perfect for targeted content extraction
 64 |    - Configurable content selectors
 65 | 
 66 | 2. **Multi URL Crawler** (`multi_url_crawler.py`)
 67 |    - Processes multiple URLs in parallel
 68 |    - Generates individual Markdown files per page
 69 |    - Efficient batch processing
 70 |    - Shared browser session for better performance
 71 | 
 72 | 3. **Sitemap Crawler** (`sitemap_crawler.py`)
 73 |    - Automatically discovers and crawls sitemap.xml
 74 |    - Creates Markdown files for each page
 75 |    - Supports recursive sitemap parsing
 76 |    - Handles gzipped sitemaps
 77 | 
 78 | 4. **Menu Crawler** (`menu_crawler.py`)
 79 |    - Extracts all menu links from documentation
 80 |    - Outputs structured JSON format
 81 |    - Handles nested and dynamic menus
 82 |    - Smart menu expansion
 83 | 
 84 | ## Requirements
 85 | 
 86 | - Python 3.7+
 87 | - Virtual Environment (recommended)
 88 | 
 89 | ## Installation
 90 | 
 91 | 1. Clone the repository:
 92 | ```bash
 93 | git clone https://github.com/felores/crawl4ai_docs_scraper.git
 94 | cd crawl4ai_docs_scraper
 95 | ```
 96 | 
 97 | 2. Create and activate a virtual environment:
 98 | ```bash
 99 | python -m venv venv
100 | source venv/bin/activate  # On Windows: venv\Scripts\activate
101 | ```
102 | 
103 | 3. Install dependencies:
104 | ```bash
105 | pip install -r requirements.txt
106 | ```
107 | 
108 | ## Usage
109 | 
110 | ### 1. Single URL Crawler
111 | 
112 | ```bash
113 | python single_url_crawler.py https://docs.example.com/page
114 | ```
115 | 
116 | Arguments:
117 | - URL: Target documentation URL (required, first argument)
118 | 
119 | Note: Use quotes only if your URL contains special characters or spaces.
120 | 
121 | Output format (Markdown):
122 | ```markdown
123 | # Page Title
124 | 
125 | ## Section 1
126 | Content with preserved formatting, including:
127 | - Lists
128 | - Links
129 | - Tables
130 | 
131 | ### Code Examples
132 | ```python
133 | def example():
134 |     return "Code blocks are preserved"
135 | ```
136 | 
137 | ### 2. Multi URL Crawler
138 | 
139 | ```bash
140 | # Using a text file with URLs
141 | python multi_url_crawler.py urls.txt
142 | 
143 | # Using JSON output from menu crawler
144 | python multi_url_crawler.py menu_links.json
145 | 
146 | # Using custom output prefix
147 | python multi_url_crawler.py menu_links.json --output-prefix custom_name
148 | ```
149 | 
150 | Arguments:
151 | - URLs file: Path to file containing URLs (required, first argument)
152 |   - Can be .txt with one URL per line
153 |   - Or .json from menu crawler output
154 | - `--output-prefix`: Custom prefix for output markdown file (optional)
155 | 
156 | Note: Use quotes only if your file path contains spaces.
157 | 
158 | Output filename format:
159 | - Without `--output-prefix`: `domain_path_docs_content_timestamp.md` (e.g., `cloudflare_agents_docs_content_20240323_223656.md`)
160 | - With `--output-prefix`: `custom_prefix_docs_content_timestamp.md` (e.g., `custom_name_docs_content_20240323_223656.md`)
161 | 
162 | The crawler accepts two types of input files:
163 | 1. Text file with one URL per line:
164 | ```text
165 | https://docs.example.com/page1
166 | https://docs.example.com/page2
167 | https://docs.example.com/page3
168 | ```
169 | 
170 | 2. JSON file (compatible with menu crawler output):
171 | ```json
172 | {
173 |     "menu_links": [
174 |         "https://docs.example.com/page1",
175 |         "https://docs.example.com/page2"
176 |     ]
177 | }
178 | ```
179 | 
180 | ### 3. Sitemap Crawler
181 | 
182 | ```bash
183 | python sitemap_crawler.py https://docs.example.com/sitemap.xml
184 | ```
185 | 
186 | Options:
187 | - `--max-depth`: Maximum sitemap recursion depth (optional)
188 | - `--patterns`: URL patterns to include (optional)
189 | 
190 | ### 4. Menu Crawler
191 | 
192 | ```bash
193 | python menu_crawler.py https://docs.example.com
194 | ```
195 | 
196 | Options:
197 | - `--selectors`: Custom menu selectors (optional)
198 | 
199 | The menu crawler now saves its output to the `input_files` directory, making it ready for immediate use with the multi-url crawler. The output JSON has this format:
200 | ```json
201 | {
202 |     "start_url": "https://docs.example.com/",
203 |     "total_links_found": 42,
204 |     "menu_links": [
205 |         "https://docs.example.com/page1",
206 |         "https://docs.example.com/page2"
207 |     ]
208 | }
209 | ```
210 | 
211 | After running the menu crawler, you'll get a command to run the multi-url crawler with the generated file.
212 | 
213 | ## Directory Structure
214 | 
215 | ```
216 | crawl4ai_docs_scraper/
217 | ├── input_files/           # Input files for URL processing
218 | │   ├── urls.txt          # Text file with URLs
219 | │   └── menu_links.json   # JSON output from menu crawler
220 | ├── scraped_docs/         # Output directory for markdown files
221 | │   └── docs_timestamp.md # Generated documentation
222 | ├── multi_url_crawler.py
223 | ├── menu_crawler.py
224 | └── requirements.txt
225 | ```
226 | 
227 | ## Error Handling
228 | 
229 | All crawlers include comprehensive error handling with colored terminal output:
230 | - 🟢 Green: Success messages
231 | - 🔵 Cyan: Processing status
232 | - 🟡 Yellow: Warnings
233 | - 🔴 Red: Error messages
234 | 
235 | ## Contributing
236 | 
237 | Contributions are welcome! Please feel free to submit a Pull Request.
238 | 
239 | ## License
240 | 
241 | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
242 | 
243 | ## Attribution
244 | 
245 | This project uses [Crawl4AI](https://github.com/unclecode/crawl4ai) for web data extraction.
246 | 
247 | ## Acknowledgments
248 | 
249 | - Built with [Crawl4AI](https://github.com/unclecode/crawl4ai)
250 | - Uses [termcolor](https://pypi.org/project/termcolor/) for colorful terminal output
```

--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------

```
1 | crawl4ai
2 | aiohttp
3 | termcolor
4 | playwright
```

--------------------------------------------------------------------------------
/src/docs_scraper/utils/__init__.py:
--------------------------------------------------------------------------------

```python
 1 | """
 2 | Utility modules for web crawling and HTML parsing.
 3 | """
 4 | from .request_handler import RequestHandler
 5 | from .html_parser import HTMLParser
 6 | 
 7 | __all__ = [
 8 |     'RequestHandler',
 9 |     'HTMLParser'
10 | ] 
```

--------------------------------------------------------------------------------
/src/docs_scraper/__init__.py:
--------------------------------------------------------------------------------

```python
1 | """
2 | Documentation scraper MCP server package.
3 | """
4 | # Import subpackages but not modules to avoid circular imports
5 | from . import crawlers
6 | from . import utils
7 | 
8 | # Expose important items at package level
9 | __all__ = ['crawlers', 'utils'] 
```

--------------------------------------------------------------------------------
/src/docs_scraper/crawlers/__init__.py:
--------------------------------------------------------------------------------

```python
 1 | """
 2 | Web crawler implementations for documentation scraping.
 3 | """
 4 | from .single_url_crawler import SingleURLCrawler
 5 | from .multi_url_crawler import MultiURLCrawler
 6 | from .sitemap_crawler import SitemapCrawler
 7 | from .menu_crawler import MenuCrawler
 8 | 
 9 | __all__ = [
10 |     'SingleURLCrawler',
11 |     'MultiURLCrawler',
12 |     'SitemapCrawler',
13 |     'MenuCrawler'
14 | ] 
```

--------------------------------------------------------------------------------
/.venv/Scripts/deactivate.bat:
--------------------------------------------------------------------------------

```
 1 | @echo off
 2 | 
 3 | if defined _OLD_VIRTUAL_PROMPT (
 4 |     set "PROMPT=%_OLD_VIRTUAL_PROMPT%"
 5 | )
 6 | set _OLD_VIRTUAL_PROMPT=
 7 | 
 8 | if defined _OLD_VIRTUAL_PYTHONHOME (
 9 |     set "PYTHONHOME=%_OLD_VIRTUAL_PYTHONHOME%"
10 |     set _OLD_VIRTUAL_PYTHONHOME=
11 | )
12 | 
13 | if defined _OLD_VIRTUAL_PATH (
14 |     set "PATH=%_OLD_VIRTUAL_PATH%"
15 | )
16 | 
17 | set _OLD_VIRTUAL_PATH=
18 | 
19 | set VIRTUAL_ENV=
20 | set VIRTUAL_ENV_PROMPT=
21 | 
22 | :END
23 | 
```

--------------------------------------------------------------------------------
/src/docs_scraper/cli.py:
--------------------------------------------------------------------------------

```python
 1 | """
 2 | Command line interface for the docs_scraper package.
 3 | """
 4 | import logging
 5 | 
 6 | def main():
 7 |     """Entry point for the package when run from the command line."""
 8 |     from docs_scraper.server import main as server_main
 9 |     
10 |     # Configure logging
11 |     logging.basicConfig(
12 |         level=logging.INFO,
13 |         format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
14 |     )
15 |     
16 |     # Run the server
17 |     server_main()
18 | 
19 | if __name__ == "__main__":
20 |     main() 
```

--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------

```toml
 1 | [build-system]
 2 | requires = ["setuptools>=61.0"]
 3 | build-backend = "setuptools.build_meta"
 4 | 
 5 | [project]
 6 | name = "docs_scraper"
 7 | version = "0.1.0"
 8 | authors = [
 9 |     { name = "Your Name", email = "[email protected]" }
10 | ]
11 | description = "A documentation scraping tool"
12 | requires-python = ">=3.7"
13 | dependencies = [
14 |     "beautifulsoup4",
15 |     "requests",
16 |     "aiohttp",
17 |     "lxml",
18 |     "termcolor",
19 |     "crawl4ai"
20 | ]
21 | classifiers = [
22 |     "Programming Language :: Python :: 3",
23 |     "Operating System :: OS Independent",
24 | ]
25 | 
26 | [project.optional-dependencies]
27 | test = [
28 |     "pytest",
29 |     "pytest-asyncio",
30 |     "aioresponses"
31 | ]
32 | 
33 | [project.scripts]
34 | docs-scraper = "docs_scraper.cli:main"
35 | 
36 | [tool.setuptools.packages.find]
37 | where = ["src"]
38 | include = ["docs_scraper*"]
39 | namespaces = false
40 | 
41 | [tool.hatch.build]
42 | packages = ["src/docs_scraper"] 
```

--------------------------------------------------------------------------------
/.venv/Scripts/activate.bat:
--------------------------------------------------------------------------------

```
 1 | @echo off
 2 | 
 3 | rem This file is UTF-8 encoded, so we need to update the current code page while executing it
 4 | for /f "tokens=2 delims=:." %%a in ('"%SystemRoot%\System32\chcp.com"') do (
 5 |     set _OLD_CODEPAGE=%%a
 6 | )
 7 | if defined _OLD_CODEPAGE (
 8 |     "%SystemRoot%\System32\chcp.com" 65001 > nul
 9 | )
10 | 
11 | set "VIRTUAL_ENV=D:\AI-DEV\mcp\docs_scraper_mcp\.venv"
12 | 
13 | if not defined PROMPT set PROMPT=$P$G
14 | 
15 | if defined _OLD_VIRTUAL_PROMPT set PROMPT=%_OLD_VIRTUAL_PROMPT%
16 | if defined _OLD_VIRTUAL_PYTHONHOME set PYTHONHOME=%_OLD_VIRTUAL_PYTHONHOME%
17 | 
18 | set _OLD_VIRTUAL_PROMPT=%PROMPT%
19 | set PROMPT=(.venv) %PROMPT%
20 | 
21 | if defined PYTHONHOME set _OLD_VIRTUAL_PYTHONHOME=%PYTHONHOME%
22 | set PYTHONHOME=
23 | 
24 | if defined _OLD_VIRTUAL_PATH set PATH=%_OLD_VIRTUAL_PATH%
25 | if not defined _OLD_VIRTUAL_PATH set _OLD_VIRTUAL_PATH=%PATH%
26 | 
27 | set "PATH=%VIRTUAL_ENV%\Scripts;%PATH%"
28 | set "VIRTUAL_ENV_PROMPT=(.venv) "
29 | 
30 | :END
31 | if defined _OLD_CODEPAGE (
32 |     "%SystemRoot%\System32\chcp.com" %_OLD_CODEPAGE% > nul
33 |     set _OLD_CODEPAGE=
34 | )
35 | 
```

--------------------------------------------------------------------------------
/tests/conftest.py:
--------------------------------------------------------------------------------

```python
  1 | """
  2 | Test configuration and fixtures for the docs_scraper package.
  3 | """
  4 | import os
  5 | import pytest
  6 | import aiohttp
  7 | from typing import AsyncGenerator, Dict, Any
  8 | from aioresponses import aioresponses
  9 | from bs4 import BeautifulSoup
 10 | 
 11 | @pytest.fixture
 12 | def mock_aiohttp() -> aioresponses:
 13 |     """Fixture for mocking aiohttp requests."""
 14 |     with aioresponses() as m:
 15 |         yield m
 16 | 
 17 | @pytest.fixture
 18 | def sample_html() -> str:
 19 |     """Sample HTML content for testing."""
 20 |     return """
 21 |     <!DOCTYPE html>
 22 |     <html>
 23 |     <head>
 24 |         <title>Test Page</title>
 25 |         <meta name="description" content="Test description">
 26 |     </head>
 27 |     <body>
 28 |         <nav class="menu">
 29 |             <ul>
 30 |                 <li><a href="/page1">Page 1</a></li>
 31 |                 <li>
 32 |                     <a href="/section1">Section 1</a>
 33 |                     <ul>
 34 |                         <li><a href="/section1/page1">Section 1.1</a></li>
 35 |                         <li><a href="/section1/page2">Section 1.2</a></li>
 36 |                     </ul>
 37 |                 </li>
 38 |             </ul>
 39 |         </nav>
 40 |         <main>
 41 |             <h1>Welcome</h1>
 42 |             <p>Test content</p>
 43 |             <a href="/test1">Link 1</a>
 44 |             <a href="/test2">Link 2</a>
 45 |         </main>
 46 |     </body>
 47 |     </html>
 48 |     """
 49 | 
 50 | @pytest.fixture
 51 | def sample_sitemap() -> str:
 52 |     """Sample sitemap.xml content for testing."""
 53 |     return """<?xml version="1.0" encoding="UTF-8"?>
 54 |     <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 55 |         <url>
 56 |             <loc>https://example.com/</loc>
 57 |             <lastmod>2024-03-24</lastmod>
 58 |         </url>
 59 |         <url>
 60 |             <loc>https://example.com/page1</loc>
 61 |             <lastmod>2024-03-24</lastmod>
 62 |         </url>
 63 |         <url>
 64 |             <loc>https://example.com/page2</loc>
 65 |             <lastmod>2024-03-24</lastmod>
 66 |         </url>
 67 |     </urlset>
 68 |     """
 69 | 
 70 | @pytest.fixture
 71 | def mock_website(mock_aiohttp, sample_html, sample_sitemap) -> None:
 72 |     """Set up a mock website with various pages and a sitemap."""
 73 |     base_url = "https://example.com"
 74 |     pages = {
 75 |         "/": sample_html,
 76 |         "/page1": sample_html.replace("Test Page", "Page 1"),
 77 |         "/page2": sample_html.replace("Test Page", "Page 2"),
 78 |         "/section1": sample_html.replace("Test Page", "Section 1"),
 79 |         "/section1/page1": sample_html.replace("Test Page", "Section 1.1"),
 80 |         "/section1/page2": sample_html.replace("Test Page", "Section 1.2"),
 81 |         "/robots.txt": "User-agent: *\nAllow: /",
 82 |         "/sitemap.xml": sample_sitemap
 83 |     }
 84 |     
 85 |     for path, content in pages.items():
 86 |         mock_aiohttp.get(f"{base_url}{path}", status=200, body=content)
 87 | 
 88 | @pytest.fixture
 89 | async def aiohttp_session() -> AsyncGenerator[aiohttp.ClientSession, None]:
 90 |     """Create an aiohttp ClientSession for testing."""
 91 |     async with aiohttp.ClientSession() as session:
 92 |         yield session
 93 | 
 94 | @pytest.fixture
 95 | def test_urls() -> Dict[str, Any]:
 96 |     """Test URLs and related data for testing."""
 97 |     base_url = "https://example.com"
 98 |     return {
 99 |         "base_url": base_url,
100 |         "valid_urls": [
101 |             f"{base_url}/",
102 |             f"{base_url}/page1",
103 |             f"{base_url}/page2"
104 |         ],
105 |         "invalid_urls": [
106 |             "not_a_url",
107 |             "ftp://example.com",
108 |             "https://nonexistent.example.com"
109 |         ],
110 |         "menu_selector": "nav.menu",
111 |         "sitemap_url": f"{base_url}/sitemap.xml"
112 |     } 
```

--------------------------------------------------------------------------------
/tests/test_crawlers/test_single_url_crawler.py:
--------------------------------------------------------------------------------

```python
  1 | """
  2 | Tests for the SingleURLCrawler class.
  3 | """
  4 | import pytest
  5 | from docs_scraper.crawlers import SingleURLCrawler
  6 | from docs_scraper.utils import RequestHandler, HTMLParser
  7 | 
  8 | @pytest.mark.asyncio
  9 | async def test_single_url_crawler_successful_crawl(mock_website, test_urls, aiohttp_session):
 10 |     """Test successful crawling of a single URL."""
 11 |     url = test_urls["valid_urls"][0]
 12 |     request_handler = RequestHandler(session=aiohttp_session)
 13 |     html_parser = HTMLParser()
 14 |     crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
 15 |     
 16 |     result = await crawler.crawl(url)
 17 |     
 18 |     assert result["success"] is True
 19 |     assert result["url"] == url
 20 |     assert "content" in result
 21 |     assert "title" in result["metadata"]
 22 |     assert "description" in result["metadata"]
 23 |     assert len(result["links"]) > 0
 24 |     assert result["status_code"] == 200
 25 |     assert result["error"] is None
 26 | 
 27 | @pytest.mark.asyncio
 28 | async def test_single_url_crawler_invalid_url(mock_website, test_urls, aiohttp_session):
 29 |     """Test crawling with an invalid URL."""
 30 |     url = test_urls["invalid_urls"][0]
 31 |     request_handler = RequestHandler(session=aiohttp_session)
 32 |     html_parser = HTMLParser()
 33 |     crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
 34 |     
 35 |     result = await crawler.crawl(url)
 36 |     
 37 |     assert result["success"] is False
 38 |     assert result["url"] == url
 39 |     assert result["content"] is None
 40 |     assert result["metadata"] == {}
 41 |     assert result["links"] == []
 42 |     assert result["error"] is not None
 43 | 
 44 | @pytest.mark.asyncio
 45 | async def test_single_url_crawler_nonexistent_url(mock_website, test_urls, aiohttp_session):
 46 |     """Test crawling a URL that doesn't exist."""
 47 |     url = test_urls["invalid_urls"][2]
 48 |     request_handler = RequestHandler(session=aiohttp_session)
 49 |     html_parser = HTMLParser()
 50 |     crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
 51 |     
 52 |     result = await crawler.crawl(url)
 53 |     
 54 |     assert result["success"] is False
 55 |     assert result["url"] == url
 56 |     assert result["content"] is None
 57 |     assert result["metadata"] == {}
 58 |     assert result["links"] == []
 59 |     assert result["error"] is not None
 60 | 
 61 | @pytest.mark.asyncio
 62 | async def test_single_url_crawler_metadata_extraction(mock_website, test_urls, aiohttp_session):
 63 |     """Test extraction of metadata from a crawled page."""
 64 |     url = test_urls["valid_urls"][0]
 65 |     request_handler = RequestHandler(session=aiohttp_session)
 66 |     html_parser = HTMLParser()
 67 |     crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
 68 |     
 69 |     result = await crawler.crawl(url)
 70 |     
 71 |     assert result["success"] is True
 72 |     assert result["metadata"]["title"] == "Test Page"
 73 |     assert result["metadata"]["description"] == "Test description"
 74 | 
 75 | @pytest.mark.asyncio
 76 | async def test_single_url_crawler_link_extraction(mock_website, test_urls, aiohttp_session):
 77 |     """Test extraction of links from a crawled page."""
 78 |     url = test_urls["valid_urls"][0]
 79 |     request_handler = RequestHandler(session=aiohttp_session)
 80 |     html_parser = HTMLParser()
 81 |     crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
 82 |     
 83 |     result = await crawler.crawl(url)
 84 |     
 85 |     assert result["success"] is True
 86 |     assert len(result["links"]) >= 6  # Number of links in sample HTML
 87 |     assert "/page1" in result["links"]
 88 |     assert "/section1" in result["links"]
 89 |     assert "/test1" in result["links"]
 90 |     assert "/test2" in result["links"]
 91 | 
 92 | @pytest.mark.asyncio
 93 | async def test_single_url_crawler_rate_limiting(mock_website, test_urls, aiohttp_session):
 94 |     """Test rate limiting functionality."""
 95 |     url = test_urls["valid_urls"][0]
 96 |     request_handler = RequestHandler(session=aiohttp_session, rate_limit=1)  # 1 request per second
 97 |     html_parser = HTMLParser()
 98 |     crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
 99 |     
100 |     import time
101 |     start_time = time.time()
102 |     
103 |     # Make multiple requests
104 |     for _ in range(3):
105 |         result = await crawler.crawl(url)
106 |         assert result["success"] is True
107 |     
108 |     end_time = time.time()
109 |     elapsed_time = end_time - start_time
110 |     
111 |     # Should take at least 2 seconds due to rate limiting
112 |     assert elapsed_time >= 2.0 
```

--------------------------------------------------------------------------------
/tests/test_crawlers/test_multi_url_crawler.py:
--------------------------------------------------------------------------------

```python
  1 | """
  2 | Tests for the MultiURLCrawler class.
  3 | """
  4 | import pytest
  5 | from docs_scraper.crawlers import MultiURLCrawler
  6 | from docs_scraper.utils import RequestHandler, HTMLParser
  7 | 
  8 | @pytest.mark.asyncio
  9 | async def test_multi_url_crawler_successful_crawl(mock_website, test_urls, aiohttp_session):
 10 |     """Test successful crawling of multiple URLs."""
 11 |     urls = test_urls["valid_urls"]
 12 |     request_handler = RequestHandler(session=aiohttp_session)
 13 |     html_parser = HTMLParser()
 14 |     crawler = MultiURLCrawler(request_handler=request_handler, html_parser=html_parser)
 15 |     
 16 |     results = await crawler.crawl(urls)
 17 |     
 18 |     assert len(results) == len(urls)
 19 |     for result, url in zip(results, urls):
 20 |         assert result["success"] is True
 21 |         assert result["url"] == url
 22 |         assert "content" in result
 23 |         assert "title" in result["metadata"]
 24 |         assert "description" in result["metadata"]
 25 |         assert len(result["links"]) > 0
 26 |         assert result["status_code"] == 200
 27 |         assert result["error"] is None
 28 | 
 29 | @pytest.mark.asyncio
 30 | async def test_multi_url_crawler_mixed_urls(mock_website, test_urls, aiohttp_session):
 31 |     """Test crawling a mix of valid and invalid URLs."""
 32 |     urls = test_urls["valid_urls"][:1] + test_urls["invalid_urls"][:1]
 33 |     request_handler = RequestHandler(session=aiohttp_session)
 34 |     html_parser = HTMLParser()
 35 |     crawler = MultiURLCrawler(request_handler=request_handler, html_parser=html_parser)
 36 |     
 37 |     results = await crawler.crawl(urls)
 38 |     
 39 |     assert len(results) == len(urls)
 40 |     # Valid URL
 41 |     assert results[0]["success"] is True
 42 |     assert results[0]["url"] == urls[0]
 43 |     assert "content" in results[0]
 44 |     # Invalid URL
 45 |     assert results[1]["success"] is False
 46 |     assert results[1]["url"] == urls[1]
 47 |     assert results[1]["content"] is None
 48 | 
 49 | @pytest.mark.asyncio
 50 | async def test_multi_url_crawler_concurrent_limit(mock_website, test_urls, aiohttp_session):
 51 |     """Test concurrent request limiting."""
 52 |     urls = test_urls["valid_urls"] * 2  # Duplicate URLs to have more requests
 53 |     request_handler = RequestHandler(session=aiohttp_session)
 54 |     html_parser = HTMLParser()
 55 |     crawler = MultiURLCrawler(
 56 |         request_handler=request_handler,
 57 |         html_parser=html_parser,
 58 |         concurrent_limit=2
 59 |     )
 60 |     
 61 |     import time
 62 |     start_time = time.time()
 63 |     
 64 |     results = await crawler.crawl(urls)
 65 |     
 66 |     end_time = time.time()
 67 |     elapsed_time = end_time - start_time
 68 |     
 69 |     assert len(results) == len(urls)
 70 |     # With concurrent_limit=2, processing 6 URLs should take at least 3 time units
 71 |     assert elapsed_time >= (len(urls) / 2) * 0.1  # Assuming each request takes ~0.1s
 72 | 
 73 | @pytest.mark.asyncio
 74 | async def test_multi_url_crawler_empty_urls(mock_website, aiohttp_session):
 75 |     """Test crawling with empty URL list."""
 76 |     request_handler = RequestHandler(session=aiohttp_session)
 77 |     html_parser = HTMLParser()
 78 |     crawler = MultiURLCrawler(request_handler=request_handler, html_parser=html_parser)
 79 |     
 80 |     results = await crawler.crawl([])
 81 |     
 82 |     assert len(results) == 0
 83 | 
 84 | @pytest.mark.asyncio
 85 | async def test_multi_url_crawler_duplicate_urls(mock_website, test_urls, aiohttp_session):
 86 |     """Test crawling with duplicate URLs."""
 87 |     url = test_urls["valid_urls"][0]
 88 |     urls = [url, url, url]  # Same URL multiple times
 89 |     request_handler = RequestHandler(session=aiohttp_session)
 90 |     html_parser = HTMLParser()
 91 |     crawler = MultiURLCrawler(request_handler=request_handler, html_parser=html_parser)
 92 |     
 93 |     results = await crawler.crawl(urls)
 94 |     
 95 |     assert len(results) == len(urls)
 96 |     for result in results:
 97 |         assert result["success"] is True
 98 |         assert result["url"] == url
 99 |         assert result["metadata"]["title"] == "Test Page"
100 | 
101 | @pytest.mark.asyncio
102 | async def test_multi_url_crawler_rate_limiting(mock_website, test_urls, aiohttp_session):
103 |     """Test rate limiting with multiple URLs."""
104 |     urls = test_urls["valid_urls"]
105 |     request_handler = RequestHandler(session=aiohttp_session, rate_limit=1)  # 1 request per second
106 |     html_parser = HTMLParser()
107 |     crawler = MultiURLCrawler(request_handler=request_handler, html_parser=html_parser)
108 |     
109 |     import time
110 |     start_time = time.time()
111 |     
112 |     results = await crawler.crawl(urls)
113 |     
114 |     end_time = time.time()
115 |     elapsed_time = end_time - start_time
116 |     
117 |     assert len(results) == len(urls)
118 |     # Should take at least (len(urls) - 1) seconds due to rate limiting
119 |     assert elapsed_time >= len(urls) - 1 
```

--------------------------------------------------------------------------------
/.venv/Include/site/python3.12/greenlet/greenlet.h:
--------------------------------------------------------------------------------

```
  1 | /* -*- indent-tabs-mode: nil; tab-width: 4; -*- */
  2 | 
  3 | /* Greenlet object interface */
  4 | 
  5 | #ifndef Py_GREENLETOBJECT_H
  6 | #define Py_GREENLETOBJECT_H
  7 | 
  8 | 
  9 | #include <Python.h>
 10 | 
 11 | #ifdef __cplusplus
 12 | extern "C" {
 13 | #endif
 14 | 
 15 | /* This is deprecated and undocumented. It does not change. */
 16 | #define GREENLET_VERSION "1.0.0"
 17 | 
 18 | #ifndef GREENLET_MODULE
 19 | #define implementation_ptr_t void*
 20 | #endif
 21 | 
 22 | typedef struct _greenlet {
 23 |     PyObject_HEAD
 24 |     PyObject* weakreflist;
 25 |     PyObject* dict;
 26 |     implementation_ptr_t pimpl;
 27 | } PyGreenlet;
 28 | 
 29 | #define PyGreenlet_Check(op) (op && PyObject_TypeCheck(op, &PyGreenlet_Type))
 30 | 
 31 | 
 32 | /* C API functions */
 33 | 
 34 | /* Total number of symbols that are exported */
 35 | #define PyGreenlet_API_pointers 12
 36 | 
 37 | #define PyGreenlet_Type_NUM 0
 38 | #define PyExc_GreenletError_NUM 1
 39 | #define PyExc_GreenletExit_NUM 2
 40 | 
 41 | #define PyGreenlet_New_NUM 3
 42 | #define PyGreenlet_GetCurrent_NUM 4
 43 | #define PyGreenlet_Throw_NUM 5
 44 | #define PyGreenlet_Switch_NUM 6
 45 | #define PyGreenlet_SetParent_NUM 7
 46 | 
 47 | #define PyGreenlet_MAIN_NUM 8
 48 | #define PyGreenlet_STARTED_NUM 9
 49 | #define PyGreenlet_ACTIVE_NUM 10
 50 | #define PyGreenlet_GET_PARENT_NUM 11
 51 | 
 52 | #ifndef GREENLET_MODULE
 53 | /* This section is used by modules that uses the greenlet C API */
 54 | static void** _PyGreenlet_API = NULL;
 55 | 
 56 | #    define PyGreenlet_Type \
 57 |         (*(PyTypeObject*)_PyGreenlet_API[PyGreenlet_Type_NUM])
 58 | 
 59 | #    define PyExc_GreenletError \
 60 |         ((PyObject*)_PyGreenlet_API[PyExc_GreenletError_NUM])
 61 | 
 62 | #    define PyExc_GreenletExit \
 63 |         ((PyObject*)_PyGreenlet_API[PyExc_GreenletExit_NUM])
 64 | 
 65 | /*
 66 |  * PyGreenlet_New(PyObject *args)
 67 |  *
 68 |  * greenlet.greenlet(run, parent=None)
 69 |  */
 70 | #    define PyGreenlet_New                                        \
 71 |         (*(PyGreenlet * (*)(PyObject * run, PyGreenlet * parent)) \
 72 |              _PyGreenlet_API[PyGreenlet_New_NUM])
 73 | 
 74 | /*
 75 |  * PyGreenlet_GetCurrent(void)
 76 |  *
 77 |  * greenlet.getcurrent()
 78 |  */
 79 | #    define PyGreenlet_GetCurrent \
 80 |         (*(PyGreenlet * (*)(void)) _PyGreenlet_API[PyGreenlet_GetCurrent_NUM])
 81 | 
 82 | /*
 83 |  * PyGreenlet_Throw(
 84 |  *         PyGreenlet *greenlet,
 85 |  *         PyObject *typ,
 86 |  *         PyObject *val,
 87 |  *         PyObject *tb)
 88 |  *
 89 |  * g.throw(...)
 90 |  */
 91 | #    define PyGreenlet_Throw                 \
 92 |         (*(PyObject * (*)(PyGreenlet * self, \
 93 |                           PyObject * typ,    \
 94 |                           PyObject * val,    \
 95 |                           PyObject * tb))    \
 96 |              _PyGreenlet_API[PyGreenlet_Throw_NUM])
 97 | 
 98 | /*
 99 |  * PyGreenlet_Switch(PyGreenlet *greenlet, PyObject *args)
100 |  *
101 |  * g.switch(*args, **kwargs)
102 |  */
103 | #    define PyGreenlet_Switch                                              \
104 |         (*(PyObject *                                                      \
105 |            (*)(PyGreenlet * greenlet, PyObject * args, PyObject * kwargs)) \
106 |              _PyGreenlet_API[PyGreenlet_Switch_NUM])
107 | 
108 | /*
109 |  * PyGreenlet_SetParent(PyObject *greenlet, PyObject *new_parent)
110 |  *
111 |  * g.parent = new_parent
112 |  */
113 | #    define PyGreenlet_SetParent                                 \
114 |         (*(int (*)(PyGreenlet * greenlet, PyGreenlet * nparent)) \
115 |              _PyGreenlet_API[PyGreenlet_SetParent_NUM])
116 | 
117 | /*
118 |  * PyGreenlet_GetParent(PyObject* greenlet)
119 |  *
120 |  * return greenlet.parent;
121 |  *
122 |  * This could return NULL even if there is no exception active.
123 |  * If it does not return NULL, you are responsible for decrementing the
124 |  * reference count.
125 |  */
126 | #     define PyGreenlet_GetParent                                    \
127 |     (*(PyGreenlet* (*)(PyGreenlet*))                                 \
128 |      _PyGreenlet_API[PyGreenlet_GET_PARENT_NUM])
129 | 
130 | /*
131 |  * deprecated, undocumented alias.
132 |  */
133 | #     define PyGreenlet_GET_PARENT PyGreenlet_GetParent
134 | 
135 | #     define PyGreenlet_MAIN                                         \
136 |     (*(int (*)(PyGreenlet*))                                         \
137 |      _PyGreenlet_API[PyGreenlet_MAIN_NUM])
138 | 
139 | #     define PyGreenlet_STARTED                                      \
140 |     (*(int (*)(PyGreenlet*))                                         \
141 |      _PyGreenlet_API[PyGreenlet_STARTED_NUM])
142 | 
143 | #     define PyGreenlet_ACTIVE                                       \
144 |     (*(int (*)(PyGreenlet*))                                         \
145 |      _PyGreenlet_API[PyGreenlet_ACTIVE_NUM])
146 | 
147 | 
148 | 
149 | 
150 | /* Macro that imports greenlet and initializes C API */
151 | /* NOTE: This has actually moved to ``greenlet._greenlet._C_API``, but we
152 |    keep the older definition to be sure older code that might have a copy of
153 |    the header still works. */
154 | #    define PyGreenlet_Import()                                               \
155 |         {                                                                     \
156 |             _PyGreenlet_API = (void**)PyCapsule_Import("greenlet._C_API", 0); \
157 |         }
158 | 
159 | #endif /* GREENLET_MODULE */
160 | 
161 | #ifdef __cplusplus
162 | }
163 | #endif
164 | #endif /* !Py_GREENLETOBJECT_H */
165 | 
```

--------------------------------------------------------------------------------
/src/docs_scraper/utils/html_parser.py:
--------------------------------------------------------------------------------

```python
  1 | """
  2 | HTML parser module for extracting content and links from HTML documents.
  3 | """
  4 | from typing import List, Dict, Any, Optional
  5 | from bs4 import BeautifulSoup
  6 | from urllib.parse import urljoin, urlparse
  7 | 
  8 | class HTMLParser:
  9 |     def __init__(self, base_url: str):
 10 |         """
 11 |         Initialize the HTML parser.
 12 |         
 13 |         Args:
 14 |             base_url: Base URL for resolving relative links
 15 |         """
 16 |         self.base_url = base_url
 17 | 
 18 |     def parse_content(self, html: str) -> Dict[str, Any]:
 19 |         """
 20 |         Parse HTML content and extract useful information.
 21 |         
 22 |         Args:
 23 |             html: Raw HTML content
 24 |             
 25 |         Returns:
 26 |             Dict containing:
 27 |                 - title: Page title
 28 |                 - description: Meta description
 29 |                 - text_content: Main text content
 30 |                 - links: List of links found
 31 |                 - headers: List of headers found
 32 |         """
 33 |         soup = BeautifulSoup(html, 'lxml')
 34 |         
 35 |         # Extract title
 36 |         title = soup.title.string if soup.title else None
 37 |         
 38 |         # Extract meta description
 39 |         meta_desc = None
 40 |         meta_tag = soup.find('meta', attrs={'name': 'description'})
 41 |         if meta_tag:
 42 |             meta_desc = meta_tag.get('content')
 43 |         
 44 |         # Extract main content (remove script, style, etc.)
 45 |         for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
 46 |             tag.decompose()
 47 |         
 48 |         # Get text content
 49 |         text_content = ' '.join(soup.stripped_strings)
 50 |         
 51 |         # Extract headers
 52 |         headers = []
 53 |         for tag in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
 54 |             headers.append({
 55 |                 'level': int(tag.name[1]),
 56 |                 'text': tag.get_text(strip=True)
 57 |             })
 58 |         
 59 |         # Extract links
 60 |         links = self._extract_links(soup)
 61 |         
 62 |         return {
 63 |             'title': title,
 64 |             'description': meta_desc,
 65 |             'text_content': text_content,
 66 |             'links': links,
 67 |             'headers': headers
 68 |         }
 69 | 
 70 |     def parse_menu(self, html: str, menu_selector: str) -> List[Dict[str, Any]]:
 71 |         """
 72 |         Parse navigation menu from HTML using a CSS selector.
 73 |         
 74 |         Args:
 75 |             html: Raw HTML content
 76 |             menu_selector: CSS selector for the menu element
 77 |             
 78 |         Returns:
 79 |             List of menu items with their structure
 80 |         """
 81 |         soup = BeautifulSoup(html, 'lxml')
 82 |         menu = soup.select_one(menu_selector)
 83 |         
 84 |         if not menu:
 85 |             return []
 86 |             
 87 |         return self._extract_menu_items(menu)
 88 | 
 89 |     def _extract_links(self, soup: BeautifulSoup) -> List[Dict[str, str]]:
 90 |         """Extract and normalize all links from the document."""
 91 |         links = []
 92 |         for a in soup.find_all('a', href=True):
 93 |             href = a['href']
 94 |             text = a.get_text(strip=True)
 95 |             
 96 |             # Skip empty or javascript links
 97 |             if not href or href.startswith(('javascript:', '#')):
 98 |                 continue
 99 |                 
100 |             # Resolve relative URLs
101 |             absolute_url = urljoin(self.base_url, href)
102 |             
103 |             # Only include links to the same domain
104 |             if urlparse(absolute_url).netloc == urlparse(self.base_url).netloc:
105 |                 links.append({
106 |                     'url': absolute_url,
107 |                     'text': text
108 |                 })
109 |                 
110 |         return links
111 | 
112 |     def _extract_menu_items(self, element: BeautifulSoup) -> List[Dict[str, Any]]:
113 |         """Recursively extract menu structure."""
114 |         items = []
115 |         
116 |         for item in element.find_all(['li', 'a'], recursive=False):
117 |             if item.name == 'a':
118 |                 # Single link item
119 |                 href = item.get('href')
120 |                 if href and not href.startswith(('javascript:', '#')):
121 |                     items.append({
122 |                         'type': 'link',
123 |                         'url': urljoin(self.base_url, href),
124 |                         'text': item.get_text(strip=True)
125 |                     })
126 |             else:
127 |                 # Potentially nested menu item
128 |                 link = item.find('a')
129 |                 if link and link.get('href'):
130 |                     menu_item = {
131 |                         'type': 'menu',
132 |                         'text': link.get_text(strip=True),
133 |                         'url': urljoin(self.base_url, link['href']),
134 |                         'children': []
135 |                     }
136 |                     
137 |                     # Look for nested lists
138 |                     nested = item.find(['ul', 'ol'])
139 |                     if nested:
140 |                         menu_item['children'] = self._extract_menu_items(nested)
141 |                         
142 |                     items.append(menu_item)
143 |                     
144 |         return items 
```

--------------------------------------------------------------------------------
/tests/test_utils/test_request_handler.py:
--------------------------------------------------------------------------------

```python
  1 | """
  2 | Tests for the RequestHandler class.
  3 | """
  4 | import asyncio
  5 | import pytest
  6 | import aiohttp
  7 | import time
  8 | from docs_scraper.utils import RequestHandler
  9 | 
 10 | @pytest.mark.asyncio
 11 | async def test_request_handler_successful_get(mock_website, test_urls, aiohttp_session):
 12 |     """Test successful GET request."""
 13 |     url = test_urls["valid_urls"][0]
 14 |     handler = RequestHandler(session=aiohttp_session)
 15 |     
 16 |     response = await handler.get(url)
 17 |     
 18 |     assert response.status == 200
 19 |     assert "<!DOCTYPE html>" in await response.text()
 20 | 
 21 | @pytest.mark.asyncio
 22 | async def test_request_handler_invalid_url(mock_website, test_urls, aiohttp_session):
 23 |     """Test handling of invalid URL."""
 24 |     url = test_urls["invalid_urls"][0]
 25 |     handler = RequestHandler(session=aiohttp_session)
 26 |     
 27 |     with pytest.raises(aiohttp.ClientError):
 28 |         await handler.get(url)
 29 | 
 30 | @pytest.mark.asyncio
 31 | async def test_request_handler_nonexistent_url(mock_website, test_urls, aiohttp_session):
 32 |     """Test handling of nonexistent URL."""
 33 |     url = test_urls["invalid_urls"][2]
 34 |     handler = RequestHandler(session=aiohttp_session)
 35 |     
 36 |     with pytest.raises(aiohttp.ClientError):
 37 |         await handler.get(url)
 38 | 
 39 | @pytest.mark.asyncio
 40 | async def test_request_handler_rate_limiting(mock_website, test_urls, aiohttp_session):
 41 |     """Test rate limiting functionality."""
 42 |     url = test_urls["valid_urls"][0]
 43 |     rate_limit = 2  # 2 requests per second
 44 |     handler = RequestHandler(session=aiohttp_session, rate_limit=rate_limit)
 45 |     
 46 |     start_time = time.time()
 47 |     
 48 |     # Make multiple requests
 49 |     for _ in range(3):
 50 |         response = await handler.get(url)
 51 |         assert response.status == 200
 52 |     
 53 |     end_time = time.time()
 54 |     elapsed_time = end_time - start_time
 55 |     
 56 |     # Should take at least 1 second due to rate limiting
 57 |     assert elapsed_time >= 1.0
 58 | 
 59 | @pytest.mark.asyncio
 60 | async def test_request_handler_custom_headers(mock_website, test_urls, aiohttp_session):
 61 |     """Test custom headers in requests."""
 62 |     url = test_urls["valid_urls"][0]
 63 |     custom_headers = {
 64 |         "User-Agent": "Custom Bot 1.0",
 65 |         "Accept-Language": "en-US,en;q=0.9"
 66 |     }
 67 |     handler = RequestHandler(session=aiohttp_session, headers=custom_headers)
 68 |     
 69 |     response = await handler.get(url)
 70 |     
 71 |     assert response.status == 200
 72 |     # Headers should be merged with default headers
 73 |     assert handler.headers["User-Agent"] == "Custom Bot 1.0"
 74 |     assert handler.headers["Accept-Language"] == "en-US,en;q=0.9"
 75 | 
 76 | @pytest.mark.asyncio
 77 | async def test_request_handler_timeout(mock_website, test_urls, aiohttp_session):
 78 |     """Test request timeout handling."""
 79 |     url = test_urls["valid_urls"][0]
 80 |     handler = RequestHandler(session=aiohttp_session, timeout=0.001)  # Very short timeout
 81 |     
 82 |     # Mock a delayed response
 83 |     mock_website.get(url, status=200, body="Delayed response", delay=0.1)
 84 |     
 85 |     with pytest.raises(aiohttp.ClientTimeout):
 86 |         await handler.get(url)
 87 | 
 88 | @pytest.mark.asyncio
 89 | async def test_request_handler_retry(mock_website, test_urls, aiohttp_session):
 90 |     """Test request retry functionality."""
 91 |     url = test_urls["valid_urls"][0]
 92 |     handler = RequestHandler(session=aiohttp_session, max_retries=3)
 93 |     
 94 |     # Mock temporary failures followed by success
 95 |     mock_website.get(url, status=500)  # First attempt fails
 96 |     mock_website.get(url, status=500)  # Second attempt fails
 97 |     mock_website.get(url, status=200, body="Success")  # Third attempt succeeds
 98 |     
 99 |     response = await handler.get(url)
100 |     
101 |     assert response.status == 200
102 |     assert await response.text() == "Success"
103 | 
104 | @pytest.mark.asyncio
105 | async def test_request_handler_max_retries_exceeded(mock_website, test_urls, aiohttp_session):
106 |     """Test behavior when max retries are exceeded."""
107 |     url = test_urls["valid_urls"][0]
108 |     handler = RequestHandler(session=aiohttp_session, max_retries=2)
109 |     
110 |     # Mock consistent failures
111 |     mock_website.get(url, status=500)
112 |     mock_website.get(url, status=500)
113 |     mock_website.get(url, status=500)
114 |     
115 |     with pytest.raises(aiohttp.ClientError):
116 |         await handler.get(url)
117 | 
118 | @pytest.mark.asyncio
119 | async def test_request_handler_session_management(mock_website, test_urls):
120 |     """Test session management."""
121 |     url = test_urls["valid_urls"][0]
122 |     
123 |     # Test with context manager
124 |     async with aiohttp.ClientSession() as session:
125 |         handler = RequestHandler(session=session)
126 |         response = await handler.get(url)
127 |         assert response.status == 200
128 |     
129 |     # Test with closed session
130 |     with pytest.raises(aiohttp.ClientError):
131 |         await handler.get(url)
132 | 
133 | @pytest.mark.asyncio
134 | async def test_request_handler_concurrent_requests(mock_website, test_urls, aiohttp_session):
135 |     """Test handling of concurrent requests."""
136 |     urls = test_urls["valid_urls"]
137 |     handler = RequestHandler(session=aiohttp_session)
138 |     
139 |     # Make concurrent requests
140 |     tasks = [handler.get(url) for url in urls]
141 |     responses = await asyncio.gather(*tasks)
142 |     
143 |     assert all(response.status == 200 for response in responses) 
```

--------------------------------------------------------------------------------
/tests/test_crawlers/test_menu_crawler.py:
--------------------------------------------------------------------------------

```python
  1 | """
  2 | Tests for the MenuCrawler class.
  3 | """
  4 | import pytest
  5 | from docs_scraper.crawlers import MenuCrawler
  6 | from docs_scraper.utils import RequestHandler, HTMLParser
  7 | 
  8 | @pytest.mark.asyncio
  9 | async def test_menu_crawler_successful_crawl(mock_website, test_urls, aiohttp_session):
 10 |     """Test successful crawling of menu links."""
 11 |     url = test_urls["valid_urls"][0]
 12 |     menu_selector = test_urls["menu_selector"]
 13 |     request_handler = RequestHandler(session=aiohttp_session)
 14 |     html_parser = HTMLParser()
 15 |     crawler = MenuCrawler(request_handler=request_handler, html_parser=html_parser)
 16 |     
 17 |     results = await crawler.crawl(url, menu_selector)
 18 |     
 19 |     assert len(results) >= 4  # Number of menu links in sample HTML
 20 |     for result in results:
 21 |         assert result["success"] is True
 22 |         assert result["url"].startswith("https://example.com")
 23 |         assert "content" in result
 24 |         assert "title" in result["metadata"]
 25 |         assert "description" in result["metadata"]
 26 |         assert len(result["links"]) > 0
 27 |         assert result["status_code"] == 200
 28 |         assert result["error"] is None
 29 | 
 30 | @pytest.mark.asyncio
 31 | async def test_menu_crawler_invalid_url(mock_website, test_urls, aiohttp_session):
 32 |     """Test crawling with an invalid URL."""
 33 |     url = test_urls["invalid_urls"][0]
 34 |     menu_selector = test_urls["menu_selector"]
 35 |     request_handler = RequestHandler(session=aiohttp_session)
 36 |     html_parser = HTMLParser()
 37 |     crawler = MenuCrawler(request_handler=request_handler, html_parser=html_parser)
 38 |     
 39 |     results = await crawler.crawl(url, menu_selector)
 40 |     
 41 |     assert len(results) == 1
 42 |     assert results[0]["success"] is False
 43 |     assert results[0]["url"] == url
 44 |     assert results[0]["error"] is not None
 45 | 
 46 | @pytest.mark.asyncio
 47 | async def test_menu_crawler_invalid_selector(mock_website, test_urls, aiohttp_session):
 48 |     """Test crawling with an invalid CSS selector."""
 49 |     url = test_urls["valid_urls"][0]
 50 |     invalid_selector = "#nonexistent-menu"
 51 |     request_handler = RequestHandler(session=aiohttp_session)
 52 |     html_parser = HTMLParser()
 53 |     crawler = MenuCrawler(request_handler=request_handler, html_parser=html_parser)
 54 |     
 55 |     results = await crawler.crawl(url, invalid_selector)
 56 |     
 57 |     assert len(results) == 1
 58 |     assert results[0]["success"] is False
 59 |     assert results[0]["url"] == url
 60 |     assert "No menu links found" in results[0]["error"]
 61 | 
 62 | @pytest.mark.asyncio
 63 | async def test_menu_crawler_nested_menu(mock_website, test_urls, aiohttp_session):
 64 |     """Test crawling nested menu structure."""
 65 |     url = test_urls["valid_urls"][0]
 66 |     menu_selector = test_urls["menu_selector"]
 67 |     request_handler = RequestHandler(session=aiohttp_session)
 68 |     html_parser = HTMLParser()
 69 |     crawler = MenuCrawler(
 70 |         request_handler=request_handler,
 71 |         html_parser=html_parser,
 72 |         max_depth=2  # Crawl up to 2 levels deep
 73 |     )
 74 |     
 75 |     results = await crawler.crawl(url, menu_selector)
 76 |     
 77 |     # Check if nested menu items were crawled
 78 |     urls = {result["url"] for result in results}
 79 |     assert "https://example.com/section1" in urls
 80 |     assert "https://example.com/section1/page1" in urls
 81 |     assert "https://example.com/section1/page2" in urls
 82 | 
 83 | @pytest.mark.asyncio
 84 | async def test_menu_crawler_concurrent_limit(mock_website, test_urls, aiohttp_session):
 85 |     """Test concurrent request limiting for menu crawling."""
 86 |     url = test_urls["valid_urls"][0]
 87 |     menu_selector = test_urls["menu_selector"]
 88 |     request_handler = RequestHandler(session=aiohttp_session)
 89 |     html_parser = HTMLParser()
 90 |     crawler = MenuCrawler(
 91 |         request_handler=request_handler,
 92 |         html_parser=html_parser,
 93 |         concurrent_limit=1  # Process one URL at a time
 94 |     )
 95 |     
 96 |     import time
 97 |     start_time = time.time()
 98 |     
 99 |     results = await crawler.crawl(url, menu_selector)
100 |     
101 |     end_time = time.time()
102 |     elapsed_time = end_time - start_time
103 |     
104 |     assert len(results) >= 4
105 |     # With concurrent_limit=1, processing should take at least 0.4 seconds
106 |     assert elapsed_time >= 0.4
107 | 
108 | @pytest.mark.asyncio
109 | async def test_menu_crawler_rate_limiting(mock_website, test_urls, aiohttp_session):
110 |     """Test rate limiting for menu crawling."""
111 |     url = test_urls["valid_urls"][0]
112 |     menu_selector = test_urls["menu_selector"]
113 |     request_handler = RequestHandler(session=aiohttp_session, rate_limit=1)  # 1 request per second
114 |     html_parser = HTMLParser()
115 |     crawler = MenuCrawler(request_handler=request_handler, html_parser=html_parser)
116 |     
117 |     import time
118 |     start_time = time.time()
119 |     
120 |     results = await crawler.crawl(url, menu_selector)
121 |     
122 |     end_time = time.time()
123 |     elapsed_time = end_time - start_time
124 |     
125 |     assert len(results) >= 4
126 |     # Should take at least 3 seconds due to rate limiting
127 |     assert elapsed_time >= 3.0
128 | 
129 | @pytest.mark.asyncio
130 | async def test_menu_crawler_max_depth(mock_website, test_urls, aiohttp_session):
131 |     """Test max depth limitation for menu crawling."""
132 |     url = test_urls["valid_urls"][0]
133 |     menu_selector = test_urls["menu_selector"]
134 |     request_handler = RequestHandler(session=aiohttp_session)
135 |     html_parser = HTMLParser()
136 |     crawler = MenuCrawler(
137 |         request_handler=request_handler,
138 |         html_parser=html_parser,
139 |         max_depth=1  # Only crawl top-level menu items
140 |     )
141 |     
142 |     results = await crawler.crawl(url, menu_selector)
143 |     
144 |     # Should only include top-level menu items
145 |     urls = {result["url"] for result in results}
146 |     assert "https://example.com/section1" in urls
147 |     assert "https://example.com/page1" in urls
148 |     assert "https://example.com/section1/page1" not in urls  # Nested item should not be included 
```

--------------------------------------------------------------------------------
/src/docs_scraper/utils/request_handler.py:
--------------------------------------------------------------------------------

```python
  1 | """
  2 | Request handler module for managing HTTP requests with rate limiting and error handling.
  3 | """
  4 | import asyncio
  5 | import logging
  6 | from typing import Optional, Dict, Any
  7 | import aiohttp
  8 | from urllib.robotparser import RobotFileParser
  9 | from urllib.parse import urljoin
 10 | 
 11 | logger = logging.getLogger(__name__)
 12 | 
 13 | class RequestHandler:
 14 |     def __init__(
 15 |         self,
 16 |         rate_limit: float = 1.0,
 17 |         concurrent_limit: int = 5,
 18 |         user_agent: str = "DocsScraperBot/1.0",
 19 |         timeout: int = 30,
 20 |         session: Optional[aiohttp.ClientSession] = None
 21 |     ):
 22 |         """
 23 |         Initialize the request handler.
 24 |         
 25 |         Args:
 26 |             rate_limit: Minimum time between requests to the same domain (in seconds)
 27 |             concurrent_limit: Maximum number of concurrent requests
 28 |             user_agent: User agent string to use for requests
 29 |             timeout: Request timeout in seconds
 30 |             session: Optional aiohttp.ClientSession to use. If not provided, one will be created.
 31 |         """
 32 |         self.rate_limit = rate_limit
 33 |         self.concurrent_limit = concurrent_limit
 34 |         self.user_agent = user_agent
 35 |         self.timeout = timeout
 36 |         self._provided_session = session
 37 |         
 38 |         self._domain_locks: Dict[str, asyncio.Lock] = {}
 39 |         self._domain_last_request: Dict[str, float] = {}
 40 |         self._semaphore = asyncio.Semaphore(concurrent_limit)
 41 |         self._session: Optional[aiohttp.ClientSession] = None
 42 |         self._robot_parsers: Dict[str, RobotFileParser] = {}
 43 | 
 44 |     async def __aenter__(self):
 45 |         """Set up the aiohttp session."""
 46 |         if self._provided_session:
 47 |             self._session = self._provided_session
 48 |         else:
 49 |             self._session = aiohttp.ClientSession(
 50 |                 headers={"User-Agent": self.user_agent},
 51 |                 timeout=aiohttp.ClientTimeout(total=self.timeout)
 52 |             )
 53 |         return self
 54 | 
 55 |     async def __aexit__(self, exc_type, exc_val, exc_tb):
 56 |         """Clean up the aiohttp session."""
 57 |         if self._session and not self._provided_session:
 58 |             await self._session.close()
 59 | 
 60 |     async def _check_robots_txt(self, url: str) -> bool:
 61 |         """
 62 |         Check if the URL is allowed by robots.txt.
 63 |         
 64 |         Args:
 65 |             url: URL to check
 66 |             
 67 |         Returns:
 68 |             bool: True if allowed, False if disallowed
 69 |         """
 70 |         from urllib.parse import urlparse
 71 |         parsed = urlparse(url)
 72 |         domain = f"{parsed.scheme}://{parsed.netloc}"
 73 |         
 74 |         if domain not in self._robot_parsers:
 75 |             parser = RobotFileParser()
 76 |             parser.set_url(urljoin(domain, "/robots.txt"))
 77 |             try:
 78 |                 async with self._session.get(parser.url) as response:
 79 |                     content = await response.text()
 80 |                     parser.parse(content.splitlines())
 81 |             except Exception as e:
 82 |                 logger.warning(f"Failed to fetch robots.txt for {domain}: {e}")
 83 |                 return True
 84 |             self._robot_parsers[domain] = parser
 85 |             
 86 |         return self._robot_parsers[domain].can_fetch(self.user_agent, url)
 87 | 
 88 |     async def get(self, url: str, **kwargs) -> Dict[str, Any]:
 89 |         """
 90 |         Make a GET request with rate limiting and error handling.
 91 |         
 92 |         Args:
 93 |             url: URL to request
 94 |             **kwargs: Additional arguments to pass to aiohttp.ClientSession.get()
 95 |             
 96 |         Returns:
 97 |             Dict containing:
 98 |                 - success: bool indicating if request was successful
 99 |                 - status: HTTP status code if available
100 |                 - content: Response content if successful
101 |                 - error: Error message if unsuccessful
102 |         """
103 |         from urllib.parse import urlparse
104 |         parsed = urlparse(url)
105 |         domain = parsed.netloc
106 | 
107 |         # Get or create domain lock
108 |         if domain not in self._domain_locks:
109 |             self._domain_locks[domain] = asyncio.Lock()
110 | 
111 |         # Check robots.txt
112 |         if not await self._check_robots_txt(url):
113 |             return {
114 |                 "success": False,
115 |                 "status": None,
116 |                 "error": "URL disallowed by robots.txt",
117 |                 "content": None
118 |             }
119 | 
120 |         try:
121 |             async with self._semaphore:  # Limit concurrent requests
122 |                 async with self._domain_locks[domain]:  # Lock per domain
123 |                     # Rate limiting
124 |                     if domain in self._domain_last_request:
125 |                         elapsed = asyncio.get_event_loop().time() - self._domain_last_request[domain]
126 |                         if elapsed < self.rate_limit:
127 |                             await asyncio.sleep(self.rate_limit - elapsed)
128 |                     
129 |                     self._domain_last_request[domain] = asyncio.get_event_loop().time()
130 |                     
131 |                     # Make request
132 |                     async with self._session.get(url, **kwargs) as response:
133 |                         content = await response.text()
134 |                         return {
135 |                             "success": response.status < 400,
136 |                             "status": response.status,
137 |                             "content": content,
138 |                             "error": None if response.status < 400 else f"HTTP {response.status}"
139 |                         }
140 | 
141 |         except asyncio.TimeoutError:
142 |             return {
143 |                 "success": False,
144 |                 "status": None,
145 |                 "error": "Request timed out",
146 |                 "content": None
147 |             }
148 |         except Exception as e:
149 |             return {
150 |                 "success": False,
151 |                 "status": None,
152 |                 "error": str(e),
153 |                 "content": None
154 |             } 
```

--------------------------------------------------------------------------------
/tests/test_crawlers/test_sitemap_crawler.py:
--------------------------------------------------------------------------------

```python
  1 | """
  2 | Tests for the SitemapCrawler class.
  3 | """
  4 | import pytest
  5 | from docs_scraper.crawlers import SitemapCrawler
  6 | from docs_scraper.utils import RequestHandler, HTMLParser
  7 | 
  8 | @pytest.mark.asyncio
  9 | async def test_sitemap_crawler_successful_crawl(mock_website, test_urls, aiohttp_session):
 10 |     """Test successful crawling of a sitemap."""
 11 |     sitemap_url = test_urls["sitemap_url"]
 12 |     request_handler = RequestHandler(session=aiohttp_session)
 13 |     html_parser = HTMLParser(base_url=test_urls["base_url"])
 14 |     crawler = SitemapCrawler(request_handler=request_handler, html_parser=html_parser)
 15 |     
 16 |     results = await crawler.crawl(sitemap_url)
 17 |     
 18 |     assert len(results) == 3  # Number of URLs in sample sitemap
 19 |     for result in results:
 20 |         assert result["success"] is True
 21 |         assert result["url"].startswith("https://example.com")
 22 |         assert "content" in result
 23 |         assert "title" in result["metadata"]
 24 |         assert "description" in result["metadata"]
 25 |         assert len(result["links"]) > 0
 26 |         assert result["status_code"] == 200
 27 |         assert result["error"] is None
 28 | 
 29 | @pytest.mark.asyncio
 30 | async def test_sitemap_crawler_invalid_sitemap_url(mock_website, aiohttp_session):
 31 |     """Test crawling with an invalid sitemap URL."""
 32 |     sitemap_url = "https://nonexistent.example.com/sitemap.xml"
 33 |     request_handler = RequestHandler(session=aiohttp_session)
 34 |     html_parser = HTMLParser(base_url="https://nonexistent.example.com")
 35 |     crawler = SitemapCrawler(request_handler=request_handler, html_parser=html_parser)
 36 |     
 37 |     results = await crawler.crawl(sitemap_url)
 38 |     
 39 |     assert len(results) == 1
 40 |     assert results[0]["success"] is False
 41 |     assert results[0]["url"] == sitemap_url
 42 |     assert results[0]["error"] is not None
 43 | 
 44 | @pytest.mark.asyncio
 45 | async def test_sitemap_crawler_invalid_xml(mock_website, aiohttp_session):
 46 |     """Test crawling with invalid XML content."""
 47 |     sitemap_url = "https://example.com/invalid-sitemap.xml"
 48 |     mock_website.get(sitemap_url, status=200, body="<invalid>xml</invalid>")
 49 |     
 50 |     request_handler = RequestHandler(session=aiohttp_session)
 51 |     html_parser = HTMLParser(base_url="https://example.com")
 52 |     crawler = SitemapCrawler(request_handler=request_handler, html_parser=html_parser)
 53 |     
 54 |     results = await crawler.crawl(sitemap_url)
 55 |     
 56 |     assert len(results) == 1
 57 |     assert results[0]["success"] is False
 58 |     assert results[0]["url"] == sitemap_url
 59 |     assert "Invalid sitemap format" in results[0]["error"]
 60 | 
 61 | @pytest.mark.asyncio
 62 | async def test_sitemap_crawler_concurrent_limit(mock_website, test_urls, aiohttp_session):
 63 |     """Test concurrent request limiting for sitemap crawling."""
 64 |     sitemap_url = test_urls["sitemap_url"]
 65 |     request_handler = RequestHandler(session=aiohttp_session)
 66 |     html_parser = HTMLParser(base_url=test_urls["base_url"])
 67 |     crawler = SitemapCrawler(request_handler=request_handler, html_parser=html_parser)
 68 |     
 69 |     import time
 70 |     start_time = time.time()
 71 |     
 72 |     results = await crawler.crawl(sitemap_url)
 73 |     
 74 |     end_time = time.time()
 75 |     elapsed_time = end_time - start_time
 76 |     
 77 |     assert len(results) == 3
 78 |     # With concurrent_limit=1, processing should take at least 0.3 seconds
 79 |     assert elapsed_time >= 0.3
 80 | 
 81 | @pytest.mark.asyncio
 82 | async def test_sitemap_crawler_rate_limiting(mock_website, test_urls, aiohttp_session):
 83 |     """Test rate limiting for sitemap crawling."""
 84 |     sitemap_url = test_urls["sitemap_url"]
 85 |     request_handler = RequestHandler(session=aiohttp_session, rate_limit=1)  # 1 request per second
 86 |     html_parser = HTMLParser(base_url=test_urls["base_url"])
 87 |     crawler = SitemapCrawler(request_handler=request_handler, html_parser=html_parser)
 88 |     
 89 |     import time
 90 |     start_time = time.time()
 91 |     
 92 |     results = await crawler.crawl(sitemap_url)
 93 |     
 94 |     end_time = time.time()
 95 |     elapsed_time = end_time - start_time
 96 |     
 97 |     assert len(results) == 3
 98 |     # Should take at least 3 seconds due to rate limiting (1 for sitemap + 2 for pages)
 99 |     assert elapsed_time >= 2.0
100 | 
101 | @pytest.mark.asyncio
102 | async def test_sitemap_crawler_nested_sitemaps(mock_website, test_urls, aiohttp_session):
103 |     """Test crawling nested sitemaps."""
104 |     # Create a sitemap index
105 |     sitemap_index = """<?xml version="1.0" encoding="UTF-8"?>
106 |     <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
107 |         <sitemap>
108 |             <loc>https://example.com/sitemap1.xml</loc>
109 |         </sitemap>
110 |         <sitemap>
111 |             <loc>https://example.com/sitemap2.xml</loc>
112 |         </sitemap>
113 |     </sitemapindex>
114 |     """
115 |     
116 |     # Create sub-sitemaps
117 |     sitemap1 = """<?xml version="1.0" encoding="UTF-8"?>
118 |     <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
119 |         <url>
120 |             <loc>https://example.com/page1</loc>
121 |         </url>
122 |     </urlset>
123 |     """
124 |     
125 |     sitemap2 = """<?xml version="1.0" encoding="UTF-8"?>
126 |     <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
127 |         <url>
128 |             <loc>https://example.com/page2</loc>
129 |         </url>
130 |     </urlset>
131 |     """
132 |     
133 |     mock_website.get("https://example.com/sitemap-index.xml", status=200, body=sitemap_index)
134 |     mock_website.get("https://example.com/sitemap1.xml", status=200, body=sitemap1)
135 |     mock_website.get("https://example.com/sitemap2.xml", status=200, body=sitemap2)
136 |     
137 |     request_handler = RequestHandler(session=aiohttp_session)
138 |     html_parser = HTMLParser(base_url="https://example.com")
139 |     crawler = SitemapCrawler(request_handler=request_handler, html_parser=html_parser)
140 |     
141 |     results = await crawler.crawl("https://example.com/sitemap-index.xml")
142 |     
143 |     assert len(results) == 2  # Two pages from two sub-sitemaps
144 |     urls = {result["url"] for result in results}
145 |     assert "https://example.com/page1" in urls
146 |     assert "https://example.com/page2" in urls 
```

--------------------------------------------------------------------------------
/tests/test_utils/test_html_parser.py:
--------------------------------------------------------------------------------

```python
  1 | """
  2 | Tests for the HTMLParser class.
  3 | """
  4 | import pytest
  5 | from bs4 import BeautifulSoup
  6 | from docs_scraper.utils import HTMLParser
  7 | 
  8 | @pytest.fixture
  9 | def html_parser():
 10 |     """Fixture for HTMLParser instance."""
 11 |     return HTMLParser()
 12 | 
 13 | @pytest.fixture
 14 | def sample_html():
 15 |     """Sample HTML content for testing."""
 16 |     return """
 17 |     <!DOCTYPE html>
 18 |     <html>
 19 |     <head>
 20 |         <title>Test Page</title>
 21 |         <meta name="description" content="Test description">
 22 |         <meta name="keywords" content="test, keywords">
 23 |         <meta property="og:title" content="OG Title">
 24 |         <meta property="og:description" content="OG Description">
 25 |     </head>
 26 |     <body>
 27 |         <nav class="menu">
 28 |             <ul>
 29 |                 <li><a href="/page1">Page 1</a></li>
 30 |                 <li>
 31 |                     <a href="/section1">Section 1</a>
 32 |                     <ul>
 33 |                         <li><a href="/section1/page1">Section 1.1</a></li>
 34 |                         <li><a href="/section1/page2">Section 1.2</a></li>
 35 |                     </ul>
 36 |                 </li>
 37 |             </ul>
 38 |         </nav>
 39 |         <main>
 40 |             <h1>Welcome</h1>
 41 |             <p>Test content with a <a href="/test1">link</a> and another <a href="/test2">link</a>.</p>
 42 |             <div class="content">
 43 |                 <p>More content</p>
 44 |                 <a href="mailto:[email protected]">Email</a>
 45 |                 <a href="tel:+1234567890">Phone</a>
 46 |                 <a href="javascript:void(0)">JavaScript</a>
 47 |                 <a href="#section">Hash</a>
 48 |                 <a href="ftp://example.com">FTP</a>
 49 |             </div>
 50 |         </main>
 51 |     </body>
 52 |     </html>
 53 |     """
 54 | 
 55 | def test_parse_html(html_parser, sample_html):
 56 |     """Test HTML parsing."""
 57 |     soup = html_parser.parse_html(sample_html)
 58 |     assert isinstance(soup, BeautifulSoup)
 59 |     assert soup.title.string == "Test Page"
 60 | 
 61 | def test_extract_metadata(html_parser, sample_html):
 62 |     """Test metadata extraction."""
 63 |     soup = html_parser.parse_html(sample_html)
 64 |     metadata = html_parser.extract_metadata(soup)
 65 |     
 66 |     assert metadata["title"] == "Test Page"
 67 |     assert metadata["description"] == "Test description"
 68 |     assert metadata["keywords"] == "test, keywords"
 69 |     assert metadata["og:title"] == "OG Title"
 70 |     assert metadata["og:description"] == "OG Description"
 71 | 
 72 | def test_extract_links(html_parser, sample_html):
 73 |     """Test link extraction."""
 74 |     soup = html_parser.parse_html(sample_html)
 75 |     links = html_parser.extract_links(soup)
 76 |     
 77 |     # Should only include valid HTTP(S) links
 78 |     assert "/page1" in links
 79 |     assert "/section1" in links
 80 |     assert "/section1/page1" in links
 81 |     assert "/section1/page2" in links
 82 |     assert "/test1" in links
 83 |     assert "/test2" in links
 84 |     
 85 |     # Should not include invalid or special links
 86 |     assert "mailto:[email protected]" not in links
 87 |     assert "tel:+1234567890" not in links
 88 |     assert "javascript:void(0)" not in links
 89 |     assert "#section" not in links
 90 |     assert "ftp://example.com" not in links
 91 | 
 92 | def test_extract_menu_links(html_parser, sample_html):
 93 |     """Test menu link extraction."""
 94 |     soup = html_parser.parse_html(sample_html)
 95 |     menu_links = html_parser.extract_menu_links(soup, "nav.menu")
 96 |     
 97 |     assert len(menu_links) == 4
 98 |     assert "/page1" in menu_links
 99 |     assert "/section1" in menu_links
100 |     assert "/section1/page1" in menu_links
101 |     assert "/section1/page2" in menu_links
102 | 
103 | def test_extract_menu_links_invalid_selector(html_parser, sample_html):
104 |     """Test menu link extraction with invalid selector."""
105 |     soup = html_parser.parse_html(sample_html)
106 |     menu_links = html_parser.extract_menu_links(soup, "#nonexistent")
107 |     
108 |     assert len(menu_links) == 0
109 | 
110 | def test_extract_text_content(html_parser, sample_html):
111 |     """Test text content extraction."""
112 |     soup = html_parser.parse_html(sample_html)
113 |     content = html_parser.extract_text_content(soup)
114 |     
115 |     assert "Welcome" in content
116 |     assert "Test content" in content
117 |     assert "More content" in content
118 |     # Should not include navigation text
119 |     assert "Section 1.1" not in content
120 | 
121 | def test_clean_html(html_parser):
122 |     """Test HTML cleaning."""
123 |     dirty_html = """
124 |     <html>
125 |     <body>
126 |         <script>alert('test');</script>
127 |         <style>body { color: red; }</style>
128 |         <p>Test content</p>
129 |         <!-- Comment -->
130 |         <iframe src="test.html"></iframe>
131 |     </body>
132 |     </html>
133 |     """
134 |     
135 |     clean_html = html_parser.clean_html(dirty_html)
136 |     soup = html_parser.parse_html(clean_html)
137 |     
138 |     assert len(soup.find_all("script")) == 0
139 |     assert len(soup.find_all("style")) == 0
140 |     assert len(soup.find_all("iframe")) == 0
141 |     assert "Test content" in soup.get_text()
142 | 
143 | def test_normalize_url(html_parser):
144 |     """Test URL normalization."""
145 |     base_url = "https://example.com/docs"
146 |     test_cases = [
147 |         ("/test", "https://example.com/test"),
148 |         ("test", "https://example.com/docs/test"),
149 |         ("../test", "https://example.com/test"),
150 |         ("https://other.com/test", "https://other.com/test"),
151 |         ("//other.com/test", "https://other.com/test"),
152 |     ]
153 |     
154 |     for input_url, expected_url in test_cases:
155 |         assert html_parser.normalize_url(input_url, base_url) == expected_url
156 | 
157 | def test_is_valid_link(html_parser):
158 |     """Test link validation."""
159 |     valid_links = [
160 |         "https://example.com",
161 |         "http://example.com",
162 |         "/absolute/path",
163 |         "relative/path",
164 |         "../parent/path",
165 |         "./current/path"
166 |     ]
167 |     
168 |     invalid_links = [
169 |         "mailto:[email protected]",
170 |         "tel:+1234567890",
171 |         "javascript:void(0)",
172 |         "#hash",
173 |         "ftp://example.com",
174 |         ""
175 |     ]
176 |     
177 |     for link in valid_links:
178 |         assert html_parser.is_valid_link(link) is True
179 |     
180 |     for link in invalid_links:
181 |         assert html_parser.is_valid_link(link) is False
182 | 
183 | def test_extract_structured_data(html_parser):
184 |     """Test structured data extraction."""
185 |     html = """
186 |     <html>
187 |     <head>
188 |         <script type="application/ld+json">
189 |         {
190 |             "@context": "https://schema.org",
191 |             "@type": "Article",
192 |             "headline": "Test Article",
193 |             "author": {
194 |                 "@type": "Person",
195 |                 "name": "John Doe"
196 |             }
197 |         }
198 |         </script>
199 |     </head>
200 |     <body>
201 |         <p>Test content</p>
202 |     </body>
203 |     </html>
204 |     """
205 |     
206 |     soup = html_parser.parse_html(html)
207 |     structured_data = html_parser.extract_structured_data(soup)
208 |     
209 |     assert len(structured_data) == 1
210 |     assert structured_data[0]["@type"] == "Article"
211 |     assert structured_data[0]["headline"] == "Test Article"
212 |     assert structured_data[0]["author"]["name"] == "John Doe" 
```

--------------------------------------------------------------------------------
/src/docs_scraper/crawlers/single_url_crawler.py:
--------------------------------------------------------------------------------

```python
  1 | import os
  2 | import sys
  3 | import asyncio
  4 | import re
  5 | import argparse
  6 | from datetime import datetime
  7 | from termcolor import colored
  8 | from crawl4ai import *
  9 | from ..utils import RequestHandler, HTMLParser
 10 | from typing import Dict, Any, Optional
 11 | 
 12 | class SingleURLCrawler:
 13 |     """A crawler that processes a single URL."""
 14 |     
 15 |     def __init__(self, request_handler: RequestHandler, html_parser: HTMLParser):
 16 |         """
 17 |         Initialize the crawler.
 18 |         
 19 |         Args:
 20 |             request_handler: Handler for making HTTP requests
 21 |             html_parser: Parser for processing HTML content
 22 |         """
 23 |         self.request_handler = request_handler
 24 |         self.html_parser = html_parser
 25 |     
 26 |     async def crawl(self, url: str) -> Dict[str, Any]:
 27 |         """
 28 |         Crawl a single URL and extract its content.
 29 |         
 30 |         Args:
 31 |             url: The URL to crawl
 32 |             
 33 |         Returns:
 34 |             Dict containing:
 35 |                 - success: Whether the crawl was successful
 36 |                 - url: The URL that was crawled
 37 |                 - content: The extracted content (if successful)
 38 |                 - metadata: Additional metadata about the page
 39 |                 - links: Links found on the page
 40 |                 - status_code: HTTP status code
 41 |                 - error: Error message (if unsuccessful)
 42 |         """
 43 |         try:
 44 |             response = await self.request_handler.get(url)
 45 |             if not response["success"]:
 46 |                 return {
 47 |                     "success": False,
 48 |                     "url": url,
 49 |                     "content": None,
 50 |                     "metadata": {},
 51 |                     "links": [],
 52 |                     "status_code": response.get("status"),
 53 |                     "error": response.get("error", "Unknown error")
 54 |                 }
 55 |             
 56 |             html_content = response["content"]
 57 |             parsed_content = self.html_parser.parse_content(html_content)
 58 |             
 59 |             return {
 60 |                 "success": True,
 61 |                 "url": url,
 62 |                 "content": parsed_content["text_content"],
 63 |                 "metadata": {
 64 |                     "title": parsed_content["title"],
 65 |                     "description": parsed_content["description"]
 66 |                 },
 67 |                 "links": parsed_content["links"],
 68 |                 "status_code": response["status"],
 69 |                 "error": None
 70 |             }
 71 |             
 72 |         except Exception as e:
 73 |             return {
 74 |                 "success": False,
 75 |                 "url": url,
 76 |                 "content": None,
 77 |                 "metadata": {},
 78 |                 "links": [],
 79 |                 "status_code": None,
 80 |                 "error": str(e)
 81 |             }
 82 | 
 83 | def get_filename_prefix(url: str) -> str:
 84 |     """
 85 |     Generate a filename prefix from a URL including path components.
 86 |     Examples:
 87 |     - https://docs.literalai.com/page -> literalai_docs_page
 88 |     - https://literalai.com/docs/page -> literalai_docs_page
 89 |     - https://api.example.com/path/to/page -> example_api_path_to_page
 90 |     
 91 |     Args:
 92 |         url (str): The URL to process
 93 |         
 94 |     Returns:
 95 |         str: Generated filename prefix
 96 |     """
 97 |     # Remove protocol and split URL parts
 98 |     clean_url = url.split('://')[1]
 99 |     url_parts = clean_url.split('/')
100 |     
101 |     # Get domain parts
102 |     domain_parts = url_parts[0].split('.')
103 |     
104 |     # Extract main domain name (ignoring TLD)
105 |     main_domain = domain_parts[-2]
106 |     
107 |     # Start building the prefix with domain
108 |     prefix_parts = [main_domain]
109 |     
110 |     # Add subdomain if exists
111 |     if len(domain_parts) > 2:
112 |         subdomain = domain_parts[0]
113 |         if subdomain != main_domain:
114 |             prefix_parts.append(subdomain)
115 |     
116 |     # Add all path segments
117 |     if len(url_parts) > 1:
118 |         path_segments = [segment for segment in url_parts[1:] if segment]
119 |         for segment in path_segments:
120 |             # Clean up segment (remove special characters, convert to lowercase)
121 |             clean_segment = re.sub(r'[^a-zA-Z0-9]', '', segment.lower())
122 |             if clean_segment and clean_segment != main_domain:
123 |                 prefix_parts.append(clean_segment)
124 |     
125 |     # Join all parts with underscore
126 |     return '_'.join(prefix_parts)
127 | 
128 | def process_markdown_content(content: str, url: str) -> str:
129 |     """Process markdown content to start from first H1 and add URL as H2"""
130 |     # Find the first H1 tag
131 |     h1_match = re.search(r'^# .+$', content, re.MULTILINE)
132 |     if not h1_match:
133 |         # If no H1 found, return original content with URL as H1
134 |         return f"# No Title Found\n\n## Source\n{url}\n\n{content}"
135 |         
136 |     # Get the content starting from the first H1
137 |     content_from_h1 = content[h1_match.start():]
138 |     
139 |     # Remove "Was this page helpful?" section and everything after it
140 |     helpful_patterns = [
141 |         r'^#+\s*Was this page helpful\?.*$',  # Matches any heading level with this text
142 |         r'^Was this page helpful\?.*$',       # Matches the text without heading
143 |         r'^#+\s*Was this helpful\?.*$',       # Matches any heading level with shorter text
144 |         r'^Was this helpful\?.*$'             # Matches shorter text without heading
145 |     ]
146 |     
147 |     for pattern in helpful_patterns:
148 |         parts = re.split(pattern, content_from_h1, flags=re.MULTILINE | re.IGNORECASE)
149 |         if len(parts) > 1:
150 |             content_from_h1 = parts[0].strip()
151 |             break
152 |     
153 |     # Insert URL as H2 after the H1
154 |     lines = content_from_h1.split('\n')
155 |     h1_line = lines[0]
156 |     rest_of_content = '\n'.join(lines[1:]).strip()
157 |     
158 |     return f"{h1_line}\n\n## Source\n{url}\n\n{rest_of_content}"
159 | 
160 | def save_markdown_content(content: str, url: str) -> str:
161 |     """Save markdown content to a file"""
162 |     try:
163 |         # Generate filename prefix from URL
164 |         filename_prefix = get_filename_prefix(url)
165 |         
166 |         timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
167 |         filename = f"{filename_prefix}_{timestamp}.md"
168 |         filepath = os.path.join("scraped_docs", filename)
169 |         
170 |         # Create scraped_docs directory if it doesn't exist
171 |         os.makedirs("scraped_docs", exist_ok=True)
172 |         
173 |         processed_content = process_markdown_content(content, url)
174 |         
175 |         with open(filepath, "w", encoding="utf-8") as f:
176 |             f.write(processed_content)
177 |         
178 |         print(colored(f"\n✓ Markdown content saved to: {filepath}", "green"))
179 |         return filepath
180 |     except Exception as e:
181 |         print(colored(f"\n✗ Error saving markdown content: {str(e)}", "red"))
182 |         return None
183 | 
184 | async def main():
185 |     # Set up argument parser
186 |     parser = argparse.ArgumentParser(description='Crawl a single URL and generate markdown documentation')
187 |     parser.add_argument('url', type=str, help='Target documentation URL to crawl')
188 |     args = parser.parse_args()
189 | 
190 |     try:
191 |         print(colored("\n=== Starting Single URL Crawl ===", "cyan"))
192 |         print(colored(f"\nCrawling URL: {args.url}", "yellow"))
193 |         
194 |         browser_config = BrowserConfig(headless=True, verbose=True)
195 |         async with AsyncWebCrawler(config=browser_config) as crawler:
196 |             crawler_config = CrawlerRunConfig(
197 |                 cache_mode=CacheMode.BYPASS,
198 |                 markdown_generator=DefaultMarkdownGenerator(
199 |                     content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
200 |                 )
201 |             )
202 |             
203 |             result = await crawler.arun(
204 |                 url=args.url,
205 |                 config=crawler_config
206 |             )
207 |             
208 |             if result.success:
209 |                 print(colored("\n✓ Successfully crawled URL", "green"))
210 |                 print(colored(f"Content length: {len(result.markdown.raw_markdown)} characters", "cyan"))
211 |                 save_markdown_content(result.markdown.raw_markdown, args.url)
212 |             else:
213 |                 print(colored(f"\n✗ Failed to crawl URL: {result.error_message}", "red"))
214 |                 
215 |     except Exception as e:
216 |         print(colored(f"\n✗ Error during crawl: {str(e)}", "red"))
217 |         sys.exit(1)
218 | 
219 | if __name__ == "__main__":
220 |     asyncio.run(main())
```

--------------------------------------------------------------------------------
/src/docs_scraper/server.py:
--------------------------------------------------------------------------------

```python
  1 | """
  2 | MCP server implementation for web crawling and documentation scraping.
  3 | """
  4 | import asyncio
  5 | import logging
  6 | from typing import List, Dict, Any, Optional
  7 | from pydantic import BaseModel, Field, HttpUrl
  8 | from mcp.server.fastmcp import FastMCP
  9 | 
 10 | # Import the crawlers with relative imports
 11 | # This helps prevent circular import issues
 12 | from .crawlers.single_url_crawler import SingleURLCrawler
 13 | from .crawlers.multi_url_crawler import MultiURLCrawler
 14 | from .crawlers.sitemap_crawler import SitemapCrawler
 15 | from .crawlers.menu_crawler import MenuCrawler
 16 | 
 17 | # Import utility classes
 18 | from .utils import RequestHandler, HTMLParser
 19 | 
 20 | # Configure logging
 21 | logging.basicConfig(
 22 |     level=logging.INFO,
 23 |     format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
 24 | )
 25 | logger = logging.getLogger(__name__)
 26 | 
 27 | # Create MCP server
 28 | mcp = FastMCP(
 29 |     name="DocsScraperMCP",
 30 |     version="0.1.0"
 31 | )
 32 | 
 33 | # Input validation models
 34 | class SingleUrlInput(BaseModel):
 35 |     url: HttpUrl = Field(..., description="Target URL to crawl")
 36 |     depth: int = Field(0, ge=0, description="How many levels deep to follow links")
 37 |     exclusion_patterns: Optional[List[str]] = Field(None, description="List of regex patterns for URLs to exclude")
 38 |     rate_limit: float = Field(1.0, gt=0, description="Minimum time between requests (seconds)")
 39 | 
 40 | class MultiUrlInput(BaseModel):
 41 |     urls: List[HttpUrl] = Field(..., min_items=1, description="List of URLs to crawl")
 42 |     concurrent_limit: int = Field(5, gt=0, description="Maximum number of concurrent requests")
 43 |     exclusion_patterns: Optional[List[str]] = Field(None, description="List of regex patterns for URLs to exclude")
 44 |     rate_limit: float = Field(1.0, gt=0, description="Minimum time between requests to the same domain (seconds)")
 45 | 
 46 | class SitemapInput(BaseModel):
 47 |     base_url: HttpUrl = Field(..., description="Base URL of the website")
 48 |     sitemap_url: Optional[HttpUrl] = Field(None, description="Optional explicit sitemap URL")
 49 |     concurrent_limit: int = Field(5, gt=0, description="Maximum number of concurrent requests")
 50 |     exclusion_patterns: Optional[List[str]] = Field(None, description="List of regex patterns for URLs to exclude")
 51 |     rate_limit: float = Field(1.0, gt=0, description="Minimum time between requests (seconds)")
 52 | 
 53 | class MenuInput(BaseModel):
 54 |     base_url: HttpUrl = Field(..., description="Base URL of the website")
 55 |     menu_selector: str = Field(..., min_length=1, description="CSS selector for the navigation menu element")
 56 |     concurrent_limit: int = Field(5, gt=0, description="Maximum number of concurrent requests")
 57 |     exclusion_patterns: Optional[List[str]] = Field(None, description="List of regex patterns for URLs to exclude")
 58 |     rate_limit: float = Field(1.0, gt=0, description="Minimum time between requests (seconds)")
 59 | 
 60 | @mcp.tool()
 61 | async def single_url_crawler(
 62 |     url: str,
 63 |     depth: int = 0,
 64 |     exclusion_patterns: Optional[List[str]] = None,
 65 |     rate_limit: float = 1.0
 66 | ) -> Dict[str, Any]:
 67 |     """
 68 |     Crawl a single URL and optionally follow links up to a specified depth.
 69 |     
 70 |     Args:
 71 |         url: Target URL to crawl
 72 |         depth: How many levels deep to follow links (0 means only the target URL)
 73 |         exclusion_patterns: List of regex patterns for URLs to exclude
 74 |         rate_limit: Minimum time between requests (seconds)
 75 |         
 76 |     Returns:
 77 |         Dict containing crawled content and statistics
 78 |     """
 79 |     try:
 80 |         # Validate input
 81 |         input_data = SingleUrlInput(
 82 |             url=url,
 83 |             depth=depth,
 84 |             exclusion_patterns=exclusion_patterns,
 85 |             rate_limit=rate_limit
 86 |         )
 87 |         
 88 |         # Create required utility instances
 89 |         request_handler = RequestHandler(rate_limit=input_data.rate_limit)
 90 |         html_parser = HTMLParser(base_url=str(input_data.url))
 91 |         
 92 |         # Create the crawler with the proper parameters
 93 |         crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
 94 |         
 95 |         # Use request_handler as a context manager to ensure proper session initialization
 96 |         async with request_handler:
 97 |             # Call the crawl method with the URL
 98 |             return await crawler.crawl(str(input_data.url))
 99 |         
100 |     except Exception as e:
101 |         logger.error(f"Single URL crawler failed: {str(e)}")
102 |         return {
103 |             "success": False,
104 |             "error": str(e),
105 |             "content": None,
106 |             "stats": {
107 |                 "urls_crawled": 0,
108 |                 "urls_failed": 1,
109 |                 "max_depth_reached": 0
110 |             }
111 |         }
112 | 
113 | @mcp.tool()
114 | async def multi_url_crawler(
115 |     urls: List[str],
116 |     concurrent_limit: int = 5,
117 |     exclusion_patterns: Optional[List[str]] = None,
118 |     rate_limit: float = 1.0
119 | ) -> Dict[str, Any]:
120 |     """
121 |     Crawl multiple URLs in parallel with rate limiting.
122 |     
123 |     Args:
124 |         urls: List of URLs to crawl
125 |         concurrent_limit: Maximum number of concurrent requests
126 |         exclusion_patterns: List of regex patterns for URLs to exclude
127 |         rate_limit: Minimum time between requests to the same domain (seconds)
128 |         
129 |     Returns:
130 |         Dict containing results for each URL and overall statistics
131 |     """
132 |     try:
133 |         # Validate input
134 |         input_data = MultiUrlInput(
135 |             urls=urls,
136 |             concurrent_limit=concurrent_limit,
137 |             exclusion_patterns=exclusion_patterns,
138 |             rate_limit=rate_limit
139 |         )
140 |         
141 |         # Create the crawler with the proper parameters
142 |         crawler = MultiURLCrawler(verbose=True)
143 |         
144 |         # Call the crawl method with the URLs
145 |         url_list = [str(url) for url in input_data.urls]
146 |         results = await crawler.crawl(url_list)
147 |         
148 |         # Return a standardized response format
149 |         return {
150 |             "success": True,
151 |             "results": results,
152 |             "stats": {
153 |                 "urls_crawled": len(results),
154 |                 "urls_succeeded": sum(1 for r in results if r["success"]),
155 |                 "urls_failed": sum(1 for r in results if not r["success"])
156 |             }
157 |         }
158 |         
159 |     except Exception as e:
160 |         logger.error(f"Multi URL crawler failed: {str(e)}")
161 |         return {
162 |             "success": False,
163 |             "error": str(e),
164 |             "content": None,
165 |             "stats": {
166 |                 "urls_crawled": 0,
167 |                 "urls_failed": len(urls),
168 |                 "concurrent_requests_max": 0
169 |             }
170 |         }
171 | 
172 | @mcp.tool()
173 | async def sitemap_crawler(
174 |     base_url: str,
175 |     sitemap_url: Optional[str] = None,
176 |     concurrent_limit: int = 5,
177 |     exclusion_patterns: Optional[List[str]] = None,
178 |     rate_limit: float = 1.0
179 | ) -> Dict[str, Any]:
180 |     """
181 |     Crawl a website using its sitemap.xml.
182 |     
183 |     Args:
184 |         base_url: Base URL of the website
185 |         sitemap_url: Optional explicit sitemap URL (if different from base_url/sitemap.xml)
186 |         concurrent_limit: Maximum number of concurrent requests
187 |         exclusion_patterns: List of regex patterns for URLs to exclude
188 |         rate_limit: Minimum time between requests (seconds)
189 |         
190 |     Returns:
191 |         Dict containing crawled pages and statistics
192 |     """
193 |     try:
194 |         # Validate input
195 |         input_data = SitemapInput(
196 |             base_url=base_url,
197 |             sitemap_url=sitemap_url,
198 |             concurrent_limit=concurrent_limit,
199 |             exclusion_patterns=exclusion_patterns,
200 |             rate_limit=rate_limit
201 |         )
202 |         
203 |         # Create required utility instances
204 |         request_handler = RequestHandler(
205 |             rate_limit=input_data.rate_limit,
206 |             concurrent_limit=input_data.concurrent_limit
207 |         )
208 |         html_parser = HTMLParser(base_url=str(input_data.base_url))
209 |         
210 |         # Create the crawler with the proper parameters
211 |         crawler = SitemapCrawler(
212 |             request_handler=request_handler,
213 |             html_parser=html_parser,
214 |             verbose=True
215 |         )
216 |         
217 |         # Determine the sitemap URL to use
218 |         sitemap_url_to_use = str(input_data.sitemap_url) if input_data.sitemap_url else f"{str(input_data.base_url).rstrip('/')}/sitemap.xml"
219 |         
220 |         # Call the crawl method with the sitemap URL
221 |         results = await crawler.crawl(sitemap_url_to_use)
222 |         
223 |         return {
224 |             "success": True,
225 |             "content": results,
226 |             "stats": {
227 |                 "urls_crawled": len(results),
228 |                 "urls_succeeded": sum(1 for r in results if r["success"]),
229 |                 "urls_failed": sum(1 for r in results if not r["success"]),
230 |                 "sitemap_found": len(results) > 0
231 |             }
232 |         }
233 |         
234 |     except Exception as e:
235 |         logger.error(f"Sitemap crawler failed: {str(e)}")
236 |         return {
237 |             "success": False,
238 |             "error": str(e),
239 |             "content": None,
240 |             "stats": {
241 |                 "urls_crawled": 0,
242 |                 "urls_failed": 1,
243 |                 "sitemap_found": False
244 |             }
245 |         }
246 | 
247 | @mcp.tool()
248 | async def menu_crawler(
249 |     base_url: str,
250 |     menu_selector: str,
251 |     concurrent_limit: int = 5,
252 |     exclusion_patterns: Optional[List[str]] = None,
253 |     rate_limit: float = 1.0
254 | ) -> Dict[str, Any]:
255 |     """
256 |     Crawl a website by following its navigation menu structure.
257 |     
258 |     Args:
259 |         base_url: Base URL of the website
260 |         menu_selector: CSS selector for the navigation menu element
261 |         concurrent_limit: Maximum number of concurrent requests
262 |         exclusion_patterns: List of regex patterns for URLs to exclude
263 |         rate_limit: Minimum time between requests (seconds)
264 |         
265 |     Returns:
266 |         Dict containing menu structure and crawled content
267 |     """
268 |     try:
269 |         # Validate input
270 |         input_data = MenuInput(
271 |             base_url=base_url,
272 |             menu_selector=menu_selector,
273 |             concurrent_limit=concurrent_limit,
274 |             exclusion_patterns=exclusion_patterns,
275 |             rate_limit=rate_limit
276 |         )
277 |         
278 |         # Create the crawler with the proper parameters
279 |         crawler = MenuCrawler(start_url=str(input_data.base_url))
280 |         
281 |         # Call the crawl method
282 |         results = await crawler.crawl()
283 |         
284 |         return {
285 |             "success": True,
286 |             "content": results,
287 |             "stats": {
288 |                 "urls_crawled": len(results.get("menu_links", [])),
289 |                 "urls_failed": 0,
290 |                 "menu_items_found": len(results.get("menu_structure", {}).get("items", []))
291 |             }
292 |         }
293 |         
294 |     except Exception as e:
295 |         logger.error(f"Menu crawler failed: {str(e)}")
296 |         return {
297 |             "success": False,
298 |             "error": str(e),
299 |             "content": None,
300 |             "stats": {
301 |                 "urls_crawled": 0,
302 |                 "urls_failed": 1,
303 |                 "menu_items_found": 0
304 |             }
305 |         }
306 | 
307 | def main():
308 |     """Main entry point for the MCP server."""
309 |     try:
310 |         logger.info("Starting DocsScraperMCP server...")
311 |         mcp.run()  # Using run() method instead of start()
312 |     except Exception as e:
313 |         logger.error(f"Server failed: {str(e)}")
314 |         raise
315 |     finally:
316 |         logger.info("DocsScraperMCP server stopped.")
317 | 
318 | if __name__ == "__main__":
319 |     main() 
```

--------------------------------------------------------------------------------
/src/docs_scraper/crawlers/multi_url_crawler.py:
--------------------------------------------------------------------------------

```python
  1 | import os
  2 | import sys
  3 | import asyncio
  4 | import re
  5 | import json
  6 | import argparse
  7 | from typing import List, Optional
  8 | from datetime import datetime
  9 | from termcolor import colored
 10 | from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
 11 | from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
 12 | from crawl4ai.content_filter_strategy import PruningContentFilter
 13 | from urllib.parse import urlparse
 14 | 
 15 | def load_urls_from_file(file_path: str) -> List[str]:
 16 |     """Load URLs from either a text file or JSON file"""
 17 |     try:
 18 |         # Create input_files directory if it doesn't exist
 19 |         input_dir = "input_files"
 20 |         os.makedirs(input_dir, exist_ok=True)
 21 |         
 22 |         # Check if file exists in current directory or input_files directory
 23 |         if os.path.exists(file_path):
 24 |             actual_path = file_path
 25 |         elif os.path.exists(os.path.join(input_dir, file_path)):
 26 |             actual_path = os.path.join(input_dir, file_path)
 27 |         else:
 28 |             print(colored(f"Error: File {file_path} not found", "red"))
 29 |             print(colored(f"Please place your URL files in either:", "yellow"))
 30 |             print(colored(f"1. The root directory ({os.getcwd()})", "yellow"))
 31 |             print(colored(f"2. The input_files directory ({os.path.join(os.getcwd(), input_dir)})", "yellow"))
 32 |             sys.exit(1)
 33 |             
 34 |         file_ext = os.path.splitext(actual_path)[1].lower()
 35 |         
 36 |         if file_ext == '.json':
 37 |             print(colored(f"Loading URLs from JSON file: {actual_path}", "cyan"))
 38 |             with open(actual_path, 'r', encoding='utf-8') as f:
 39 |                 try:
 40 |                     data = json.load(f)
 41 |                     # Handle menu crawler output format
 42 |                     if isinstance(data, dict) and 'menu_links' in data:
 43 |                         urls = data['menu_links']
 44 |                     elif isinstance(data, dict) and 'urls' in data:
 45 |                         urls = data['urls']
 46 |                     elif isinstance(data, list):
 47 |                         urls = data
 48 |                     else:
 49 |                         print(colored("Error: Invalid JSON format. Expected 'menu_links' or 'urls' key, or list of URLs", "red"))
 50 |                         sys.exit(1)
 51 |                     print(colored(f"Successfully loaded {len(urls)} URLs from JSON file", "green"))
 52 |                     return urls
 53 |                 except json.JSONDecodeError as e:
 54 |                     print(colored(f"Error: Invalid JSON file - {str(e)}", "red"))
 55 |                     sys.exit(1)
 56 |         else:
 57 |             print(colored(f"Loading URLs from text file: {actual_path}", "cyan"))
 58 |             with open(actual_path, 'r', encoding='utf-8') as f:
 59 |                 urls = [line.strip() for line in f if line.strip()]
 60 |                 print(colored(f"Successfully loaded {len(urls)} URLs from text file", "green"))
 61 |                 return urls
 62 |                 
 63 |     except Exception as e:
 64 |         print(colored(f"Error loading URLs from file: {str(e)}", "red"))
 65 |         sys.exit(1)
 66 | 
 67 | class MultiURLCrawler:
 68 |     def __init__(self, verbose: bool = True):
 69 |         self.browser_config = BrowserConfig(
 70 |             headless=True,
 71 |             verbose=True,
 72 |             viewport_width=800,
 73 |             viewport_height=600
 74 |         )
 75 |         
 76 |         self.crawler_config = CrawlerRunConfig(
 77 |             cache_mode=CacheMode.BYPASS,
 78 |             markdown_generator=DefaultMarkdownGenerator(
 79 |                 content_filter=PruningContentFilter(
 80 |                     threshold=0.48,
 81 |                     threshold_type="fixed",
 82 |                     min_word_threshold=0
 83 |                 )
 84 |             ),
 85 |         )
 86 |         
 87 |         self.verbose = verbose
 88 |         
 89 |     def process_markdown_content(self, content: str, url: str) -> str:
 90 |         """Process markdown content to start from first H1 and add URL as H2"""
 91 |         # Find the first H1 tag
 92 |         h1_match = re.search(r'^# .+$', content, re.MULTILINE)
 93 |         if not h1_match:
 94 |             # If no H1 found, return original content with URL as H1
 95 |             return f"# No Title Found\n\n## Source\n{url}\n\n{content}"
 96 |             
 97 |         # Get the content starting from the first H1
 98 |         content_from_h1 = content[h1_match.start():]
 99 |         
100 |         # Remove "Was this page helpful?" section and everything after it
101 |         helpful_patterns = [
102 |             r'^#+\s*Was this page helpful\?.*$',  # Matches any heading level with this text
103 |             r'^Was this page helpful\?.*$',       # Matches the text without heading
104 |             r'^#+\s*Was this helpful\?.*$',       # Matches any heading level with shorter text
105 |             r'^Was this helpful\?.*$'             # Matches shorter text without heading
106 |         ]
107 |         
108 |         for pattern in helpful_patterns:
109 |             parts = re.split(pattern, content_from_h1, flags=re.MULTILINE | re.IGNORECASE)
110 |             if len(parts) > 1:
111 |                 content_from_h1 = parts[0].strip()
112 |                 break
113 |         
114 |         # Insert URL as H2 after the H1
115 |         lines = content_from_h1.split('\n')
116 |         h1_line = lines[0]
117 |         rest_of_content = '\n'.join(lines[1:])
118 |         
119 |         return f"{h1_line}\n\n## Source\n{url}\n\n{rest_of_content}"
120 |         
121 |     def get_filename_prefix(self, url: str) -> str:
122 |         """
123 |         Generate a filename prefix from a URL including path components.
124 |         Examples:
125 |         - https://docs.literalai.com/page -> literalai_docs_page
126 |         - https://literalai.com/docs/page -> literalai_docs_page
127 |         - https://api.example.com/path/to/page -> example_api_path_to_page
128 |         """
129 |         try:
130 |             # Parse the URL
131 |             parsed = urlparse(url)
132 |             
133 |             # Split hostname and reverse it (e.g., 'docs.example.com' -> ['com', 'example', 'docs'])
134 |             hostname_parts = parsed.hostname.split('.')
135 |             hostname_parts.reverse()
136 |             
137 |             # Remove common TLDs and 'www'
138 |             hostname_parts = [p for p in hostname_parts if p not in ('com', 'org', 'net', 'www')]
139 |             
140 |             # Get path components, removing empty strings
141 |             path_parts = [p for p in parsed.path.split('/') if p]
142 |             
143 |             # Combine hostname and path parts
144 |             all_parts = hostname_parts + path_parts
145 |             
146 |             # Clean up parts: lowercase, remove special chars, limit length
147 |             cleaned_parts = []
148 |             for part in all_parts:
149 |                 # Convert to lowercase and remove special characters
150 |                 cleaned = re.sub(r'[^a-zA-Z0-9]+', '_', part.lower())
151 |                 # Remove leading/trailing underscores
152 |                 cleaned = cleaned.strip('_')
153 |                 # Only add non-empty parts
154 |                 if cleaned:
155 |                     cleaned_parts.append(cleaned)
156 |             
157 |             # Join parts with underscores
158 |             return '_'.join(cleaned_parts)
159 |         
160 |         except Exception as e:
161 |             print(colored(f"Error generating filename prefix: {str(e)}", "red"))
162 |             return "default"
163 | 
164 |     def save_markdown_content(self, results: List[dict], filename_prefix: str = None):
165 |         """Save all markdown content to a single file"""
166 |         try:
167 |             # Use the first successful URL to generate the filename prefix if none provided
168 |             if not filename_prefix and results:
169 |                 # Find first successful result
170 |                 first_url = next((result["url"] for result in results if result["success"]), None)
171 |                 if first_url:
172 |                     filename_prefix = self.get_filename_prefix(first_url)
173 |                 else:
174 |                     filename_prefix = "docs"  # Fallback if no successful results
175 |             
176 |             timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
177 |             filename = f"{filename_prefix}_{timestamp}.md"
178 |             filepath = os.path.join("scraped_docs", filename)
179 |             
180 |             # Create scraped_docs directory if it doesn't exist
181 |             os.makedirs("scraped_docs", exist_ok=True)
182 |             
183 |             with open(filepath, "w", encoding="utf-8") as f:
184 |                 for result in results:
185 |                     if result["success"]:
186 |                         processed_content = self.process_markdown_content(
187 |                             result["markdown_content"],
188 |                             result["url"]
189 |                         )
190 |                         f.write(processed_content)
191 |                         f.write("\n\n---\n\n")
192 |             
193 |             if self.verbose:
194 |                 print(colored(f"\nMarkdown content saved to: {filepath}", "green"))
195 |             return filepath
196 |             
197 |         except Exception as e:
198 |             print(colored(f"\nError saving markdown content: {str(e)}", "red"))
199 |             return None
200 | 
201 |     async def crawl(self, urls: List[str]) -> List[dict]:
202 |         """
203 |         Crawl multiple URLs sequentially using session reuse for optimal performance
204 |         """
205 |         if self.verbose:
206 |             print("\n=== Starting Crawl ===")
207 |             total_urls = len(urls)
208 |             print(f"Total URLs to crawl: {total_urls}")
209 | 
210 |         results = []
211 |         async with AsyncWebCrawler(config=self.browser_config) as crawler:
212 |             session_id = "crawl_session"  # Reuse the same session for all URLs
213 |             for idx, url in enumerate(urls, 1):
214 |                 try:
215 |                     if self.verbose:
216 |                         progress = (idx / total_urls) * 100
217 |                         print(f"\nProgress: {idx}/{total_urls} ({progress:.1f}%)")
218 |                         print(f"Crawling: {url}")
219 |                     
220 |                     result = await crawler.arun(
221 |                         url=url,
222 |                         config=self.crawler_config,
223 |                         session_id=session_id,
224 |                     )
225 |                     
226 |                     results.append({
227 |                         "url": url,
228 |                         "success": result.success,
229 |                         "content_length": len(result.markdown.raw_markdown) if result.success else 0,
230 |                         "markdown_content": result.markdown.raw_markdown if result.success else "",
231 |                         "error": result.error_message if not result.success else None
232 |                     })
233 |                     
234 |                     if self.verbose and result.success:
235 |                         print(f"✓ Successfully crawled URL {idx}/{total_urls}")
236 |                         print(f"Content length: {len(result.markdown.raw_markdown)} characters")
237 |                 except Exception as e:
238 |                     results.append({
239 |                         "url": url,
240 |                         "success": False,
241 |                         "content_length": 0,
242 |                         "markdown_content": "",
243 |                         "error": str(e)
244 |                     })
245 |                     if self.verbose:
246 |                         print(f"✗ Error crawling URL {idx}/{total_urls}: {str(e)}")
247 | 
248 |         if self.verbose:
249 |             successful = sum(1 for r in results if r["success"])
250 |             print(f"\n=== Crawl Complete ===")
251 |             print(f"Successfully crawled: {successful}/{total_urls} URLs")
252 | 
253 |         return results
254 | 
255 | async def main():
256 |     parser = argparse.ArgumentParser(description='Crawl multiple URLs and generate markdown documentation')
257 |     parser.add_argument('urls_file', type=str, help='Path to file containing URLs (either .txt or .json)')
258 |     parser.add_argument('--output-prefix', type=str, help='Prefix for output markdown file (optional)')
259 |     args = parser.parse_args()
260 | 
261 |     try:
262 |         # Load URLs from file
263 |         urls = load_urls_from_file(args.urls_file)
264 |         
265 |         if not urls:
266 |             print(colored("Error: No URLs found in the input file", "red"))
267 |             sys.exit(1)
268 |             
269 |         print(colored(f"Found {len(urls)} URLs to crawl", "green"))
270 |         
271 |         # Initialize and run crawler
272 |         crawler = MultiURLCrawler(verbose=True)
273 |         results = await crawler.crawl(urls)
274 |         
275 |         # Save results to markdown file - only pass output_prefix if explicitly set
276 |         crawler.save_markdown_content(results, args.output_prefix if args.output_prefix else None)
277 |         
278 |     except Exception as e:
279 |         print(colored(f"Error during crawling: {str(e)}", "red"))
280 |         sys.exit(1)
281 | 
282 | if __name__ == "__main__":
283 |     asyncio.run(main()) 
```

--------------------------------------------------------------------------------
/src/docs_scraper/crawlers/sitemap_crawler.py:
--------------------------------------------------------------------------------

```python
  1 | import os
  2 | import sys
  3 | import asyncio
  4 | import re
  5 | import xml.etree.ElementTree as ET
  6 | import argparse
  7 | from typing import List, Optional, Dict
  8 | from datetime import datetime
  9 | from termcolor import colored
 10 | from ..utils import RequestHandler, HTMLParser
 11 | 
 12 | class SitemapCrawler:
 13 |     def __init__(self, request_handler: Optional[RequestHandler] = None, html_parser: Optional[HTMLParser] = None, verbose: bool = True):
 14 |         """
 15 |         Initialize the sitemap crawler.
 16 |         
 17 |         Args:
 18 |             request_handler: Optional RequestHandler instance. If not provided, one will be created.
 19 |             html_parser: Optional HTMLParser instance. If not provided, one will be created.
 20 |             verbose: Whether to print progress messages
 21 |         """
 22 |         self.verbose = verbose
 23 |         self.request_handler = request_handler or RequestHandler(
 24 |             rate_limit=1.0,
 25 |             concurrent_limit=5,
 26 |             user_agent="DocsScraperBot/1.0",
 27 |             timeout=30
 28 |         )
 29 |         self._html_parser = html_parser
 30 | 
 31 |     async def fetch_sitemap(self, sitemap_url: str) -> List[str]:
 32 |         """
 33 |         Fetch and parse an XML sitemap to extract URLs.
 34 |         
 35 |         Args:
 36 |             sitemap_url (str): The URL of the XML sitemap
 37 |             
 38 |         Returns:
 39 |             List[str]: List of URLs found in the sitemap
 40 |         """
 41 |         if self.verbose:
 42 |             print(f"\nFetching sitemap from: {sitemap_url}")
 43 |             
 44 |         async with self.request_handler as handler:
 45 |             try:
 46 |                 response = await handler.get(sitemap_url)
 47 |                 if not response["success"]:
 48 |                     raise Exception(f"Failed to fetch sitemap: {response['error']}")
 49 |                 
 50 |                 content = response["content"]
 51 |                 
 52 |                 # Parse XML content
 53 |                 root = ET.fromstring(content)
 54 |                 
 55 |                 # Handle both standard sitemaps and sitemap indexes
 56 |                 urls = []
 57 |                 
 58 |                 # Remove XML namespace for easier parsing
 59 |                 namespace = root.tag.split('}')[0] + '}' if '}' in root.tag else ''
 60 |                 
 61 |                 if root.tag == f"{namespace}sitemapindex":
 62 |                     # This is a sitemap index file
 63 |                     if self.verbose:
 64 |                         print("Found sitemap index, processing nested sitemaps...")
 65 |                     
 66 |                     for sitemap in root.findall(f".//{namespace}sitemap"):
 67 |                         loc = sitemap.find(f"{namespace}loc")
 68 |                         if loc is not None and loc.text:
 69 |                             nested_urls = await self.fetch_sitemap(loc.text)
 70 |                             urls.extend(nested_urls)
 71 |                 else:
 72 |                     # This is a standard sitemap
 73 |                     for url in root.findall(f".//{namespace}url"):
 74 |                         loc = url.find(f"{namespace}loc")
 75 |                         if loc is not None and loc.text:
 76 |                             urls.append(loc.text)
 77 |                 
 78 |                 if self.verbose:
 79 |                     print(f"Found {len(urls)} URLs in sitemap")
 80 |                 return urls
 81 |                 
 82 |             except Exception as e:
 83 |                 print(f"Error fetching sitemap: {str(e)}")
 84 |                 return []
 85 | 
 86 |     def process_markdown_content(self, content: str, url: str) -> str:
 87 |         """Process markdown content to start from first H1 and add URL as H2"""
 88 |         # Find the first H1 tag
 89 |         h1_match = re.search(r'^# .+$', content, re.MULTILINE)
 90 |         if not h1_match:
 91 |             # If no H1 found, return original content with URL as H1
 92 |             return f"# No Title Found\n\n## Source\n{url}\n\n{content}"
 93 |             
 94 |         # Get the content starting from the first H1
 95 |         content_from_h1 = content[h1_match.start():]
 96 |         
 97 |         # Remove "Was this page helpful?" section and everything after it
 98 |         helpful_patterns = [
 99 |             r'^#+\s*Was this page helpful\?.*$',  # Matches any heading level with this text
100 |             r'^Was this page helpful\?.*$',       # Matches the text without heading
101 |             r'^#+\s*Was this helpful\?.*$',       # Matches any heading level with shorter text
102 |             r'^Was this helpful\?.*$'             # Matches shorter text without heading
103 |         ]
104 |         
105 |         for pattern in helpful_patterns:
106 |             parts = re.split(pattern, content_from_h1, flags=re.MULTILINE | re.IGNORECASE)
107 |             if len(parts) > 1:
108 |                 content_from_h1 = parts[0].strip()
109 |                 break
110 |         
111 |         # Insert URL as H2 after the H1
112 |         lines = content_from_h1.split('\n')
113 |         h1_line = lines[0]
114 |         rest_of_content = '\n'.join(lines[1:]).strip()
115 |         
116 |         return f"{h1_line}\n\n## Source\n{url}\n\n{rest_of_content}"
117 | 
118 |     def save_markdown_content(self, results: List[dict], filename_prefix: str = "vercel_ai_docs"):
119 |         """Save all markdown content to a single file"""
120 |         timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
121 |         filename = f"{filename_prefix}_{timestamp}.md"
122 |         filepath = os.path.join("scraped_docs", filename)
123 |         
124 |         # Create scraped_docs directory if it doesn't exist
125 |         os.makedirs("scraped_docs", exist_ok=True)
126 |         
127 |         with open(filepath, "w", encoding="utf-8") as f:
128 |             for result in results:
129 |                 if result["success"]:
130 |                     processed_content = self.process_markdown_content(
131 |                         result["content"],
132 |                         result["url"]
133 |                     )
134 |                     f.write(processed_content)
135 |                     f.write("\n\n---\n\n")
136 |         
137 |         if self.verbose:
138 |             print(f"\nMarkdown content saved to: {filepath}")
139 |         return filepath
140 | 
141 |     async def crawl(self, sitemap_url: str) -> List[dict]:
142 |         """
143 |         Crawl a sitemap URL and all URLs it contains.
144 |         
145 |         Args:
146 |             sitemap_url: URL of the sitemap to crawl
147 |             
148 |         Returns:
149 |             List of dictionaries containing crawl results
150 |         """
151 |         if self.verbose:
152 |             print("\n=== Starting Crawl ===")
153 |         
154 |         # First fetch all URLs from the sitemap
155 |         urls = await self.fetch_sitemap(sitemap_url)
156 |         
157 |         if self.verbose:
158 |             print(f"Total URLs to crawl: {len(urls)}")
159 | 
160 |         results = []
161 |         async with self.request_handler as handler:
162 |             for idx, url in enumerate(urls, 1):
163 |                 try:
164 |                     if self.verbose:
165 |                         progress = (idx / len(urls)) * 100
166 |                         print(f"\nProgress: {idx}/{len(urls)} ({progress:.1f}%)")
167 |                         print(f"Crawling: {url}")
168 |                     
169 |                     response = await handler.get(url)
170 |                     html_parser = self._html_parser or HTMLParser(url)
171 |                     
172 |                     if response["success"]:
173 |                         parsed_content = html_parser.parse_content(response["content"])
174 |                         results.append({
175 |                             "url": url,
176 |                             "success": True,
177 |                             "content": parsed_content["text_content"],
178 |                             "metadata": {
179 |                                 "title": parsed_content["title"],
180 |                                 "description": parsed_content["description"]
181 |                             },
182 |                             "links": parsed_content["links"],
183 |                             "status_code": response["status"],
184 |                             "error": None
185 |                         })
186 |                         
187 |                         if self.verbose:
188 |                             print(f"✓ Successfully crawled URL {idx}/{len(urls)}")
189 |                             print(f"Content length: {len(parsed_content['text_content'])} characters")
190 |                     else:
191 |                         results.append({
192 |                             "url": url,
193 |                             "success": False,
194 |                             "content": "",
195 |                             "metadata": {"title": None, "description": None},
196 |                             "links": [],
197 |                             "status_code": response.get("status"),
198 |                             "error": response["error"]
199 |                         })
200 |                         if self.verbose:
201 |                             print(f"✗ Error crawling URL {idx}/{len(urls)}: {response['error']}")
202 |                             
203 |                 except Exception as e:
204 |                     results.append({
205 |                         "url": url,
206 |                         "success": False,
207 |                         "content": "",
208 |                         "metadata": {"title": None, "description": None},
209 |                         "links": [],
210 |                         "status_code": None,
211 |                         "error": str(e)
212 |                     })
213 |                     if self.verbose:
214 |                         print(f"✗ Error crawling URL {idx}/{len(urls)}: {str(e)}")
215 | 
216 |         if self.verbose:
217 |             successful = sum(1 for r in results if r["success"])
218 |             print(f"\n=== Crawl Complete ===")
219 |             print(f"Successfully crawled: {successful}/{len(urls)} URLs")
220 | 
221 |         return results
222 | 
223 |     def get_filename_prefix(self, url: str) -> str:
224 |         """
225 |         Generate a filename prefix from a sitemap URL.
226 |         Examples:
227 |         - https://docs.literalai.com/sitemap.xml -> literalai_documentation
228 |         - https://literalai.com/docs/sitemap.xml -> literalai_docs
229 |         - https://api.example.com/sitemap.xml -> example_api
230 |         
231 |         Args:
232 |             url (str): The sitemap URL
233 |             
234 |         Returns:
235 |             str: Generated filename prefix
236 |         """
237 |         # Remove protocol and split URL parts
238 |         clean_url = url.split('://')[1]
239 |         url_parts = clean_url.split('/')
240 |         
241 |         # Get domain parts
242 |         domain_parts = url_parts[0].split('.')
243 |         
244 |         # Extract main domain name (ignoring TLD)
245 |         main_domain = domain_parts[-2]
246 |         
247 |         # Determine the qualifier (subdomain or path segment)
248 |         qualifier = None
249 |         
250 |         # First check subdomain
251 |         if len(domain_parts) > 2:
252 |             qualifier = domain_parts[0]
253 |         # Then check path
254 |         elif len(url_parts) > 2:
255 |             # Get the first meaningful path segment
256 |             for segment in url_parts[1:]:
257 |                 if segment and segment != 'sitemap.xml':
258 |                     qualifier = segment
259 |                     break
260 |         
261 |         # Build the prefix
262 |         if qualifier:
263 |             # Clean up qualifier (remove special characters, convert to lowercase)
264 |             qualifier = re.sub(r'[^a-zA-Z0-9]', '', qualifier.lower())
265 |             # Don't duplicate parts if they're the same
266 |             if qualifier != main_domain:
267 |                 return f"{main_domain}_{qualifier}"
268 |         
269 |         return main_domain
270 | 
271 | async def main():
272 |     # Set up argument parser
273 |     parser = argparse.ArgumentParser(description='Crawl a sitemap and generate markdown documentation')
274 |     parser.add_argument('sitemap_url', type=str, help='URL of the sitemap (e.g., https://docs.example.com/sitemap.xml)')
275 |     parser.add_argument('--max-depth', type=int, default=10, help='Maximum sitemap recursion depth')
276 |     parser.add_argument('--patterns', type=str, nargs='+', help='URL patterns to include (e.g., "/docs/*" "/guide/*")')
277 |     args = parser.parse_args()
278 | 
279 |     try:
280 |         print(colored(f"\nFetching sitemap: {args.sitemap_url}", "cyan"))
281 |         
282 |         # Initialize crawler
283 |         crawler = SitemapCrawler(verbose=True)
284 |         
285 |         # Fetch URLs from sitemap
286 |         urls = await crawler.fetch_sitemap(args.sitemap_url)
287 |         
288 |         if not urls:
289 |             print(colored("No URLs found in sitemap", "red"))
290 |             sys.exit(1)
291 |             
292 |         # Filter URLs by pattern if specified
293 |         if args.patterns:
294 |             print(colored("\nFiltering URLs by patterns:", "cyan"))
295 |             for pattern in args.patterns:
296 |                 print(colored(f"  {pattern}", "yellow"))
297 |             
298 |             filtered_urls = []
299 |             for url in urls:
300 |                 if any(pattern.replace('*', '') in url for pattern in args.patterns):
301 |                     filtered_urls.append(url)
302 |             
303 |             print(colored(f"\nFound {len(filtered_urls)} URLs matching patterns", "green"))
304 |             urls = filtered_urls
305 |         
306 |         # Crawl the URLs
307 |         results = await crawler.crawl(args.sitemap_url)
308 |         
309 |         # Save results to markdown file with dynamic name
310 |         filename_prefix = crawler.get_filename_prefix(args.sitemap_url)
311 |         crawler.save_markdown_content(results, filename_prefix)
312 |         
313 |     except Exception as e:
314 |         print(colored(f"Error during crawling: {str(e)}", "red"))
315 |         sys.exit(1)
316 | 
317 | if __name__ == "__main__":
318 |     asyncio.run(main()) 
```

--------------------------------------------------------------------------------
/src/docs_scraper/crawlers/menu_crawler.py:
--------------------------------------------------------------------------------

```python
  1 | #!/usr/bin/env python3
  2 | 
  3 | import asyncio
  4 | from typing import List, Set
  5 | from termcolor import colored
  6 | from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
  7 | from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
  8 | from urllib.parse import urljoin, urlparse
  9 | import json
 10 | import os
 11 | import sys
 12 | import argparse
 13 | from datetime import datetime
 14 | import re
 15 | 
 16 | # Constants
 17 | BASE_URL = "https://developers.cloudflare.com/agents/"
 18 | INPUT_DIR = "input_files"  # Changed from OUTPUT_DIR
 19 | MENU_SELECTORS = [
 20 |     # Traditional documentation selectors
 21 |     "nav a",                                  # General navigation links
 22 |     "[role='navigation'] a",                  # Role-based navigation
 23 |     ".sidebar a",                             # Common sidebar class
 24 |     "[class*='nav'] a",                       # Classes containing 'nav'
 25 |     "[class*='menu'] a",                      # Classes containing 'menu'
 26 |     "aside a",                                # Side navigation
 27 |     ".toc a",                                 # Table of contents
 28 |     
 29 |     # Modern framework selectors (Mintlify, Docusaurus, etc)
 30 |     "[class*='sidebar'] [role='navigation'] [class*='group'] a",  # Navigation groups
 31 |     "[class*='sidebar'] [role='navigation'] [class*='item'] a",   # Navigation items
 32 |     "[class*='sidebar'] [role='navigation'] [class*='link'] a",   # Direct links
 33 |     "[class*='sidebar'] [role='navigation'] div[class*='text']",  # Text items
 34 |     "[class*='sidebar'] [role='navigation'] [class*='nav-item']", # Nav items
 35 |     
 36 |     # Additional common patterns
 37 |     "[class*='docs-'] a",                     # Documentation-specific links
 38 |     "[class*='navigation'] a",                # Navigation containers
 39 |     "[class*='toc'] a",                       # Table of contents variations
 40 |     ".docNavigation a",                       # Documentation navigation
 41 |     "[class*='menu-item'] a",                 # Menu items
 42 |     
 43 |     # Client-side rendered navigation
 44 |     "[class*='sidebar'] a[href]",             # Any link in sidebar
 45 |     "[class*='sidebar'] [role='link']",       # ARIA role links
 46 |     "[class*='sidebar'] [role='menuitem']",   # Menu items
 47 |     "[class*='sidebar'] [role='treeitem']",   # Tree navigation items
 48 |     "[class*='sidebar'] [onclick]",           # Elements with click handlers
 49 |     "[class*='sidebar'] [class*='link']",     # Elements with link classes
 50 |     "a[href^='/']",                           # Root-relative links
 51 |     "a[href^='./']",                          # Relative links
 52 |     "a[href^='../']"                          # Parent-relative links
 53 | ]
 54 | 
 55 | # JavaScript to expand nested menus
 56 | EXPAND_MENUS_JS = """
 57 | (async () => {
 58 |     // Wait for client-side rendering to complete
 59 |     await new Promise(r => setTimeout(r, 2000));
 60 |     
 61 |     // Function to expand all menu items
 62 |     async function expandAllMenus() {
 63 |         // Combined selectors for expandable menu items
 64 |         const expandableSelectors = [
 65 |             // Previous selectors...
 66 |             // Additional selectors for client-side rendered menus
 67 |             '[class*="sidebar"] button',
 68 |             '[class*="sidebar"] [role="button"]',
 69 |             '[class*="sidebar"] [aria-controls]',
 70 |             '[class*="sidebar"] [aria-expanded]',
 71 |             '[class*="sidebar"] [data-state]',
 72 |             '[class*="sidebar"] [class*="expand"]',
 73 |             '[class*="sidebar"] [class*="toggle"]',
 74 |             '[class*="sidebar"] [class*="collapse"]'
 75 |         ];
 76 |         
 77 |         let expanded = 0;
 78 |         let lastExpanded = -1;
 79 |         let attempts = 0;
 80 |         const maxAttempts = 10;  // Increased attempts for client-side rendering
 81 |         
 82 |         while (expanded !== lastExpanded && attempts < maxAttempts) {
 83 |             lastExpanded = expanded;
 84 |             attempts++;
 85 |             
 86 |             for (const selector of expandableSelectors) {
 87 |                 const elements = document.querySelectorAll(selector);
 88 |                 for (const el of elements) {
 89 |                     try {
 90 |                         // Click the element
 91 |                         el.click();
 92 |                         
 93 |                         // Try multiple expansion methods
 94 |                         el.setAttribute('aria-expanded', 'true');
 95 |                         el.setAttribute('data-state', 'open');
 96 |                         el.classList.add('expanded', 'show', 'active');
 97 |                         el.classList.remove('collapsed', 'closed');
 98 |                         
 99 |                         // Handle parent groups - multiple patterns
100 |                         ['[class*="group"]', '[class*="parent"]', '[class*="submenu"]'].forEach(parentSelector => {
101 |                             let parent = el.closest(parentSelector);
102 |                             if (parent) {
103 |                                 parent.setAttribute('data-state', 'open');
104 |                                 parent.setAttribute('aria-expanded', 'true');
105 |                                 parent.classList.add('expanded', 'show', 'active');
106 |                             }
107 |                         });
108 |                         
109 |                         expanded++;
110 |                         await new Promise(r => setTimeout(r, 200));  // Increased delay between clicks
111 |                     } catch (e) {
112 |                         continue;
113 |                     }
114 |                 }
115 |             }
116 |             
117 |             // Wait longer between attempts for client-side rendering
118 |             await new Promise(r => setTimeout(r, 500));
119 |         }
120 |         
121 |         // After expansion, try to convert text items to links if needed
122 |         const textSelectors = [
123 |             '[class*="sidebar"] [role="navigation"] [class*="text"]',
124 |             '[class*="menu-item"]',
125 |             '[class*="nav-item"]',
126 |             '[class*="sidebar"] [role="menuitem"]',
127 |             '[class*="sidebar"] [role="treeitem"]'
128 |         ];
129 |         
130 |         textSelectors.forEach(selector => {
131 |             const textItems = document.querySelectorAll(selector);
132 |             textItems.forEach(item => {
133 |                 if (!item.querySelector('a') && item.textContent && item.textContent.trim()) {
134 |                     const text = item.textContent.trim();
135 |                     // Only create link if it doesn't already exist
136 |                     if (!Array.from(item.children).some(child => child.tagName === 'A')) {
137 |                         const link = document.createElement('a');
138 |                         link.href = '#' + text.toLowerCase().replace(/[^a-z0-9]+/g, '-');
139 |                         link.textContent = text;
140 |                         item.appendChild(link);
141 |                     }
142 |                 }
143 |             });
144 |         });
145 |         
146 |         return expanded;
147 |     }
148 |     
149 |     const expandedCount = await expandAllMenus();
150 |     // Final wait to ensure all client-side updates are complete
151 |     await new Promise(r => setTimeout(r, 1000));
152 |     return expandedCount;
153 | })();
154 | """
155 | 
156 | def get_filename_prefix(url: str) -> str:
157 |     """
158 |     Generate a filename prefix from a URL including path components.
159 |     Examples:
160 |     - https://docs.literalai.com/page -> literalai_docs_page
161 |     - https://literalai.com/docs/page -> literalai_docs_page
162 |     - https://api.example.com/path/to/page -> example_api_path_to_page
163 |     
164 |     Args:
165 |         url (str): The URL to process
166 |         
167 |     Returns:
168 |         str: A filename-safe string derived from the URL
169 |     """
170 |     try:
171 |         # Parse the URL
172 |         parsed = urlparse(url)
173 |         
174 |         # Split hostname and reverse it (e.g., 'docs.example.com' -> ['com', 'example', 'docs'])
175 |         hostname_parts = parsed.hostname.split('.')
176 |         hostname_parts.reverse()
177 |         
178 |         # Remove common TLDs and 'www'
179 |         hostname_parts = [p for p in hostname_parts if p not in ('com', 'org', 'net', 'www')]
180 |         
181 |         # Get path components, removing empty strings
182 |         path_parts = [p for p in parsed.path.split('/') if p]
183 |         
184 |         # Combine hostname and path parts
185 |         all_parts = hostname_parts + path_parts
186 |         
187 |         # Clean up parts: lowercase, remove special chars, limit length
188 |         cleaned_parts = []
189 |         for part in all_parts:
190 |             # Convert to lowercase and remove special characters
191 |             cleaned = re.sub(r'[^a-zA-Z0-9]+', '_', part.lower())
192 |             # Remove leading/trailing underscores
193 |             cleaned = cleaned.strip('_')
194 |             # Only add non-empty parts
195 |             if cleaned:
196 |                 cleaned_parts.append(cleaned)
197 |         
198 |         # Join parts with underscores
199 |         return '_'.join(cleaned_parts)
200 |     
201 |     except Exception as e:
202 |         print(colored(f"Error generating filename prefix: {str(e)}", "red"))
203 |         return "default"
204 | 
205 | class MenuCrawler:
206 |     def __init__(self, start_url: str):
207 |         self.start_url = start_url
208 |         
209 |         # Configure browser settings
210 |         self.browser_config = BrowserConfig(
211 |             headless=True,
212 |             viewport_width=1920,
213 |             viewport_height=1080,
214 |             java_script_enabled=True  # Ensure JavaScript is enabled
215 |         )
216 |         
217 |         # Create extraction strategy for menu links
218 |         extraction_schema = {
219 |             "name": "MenuLinks",
220 |             "baseSelector": ", ".join(MENU_SELECTORS),
221 |             "fields": [
222 |                 {
223 |                     "name": "href",
224 |                     "type": "attribute",
225 |                     "attribute": "href"
226 |                 },
227 |                 {
228 |                     "name": "text",
229 |                     "type": "text"
230 |                 },
231 |                 {
232 |                     "name": "onclick",
233 |                     "type": "attribute",
234 |                     "attribute": "onclick"
235 |                 },
236 |                 {
237 |                     "name": "role",
238 |                     "type": "attribute",
239 |                     "attribute": "role"
240 |                 }
241 |             ]
242 |         }
243 |         extraction_strategy = JsonCssExtractionStrategy(extraction_schema)
244 |         
245 |         # Configure crawler settings with proper wait conditions
246 |         self.crawler_config = CrawlerRunConfig(
247 |             extraction_strategy=extraction_strategy,
248 |             cache_mode=CacheMode.BYPASS,  # Don't use cache for fresh results
249 |             verbose=True,  # Enable detailed logging
250 |             wait_for_images=True,  # Ensure lazy-loaded content is captured
251 |             js_code=[
252 |                 # Initial wait for client-side rendering
253 |                 "await new Promise(r => setTimeout(r, 2000));",
254 |                 EXPAND_MENUS_JS
255 |             ],  # Add JavaScript to expand nested menus
256 |             wait_for="""js:() => {
257 |                 // Wait for sidebar and its content to be present
258 |                 const sidebar = document.querySelector('[class*="sidebar"]');
259 |                 if (!sidebar) return false;
260 |                 
261 |                 // Check if we have navigation items
262 |                 const hasNavItems = sidebar.querySelectorAll('a').length > 0;
263 |                 if (hasNavItems) return true;
264 |                 
265 |                 // If no nav items yet, check for loading indicators
266 |                 const isLoading = document.querySelector('[class*="loading"]') !== null;
267 |                 return !isLoading;  // Return true if not loading anymore
268 |             }""",
269 |             session_id="menu_crawler",  # Use a session to maintain state
270 |             js_only=False  # We want full page load first
271 |         )
272 |         
273 |         # Create output directory if it doesn't exist
274 |         if not os.path.exists(INPUT_DIR):
275 |             os.makedirs(INPUT_DIR)
276 |             print(colored(f"Created output directory: {INPUT_DIR}", "green"))
277 | 
278 |     async def extract_all_menu_links(self) -> List[str]:
279 |         """Extract all menu links from the main page, including nested menus."""
280 |         try:
281 |             print(colored(f"Crawling main page: {self.start_url}", "cyan"))
282 |             print(colored("Expanding all nested menus...", "yellow"))
283 |             
284 |             async with AsyncWebCrawler(config=self.browser_config) as crawler:
285 |                 # Get page content using crawl4ai
286 |                 result = await crawler.arun(
287 |                     url=self.start_url,
288 |                     config=self.crawler_config
289 |                 )
290 | 
291 |                 if not result or not result.success:
292 |                     print(colored(f"Failed to get page data", "red"))
293 |                     if result and result.error_message:
294 |                         print(colored(f"Error: {result.error_message}", "red"))
295 |                     return []
296 | 
297 |                 links = set()
298 |                 
299 |                 # Parse the base domain from start_url
300 |                 base_domain = urlparse(self.start_url).netloc
301 |                 
302 |                 # Add the base URL first (without trailing slash for consistency)
303 |                 base_url = self.start_url.rstrip('/')
304 |                 links.add(base_url)
305 |                 print(colored(f"Added base URL: {base_url}", "green"))
306 |                 
307 |                 # Extract links from the result
308 |                 if hasattr(result, 'extracted_content') and result.extracted_content:
309 |                     try:
310 |                         menu_links = json.loads(result.extracted_content)
311 |                         for link in menu_links:
312 |                             href = link.get('href', '')
313 |                             text = link.get('text', '').strip()
314 |                             
315 |                             # Skip empty hrefs
316 |                             if not href:
317 |                                 continue
318 |                                 
319 |                             # Convert relative URLs to absolute
320 |                             absolute_url = urljoin(self.start_url, href)
321 |                             parsed_url = urlparse(absolute_url)
322 |                             
323 |                             # Accept internal links (same domain) that aren't anchors
324 |                             if (parsed_url.netloc == base_domain and 
325 |                                 not href.startswith('#') and 
326 |                                 '#' not in absolute_url):
327 |                                 
328 |                                 # Remove any trailing slashes for consistency
329 |                                 absolute_url = absolute_url.rstrip('/')
330 |                                 
331 |                                 links.add(absolute_url)
332 |                                 print(colored(f"Found link: {text} -> {absolute_url}", "green"))
333 |                             else:
334 |                                 print(colored(f"Skipping external or anchor link: {text} -> {href}", "yellow"))
335 |                                 
336 |                     except json.JSONDecodeError as e:
337 |                         print(colored(f"Error parsing extracted content: {str(e)}", "red"))
338 |                 
339 |                 print(colored(f"\nFound {len(links)} unique menu links", "green"))
340 |                 return sorted(list(links))
341 | 
342 |         except Exception as e:
343 |             print(colored(f"Error extracting menu links: {str(e)}", "red"))
344 |             return []
345 | 
346 |     def save_results(self, results: dict) -> str:
347 |         """Save crawling results to a JSON file in the input_files directory."""
348 |         try:
349 |             # Create input_files directory if it doesn't exist
350 |             os.makedirs(INPUT_DIR, exist_ok=True)
351 |             
352 |             # Generate filename using the same pattern
353 |             filename_prefix = get_filename_prefix(self.start_url)
354 |             timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
355 |             filename = f"{filename_prefix}_menu_links_{timestamp}.json"
356 |             filepath = os.path.join(INPUT_DIR, filename)
357 |             
358 |             with open(filepath, "w", encoding="utf-8") as f:
359 |                 json.dump(results, f, indent=2)
360 |             
361 |             print(colored(f"\n✓ Menu links saved to: {filepath}", "green"))
362 |             print(colored("\nTo crawl these URLs with multi_url_crawler.py, run:", "cyan"))
363 |             print(colored(f"python multi_url_crawler.py --urls {filename}", "yellow"))
364 |             return filepath
365 |             
366 |         except Exception as e:
367 |             print(colored(f"\n✗ Error saving menu links: {str(e)}", "red"))
368 |             return None
369 | 
370 |     async def crawl(self):
371 |         """Main crawling method."""
372 |         try:
373 |             # Extract all menu links from the main page
374 |             menu_links = await self.extract_all_menu_links()
375 | 
376 |             # Save results
377 |             results = {
378 |                 "start_url": self.start_url,
379 |                 "total_links_found": len(menu_links),
380 |                 "menu_links": menu_links
381 |             }
382 | 
383 |             self.save_results(results)
384 | 
385 |             print(colored(f"\nCrawling completed!", "green"))
386 |             print(colored(f"Total unique menu links found: {len(menu_links)}", "green"))
387 | 
388 |         except Exception as e:
389 |             print(colored(f"Error during crawling: {str(e)}", "red"))
390 | 
391 | async def main():
392 |     # Set up argument parser
393 |     parser = argparse.ArgumentParser(description='Extract menu links from a documentation website')
394 |     parser.add_argument('url', type=str, help='Documentation site URL to crawl')
395 |     parser.add_argument('--selectors', type=str, nargs='+', help='Custom menu selectors (optional)')
396 |     args = parser.parse_args()
397 | 
398 |     try:
399 |         # Update menu selectors if custom ones provided
400 |         if args.selectors:
401 |             print(colored("Using custom menu selectors:", "cyan"))
402 |             for selector in args.selectors:
403 |                 print(colored(f"  {selector}", "yellow"))
404 |             global MENU_SELECTORS
405 |             MENU_SELECTORS = args.selectors
406 | 
407 |         crawler = MenuCrawler(args.url)
408 |         await crawler.crawl()
409 |     except Exception as e:
410 |         print(colored(f"Error in main: {str(e)}", "red"))
411 |         sys.exit(1)
412 | 
413 | if __name__ == "__main__":
414 |     print(colored("Starting documentation menu crawler...", "cyan"))
415 |     asyncio.run(main()) 
```
Page 1/2FirstPrevNextLast