This is page 1 of 2. Use http://codebase.md/felores/docs_scraper_mcp?lines=true&page={x} to view the full context.
# Directory Structure
```
├── .cursor
│ └── rules
│ ├── implementation-plan.mdc
│ └── mcp-development-protocol.mdc
├── .gitignore
├── .venv
│ ├── Include
│ │ └── site
│ │ └── python3.12
│ │ └── greenlet
│ │ └── greenlet.h
│ ├── pyvenv.cfg
│ └── Scripts
│ ├── activate
│ ├── activate.bat
│ ├── Activate.ps1
│ ├── cchardetect
│ ├── crawl4ai-doctor.exe
│ ├── crawl4ai-download-models.exe
│ ├── crawl4ai-migrate.exe
│ ├── crawl4ai-setup.exe
│ ├── crwl.exe
│ ├── deactivate.bat
│ ├── distro.exe
│ ├── docs-scraper.exe
│ ├── dotenv.exe
│ ├── f2py.exe
│ ├── httpx.exe
│ ├── huggingface-cli.exe
│ ├── jsonschema.exe
│ ├── litellm.exe
│ ├── markdown-it.exe
│ ├── mcp.exe
│ ├── nltk.exe
│ ├── normalizer.exe
│ ├── numpy-config.exe
│ ├── openai.exe
│ ├── pip.exe
│ ├── pip3.12.exe
│ ├── pip3.exe
│ ├── playwright.exe
│ ├── py.test.exe
│ ├── pygmentize.exe
│ ├── pytest.exe
│ ├── python.exe
│ ├── pythonw.exe
│ ├── tqdm.exe
│ ├── typer.exe
│ └── uvicorn.exe
├── input_files
│ └── .gitkeep
├── LICENSE
├── pyproject.toml
├── README.md
├── requirements.txt
├── scraped_docs
│ └── .gitkeep
├── src
│ └── docs_scraper
│ ├── __init__.py
│ ├── cli.py
│ ├── crawlers
│ │ ├── __init__.py
│ │ ├── menu_crawler.py
│ │ ├── multi_url_crawler.py
│ │ ├── single_url_crawler.py
│ │ └── sitemap_crawler.py
│ ├── server.py
│ └── utils
│ ├── __init__.py
│ ├── html_parser.py
│ └── request_handler.py
└── tests
├── conftest.py
├── test_crawlers
│ ├── test_menu_crawler.py
│ ├── test_multi_url_crawler.py
│ ├── test_single_url_crawler.py
│ └── test_sitemap_crawler.py
└── test_utils
├── test_html_parser.py
└── test_request_handler.py
```
# Files
--------------------------------------------------------------------------------
/input_files/.gitkeep:
--------------------------------------------------------------------------------
```
1 |
```
--------------------------------------------------------------------------------
/scraped_docs/.gitkeep:
--------------------------------------------------------------------------------
```
1 |
```
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
```
1 | # Python
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 | *.so
6 | .Python
7 | env/
8 | build/
9 | develop-eggs/
10 | dist/
11 | downloads/
12 | eggs/
13 | .eggs/
14 | lib/
15 | lib64/
16 | parts/
17 | sdist/
18 | var/
19 | wheels/
20 | *.egg-info/
21 | .installed.cfg
22 | *.egg
23 |
24 | # Virtual Environment
25 | venv/
26 | ENV/
27 | .env
28 |
29 | # IDE
30 | .idea/
31 | .vscode/
32 | *.swp
33 | *.swo
34 | .DS_Store
35 |
36 | # Scraped Docs - ignore contents but keep directory
37 | scraped_docs/*
38 | !scraped_docs/.gitkeep
39 |
40 | # Input Files - ignore contents but keep directory
41 | input_files/*
42 | !input_files/.gitkeep
```
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
```markdown
1 | # Crawl4AI Documentation Scraper
2 |
3 | Keep your dependency documentation lean, current, and AI-ready. This toolkit helps you extract clean, focused documentation from any framework or library website, perfect for both human readers and LLM consumption.
4 |
5 | ## Why This Tool?
6 |
7 | In today's fast-paced development environment, you need:
8 | - 📚 Quick access to dependency documentation without the bloat
9 | - 🤖 Documentation in a format that's ready for RAG systems and LLMs
10 | - 🎯 Focused content without navigation elements, ads, or irrelevant sections
11 | - ⚡ Fast, efficient way to keep documentation up-to-date
12 | - 🧹 Clean Markdown output for easy integration with documentation tools
13 |
14 | Traditional web scraping often gives you everything - including navigation menus, footers, ads, and other noise. This toolkit is specifically designed to extract only what matters: the actual documentation content.
15 |
16 | ### Key Benefits
17 |
18 | 1. **Clean Documentation Output**
19 | - Markdown format for content-focused documentation
20 | - JSON format for structured menu data
21 | - Perfect for documentation sites, wikis, and knowledge bases
22 | - Ideal format for LLM training and RAG systems
23 |
24 | 2. **Smart Content Extraction**
25 | - Automatically identifies main content areas
26 | - Strips away navigation, ads, and irrelevant sections
27 | - Preserves code blocks and technical formatting
28 | - Maintains proper Markdown structure
29 |
30 | 3. **Flexible Crawling Strategies**
31 | - Single page for quick reference docs
32 | - Multi-page for comprehensive library documentation
33 | - Sitemap-based for complete framework coverage
34 | - Menu-based for structured documentation hierarchies
35 |
36 | 4. **LLM and RAG Ready**
37 | - Clean Markdown text suitable for embeddings
38 | - Preserved code blocks for technical accuracy
39 | - Structured menu data in JSON format
40 | - Consistent formatting for reliable processing
41 |
42 | A comprehensive Python toolkit for scraping documentation websites using different crawling strategies. Built using the Crawl4AI library for efficient web crawling.
43 |
44 | [](https://github.com/unclecode/crawl4ai)
45 |
46 | ## Features
47 |
48 | ### Core Features
49 | - 🚀 Multiple crawling strategies
50 | - 📑 Automatic nested menu expansion
51 | - 🔄 Handles dynamic content and lazy-loaded elements
52 | - 🎯 Configurable selectors
53 | - 📝 Clean Markdown output for documentation
54 | - 📊 JSON output for menu structure
55 | - 🎨 Colorful terminal feedback
56 | - 🔍 Smart URL processing
57 | - ⚡ Asynchronous execution
58 |
59 | ### Available Crawlers
60 | 1. **Single URL Crawler** (`single_url_crawler.py`)
61 | - Extracts content from a single documentation page
62 | - Outputs clean Markdown format
63 | - Perfect for targeted content extraction
64 | - Configurable content selectors
65 |
66 | 2. **Multi URL Crawler** (`multi_url_crawler.py`)
67 | - Processes multiple URLs in parallel
68 | - Generates individual Markdown files per page
69 | - Efficient batch processing
70 | - Shared browser session for better performance
71 |
72 | 3. **Sitemap Crawler** (`sitemap_crawler.py`)
73 | - Automatically discovers and crawls sitemap.xml
74 | - Creates Markdown files for each page
75 | - Supports recursive sitemap parsing
76 | - Handles gzipped sitemaps
77 |
78 | 4. **Menu Crawler** (`menu_crawler.py`)
79 | - Extracts all menu links from documentation
80 | - Outputs structured JSON format
81 | - Handles nested and dynamic menus
82 | - Smart menu expansion
83 |
84 | ## Requirements
85 |
86 | - Python 3.7+
87 | - Virtual Environment (recommended)
88 |
89 | ## Installation
90 |
91 | 1. Clone the repository:
92 | ```bash
93 | git clone https://github.com/felores/crawl4ai_docs_scraper.git
94 | cd crawl4ai_docs_scraper
95 | ```
96 |
97 | 2. Create and activate a virtual environment:
98 | ```bash
99 | python -m venv venv
100 | source venv/bin/activate # On Windows: venv\Scripts\activate
101 | ```
102 |
103 | 3. Install dependencies:
104 | ```bash
105 | pip install -r requirements.txt
106 | ```
107 |
108 | ## Usage
109 |
110 | ### 1. Single URL Crawler
111 |
112 | ```bash
113 | python single_url_crawler.py https://docs.example.com/page
114 | ```
115 |
116 | Arguments:
117 | - URL: Target documentation URL (required, first argument)
118 |
119 | Note: Use quotes only if your URL contains special characters or spaces.
120 |
121 | Output format (Markdown):
122 | ```markdown
123 | # Page Title
124 |
125 | ## Section 1
126 | Content with preserved formatting, including:
127 | - Lists
128 | - Links
129 | - Tables
130 |
131 | ### Code Examples
132 | ```python
133 | def example():
134 | return "Code blocks are preserved"
135 | ```
136 |
137 | ### 2. Multi URL Crawler
138 |
139 | ```bash
140 | # Using a text file with URLs
141 | python multi_url_crawler.py urls.txt
142 |
143 | # Using JSON output from menu crawler
144 | python multi_url_crawler.py menu_links.json
145 |
146 | # Using custom output prefix
147 | python multi_url_crawler.py menu_links.json --output-prefix custom_name
148 | ```
149 |
150 | Arguments:
151 | - URLs file: Path to file containing URLs (required, first argument)
152 | - Can be .txt with one URL per line
153 | - Or .json from menu crawler output
154 | - `--output-prefix`: Custom prefix for output markdown file (optional)
155 |
156 | Note: Use quotes only if your file path contains spaces.
157 |
158 | Output filename format:
159 | - Without `--output-prefix`: `domain_path_docs_content_timestamp.md` (e.g., `cloudflare_agents_docs_content_20240323_223656.md`)
160 | - With `--output-prefix`: `custom_prefix_docs_content_timestamp.md` (e.g., `custom_name_docs_content_20240323_223656.md`)
161 |
162 | The crawler accepts two types of input files:
163 | 1. Text file with one URL per line:
164 | ```text
165 | https://docs.example.com/page1
166 | https://docs.example.com/page2
167 | https://docs.example.com/page3
168 | ```
169 |
170 | 2. JSON file (compatible with menu crawler output):
171 | ```json
172 | {
173 | "menu_links": [
174 | "https://docs.example.com/page1",
175 | "https://docs.example.com/page2"
176 | ]
177 | }
178 | ```
179 |
180 | ### 3. Sitemap Crawler
181 |
182 | ```bash
183 | python sitemap_crawler.py https://docs.example.com/sitemap.xml
184 | ```
185 |
186 | Options:
187 | - `--max-depth`: Maximum sitemap recursion depth (optional)
188 | - `--patterns`: URL patterns to include (optional)
189 |
190 | ### 4. Menu Crawler
191 |
192 | ```bash
193 | python menu_crawler.py https://docs.example.com
194 | ```
195 |
196 | Options:
197 | - `--selectors`: Custom menu selectors (optional)
198 |
199 | The menu crawler now saves its output to the `input_files` directory, making it ready for immediate use with the multi-url crawler. The output JSON has this format:
200 | ```json
201 | {
202 | "start_url": "https://docs.example.com/",
203 | "total_links_found": 42,
204 | "menu_links": [
205 | "https://docs.example.com/page1",
206 | "https://docs.example.com/page2"
207 | ]
208 | }
209 | ```
210 |
211 | After running the menu crawler, you'll get a command to run the multi-url crawler with the generated file.
212 |
213 | ## Directory Structure
214 |
215 | ```
216 | crawl4ai_docs_scraper/
217 | ├── input_files/ # Input files for URL processing
218 | │ ├── urls.txt # Text file with URLs
219 | │ └── menu_links.json # JSON output from menu crawler
220 | ├── scraped_docs/ # Output directory for markdown files
221 | │ └── docs_timestamp.md # Generated documentation
222 | ├── multi_url_crawler.py
223 | ├── menu_crawler.py
224 | └── requirements.txt
225 | ```
226 |
227 | ## Error Handling
228 |
229 | All crawlers include comprehensive error handling with colored terminal output:
230 | - 🟢 Green: Success messages
231 | - 🔵 Cyan: Processing status
232 | - 🟡 Yellow: Warnings
233 | - 🔴 Red: Error messages
234 |
235 | ## Contributing
236 |
237 | Contributions are welcome! Please feel free to submit a Pull Request.
238 |
239 | ## License
240 |
241 | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
242 |
243 | ## Attribution
244 |
245 | This project uses [Crawl4AI](https://github.com/unclecode/crawl4ai) for web data extraction.
246 |
247 | ## Acknowledgments
248 |
249 | - Built with [Crawl4AI](https://github.com/unclecode/crawl4ai)
250 | - Uses [termcolor](https://pypi.org/project/termcolor/) for colorful terminal output
```
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
```
1 | crawl4ai
2 | aiohttp
3 | termcolor
4 | playwright
```
--------------------------------------------------------------------------------
/src/docs_scraper/utils/__init__.py:
--------------------------------------------------------------------------------
```python
1 | """
2 | Utility modules for web crawling and HTML parsing.
3 | """
4 | from .request_handler import RequestHandler
5 | from .html_parser import HTMLParser
6 |
7 | __all__ = [
8 | 'RequestHandler',
9 | 'HTMLParser'
10 | ]
```
--------------------------------------------------------------------------------
/src/docs_scraper/__init__.py:
--------------------------------------------------------------------------------
```python
1 | """
2 | Documentation scraper MCP server package.
3 | """
4 | # Import subpackages but not modules to avoid circular imports
5 | from . import crawlers
6 | from . import utils
7 |
8 | # Expose important items at package level
9 | __all__ = ['crawlers', 'utils']
```
--------------------------------------------------------------------------------
/src/docs_scraper/crawlers/__init__.py:
--------------------------------------------------------------------------------
```python
1 | """
2 | Web crawler implementations for documentation scraping.
3 | """
4 | from .single_url_crawler import SingleURLCrawler
5 | from .multi_url_crawler import MultiURLCrawler
6 | from .sitemap_crawler import SitemapCrawler
7 | from .menu_crawler import MenuCrawler
8 |
9 | __all__ = [
10 | 'SingleURLCrawler',
11 | 'MultiURLCrawler',
12 | 'SitemapCrawler',
13 | 'MenuCrawler'
14 | ]
```
--------------------------------------------------------------------------------
/.venv/Scripts/deactivate.bat:
--------------------------------------------------------------------------------
```
1 | @echo off
2 |
3 | if defined _OLD_VIRTUAL_PROMPT (
4 | set "PROMPT=%_OLD_VIRTUAL_PROMPT%"
5 | )
6 | set _OLD_VIRTUAL_PROMPT=
7 |
8 | if defined _OLD_VIRTUAL_PYTHONHOME (
9 | set "PYTHONHOME=%_OLD_VIRTUAL_PYTHONHOME%"
10 | set _OLD_VIRTUAL_PYTHONHOME=
11 | )
12 |
13 | if defined _OLD_VIRTUAL_PATH (
14 | set "PATH=%_OLD_VIRTUAL_PATH%"
15 | )
16 |
17 | set _OLD_VIRTUAL_PATH=
18 |
19 | set VIRTUAL_ENV=
20 | set VIRTUAL_ENV_PROMPT=
21 |
22 | :END
23 |
```
--------------------------------------------------------------------------------
/src/docs_scraper/cli.py:
--------------------------------------------------------------------------------
```python
1 | """
2 | Command line interface for the docs_scraper package.
3 | """
4 | import logging
5 |
6 | def main():
7 | """Entry point for the package when run from the command line."""
8 | from docs_scraper.server import main as server_main
9 |
10 | # Configure logging
11 | logging.basicConfig(
12 | level=logging.INFO,
13 | format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
14 | )
15 |
16 | # Run the server
17 | server_main()
18 |
19 | if __name__ == "__main__":
20 | main()
```
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
```toml
1 | [build-system]
2 | requires = ["setuptools>=61.0"]
3 | build-backend = "setuptools.build_meta"
4 |
5 | [project]
6 | name = "docs_scraper"
7 | version = "0.1.0"
8 | authors = [
9 | { name = "Your Name", email = "[email protected]" }
10 | ]
11 | description = "A documentation scraping tool"
12 | requires-python = ">=3.7"
13 | dependencies = [
14 | "beautifulsoup4",
15 | "requests",
16 | "aiohttp",
17 | "lxml",
18 | "termcolor",
19 | "crawl4ai"
20 | ]
21 | classifiers = [
22 | "Programming Language :: Python :: 3",
23 | "Operating System :: OS Independent",
24 | ]
25 |
26 | [project.optional-dependencies]
27 | test = [
28 | "pytest",
29 | "pytest-asyncio",
30 | "aioresponses"
31 | ]
32 |
33 | [project.scripts]
34 | docs-scraper = "docs_scraper.cli:main"
35 |
36 | [tool.setuptools.packages.find]
37 | where = ["src"]
38 | include = ["docs_scraper*"]
39 | namespaces = false
40 |
41 | [tool.hatch.build]
42 | packages = ["src/docs_scraper"]
```
--------------------------------------------------------------------------------
/.venv/Scripts/activate.bat:
--------------------------------------------------------------------------------
```
1 | @echo off
2 |
3 | rem This file is UTF-8 encoded, so we need to update the current code page while executing it
4 | for /f "tokens=2 delims=:." %%a in ('"%SystemRoot%\System32\chcp.com"') do (
5 | set _OLD_CODEPAGE=%%a
6 | )
7 | if defined _OLD_CODEPAGE (
8 | "%SystemRoot%\System32\chcp.com" 65001 > nul
9 | )
10 |
11 | set "VIRTUAL_ENV=D:\AI-DEV\mcp\docs_scraper_mcp\.venv"
12 |
13 | if not defined PROMPT set PROMPT=$P$G
14 |
15 | if defined _OLD_VIRTUAL_PROMPT set PROMPT=%_OLD_VIRTUAL_PROMPT%
16 | if defined _OLD_VIRTUAL_PYTHONHOME set PYTHONHOME=%_OLD_VIRTUAL_PYTHONHOME%
17 |
18 | set _OLD_VIRTUAL_PROMPT=%PROMPT%
19 | set PROMPT=(.venv) %PROMPT%
20 |
21 | if defined PYTHONHOME set _OLD_VIRTUAL_PYTHONHOME=%PYTHONHOME%
22 | set PYTHONHOME=
23 |
24 | if defined _OLD_VIRTUAL_PATH set PATH=%_OLD_VIRTUAL_PATH%
25 | if not defined _OLD_VIRTUAL_PATH set _OLD_VIRTUAL_PATH=%PATH%
26 |
27 | set "PATH=%VIRTUAL_ENV%\Scripts;%PATH%"
28 | set "VIRTUAL_ENV_PROMPT=(.venv) "
29 |
30 | :END
31 | if defined _OLD_CODEPAGE (
32 | "%SystemRoot%\System32\chcp.com" %_OLD_CODEPAGE% > nul
33 | set _OLD_CODEPAGE=
34 | )
35 |
```
--------------------------------------------------------------------------------
/tests/conftest.py:
--------------------------------------------------------------------------------
```python
1 | """
2 | Test configuration and fixtures for the docs_scraper package.
3 | """
4 | import os
5 | import pytest
6 | import aiohttp
7 | from typing import AsyncGenerator, Dict, Any
8 | from aioresponses import aioresponses
9 | from bs4 import BeautifulSoup
10 |
11 | @pytest.fixture
12 | def mock_aiohttp() -> aioresponses:
13 | """Fixture for mocking aiohttp requests."""
14 | with aioresponses() as m:
15 | yield m
16 |
17 | @pytest.fixture
18 | def sample_html() -> str:
19 | """Sample HTML content for testing."""
20 | return """
21 | <!DOCTYPE html>
22 | <html>
23 | <head>
24 | <title>Test Page</title>
25 | <meta name="description" content="Test description">
26 | </head>
27 | <body>
28 | <nav class="menu">
29 | <ul>
30 | <li><a href="/page1">Page 1</a></li>
31 | <li>
32 | <a href="/section1">Section 1</a>
33 | <ul>
34 | <li><a href="/section1/page1">Section 1.1</a></li>
35 | <li><a href="/section1/page2">Section 1.2</a></li>
36 | </ul>
37 | </li>
38 | </ul>
39 | </nav>
40 | <main>
41 | <h1>Welcome</h1>
42 | <p>Test content</p>
43 | <a href="/test1">Link 1</a>
44 | <a href="/test2">Link 2</a>
45 | </main>
46 | </body>
47 | </html>
48 | """
49 |
50 | @pytest.fixture
51 | def sample_sitemap() -> str:
52 | """Sample sitemap.xml content for testing."""
53 | return """<?xml version="1.0" encoding="UTF-8"?>
54 | <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
55 | <url>
56 | <loc>https://example.com/</loc>
57 | <lastmod>2024-03-24</lastmod>
58 | </url>
59 | <url>
60 | <loc>https://example.com/page1</loc>
61 | <lastmod>2024-03-24</lastmod>
62 | </url>
63 | <url>
64 | <loc>https://example.com/page2</loc>
65 | <lastmod>2024-03-24</lastmod>
66 | </url>
67 | </urlset>
68 | """
69 |
70 | @pytest.fixture
71 | def mock_website(mock_aiohttp, sample_html, sample_sitemap) -> None:
72 | """Set up a mock website with various pages and a sitemap."""
73 | base_url = "https://example.com"
74 | pages = {
75 | "/": sample_html,
76 | "/page1": sample_html.replace("Test Page", "Page 1"),
77 | "/page2": sample_html.replace("Test Page", "Page 2"),
78 | "/section1": sample_html.replace("Test Page", "Section 1"),
79 | "/section1/page1": sample_html.replace("Test Page", "Section 1.1"),
80 | "/section1/page2": sample_html.replace("Test Page", "Section 1.2"),
81 | "/robots.txt": "User-agent: *\nAllow: /",
82 | "/sitemap.xml": sample_sitemap
83 | }
84 |
85 | for path, content in pages.items():
86 | mock_aiohttp.get(f"{base_url}{path}", status=200, body=content)
87 |
88 | @pytest.fixture
89 | async def aiohttp_session() -> AsyncGenerator[aiohttp.ClientSession, None]:
90 | """Create an aiohttp ClientSession for testing."""
91 | async with aiohttp.ClientSession() as session:
92 | yield session
93 |
94 | @pytest.fixture
95 | def test_urls() -> Dict[str, Any]:
96 | """Test URLs and related data for testing."""
97 | base_url = "https://example.com"
98 | return {
99 | "base_url": base_url,
100 | "valid_urls": [
101 | f"{base_url}/",
102 | f"{base_url}/page1",
103 | f"{base_url}/page2"
104 | ],
105 | "invalid_urls": [
106 | "not_a_url",
107 | "ftp://example.com",
108 | "https://nonexistent.example.com"
109 | ],
110 | "menu_selector": "nav.menu",
111 | "sitemap_url": f"{base_url}/sitemap.xml"
112 | }
```
--------------------------------------------------------------------------------
/tests/test_crawlers/test_single_url_crawler.py:
--------------------------------------------------------------------------------
```python
1 | """
2 | Tests for the SingleURLCrawler class.
3 | """
4 | import pytest
5 | from docs_scraper.crawlers import SingleURLCrawler
6 | from docs_scraper.utils import RequestHandler, HTMLParser
7 |
8 | @pytest.mark.asyncio
9 | async def test_single_url_crawler_successful_crawl(mock_website, test_urls, aiohttp_session):
10 | """Test successful crawling of a single URL."""
11 | url = test_urls["valid_urls"][0]
12 | request_handler = RequestHandler(session=aiohttp_session)
13 | html_parser = HTMLParser()
14 | crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
15 |
16 | result = await crawler.crawl(url)
17 |
18 | assert result["success"] is True
19 | assert result["url"] == url
20 | assert "content" in result
21 | assert "title" in result["metadata"]
22 | assert "description" in result["metadata"]
23 | assert len(result["links"]) > 0
24 | assert result["status_code"] == 200
25 | assert result["error"] is None
26 |
27 | @pytest.mark.asyncio
28 | async def test_single_url_crawler_invalid_url(mock_website, test_urls, aiohttp_session):
29 | """Test crawling with an invalid URL."""
30 | url = test_urls["invalid_urls"][0]
31 | request_handler = RequestHandler(session=aiohttp_session)
32 | html_parser = HTMLParser()
33 | crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
34 |
35 | result = await crawler.crawl(url)
36 |
37 | assert result["success"] is False
38 | assert result["url"] == url
39 | assert result["content"] is None
40 | assert result["metadata"] == {}
41 | assert result["links"] == []
42 | assert result["error"] is not None
43 |
44 | @pytest.mark.asyncio
45 | async def test_single_url_crawler_nonexistent_url(mock_website, test_urls, aiohttp_session):
46 | """Test crawling a URL that doesn't exist."""
47 | url = test_urls["invalid_urls"][2]
48 | request_handler = RequestHandler(session=aiohttp_session)
49 | html_parser = HTMLParser()
50 | crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
51 |
52 | result = await crawler.crawl(url)
53 |
54 | assert result["success"] is False
55 | assert result["url"] == url
56 | assert result["content"] is None
57 | assert result["metadata"] == {}
58 | assert result["links"] == []
59 | assert result["error"] is not None
60 |
61 | @pytest.mark.asyncio
62 | async def test_single_url_crawler_metadata_extraction(mock_website, test_urls, aiohttp_session):
63 | """Test extraction of metadata from a crawled page."""
64 | url = test_urls["valid_urls"][0]
65 | request_handler = RequestHandler(session=aiohttp_session)
66 | html_parser = HTMLParser()
67 | crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
68 |
69 | result = await crawler.crawl(url)
70 |
71 | assert result["success"] is True
72 | assert result["metadata"]["title"] == "Test Page"
73 | assert result["metadata"]["description"] == "Test description"
74 |
75 | @pytest.mark.asyncio
76 | async def test_single_url_crawler_link_extraction(mock_website, test_urls, aiohttp_session):
77 | """Test extraction of links from a crawled page."""
78 | url = test_urls["valid_urls"][0]
79 | request_handler = RequestHandler(session=aiohttp_session)
80 | html_parser = HTMLParser()
81 | crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
82 |
83 | result = await crawler.crawl(url)
84 |
85 | assert result["success"] is True
86 | assert len(result["links"]) >= 6 # Number of links in sample HTML
87 | assert "/page1" in result["links"]
88 | assert "/section1" in result["links"]
89 | assert "/test1" in result["links"]
90 | assert "/test2" in result["links"]
91 |
92 | @pytest.mark.asyncio
93 | async def test_single_url_crawler_rate_limiting(mock_website, test_urls, aiohttp_session):
94 | """Test rate limiting functionality."""
95 | url = test_urls["valid_urls"][0]
96 | request_handler = RequestHandler(session=aiohttp_session, rate_limit=1) # 1 request per second
97 | html_parser = HTMLParser()
98 | crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
99 |
100 | import time
101 | start_time = time.time()
102 |
103 | # Make multiple requests
104 | for _ in range(3):
105 | result = await crawler.crawl(url)
106 | assert result["success"] is True
107 |
108 | end_time = time.time()
109 | elapsed_time = end_time - start_time
110 |
111 | # Should take at least 2 seconds due to rate limiting
112 | assert elapsed_time >= 2.0
```
--------------------------------------------------------------------------------
/tests/test_crawlers/test_multi_url_crawler.py:
--------------------------------------------------------------------------------
```python
1 | """
2 | Tests for the MultiURLCrawler class.
3 | """
4 | import pytest
5 | from docs_scraper.crawlers import MultiURLCrawler
6 | from docs_scraper.utils import RequestHandler, HTMLParser
7 |
8 | @pytest.mark.asyncio
9 | async def test_multi_url_crawler_successful_crawl(mock_website, test_urls, aiohttp_session):
10 | """Test successful crawling of multiple URLs."""
11 | urls = test_urls["valid_urls"]
12 | request_handler = RequestHandler(session=aiohttp_session)
13 | html_parser = HTMLParser()
14 | crawler = MultiURLCrawler(request_handler=request_handler, html_parser=html_parser)
15 |
16 | results = await crawler.crawl(urls)
17 |
18 | assert len(results) == len(urls)
19 | for result, url in zip(results, urls):
20 | assert result["success"] is True
21 | assert result["url"] == url
22 | assert "content" in result
23 | assert "title" in result["metadata"]
24 | assert "description" in result["metadata"]
25 | assert len(result["links"]) > 0
26 | assert result["status_code"] == 200
27 | assert result["error"] is None
28 |
29 | @pytest.mark.asyncio
30 | async def test_multi_url_crawler_mixed_urls(mock_website, test_urls, aiohttp_session):
31 | """Test crawling a mix of valid and invalid URLs."""
32 | urls = test_urls["valid_urls"][:1] + test_urls["invalid_urls"][:1]
33 | request_handler = RequestHandler(session=aiohttp_session)
34 | html_parser = HTMLParser()
35 | crawler = MultiURLCrawler(request_handler=request_handler, html_parser=html_parser)
36 |
37 | results = await crawler.crawl(urls)
38 |
39 | assert len(results) == len(urls)
40 | # Valid URL
41 | assert results[0]["success"] is True
42 | assert results[0]["url"] == urls[0]
43 | assert "content" in results[0]
44 | # Invalid URL
45 | assert results[1]["success"] is False
46 | assert results[1]["url"] == urls[1]
47 | assert results[1]["content"] is None
48 |
49 | @pytest.mark.asyncio
50 | async def test_multi_url_crawler_concurrent_limit(mock_website, test_urls, aiohttp_session):
51 | """Test concurrent request limiting."""
52 | urls = test_urls["valid_urls"] * 2 # Duplicate URLs to have more requests
53 | request_handler = RequestHandler(session=aiohttp_session)
54 | html_parser = HTMLParser()
55 | crawler = MultiURLCrawler(
56 | request_handler=request_handler,
57 | html_parser=html_parser,
58 | concurrent_limit=2
59 | )
60 |
61 | import time
62 | start_time = time.time()
63 |
64 | results = await crawler.crawl(urls)
65 |
66 | end_time = time.time()
67 | elapsed_time = end_time - start_time
68 |
69 | assert len(results) == len(urls)
70 | # With concurrent_limit=2, processing 6 URLs should take at least 3 time units
71 | assert elapsed_time >= (len(urls) / 2) * 0.1 # Assuming each request takes ~0.1s
72 |
73 | @pytest.mark.asyncio
74 | async def test_multi_url_crawler_empty_urls(mock_website, aiohttp_session):
75 | """Test crawling with empty URL list."""
76 | request_handler = RequestHandler(session=aiohttp_session)
77 | html_parser = HTMLParser()
78 | crawler = MultiURLCrawler(request_handler=request_handler, html_parser=html_parser)
79 |
80 | results = await crawler.crawl([])
81 |
82 | assert len(results) == 0
83 |
84 | @pytest.mark.asyncio
85 | async def test_multi_url_crawler_duplicate_urls(mock_website, test_urls, aiohttp_session):
86 | """Test crawling with duplicate URLs."""
87 | url = test_urls["valid_urls"][0]
88 | urls = [url, url, url] # Same URL multiple times
89 | request_handler = RequestHandler(session=aiohttp_session)
90 | html_parser = HTMLParser()
91 | crawler = MultiURLCrawler(request_handler=request_handler, html_parser=html_parser)
92 |
93 | results = await crawler.crawl(urls)
94 |
95 | assert len(results) == len(urls)
96 | for result in results:
97 | assert result["success"] is True
98 | assert result["url"] == url
99 | assert result["metadata"]["title"] == "Test Page"
100 |
101 | @pytest.mark.asyncio
102 | async def test_multi_url_crawler_rate_limiting(mock_website, test_urls, aiohttp_session):
103 | """Test rate limiting with multiple URLs."""
104 | urls = test_urls["valid_urls"]
105 | request_handler = RequestHandler(session=aiohttp_session, rate_limit=1) # 1 request per second
106 | html_parser = HTMLParser()
107 | crawler = MultiURLCrawler(request_handler=request_handler, html_parser=html_parser)
108 |
109 | import time
110 | start_time = time.time()
111 |
112 | results = await crawler.crawl(urls)
113 |
114 | end_time = time.time()
115 | elapsed_time = end_time - start_time
116 |
117 | assert len(results) == len(urls)
118 | # Should take at least (len(urls) - 1) seconds due to rate limiting
119 | assert elapsed_time >= len(urls) - 1
```
--------------------------------------------------------------------------------
/.venv/Include/site/python3.12/greenlet/greenlet.h:
--------------------------------------------------------------------------------
```
1 | /* -*- indent-tabs-mode: nil; tab-width: 4; -*- */
2 |
3 | /* Greenlet object interface */
4 |
5 | #ifndef Py_GREENLETOBJECT_H
6 | #define Py_GREENLETOBJECT_H
7 |
8 |
9 | #include <Python.h>
10 |
11 | #ifdef __cplusplus
12 | extern "C" {
13 | #endif
14 |
15 | /* This is deprecated and undocumented. It does not change. */
16 | #define GREENLET_VERSION "1.0.0"
17 |
18 | #ifndef GREENLET_MODULE
19 | #define implementation_ptr_t void*
20 | #endif
21 |
22 | typedef struct _greenlet {
23 | PyObject_HEAD
24 | PyObject* weakreflist;
25 | PyObject* dict;
26 | implementation_ptr_t pimpl;
27 | } PyGreenlet;
28 |
29 | #define PyGreenlet_Check(op) (op && PyObject_TypeCheck(op, &PyGreenlet_Type))
30 |
31 |
32 | /* C API functions */
33 |
34 | /* Total number of symbols that are exported */
35 | #define PyGreenlet_API_pointers 12
36 |
37 | #define PyGreenlet_Type_NUM 0
38 | #define PyExc_GreenletError_NUM 1
39 | #define PyExc_GreenletExit_NUM 2
40 |
41 | #define PyGreenlet_New_NUM 3
42 | #define PyGreenlet_GetCurrent_NUM 4
43 | #define PyGreenlet_Throw_NUM 5
44 | #define PyGreenlet_Switch_NUM 6
45 | #define PyGreenlet_SetParent_NUM 7
46 |
47 | #define PyGreenlet_MAIN_NUM 8
48 | #define PyGreenlet_STARTED_NUM 9
49 | #define PyGreenlet_ACTIVE_NUM 10
50 | #define PyGreenlet_GET_PARENT_NUM 11
51 |
52 | #ifndef GREENLET_MODULE
53 | /* This section is used by modules that uses the greenlet C API */
54 | static void** _PyGreenlet_API = NULL;
55 |
56 | # define PyGreenlet_Type \
57 | (*(PyTypeObject*)_PyGreenlet_API[PyGreenlet_Type_NUM])
58 |
59 | # define PyExc_GreenletError \
60 | ((PyObject*)_PyGreenlet_API[PyExc_GreenletError_NUM])
61 |
62 | # define PyExc_GreenletExit \
63 | ((PyObject*)_PyGreenlet_API[PyExc_GreenletExit_NUM])
64 |
65 | /*
66 | * PyGreenlet_New(PyObject *args)
67 | *
68 | * greenlet.greenlet(run, parent=None)
69 | */
70 | # define PyGreenlet_New \
71 | (*(PyGreenlet * (*)(PyObject * run, PyGreenlet * parent)) \
72 | _PyGreenlet_API[PyGreenlet_New_NUM])
73 |
74 | /*
75 | * PyGreenlet_GetCurrent(void)
76 | *
77 | * greenlet.getcurrent()
78 | */
79 | # define PyGreenlet_GetCurrent \
80 | (*(PyGreenlet * (*)(void)) _PyGreenlet_API[PyGreenlet_GetCurrent_NUM])
81 |
82 | /*
83 | * PyGreenlet_Throw(
84 | * PyGreenlet *greenlet,
85 | * PyObject *typ,
86 | * PyObject *val,
87 | * PyObject *tb)
88 | *
89 | * g.throw(...)
90 | */
91 | # define PyGreenlet_Throw \
92 | (*(PyObject * (*)(PyGreenlet * self, \
93 | PyObject * typ, \
94 | PyObject * val, \
95 | PyObject * tb)) \
96 | _PyGreenlet_API[PyGreenlet_Throw_NUM])
97 |
98 | /*
99 | * PyGreenlet_Switch(PyGreenlet *greenlet, PyObject *args)
100 | *
101 | * g.switch(*args, **kwargs)
102 | */
103 | # define PyGreenlet_Switch \
104 | (*(PyObject * \
105 | (*)(PyGreenlet * greenlet, PyObject * args, PyObject * kwargs)) \
106 | _PyGreenlet_API[PyGreenlet_Switch_NUM])
107 |
108 | /*
109 | * PyGreenlet_SetParent(PyObject *greenlet, PyObject *new_parent)
110 | *
111 | * g.parent = new_parent
112 | */
113 | # define PyGreenlet_SetParent \
114 | (*(int (*)(PyGreenlet * greenlet, PyGreenlet * nparent)) \
115 | _PyGreenlet_API[PyGreenlet_SetParent_NUM])
116 |
117 | /*
118 | * PyGreenlet_GetParent(PyObject* greenlet)
119 | *
120 | * return greenlet.parent;
121 | *
122 | * This could return NULL even if there is no exception active.
123 | * If it does not return NULL, you are responsible for decrementing the
124 | * reference count.
125 | */
126 | # define PyGreenlet_GetParent \
127 | (*(PyGreenlet* (*)(PyGreenlet*)) \
128 | _PyGreenlet_API[PyGreenlet_GET_PARENT_NUM])
129 |
130 | /*
131 | * deprecated, undocumented alias.
132 | */
133 | # define PyGreenlet_GET_PARENT PyGreenlet_GetParent
134 |
135 | # define PyGreenlet_MAIN \
136 | (*(int (*)(PyGreenlet*)) \
137 | _PyGreenlet_API[PyGreenlet_MAIN_NUM])
138 |
139 | # define PyGreenlet_STARTED \
140 | (*(int (*)(PyGreenlet*)) \
141 | _PyGreenlet_API[PyGreenlet_STARTED_NUM])
142 |
143 | # define PyGreenlet_ACTIVE \
144 | (*(int (*)(PyGreenlet*)) \
145 | _PyGreenlet_API[PyGreenlet_ACTIVE_NUM])
146 |
147 |
148 |
149 |
150 | /* Macro that imports greenlet and initializes C API */
151 | /* NOTE: This has actually moved to ``greenlet._greenlet._C_API``, but we
152 | keep the older definition to be sure older code that might have a copy of
153 | the header still works. */
154 | # define PyGreenlet_Import() \
155 | { \
156 | _PyGreenlet_API = (void**)PyCapsule_Import("greenlet._C_API", 0); \
157 | }
158 |
159 | #endif /* GREENLET_MODULE */
160 |
161 | #ifdef __cplusplus
162 | }
163 | #endif
164 | #endif /* !Py_GREENLETOBJECT_H */
165 |
```
--------------------------------------------------------------------------------
/src/docs_scraper/utils/html_parser.py:
--------------------------------------------------------------------------------
```python
1 | """
2 | HTML parser module for extracting content and links from HTML documents.
3 | """
4 | from typing import List, Dict, Any, Optional
5 | from bs4 import BeautifulSoup
6 | from urllib.parse import urljoin, urlparse
7 |
8 | class HTMLParser:
9 | def __init__(self, base_url: str):
10 | """
11 | Initialize the HTML parser.
12 |
13 | Args:
14 | base_url: Base URL for resolving relative links
15 | """
16 | self.base_url = base_url
17 |
18 | def parse_content(self, html: str) -> Dict[str, Any]:
19 | """
20 | Parse HTML content and extract useful information.
21 |
22 | Args:
23 | html: Raw HTML content
24 |
25 | Returns:
26 | Dict containing:
27 | - title: Page title
28 | - description: Meta description
29 | - text_content: Main text content
30 | - links: List of links found
31 | - headers: List of headers found
32 | """
33 | soup = BeautifulSoup(html, 'lxml')
34 |
35 | # Extract title
36 | title = soup.title.string if soup.title else None
37 |
38 | # Extract meta description
39 | meta_desc = None
40 | meta_tag = soup.find('meta', attrs={'name': 'description'})
41 | if meta_tag:
42 | meta_desc = meta_tag.get('content')
43 |
44 | # Extract main content (remove script, style, etc.)
45 | for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
46 | tag.decompose()
47 |
48 | # Get text content
49 | text_content = ' '.join(soup.stripped_strings)
50 |
51 | # Extract headers
52 | headers = []
53 | for tag in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
54 | headers.append({
55 | 'level': int(tag.name[1]),
56 | 'text': tag.get_text(strip=True)
57 | })
58 |
59 | # Extract links
60 | links = self._extract_links(soup)
61 |
62 | return {
63 | 'title': title,
64 | 'description': meta_desc,
65 | 'text_content': text_content,
66 | 'links': links,
67 | 'headers': headers
68 | }
69 |
70 | def parse_menu(self, html: str, menu_selector: str) -> List[Dict[str, Any]]:
71 | """
72 | Parse navigation menu from HTML using a CSS selector.
73 |
74 | Args:
75 | html: Raw HTML content
76 | menu_selector: CSS selector for the menu element
77 |
78 | Returns:
79 | List of menu items with their structure
80 | """
81 | soup = BeautifulSoup(html, 'lxml')
82 | menu = soup.select_one(menu_selector)
83 |
84 | if not menu:
85 | return []
86 |
87 | return self._extract_menu_items(menu)
88 |
89 | def _extract_links(self, soup: BeautifulSoup) -> List[Dict[str, str]]:
90 | """Extract and normalize all links from the document."""
91 | links = []
92 | for a in soup.find_all('a', href=True):
93 | href = a['href']
94 | text = a.get_text(strip=True)
95 |
96 | # Skip empty or javascript links
97 | if not href or href.startswith(('javascript:', '#')):
98 | continue
99 |
100 | # Resolve relative URLs
101 | absolute_url = urljoin(self.base_url, href)
102 |
103 | # Only include links to the same domain
104 | if urlparse(absolute_url).netloc == urlparse(self.base_url).netloc:
105 | links.append({
106 | 'url': absolute_url,
107 | 'text': text
108 | })
109 |
110 | return links
111 |
112 | def _extract_menu_items(self, element: BeautifulSoup) -> List[Dict[str, Any]]:
113 | """Recursively extract menu structure."""
114 | items = []
115 |
116 | for item in element.find_all(['li', 'a'], recursive=False):
117 | if item.name == 'a':
118 | # Single link item
119 | href = item.get('href')
120 | if href and not href.startswith(('javascript:', '#')):
121 | items.append({
122 | 'type': 'link',
123 | 'url': urljoin(self.base_url, href),
124 | 'text': item.get_text(strip=True)
125 | })
126 | else:
127 | # Potentially nested menu item
128 | link = item.find('a')
129 | if link and link.get('href'):
130 | menu_item = {
131 | 'type': 'menu',
132 | 'text': link.get_text(strip=True),
133 | 'url': urljoin(self.base_url, link['href']),
134 | 'children': []
135 | }
136 |
137 | # Look for nested lists
138 | nested = item.find(['ul', 'ol'])
139 | if nested:
140 | menu_item['children'] = self._extract_menu_items(nested)
141 |
142 | items.append(menu_item)
143 |
144 | return items
```
--------------------------------------------------------------------------------
/tests/test_utils/test_request_handler.py:
--------------------------------------------------------------------------------
```python
1 | """
2 | Tests for the RequestHandler class.
3 | """
4 | import asyncio
5 | import pytest
6 | import aiohttp
7 | import time
8 | from docs_scraper.utils import RequestHandler
9 |
10 | @pytest.mark.asyncio
11 | async def test_request_handler_successful_get(mock_website, test_urls, aiohttp_session):
12 | """Test successful GET request."""
13 | url = test_urls["valid_urls"][0]
14 | handler = RequestHandler(session=aiohttp_session)
15 |
16 | response = await handler.get(url)
17 |
18 | assert response.status == 200
19 | assert "<!DOCTYPE html>" in await response.text()
20 |
21 | @pytest.mark.asyncio
22 | async def test_request_handler_invalid_url(mock_website, test_urls, aiohttp_session):
23 | """Test handling of invalid URL."""
24 | url = test_urls["invalid_urls"][0]
25 | handler = RequestHandler(session=aiohttp_session)
26 |
27 | with pytest.raises(aiohttp.ClientError):
28 | await handler.get(url)
29 |
30 | @pytest.mark.asyncio
31 | async def test_request_handler_nonexistent_url(mock_website, test_urls, aiohttp_session):
32 | """Test handling of nonexistent URL."""
33 | url = test_urls["invalid_urls"][2]
34 | handler = RequestHandler(session=aiohttp_session)
35 |
36 | with pytest.raises(aiohttp.ClientError):
37 | await handler.get(url)
38 |
39 | @pytest.mark.asyncio
40 | async def test_request_handler_rate_limiting(mock_website, test_urls, aiohttp_session):
41 | """Test rate limiting functionality."""
42 | url = test_urls["valid_urls"][0]
43 | rate_limit = 2 # 2 requests per second
44 | handler = RequestHandler(session=aiohttp_session, rate_limit=rate_limit)
45 |
46 | start_time = time.time()
47 |
48 | # Make multiple requests
49 | for _ in range(3):
50 | response = await handler.get(url)
51 | assert response.status == 200
52 |
53 | end_time = time.time()
54 | elapsed_time = end_time - start_time
55 |
56 | # Should take at least 1 second due to rate limiting
57 | assert elapsed_time >= 1.0
58 |
59 | @pytest.mark.asyncio
60 | async def test_request_handler_custom_headers(mock_website, test_urls, aiohttp_session):
61 | """Test custom headers in requests."""
62 | url = test_urls["valid_urls"][0]
63 | custom_headers = {
64 | "User-Agent": "Custom Bot 1.0",
65 | "Accept-Language": "en-US,en;q=0.9"
66 | }
67 | handler = RequestHandler(session=aiohttp_session, headers=custom_headers)
68 |
69 | response = await handler.get(url)
70 |
71 | assert response.status == 200
72 | # Headers should be merged with default headers
73 | assert handler.headers["User-Agent"] == "Custom Bot 1.0"
74 | assert handler.headers["Accept-Language"] == "en-US,en;q=0.9"
75 |
76 | @pytest.mark.asyncio
77 | async def test_request_handler_timeout(mock_website, test_urls, aiohttp_session):
78 | """Test request timeout handling."""
79 | url = test_urls["valid_urls"][0]
80 | handler = RequestHandler(session=aiohttp_session, timeout=0.001) # Very short timeout
81 |
82 | # Mock a delayed response
83 | mock_website.get(url, status=200, body="Delayed response", delay=0.1)
84 |
85 | with pytest.raises(aiohttp.ClientTimeout):
86 | await handler.get(url)
87 |
88 | @pytest.mark.asyncio
89 | async def test_request_handler_retry(mock_website, test_urls, aiohttp_session):
90 | """Test request retry functionality."""
91 | url = test_urls["valid_urls"][0]
92 | handler = RequestHandler(session=aiohttp_session, max_retries=3)
93 |
94 | # Mock temporary failures followed by success
95 | mock_website.get(url, status=500) # First attempt fails
96 | mock_website.get(url, status=500) # Second attempt fails
97 | mock_website.get(url, status=200, body="Success") # Third attempt succeeds
98 |
99 | response = await handler.get(url)
100 |
101 | assert response.status == 200
102 | assert await response.text() == "Success"
103 |
104 | @pytest.mark.asyncio
105 | async def test_request_handler_max_retries_exceeded(mock_website, test_urls, aiohttp_session):
106 | """Test behavior when max retries are exceeded."""
107 | url = test_urls["valid_urls"][0]
108 | handler = RequestHandler(session=aiohttp_session, max_retries=2)
109 |
110 | # Mock consistent failures
111 | mock_website.get(url, status=500)
112 | mock_website.get(url, status=500)
113 | mock_website.get(url, status=500)
114 |
115 | with pytest.raises(aiohttp.ClientError):
116 | await handler.get(url)
117 |
118 | @pytest.mark.asyncio
119 | async def test_request_handler_session_management(mock_website, test_urls):
120 | """Test session management."""
121 | url = test_urls["valid_urls"][0]
122 |
123 | # Test with context manager
124 | async with aiohttp.ClientSession() as session:
125 | handler = RequestHandler(session=session)
126 | response = await handler.get(url)
127 | assert response.status == 200
128 |
129 | # Test with closed session
130 | with pytest.raises(aiohttp.ClientError):
131 | await handler.get(url)
132 |
133 | @pytest.mark.asyncio
134 | async def test_request_handler_concurrent_requests(mock_website, test_urls, aiohttp_session):
135 | """Test handling of concurrent requests."""
136 | urls = test_urls["valid_urls"]
137 | handler = RequestHandler(session=aiohttp_session)
138 |
139 | # Make concurrent requests
140 | tasks = [handler.get(url) for url in urls]
141 | responses = await asyncio.gather(*tasks)
142 |
143 | assert all(response.status == 200 for response in responses)
```
--------------------------------------------------------------------------------
/tests/test_crawlers/test_menu_crawler.py:
--------------------------------------------------------------------------------
```python
1 | """
2 | Tests for the MenuCrawler class.
3 | """
4 | import pytest
5 | from docs_scraper.crawlers import MenuCrawler
6 | from docs_scraper.utils import RequestHandler, HTMLParser
7 |
8 | @pytest.mark.asyncio
9 | async def test_menu_crawler_successful_crawl(mock_website, test_urls, aiohttp_session):
10 | """Test successful crawling of menu links."""
11 | url = test_urls["valid_urls"][0]
12 | menu_selector = test_urls["menu_selector"]
13 | request_handler = RequestHandler(session=aiohttp_session)
14 | html_parser = HTMLParser()
15 | crawler = MenuCrawler(request_handler=request_handler, html_parser=html_parser)
16 |
17 | results = await crawler.crawl(url, menu_selector)
18 |
19 | assert len(results) >= 4 # Number of menu links in sample HTML
20 | for result in results:
21 | assert result["success"] is True
22 | assert result["url"].startswith("https://example.com")
23 | assert "content" in result
24 | assert "title" in result["metadata"]
25 | assert "description" in result["metadata"]
26 | assert len(result["links"]) > 0
27 | assert result["status_code"] == 200
28 | assert result["error"] is None
29 |
30 | @pytest.mark.asyncio
31 | async def test_menu_crawler_invalid_url(mock_website, test_urls, aiohttp_session):
32 | """Test crawling with an invalid URL."""
33 | url = test_urls["invalid_urls"][0]
34 | menu_selector = test_urls["menu_selector"]
35 | request_handler = RequestHandler(session=aiohttp_session)
36 | html_parser = HTMLParser()
37 | crawler = MenuCrawler(request_handler=request_handler, html_parser=html_parser)
38 |
39 | results = await crawler.crawl(url, menu_selector)
40 |
41 | assert len(results) == 1
42 | assert results[0]["success"] is False
43 | assert results[0]["url"] == url
44 | assert results[0]["error"] is not None
45 |
46 | @pytest.mark.asyncio
47 | async def test_menu_crawler_invalid_selector(mock_website, test_urls, aiohttp_session):
48 | """Test crawling with an invalid CSS selector."""
49 | url = test_urls["valid_urls"][0]
50 | invalid_selector = "#nonexistent-menu"
51 | request_handler = RequestHandler(session=aiohttp_session)
52 | html_parser = HTMLParser()
53 | crawler = MenuCrawler(request_handler=request_handler, html_parser=html_parser)
54 |
55 | results = await crawler.crawl(url, invalid_selector)
56 |
57 | assert len(results) == 1
58 | assert results[0]["success"] is False
59 | assert results[0]["url"] == url
60 | assert "No menu links found" in results[0]["error"]
61 |
62 | @pytest.mark.asyncio
63 | async def test_menu_crawler_nested_menu(mock_website, test_urls, aiohttp_session):
64 | """Test crawling nested menu structure."""
65 | url = test_urls["valid_urls"][0]
66 | menu_selector = test_urls["menu_selector"]
67 | request_handler = RequestHandler(session=aiohttp_session)
68 | html_parser = HTMLParser()
69 | crawler = MenuCrawler(
70 | request_handler=request_handler,
71 | html_parser=html_parser,
72 | max_depth=2 # Crawl up to 2 levels deep
73 | )
74 |
75 | results = await crawler.crawl(url, menu_selector)
76 |
77 | # Check if nested menu items were crawled
78 | urls = {result["url"] for result in results}
79 | assert "https://example.com/section1" in urls
80 | assert "https://example.com/section1/page1" in urls
81 | assert "https://example.com/section1/page2" in urls
82 |
83 | @pytest.mark.asyncio
84 | async def test_menu_crawler_concurrent_limit(mock_website, test_urls, aiohttp_session):
85 | """Test concurrent request limiting for menu crawling."""
86 | url = test_urls["valid_urls"][0]
87 | menu_selector = test_urls["menu_selector"]
88 | request_handler = RequestHandler(session=aiohttp_session)
89 | html_parser = HTMLParser()
90 | crawler = MenuCrawler(
91 | request_handler=request_handler,
92 | html_parser=html_parser,
93 | concurrent_limit=1 # Process one URL at a time
94 | )
95 |
96 | import time
97 | start_time = time.time()
98 |
99 | results = await crawler.crawl(url, menu_selector)
100 |
101 | end_time = time.time()
102 | elapsed_time = end_time - start_time
103 |
104 | assert len(results) >= 4
105 | # With concurrent_limit=1, processing should take at least 0.4 seconds
106 | assert elapsed_time >= 0.4
107 |
108 | @pytest.mark.asyncio
109 | async def test_menu_crawler_rate_limiting(mock_website, test_urls, aiohttp_session):
110 | """Test rate limiting for menu crawling."""
111 | url = test_urls["valid_urls"][0]
112 | menu_selector = test_urls["menu_selector"]
113 | request_handler = RequestHandler(session=aiohttp_session, rate_limit=1) # 1 request per second
114 | html_parser = HTMLParser()
115 | crawler = MenuCrawler(request_handler=request_handler, html_parser=html_parser)
116 |
117 | import time
118 | start_time = time.time()
119 |
120 | results = await crawler.crawl(url, menu_selector)
121 |
122 | end_time = time.time()
123 | elapsed_time = end_time - start_time
124 |
125 | assert len(results) >= 4
126 | # Should take at least 3 seconds due to rate limiting
127 | assert elapsed_time >= 3.0
128 |
129 | @pytest.mark.asyncio
130 | async def test_menu_crawler_max_depth(mock_website, test_urls, aiohttp_session):
131 | """Test max depth limitation for menu crawling."""
132 | url = test_urls["valid_urls"][0]
133 | menu_selector = test_urls["menu_selector"]
134 | request_handler = RequestHandler(session=aiohttp_session)
135 | html_parser = HTMLParser()
136 | crawler = MenuCrawler(
137 | request_handler=request_handler,
138 | html_parser=html_parser,
139 | max_depth=1 # Only crawl top-level menu items
140 | )
141 |
142 | results = await crawler.crawl(url, menu_selector)
143 |
144 | # Should only include top-level menu items
145 | urls = {result["url"] for result in results}
146 | assert "https://example.com/section1" in urls
147 | assert "https://example.com/page1" in urls
148 | assert "https://example.com/section1/page1" not in urls # Nested item should not be included
```
--------------------------------------------------------------------------------
/src/docs_scraper/utils/request_handler.py:
--------------------------------------------------------------------------------
```python
1 | """
2 | Request handler module for managing HTTP requests with rate limiting and error handling.
3 | """
4 | import asyncio
5 | import logging
6 | from typing import Optional, Dict, Any
7 | import aiohttp
8 | from urllib.robotparser import RobotFileParser
9 | from urllib.parse import urljoin
10 |
11 | logger = logging.getLogger(__name__)
12 |
13 | class RequestHandler:
14 | def __init__(
15 | self,
16 | rate_limit: float = 1.0,
17 | concurrent_limit: int = 5,
18 | user_agent: str = "DocsScraperBot/1.0",
19 | timeout: int = 30,
20 | session: Optional[aiohttp.ClientSession] = None
21 | ):
22 | """
23 | Initialize the request handler.
24 |
25 | Args:
26 | rate_limit: Minimum time between requests to the same domain (in seconds)
27 | concurrent_limit: Maximum number of concurrent requests
28 | user_agent: User agent string to use for requests
29 | timeout: Request timeout in seconds
30 | session: Optional aiohttp.ClientSession to use. If not provided, one will be created.
31 | """
32 | self.rate_limit = rate_limit
33 | self.concurrent_limit = concurrent_limit
34 | self.user_agent = user_agent
35 | self.timeout = timeout
36 | self._provided_session = session
37 |
38 | self._domain_locks: Dict[str, asyncio.Lock] = {}
39 | self._domain_last_request: Dict[str, float] = {}
40 | self._semaphore = asyncio.Semaphore(concurrent_limit)
41 | self._session: Optional[aiohttp.ClientSession] = None
42 | self._robot_parsers: Dict[str, RobotFileParser] = {}
43 |
44 | async def __aenter__(self):
45 | """Set up the aiohttp session."""
46 | if self._provided_session:
47 | self._session = self._provided_session
48 | else:
49 | self._session = aiohttp.ClientSession(
50 | headers={"User-Agent": self.user_agent},
51 | timeout=aiohttp.ClientTimeout(total=self.timeout)
52 | )
53 | return self
54 |
55 | async def __aexit__(self, exc_type, exc_val, exc_tb):
56 | """Clean up the aiohttp session."""
57 | if self._session and not self._provided_session:
58 | await self._session.close()
59 |
60 | async def _check_robots_txt(self, url: str) -> bool:
61 | """
62 | Check if the URL is allowed by robots.txt.
63 |
64 | Args:
65 | url: URL to check
66 |
67 | Returns:
68 | bool: True if allowed, False if disallowed
69 | """
70 | from urllib.parse import urlparse
71 | parsed = urlparse(url)
72 | domain = f"{parsed.scheme}://{parsed.netloc}"
73 |
74 | if domain not in self._robot_parsers:
75 | parser = RobotFileParser()
76 | parser.set_url(urljoin(domain, "/robots.txt"))
77 | try:
78 | async with self._session.get(parser.url) as response:
79 | content = await response.text()
80 | parser.parse(content.splitlines())
81 | except Exception as e:
82 | logger.warning(f"Failed to fetch robots.txt for {domain}: {e}")
83 | return True
84 | self._robot_parsers[domain] = parser
85 |
86 | return self._robot_parsers[domain].can_fetch(self.user_agent, url)
87 |
88 | async def get(self, url: str, **kwargs) -> Dict[str, Any]:
89 | """
90 | Make a GET request with rate limiting and error handling.
91 |
92 | Args:
93 | url: URL to request
94 | **kwargs: Additional arguments to pass to aiohttp.ClientSession.get()
95 |
96 | Returns:
97 | Dict containing:
98 | - success: bool indicating if request was successful
99 | - status: HTTP status code if available
100 | - content: Response content if successful
101 | - error: Error message if unsuccessful
102 | """
103 | from urllib.parse import urlparse
104 | parsed = urlparse(url)
105 | domain = parsed.netloc
106 |
107 | # Get or create domain lock
108 | if domain not in self._domain_locks:
109 | self._domain_locks[domain] = asyncio.Lock()
110 |
111 | # Check robots.txt
112 | if not await self._check_robots_txt(url):
113 | return {
114 | "success": False,
115 | "status": None,
116 | "error": "URL disallowed by robots.txt",
117 | "content": None
118 | }
119 |
120 | try:
121 | async with self._semaphore: # Limit concurrent requests
122 | async with self._domain_locks[domain]: # Lock per domain
123 | # Rate limiting
124 | if domain in self._domain_last_request:
125 | elapsed = asyncio.get_event_loop().time() - self._domain_last_request[domain]
126 | if elapsed < self.rate_limit:
127 | await asyncio.sleep(self.rate_limit - elapsed)
128 |
129 | self._domain_last_request[domain] = asyncio.get_event_loop().time()
130 |
131 | # Make request
132 | async with self._session.get(url, **kwargs) as response:
133 | content = await response.text()
134 | return {
135 | "success": response.status < 400,
136 | "status": response.status,
137 | "content": content,
138 | "error": None if response.status < 400 else f"HTTP {response.status}"
139 | }
140 |
141 | except asyncio.TimeoutError:
142 | return {
143 | "success": False,
144 | "status": None,
145 | "error": "Request timed out",
146 | "content": None
147 | }
148 | except Exception as e:
149 | return {
150 | "success": False,
151 | "status": None,
152 | "error": str(e),
153 | "content": None
154 | }
```
--------------------------------------------------------------------------------
/tests/test_crawlers/test_sitemap_crawler.py:
--------------------------------------------------------------------------------
```python
1 | """
2 | Tests for the SitemapCrawler class.
3 | """
4 | import pytest
5 | from docs_scraper.crawlers import SitemapCrawler
6 | from docs_scraper.utils import RequestHandler, HTMLParser
7 |
8 | @pytest.mark.asyncio
9 | async def test_sitemap_crawler_successful_crawl(mock_website, test_urls, aiohttp_session):
10 | """Test successful crawling of a sitemap."""
11 | sitemap_url = test_urls["sitemap_url"]
12 | request_handler = RequestHandler(session=aiohttp_session)
13 | html_parser = HTMLParser(base_url=test_urls["base_url"])
14 | crawler = SitemapCrawler(request_handler=request_handler, html_parser=html_parser)
15 |
16 | results = await crawler.crawl(sitemap_url)
17 |
18 | assert len(results) == 3 # Number of URLs in sample sitemap
19 | for result in results:
20 | assert result["success"] is True
21 | assert result["url"].startswith("https://example.com")
22 | assert "content" in result
23 | assert "title" in result["metadata"]
24 | assert "description" in result["metadata"]
25 | assert len(result["links"]) > 0
26 | assert result["status_code"] == 200
27 | assert result["error"] is None
28 |
29 | @pytest.mark.asyncio
30 | async def test_sitemap_crawler_invalid_sitemap_url(mock_website, aiohttp_session):
31 | """Test crawling with an invalid sitemap URL."""
32 | sitemap_url = "https://nonexistent.example.com/sitemap.xml"
33 | request_handler = RequestHandler(session=aiohttp_session)
34 | html_parser = HTMLParser(base_url="https://nonexistent.example.com")
35 | crawler = SitemapCrawler(request_handler=request_handler, html_parser=html_parser)
36 |
37 | results = await crawler.crawl(sitemap_url)
38 |
39 | assert len(results) == 1
40 | assert results[0]["success"] is False
41 | assert results[0]["url"] == sitemap_url
42 | assert results[0]["error"] is not None
43 |
44 | @pytest.mark.asyncio
45 | async def test_sitemap_crawler_invalid_xml(mock_website, aiohttp_session):
46 | """Test crawling with invalid XML content."""
47 | sitemap_url = "https://example.com/invalid-sitemap.xml"
48 | mock_website.get(sitemap_url, status=200, body="<invalid>xml</invalid>")
49 |
50 | request_handler = RequestHandler(session=aiohttp_session)
51 | html_parser = HTMLParser(base_url="https://example.com")
52 | crawler = SitemapCrawler(request_handler=request_handler, html_parser=html_parser)
53 |
54 | results = await crawler.crawl(sitemap_url)
55 |
56 | assert len(results) == 1
57 | assert results[0]["success"] is False
58 | assert results[0]["url"] == sitemap_url
59 | assert "Invalid sitemap format" in results[0]["error"]
60 |
61 | @pytest.mark.asyncio
62 | async def test_sitemap_crawler_concurrent_limit(mock_website, test_urls, aiohttp_session):
63 | """Test concurrent request limiting for sitemap crawling."""
64 | sitemap_url = test_urls["sitemap_url"]
65 | request_handler = RequestHandler(session=aiohttp_session)
66 | html_parser = HTMLParser(base_url=test_urls["base_url"])
67 | crawler = SitemapCrawler(request_handler=request_handler, html_parser=html_parser)
68 |
69 | import time
70 | start_time = time.time()
71 |
72 | results = await crawler.crawl(sitemap_url)
73 |
74 | end_time = time.time()
75 | elapsed_time = end_time - start_time
76 |
77 | assert len(results) == 3
78 | # With concurrent_limit=1, processing should take at least 0.3 seconds
79 | assert elapsed_time >= 0.3
80 |
81 | @pytest.mark.asyncio
82 | async def test_sitemap_crawler_rate_limiting(mock_website, test_urls, aiohttp_session):
83 | """Test rate limiting for sitemap crawling."""
84 | sitemap_url = test_urls["sitemap_url"]
85 | request_handler = RequestHandler(session=aiohttp_session, rate_limit=1) # 1 request per second
86 | html_parser = HTMLParser(base_url=test_urls["base_url"])
87 | crawler = SitemapCrawler(request_handler=request_handler, html_parser=html_parser)
88 |
89 | import time
90 | start_time = time.time()
91 |
92 | results = await crawler.crawl(sitemap_url)
93 |
94 | end_time = time.time()
95 | elapsed_time = end_time - start_time
96 |
97 | assert len(results) == 3
98 | # Should take at least 3 seconds due to rate limiting (1 for sitemap + 2 for pages)
99 | assert elapsed_time >= 2.0
100 |
101 | @pytest.mark.asyncio
102 | async def test_sitemap_crawler_nested_sitemaps(mock_website, test_urls, aiohttp_session):
103 | """Test crawling nested sitemaps."""
104 | # Create a sitemap index
105 | sitemap_index = """<?xml version="1.0" encoding="UTF-8"?>
106 | <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
107 | <sitemap>
108 | <loc>https://example.com/sitemap1.xml</loc>
109 | </sitemap>
110 | <sitemap>
111 | <loc>https://example.com/sitemap2.xml</loc>
112 | </sitemap>
113 | </sitemapindex>
114 | """
115 |
116 | # Create sub-sitemaps
117 | sitemap1 = """<?xml version="1.0" encoding="UTF-8"?>
118 | <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
119 | <url>
120 | <loc>https://example.com/page1</loc>
121 | </url>
122 | </urlset>
123 | """
124 |
125 | sitemap2 = """<?xml version="1.0" encoding="UTF-8"?>
126 | <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
127 | <url>
128 | <loc>https://example.com/page2</loc>
129 | </url>
130 | </urlset>
131 | """
132 |
133 | mock_website.get("https://example.com/sitemap-index.xml", status=200, body=sitemap_index)
134 | mock_website.get("https://example.com/sitemap1.xml", status=200, body=sitemap1)
135 | mock_website.get("https://example.com/sitemap2.xml", status=200, body=sitemap2)
136 |
137 | request_handler = RequestHandler(session=aiohttp_session)
138 | html_parser = HTMLParser(base_url="https://example.com")
139 | crawler = SitemapCrawler(request_handler=request_handler, html_parser=html_parser)
140 |
141 | results = await crawler.crawl("https://example.com/sitemap-index.xml")
142 |
143 | assert len(results) == 2 # Two pages from two sub-sitemaps
144 | urls = {result["url"] for result in results}
145 | assert "https://example.com/page1" in urls
146 | assert "https://example.com/page2" in urls
```
--------------------------------------------------------------------------------
/tests/test_utils/test_html_parser.py:
--------------------------------------------------------------------------------
```python
1 | """
2 | Tests for the HTMLParser class.
3 | """
4 | import pytest
5 | from bs4 import BeautifulSoup
6 | from docs_scraper.utils import HTMLParser
7 |
8 | @pytest.fixture
9 | def html_parser():
10 | """Fixture for HTMLParser instance."""
11 | return HTMLParser()
12 |
13 | @pytest.fixture
14 | def sample_html():
15 | """Sample HTML content for testing."""
16 | return """
17 | <!DOCTYPE html>
18 | <html>
19 | <head>
20 | <title>Test Page</title>
21 | <meta name="description" content="Test description">
22 | <meta name="keywords" content="test, keywords">
23 | <meta property="og:title" content="OG Title">
24 | <meta property="og:description" content="OG Description">
25 | </head>
26 | <body>
27 | <nav class="menu">
28 | <ul>
29 | <li><a href="/page1">Page 1</a></li>
30 | <li>
31 | <a href="/section1">Section 1</a>
32 | <ul>
33 | <li><a href="/section1/page1">Section 1.1</a></li>
34 | <li><a href="/section1/page2">Section 1.2</a></li>
35 | </ul>
36 | </li>
37 | </ul>
38 | </nav>
39 | <main>
40 | <h1>Welcome</h1>
41 | <p>Test content with a <a href="/test1">link</a> and another <a href="/test2">link</a>.</p>
42 | <div class="content">
43 | <p>More content</p>
44 | <a href="mailto:[email protected]">Email</a>
45 | <a href="tel:+1234567890">Phone</a>
46 | <a href="javascript:void(0)">JavaScript</a>
47 | <a href="#section">Hash</a>
48 | <a href="ftp://example.com">FTP</a>
49 | </div>
50 | </main>
51 | </body>
52 | </html>
53 | """
54 |
55 | def test_parse_html(html_parser, sample_html):
56 | """Test HTML parsing."""
57 | soup = html_parser.parse_html(sample_html)
58 | assert isinstance(soup, BeautifulSoup)
59 | assert soup.title.string == "Test Page"
60 |
61 | def test_extract_metadata(html_parser, sample_html):
62 | """Test metadata extraction."""
63 | soup = html_parser.parse_html(sample_html)
64 | metadata = html_parser.extract_metadata(soup)
65 |
66 | assert metadata["title"] == "Test Page"
67 | assert metadata["description"] == "Test description"
68 | assert metadata["keywords"] == "test, keywords"
69 | assert metadata["og:title"] == "OG Title"
70 | assert metadata["og:description"] == "OG Description"
71 |
72 | def test_extract_links(html_parser, sample_html):
73 | """Test link extraction."""
74 | soup = html_parser.parse_html(sample_html)
75 | links = html_parser.extract_links(soup)
76 |
77 | # Should only include valid HTTP(S) links
78 | assert "/page1" in links
79 | assert "/section1" in links
80 | assert "/section1/page1" in links
81 | assert "/section1/page2" in links
82 | assert "/test1" in links
83 | assert "/test2" in links
84 |
85 | # Should not include invalid or special links
86 | assert "mailto:[email protected]" not in links
87 | assert "tel:+1234567890" not in links
88 | assert "javascript:void(0)" not in links
89 | assert "#section" not in links
90 | assert "ftp://example.com" not in links
91 |
92 | def test_extract_menu_links(html_parser, sample_html):
93 | """Test menu link extraction."""
94 | soup = html_parser.parse_html(sample_html)
95 | menu_links = html_parser.extract_menu_links(soup, "nav.menu")
96 |
97 | assert len(menu_links) == 4
98 | assert "/page1" in menu_links
99 | assert "/section1" in menu_links
100 | assert "/section1/page1" in menu_links
101 | assert "/section1/page2" in menu_links
102 |
103 | def test_extract_menu_links_invalid_selector(html_parser, sample_html):
104 | """Test menu link extraction with invalid selector."""
105 | soup = html_parser.parse_html(sample_html)
106 | menu_links = html_parser.extract_menu_links(soup, "#nonexistent")
107 |
108 | assert len(menu_links) == 0
109 |
110 | def test_extract_text_content(html_parser, sample_html):
111 | """Test text content extraction."""
112 | soup = html_parser.parse_html(sample_html)
113 | content = html_parser.extract_text_content(soup)
114 |
115 | assert "Welcome" in content
116 | assert "Test content" in content
117 | assert "More content" in content
118 | # Should not include navigation text
119 | assert "Section 1.1" not in content
120 |
121 | def test_clean_html(html_parser):
122 | """Test HTML cleaning."""
123 | dirty_html = """
124 | <html>
125 | <body>
126 | <script>alert('test');</script>
127 | <style>body { color: red; }</style>
128 | <p>Test content</p>
129 | <!-- Comment -->
130 | <iframe src="test.html"></iframe>
131 | </body>
132 | </html>
133 | """
134 |
135 | clean_html = html_parser.clean_html(dirty_html)
136 | soup = html_parser.parse_html(clean_html)
137 |
138 | assert len(soup.find_all("script")) == 0
139 | assert len(soup.find_all("style")) == 0
140 | assert len(soup.find_all("iframe")) == 0
141 | assert "Test content" in soup.get_text()
142 |
143 | def test_normalize_url(html_parser):
144 | """Test URL normalization."""
145 | base_url = "https://example.com/docs"
146 | test_cases = [
147 | ("/test", "https://example.com/test"),
148 | ("test", "https://example.com/docs/test"),
149 | ("../test", "https://example.com/test"),
150 | ("https://other.com/test", "https://other.com/test"),
151 | ("//other.com/test", "https://other.com/test"),
152 | ]
153 |
154 | for input_url, expected_url in test_cases:
155 | assert html_parser.normalize_url(input_url, base_url) == expected_url
156 |
157 | def test_is_valid_link(html_parser):
158 | """Test link validation."""
159 | valid_links = [
160 | "https://example.com",
161 | "http://example.com",
162 | "/absolute/path",
163 | "relative/path",
164 | "../parent/path",
165 | "./current/path"
166 | ]
167 |
168 | invalid_links = [
169 | "mailto:[email protected]",
170 | "tel:+1234567890",
171 | "javascript:void(0)",
172 | "#hash",
173 | "ftp://example.com",
174 | ""
175 | ]
176 |
177 | for link in valid_links:
178 | assert html_parser.is_valid_link(link) is True
179 |
180 | for link in invalid_links:
181 | assert html_parser.is_valid_link(link) is False
182 |
183 | def test_extract_structured_data(html_parser):
184 | """Test structured data extraction."""
185 | html = """
186 | <html>
187 | <head>
188 | <script type="application/ld+json">
189 | {
190 | "@context": "https://schema.org",
191 | "@type": "Article",
192 | "headline": "Test Article",
193 | "author": {
194 | "@type": "Person",
195 | "name": "John Doe"
196 | }
197 | }
198 | </script>
199 | </head>
200 | <body>
201 | <p>Test content</p>
202 | </body>
203 | </html>
204 | """
205 |
206 | soup = html_parser.parse_html(html)
207 | structured_data = html_parser.extract_structured_data(soup)
208 |
209 | assert len(structured_data) == 1
210 | assert structured_data[0]["@type"] == "Article"
211 | assert structured_data[0]["headline"] == "Test Article"
212 | assert structured_data[0]["author"]["name"] == "John Doe"
```
--------------------------------------------------------------------------------
/src/docs_scraper/crawlers/single_url_crawler.py:
--------------------------------------------------------------------------------
```python
1 | import os
2 | import sys
3 | import asyncio
4 | import re
5 | import argparse
6 | from datetime import datetime
7 | from termcolor import colored
8 | from crawl4ai import *
9 | from ..utils import RequestHandler, HTMLParser
10 | from typing import Dict, Any, Optional
11 |
12 | class SingleURLCrawler:
13 | """A crawler that processes a single URL."""
14 |
15 | def __init__(self, request_handler: RequestHandler, html_parser: HTMLParser):
16 | """
17 | Initialize the crawler.
18 |
19 | Args:
20 | request_handler: Handler for making HTTP requests
21 | html_parser: Parser for processing HTML content
22 | """
23 | self.request_handler = request_handler
24 | self.html_parser = html_parser
25 |
26 | async def crawl(self, url: str) -> Dict[str, Any]:
27 | """
28 | Crawl a single URL and extract its content.
29 |
30 | Args:
31 | url: The URL to crawl
32 |
33 | Returns:
34 | Dict containing:
35 | - success: Whether the crawl was successful
36 | - url: The URL that was crawled
37 | - content: The extracted content (if successful)
38 | - metadata: Additional metadata about the page
39 | - links: Links found on the page
40 | - status_code: HTTP status code
41 | - error: Error message (if unsuccessful)
42 | """
43 | try:
44 | response = await self.request_handler.get(url)
45 | if not response["success"]:
46 | return {
47 | "success": False,
48 | "url": url,
49 | "content": None,
50 | "metadata": {},
51 | "links": [],
52 | "status_code": response.get("status"),
53 | "error": response.get("error", "Unknown error")
54 | }
55 |
56 | html_content = response["content"]
57 | parsed_content = self.html_parser.parse_content(html_content)
58 |
59 | return {
60 | "success": True,
61 | "url": url,
62 | "content": parsed_content["text_content"],
63 | "metadata": {
64 | "title": parsed_content["title"],
65 | "description": parsed_content["description"]
66 | },
67 | "links": parsed_content["links"],
68 | "status_code": response["status"],
69 | "error": None
70 | }
71 |
72 | except Exception as e:
73 | return {
74 | "success": False,
75 | "url": url,
76 | "content": None,
77 | "metadata": {},
78 | "links": [],
79 | "status_code": None,
80 | "error": str(e)
81 | }
82 |
83 | def get_filename_prefix(url: str) -> str:
84 | """
85 | Generate a filename prefix from a URL including path components.
86 | Examples:
87 | - https://docs.literalai.com/page -> literalai_docs_page
88 | - https://literalai.com/docs/page -> literalai_docs_page
89 | - https://api.example.com/path/to/page -> example_api_path_to_page
90 |
91 | Args:
92 | url (str): The URL to process
93 |
94 | Returns:
95 | str: Generated filename prefix
96 | """
97 | # Remove protocol and split URL parts
98 | clean_url = url.split('://')[1]
99 | url_parts = clean_url.split('/')
100 |
101 | # Get domain parts
102 | domain_parts = url_parts[0].split('.')
103 |
104 | # Extract main domain name (ignoring TLD)
105 | main_domain = domain_parts[-2]
106 |
107 | # Start building the prefix with domain
108 | prefix_parts = [main_domain]
109 |
110 | # Add subdomain if exists
111 | if len(domain_parts) > 2:
112 | subdomain = domain_parts[0]
113 | if subdomain != main_domain:
114 | prefix_parts.append(subdomain)
115 |
116 | # Add all path segments
117 | if len(url_parts) > 1:
118 | path_segments = [segment for segment in url_parts[1:] if segment]
119 | for segment in path_segments:
120 | # Clean up segment (remove special characters, convert to lowercase)
121 | clean_segment = re.sub(r'[^a-zA-Z0-9]', '', segment.lower())
122 | if clean_segment and clean_segment != main_domain:
123 | prefix_parts.append(clean_segment)
124 |
125 | # Join all parts with underscore
126 | return '_'.join(prefix_parts)
127 |
128 | def process_markdown_content(content: str, url: str) -> str:
129 | """Process markdown content to start from first H1 and add URL as H2"""
130 | # Find the first H1 tag
131 | h1_match = re.search(r'^# .+$', content, re.MULTILINE)
132 | if not h1_match:
133 | # If no H1 found, return original content with URL as H1
134 | return f"# No Title Found\n\n## Source\n{url}\n\n{content}"
135 |
136 | # Get the content starting from the first H1
137 | content_from_h1 = content[h1_match.start():]
138 |
139 | # Remove "Was this page helpful?" section and everything after it
140 | helpful_patterns = [
141 | r'^#+\s*Was this page helpful\?.*$', # Matches any heading level with this text
142 | r'^Was this page helpful\?.*$', # Matches the text without heading
143 | r'^#+\s*Was this helpful\?.*$', # Matches any heading level with shorter text
144 | r'^Was this helpful\?.*$' # Matches shorter text without heading
145 | ]
146 |
147 | for pattern in helpful_patterns:
148 | parts = re.split(pattern, content_from_h1, flags=re.MULTILINE | re.IGNORECASE)
149 | if len(parts) > 1:
150 | content_from_h1 = parts[0].strip()
151 | break
152 |
153 | # Insert URL as H2 after the H1
154 | lines = content_from_h1.split('\n')
155 | h1_line = lines[0]
156 | rest_of_content = '\n'.join(lines[1:]).strip()
157 |
158 | return f"{h1_line}\n\n## Source\n{url}\n\n{rest_of_content}"
159 |
160 | def save_markdown_content(content: str, url: str) -> str:
161 | """Save markdown content to a file"""
162 | try:
163 | # Generate filename prefix from URL
164 | filename_prefix = get_filename_prefix(url)
165 |
166 | timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
167 | filename = f"{filename_prefix}_{timestamp}.md"
168 | filepath = os.path.join("scraped_docs", filename)
169 |
170 | # Create scraped_docs directory if it doesn't exist
171 | os.makedirs("scraped_docs", exist_ok=True)
172 |
173 | processed_content = process_markdown_content(content, url)
174 |
175 | with open(filepath, "w", encoding="utf-8") as f:
176 | f.write(processed_content)
177 |
178 | print(colored(f"\n✓ Markdown content saved to: {filepath}", "green"))
179 | return filepath
180 | except Exception as e:
181 | print(colored(f"\n✗ Error saving markdown content: {str(e)}", "red"))
182 | return None
183 |
184 | async def main():
185 | # Set up argument parser
186 | parser = argparse.ArgumentParser(description='Crawl a single URL and generate markdown documentation')
187 | parser.add_argument('url', type=str, help='Target documentation URL to crawl')
188 | args = parser.parse_args()
189 |
190 | try:
191 | print(colored("\n=== Starting Single URL Crawl ===", "cyan"))
192 | print(colored(f"\nCrawling URL: {args.url}", "yellow"))
193 |
194 | browser_config = BrowserConfig(headless=True, verbose=True)
195 | async with AsyncWebCrawler(config=browser_config) as crawler:
196 | crawler_config = CrawlerRunConfig(
197 | cache_mode=CacheMode.BYPASS,
198 | markdown_generator=DefaultMarkdownGenerator(
199 | content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
200 | )
201 | )
202 |
203 | result = await crawler.arun(
204 | url=args.url,
205 | config=crawler_config
206 | )
207 |
208 | if result.success:
209 | print(colored("\n✓ Successfully crawled URL", "green"))
210 | print(colored(f"Content length: {len(result.markdown.raw_markdown)} characters", "cyan"))
211 | save_markdown_content(result.markdown.raw_markdown, args.url)
212 | else:
213 | print(colored(f"\n✗ Failed to crawl URL: {result.error_message}", "red"))
214 |
215 | except Exception as e:
216 | print(colored(f"\n✗ Error during crawl: {str(e)}", "red"))
217 | sys.exit(1)
218 |
219 | if __name__ == "__main__":
220 | asyncio.run(main())
```
--------------------------------------------------------------------------------
/src/docs_scraper/server.py:
--------------------------------------------------------------------------------
```python
1 | """
2 | MCP server implementation for web crawling and documentation scraping.
3 | """
4 | import asyncio
5 | import logging
6 | from typing import List, Dict, Any, Optional
7 | from pydantic import BaseModel, Field, HttpUrl
8 | from mcp.server.fastmcp import FastMCP
9 |
10 | # Import the crawlers with relative imports
11 | # This helps prevent circular import issues
12 | from .crawlers.single_url_crawler import SingleURLCrawler
13 | from .crawlers.multi_url_crawler import MultiURLCrawler
14 | from .crawlers.sitemap_crawler import SitemapCrawler
15 | from .crawlers.menu_crawler import MenuCrawler
16 |
17 | # Import utility classes
18 | from .utils import RequestHandler, HTMLParser
19 |
20 | # Configure logging
21 | logging.basicConfig(
22 | level=logging.INFO,
23 | format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
24 | )
25 | logger = logging.getLogger(__name__)
26 |
27 | # Create MCP server
28 | mcp = FastMCP(
29 | name="DocsScraperMCP",
30 | version="0.1.0"
31 | )
32 |
33 | # Input validation models
34 | class SingleUrlInput(BaseModel):
35 | url: HttpUrl = Field(..., description="Target URL to crawl")
36 | depth: int = Field(0, ge=0, description="How many levels deep to follow links")
37 | exclusion_patterns: Optional[List[str]] = Field(None, description="List of regex patterns for URLs to exclude")
38 | rate_limit: float = Field(1.0, gt=0, description="Minimum time between requests (seconds)")
39 |
40 | class MultiUrlInput(BaseModel):
41 | urls: List[HttpUrl] = Field(..., min_items=1, description="List of URLs to crawl")
42 | concurrent_limit: int = Field(5, gt=0, description="Maximum number of concurrent requests")
43 | exclusion_patterns: Optional[List[str]] = Field(None, description="List of regex patterns for URLs to exclude")
44 | rate_limit: float = Field(1.0, gt=0, description="Minimum time between requests to the same domain (seconds)")
45 |
46 | class SitemapInput(BaseModel):
47 | base_url: HttpUrl = Field(..., description="Base URL of the website")
48 | sitemap_url: Optional[HttpUrl] = Field(None, description="Optional explicit sitemap URL")
49 | concurrent_limit: int = Field(5, gt=0, description="Maximum number of concurrent requests")
50 | exclusion_patterns: Optional[List[str]] = Field(None, description="List of regex patterns for URLs to exclude")
51 | rate_limit: float = Field(1.0, gt=0, description="Minimum time between requests (seconds)")
52 |
53 | class MenuInput(BaseModel):
54 | base_url: HttpUrl = Field(..., description="Base URL of the website")
55 | menu_selector: str = Field(..., min_length=1, description="CSS selector for the navigation menu element")
56 | concurrent_limit: int = Field(5, gt=0, description="Maximum number of concurrent requests")
57 | exclusion_patterns: Optional[List[str]] = Field(None, description="List of regex patterns for URLs to exclude")
58 | rate_limit: float = Field(1.0, gt=0, description="Minimum time between requests (seconds)")
59 |
60 | @mcp.tool()
61 | async def single_url_crawler(
62 | url: str,
63 | depth: int = 0,
64 | exclusion_patterns: Optional[List[str]] = None,
65 | rate_limit: float = 1.0
66 | ) -> Dict[str, Any]:
67 | """
68 | Crawl a single URL and optionally follow links up to a specified depth.
69 |
70 | Args:
71 | url: Target URL to crawl
72 | depth: How many levels deep to follow links (0 means only the target URL)
73 | exclusion_patterns: List of regex patterns for URLs to exclude
74 | rate_limit: Minimum time between requests (seconds)
75 |
76 | Returns:
77 | Dict containing crawled content and statistics
78 | """
79 | try:
80 | # Validate input
81 | input_data = SingleUrlInput(
82 | url=url,
83 | depth=depth,
84 | exclusion_patterns=exclusion_patterns,
85 | rate_limit=rate_limit
86 | )
87 |
88 | # Create required utility instances
89 | request_handler = RequestHandler(rate_limit=input_data.rate_limit)
90 | html_parser = HTMLParser(base_url=str(input_data.url))
91 |
92 | # Create the crawler with the proper parameters
93 | crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
94 |
95 | # Use request_handler as a context manager to ensure proper session initialization
96 | async with request_handler:
97 | # Call the crawl method with the URL
98 | return await crawler.crawl(str(input_data.url))
99 |
100 | except Exception as e:
101 | logger.error(f"Single URL crawler failed: {str(e)}")
102 | return {
103 | "success": False,
104 | "error": str(e),
105 | "content": None,
106 | "stats": {
107 | "urls_crawled": 0,
108 | "urls_failed": 1,
109 | "max_depth_reached": 0
110 | }
111 | }
112 |
113 | @mcp.tool()
114 | async def multi_url_crawler(
115 | urls: List[str],
116 | concurrent_limit: int = 5,
117 | exclusion_patterns: Optional[List[str]] = None,
118 | rate_limit: float = 1.0
119 | ) -> Dict[str, Any]:
120 | """
121 | Crawl multiple URLs in parallel with rate limiting.
122 |
123 | Args:
124 | urls: List of URLs to crawl
125 | concurrent_limit: Maximum number of concurrent requests
126 | exclusion_patterns: List of regex patterns for URLs to exclude
127 | rate_limit: Minimum time between requests to the same domain (seconds)
128 |
129 | Returns:
130 | Dict containing results for each URL and overall statistics
131 | """
132 | try:
133 | # Validate input
134 | input_data = MultiUrlInput(
135 | urls=urls,
136 | concurrent_limit=concurrent_limit,
137 | exclusion_patterns=exclusion_patterns,
138 | rate_limit=rate_limit
139 | )
140 |
141 | # Create the crawler with the proper parameters
142 | crawler = MultiURLCrawler(verbose=True)
143 |
144 | # Call the crawl method with the URLs
145 | url_list = [str(url) for url in input_data.urls]
146 | results = await crawler.crawl(url_list)
147 |
148 | # Return a standardized response format
149 | return {
150 | "success": True,
151 | "results": results,
152 | "stats": {
153 | "urls_crawled": len(results),
154 | "urls_succeeded": sum(1 for r in results if r["success"]),
155 | "urls_failed": sum(1 for r in results if not r["success"])
156 | }
157 | }
158 |
159 | except Exception as e:
160 | logger.error(f"Multi URL crawler failed: {str(e)}")
161 | return {
162 | "success": False,
163 | "error": str(e),
164 | "content": None,
165 | "stats": {
166 | "urls_crawled": 0,
167 | "urls_failed": len(urls),
168 | "concurrent_requests_max": 0
169 | }
170 | }
171 |
172 | @mcp.tool()
173 | async def sitemap_crawler(
174 | base_url: str,
175 | sitemap_url: Optional[str] = None,
176 | concurrent_limit: int = 5,
177 | exclusion_patterns: Optional[List[str]] = None,
178 | rate_limit: float = 1.0
179 | ) -> Dict[str, Any]:
180 | """
181 | Crawl a website using its sitemap.xml.
182 |
183 | Args:
184 | base_url: Base URL of the website
185 | sitemap_url: Optional explicit sitemap URL (if different from base_url/sitemap.xml)
186 | concurrent_limit: Maximum number of concurrent requests
187 | exclusion_patterns: List of regex patterns for URLs to exclude
188 | rate_limit: Minimum time between requests (seconds)
189 |
190 | Returns:
191 | Dict containing crawled pages and statistics
192 | """
193 | try:
194 | # Validate input
195 | input_data = SitemapInput(
196 | base_url=base_url,
197 | sitemap_url=sitemap_url,
198 | concurrent_limit=concurrent_limit,
199 | exclusion_patterns=exclusion_patterns,
200 | rate_limit=rate_limit
201 | )
202 |
203 | # Create required utility instances
204 | request_handler = RequestHandler(
205 | rate_limit=input_data.rate_limit,
206 | concurrent_limit=input_data.concurrent_limit
207 | )
208 | html_parser = HTMLParser(base_url=str(input_data.base_url))
209 |
210 | # Create the crawler with the proper parameters
211 | crawler = SitemapCrawler(
212 | request_handler=request_handler,
213 | html_parser=html_parser,
214 | verbose=True
215 | )
216 |
217 | # Determine the sitemap URL to use
218 | sitemap_url_to_use = str(input_data.sitemap_url) if input_data.sitemap_url else f"{str(input_data.base_url).rstrip('/')}/sitemap.xml"
219 |
220 | # Call the crawl method with the sitemap URL
221 | results = await crawler.crawl(sitemap_url_to_use)
222 |
223 | return {
224 | "success": True,
225 | "content": results,
226 | "stats": {
227 | "urls_crawled": len(results),
228 | "urls_succeeded": sum(1 for r in results if r["success"]),
229 | "urls_failed": sum(1 for r in results if not r["success"]),
230 | "sitemap_found": len(results) > 0
231 | }
232 | }
233 |
234 | except Exception as e:
235 | logger.error(f"Sitemap crawler failed: {str(e)}")
236 | return {
237 | "success": False,
238 | "error": str(e),
239 | "content": None,
240 | "stats": {
241 | "urls_crawled": 0,
242 | "urls_failed": 1,
243 | "sitemap_found": False
244 | }
245 | }
246 |
247 | @mcp.tool()
248 | async def menu_crawler(
249 | base_url: str,
250 | menu_selector: str,
251 | concurrent_limit: int = 5,
252 | exclusion_patterns: Optional[List[str]] = None,
253 | rate_limit: float = 1.0
254 | ) -> Dict[str, Any]:
255 | """
256 | Crawl a website by following its navigation menu structure.
257 |
258 | Args:
259 | base_url: Base URL of the website
260 | menu_selector: CSS selector for the navigation menu element
261 | concurrent_limit: Maximum number of concurrent requests
262 | exclusion_patterns: List of regex patterns for URLs to exclude
263 | rate_limit: Minimum time between requests (seconds)
264 |
265 | Returns:
266 | Dict containing menu structure and crawled content
267 | """
268 | try:
269 | # Validate input
270 | input_data = MenuInput(
271 | base_url=base_url,
272 | menu_selector=menu_selector,
273 | concurrent_limit=concurrent_limit,
274 | exclusion_patterns=exclusion_patterns,
275 | rate_limit=rate_limit
276 | )
277 |
278 | # Create the crawler with the proper parameters
279 | crawler = MenuCrawler(start_url=str(input_data.base_url))
280 |
281 | # Call the crawl method
282 | results = await crawler.crawl()
283 |
284 | return {
285 | "success": True,
286 | "content": results,
287 | "stats": {
288 | "urls_crawled": len(results.get("menu_links", [])),
289 | "urls_failed": 0,
290 | "menu_items_found": len(results.get("menu_structure", {}).get("items", []))
291 | }
292 | }
293 |
294 | except Exception as e:
295 | logger.error(f"Menu crawler failed: {str(e)}")
296 | return {
297 | "success": False,
298 | "error": str(e),
299 | "content": None,
300 | "stats": {
301 | "urls_crawled": 0,
302 | "urls_failed": 1,
303 | "menu_items_found": 0
304 | }
305 | }
306 |
307 | def main():
308 | """Main entry point for the MCP server."""
309 | try:
310 | logger.info("Starting DocsScraperMCP server...")
311 | mcp.run() # Using run() method instead of start()
312 | except Exception as e:
313 | logger.error(f"Server failed: {str(e)}")
314 | raise
315 | finally:
316 | logger.info("DocsScraperMCP server stopped.")
317 |
318 | if __name__ == "__main__":
319 | main()
```
--------------------------------------------------------------------------------
/src/docs_scraper/crawlers/multi_url_crawler.py:
--------------------------------------------------------------------------------
```python
1 | import os
2 | import sys
3 | import asyncio
4 | import re
5 | import json
6 | import argparse
7 | from typing import List, Optional
8 | from datetime import datetime
9 | from termcolor import colored
10 | from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
11 | from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
12 | from crawl4ai.content_filter_strategy import PruningContentFilter
13 | from urllib.parse import urlparse
14 |
15 | def load_urls_from_file(file_path: str) -> List[str]:
16 | """Load URLs from either a text file or JSON file"""
17 | try:
18 | # Create input_files directory if it doesn't exist
19 | input_dir = "input_files"
20 | os.makedirs(input_dir, exist_ok=True)
21 |
22 | # Check if file exists in current directory or input_files directory
23 | if os.path.exists(file_path):
24 | actual_path = file_path
25 | elif os.path.exists(os.path.join(input_dir, file_path)):
26 | actual_path = os.path.join(input_dir, file_path)
27 | else:
28 | print(colored(f"Error: File {file_path} not found", "red"))
29 | print(colored(f"Please place your URL files in either:", "yellow"))
30 | print(colored(f"1. The root directory ({os.getcwd()})", "yellow"))
31 | print(colored(f"2. The input_files directory ({os.path.join(os.getcwd(), input_dir)})", "yellow"))
32 | sys.exit(1)
33 |
34 | file_ext = os.path.splitext(actual_path)[1].lower()
35 |
36 | if file_ext == '.json':
37 | print(colored(f"Loading URLs from JSON file: {actual_path}", "cyan"))
38 | with open(actual_path, 'r', encoding='utf-8') as f:
39 | try:
40 | data = json.load(f)
41 | # Handle menu crawler output format
42 | if isinstance(data, dict) and 'menu_links' in data:
43 | urls = data['menu_links']
44 | elif isinstance(data, dict) and 'urls' in data:
45 | urls = data['urls']
46 | elif isinstance(data, list):
47 | urls = data
48 | else:
49 | print(colored("Error: Invalid JSON format. Expected 'menu_links' or 'urls' key, or list of URLs", "red"))
50 | sys.exit(1)
51 | print(colored(f"Successfully loaded {len(urls)} URLs from JSON file", "green"))
52 | return urls
53 | except json.JSONDecodeError as e:
54 | print(colored(f"Error: Invalid JSON file - {str(e)}", "red"))
55 | sys.exit(1)
56 | else:
57 | print(colored(f"Loading URLs from text file: {actual_path}", "cyan"))
58 | with open(actual_path, 'r', encoding='utf-8') as f:
59 | urls = [line.strip() for line in f if line.strip()]
60 | print(colored(f"Successfully loaded {len(urls)} URLs from text file", "green"))
61 | return urls
62 |
63 | except Exception as e:
64 | print(colored(f"Error loading URLs from file: {str(e)}", "red"))
65 | sys.exit(1)
66 |
67 | class MultiURLCrawler:
68 | def __init__(self, verbose: bool = True):
69 | self.browser_config = BrowserConfig(
70 | headless=True,
71 | verbose=True,
72 | viewport_width=800,
73 | viewport_height=600
74 | )
75 |
76 | self.crawler_config = CrawlerRunConfig(
77 | cache_mode=CacheMode.BYPASS,
78 | markdown_generator=DefaultMarkdownGenerator(
79 | content_filter=PruningContentFilter(
80 | threshold=0.48,
81 | threshold_type="fixed",
82 | min_word_threshold=0
83 | )
84 | ),
85 | )
86 |
87 | self.verbose = verbose
88 |
89 | def process_markdown_content(self, content: str, url: str) -> str:
90 | """Process markdown content to start from first H1 and add URL as H2"""
91 | # Find the first H1 tag
92 | h1_match = re.search(r'^# .+$', content, re.MULTILINE)
93 | if not h1_match:
94 | # If no H1 found, return original content with URL as H1
95 | return f"# No Title Found\n\n## Source\n{url}\n\n{content}"
96 |
97 | # Get the content starting from the first H1
98 | content_from_h1 = content[h1_match.start():]
99 |
100 | # Remove "Was this page helpful?" section and everything after it
101 | helpful_patterns = [
102 | r'^#+\s*Was this page helpful\?.*$', # Matches any heading level with this text
103 | r'^Was this page helpful\?.*$', # Matches the text without heading
104 | r'^#+\s*Was this helpful\?.*$', # Matches any heading level with shorter text
105 | r'^Was this helpful\?.*$' # Matches shorter text without heading
106 | ]
107 |
108 | for pattern in helpful_patterns:
109 | parts = re.split(pattern, content_from_h1, flags=re.MULTILINE | re.IGNORECASE)
110 | if len(parts) > 1:
111 | content_from_h1 = parts[0].strip()
112 | break
113 |
114 | # Insert URL as H2 after the H1
115 | lines = content_from_h1.split('\n')
116 | h1_line = lines[0]
117 | rest_of_content = '\n'.join(lines[1:])
118 |
119 | return f"{h1_line}\n\n## Source\n{url}\n\n{rest_of_content}"
120 |
121 | def get_filename_prefix(self, url: str) -> str:
122 | """
123 | Generate a filename prefix from a URL including path components.
124 | Examples:
125 | - https://docs.literalai.com/page -> literalai_docs_page
126 | - https://literalai.com/docs/page -> literalai_docs_page
127 | - https://api.example.com/path/to/page -> example_api_path_to_page
128 | """
129 | try:
130 | # Parse the URL
131 | parsed = urlparse(url)
132 |
133 | # Split hostname and reverse it (e.g., 'docs.example.com' -> ['com', 'example', 'docs'])
134 | hostname_parts = parsed.hostname.split('.')
135 | hostname_parts.reverse()
136 |
137 | # Remove common TLDs and 'www'
138 | hostname_parts = [p for p in hostname_parts if p not in ('com', 'org', 'net', 'www')]
139 |
140 | # Get path components, removing empty strings
141 | path_parts = [p for p in parsed.path.split('/') if p]
142 |
143 | # Combine hostname and path parts
144 | all_parts = hostname_parts + path_parts
145 |
146 | # Clean up parts: lowercase, remove special chars, limit length
147 | cleaned_parts = []
148 | for part in all_parts:
149 | # Convert to lowercase and remove special characters
150 | cleaned = re.sub(r'[^a-zA-Z0-9]+', '_', part.lower())
151 | # Remove leading/trailing underscores
152 | cleaned = cleaned.strip('_')
153 | # Only add non-empty parts
154 | if cleaned:
155 | cleaned_parts.append(cleaned)
156 |
157 | # Join parts with underscores
158 | return '_'.join(cleaned_parts)
159 |
160 | except Exception as e:
161 | print(colored(f"Error generating filename prefix: {str(e)}", "red"))
162 | return "default"
163 |
164 | def save_markdown_content(self, results: List[dict], filename_prefix: str = None):
165 | """Save all markdown content to a single file"""
166 | try:
167 | # Use the first successful URL to generate the filename prefix if none provided
168 | if not filename_prefix and results:
169 | # Find first successful result
170 | first_url = next((result["url"] for result in results if result["success"]), None)
171 | if first_url:
172 | filename_prefix = self.get_filename_prefix(first_url)
173 | else:
174 | filename_prefix = "docs" # Fallback if no successful results
175 |
176 | timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
177 | filename = f"{filename_prefix}_{timestamp}.md"
178 | filepath = os.path.join("scraped_docs", filename)
179 |
180 | # Create scraped_docs directory if it doesn't exist
181 | os.makedirs("scraped_docs", exist_ok=True)
182 |
183 | with open(filepath, "w", encoding="utf-8") as f:
184 | for result in results:
185 | if result["success"]:
186 | processed_content = self.process_markdown_content(
187 | result["markdown_content"],
188 | result["url"]
189 | )
190 | f.write(processed_content)
191 | f.write("\n\n---\n\n")
192 |
193 | if self.verbose:
194 | print(colored(f"\nMarkdown content saved to: {filepath}", "green"))
195 | return filepath
196 |
197 | except Exception as e:
198 | print(colored(f"\nError saving markdown content: {str(e)}", "red"))
199 | return None
200 |
201 | async def crawl(self, urls: List[str]) -> List[dict]:
202 | """
203 | Crawl multiple URLs sequentially using session reuse for optimal performance
204 | """
205 | if self.verbose:
206 | print("\n=== Starting Crawl ===")
207 | total_urls = len(urls)
208 | print(f"Total URLs to crawl: {total_urls}")
209 |
210 | results = []
211 | async with AsyncWebCrawler(config=self.browser_config) as crawler:
212 | session_id = "crawl_session" # Reuse the same session for all URLs
213 | for idx, url in enumerate(urls, 1):
214 | try:
215 | if self.verbose:
216 | progress = (idx / total_urls) * 100
217 | print(f"\nProgress: {idx}/{total_urls} ({progress:.1f}%)")
218 | print(f"Crawling: {url}")
219 |
220 | result = await crawler.arun(
221 | url=url,
222 | config=self.crawler_config,
223 | session_id=session_id,
224 | )
225 |
226 | results.append({
227 | "url": url,
228 | "success": result.success,
229 | "content_length": len(result.markdown.raw_markdown) if result.success else 0,
230 | "markdown_content": result.markdown.raw_markdown if result.success else "",
231 | "error": result.error_message if not result.success else None
232 | })
233 |
234 | if self.verbose and result.success:
235 | print(f"✓ Successfully crawled URL {idx}/{total_urls}")
236 | print(f"Content length: {len(result.markdown.raw_markdown)} characters")
237 | except Exception as e:
238 | results.append({
239 | "url": url,
240 | "success": False,
241 | "content_length": 0,
242 | "markdown_content": "",
243 | "error": str(e)
244 | })
245 | if self.verbose:
246 | print(f"✗ Error crawling URL {idx}/{total_urls}: {str(e)}")
247 |
248 | if self.verbose:
249 | successful = sum(1 for r in results if r["success"])
250 | print(f"\n=== Crawl Complete ===")
251 | print(f"Successfully crawled: {successful}/{total_urls} URLs")
252 |
253 | return results
254 |
255 | async def main():
256 | parser = argparse.ArgumentParser(description='Crawl multiple URLs and generate markdown documentation')
257 | parser.add_argument('urls_file', type=str, help='Path to file containing URLs (either .txt or .json)')
258 | parser.add_argument('--output-prefix', type=str, help='Prefix for output markdown file (optional)')
259 | args = parser.parse_args()
260 |
261 | try:
262 | # Load URLs from file
263 | urls = load_urls_from_file(args.urls_file)
264 |
265 | if not urls:
266 | print(colored("Error: No URLs found in the input file", "red"))
267 | sys.exit(1)
268 |
269 | print(colored(f"Found {len(urls)} URLs to crawl", "green"))
270 |
271 | # Initialize and run crawler
272 | crawler = MultiURLCrawler(verbose=True)
273 | results = await crawler.crawl(urls)
274 |
275 | # Save results to markdown file - only pass output_prefix if explicitly set
276 | crawler.save_markdown_content(results, args.output_prefix if args.output_prefix else None)
277 |
278 | except Exception as e:
279 | print(colored(f"Error during crawling: {str(e)}", "red"))
280 | sys.exit(1)
281 |
282 | if __name__ == "__main__":
283 | asyncio.run(main())
```
--------------------------------------------------------------------------------
/src/docs_scraper/crawlers/sitemap_crawler.py:
--------------------------------------------------------------------------------
```python
1 | import os
2 | import sys
3 | import asyncio
4 | import re
5 | import xml.etree.ElementTree as ET
6 | import argparse
7 | from typing import List, Optional, Dict
8 | from datetime import datetime
9 | from termcolor import colored
10 | from ..utils import RequestHandler, HTMLParser
11 |
12 | class SitemapCrawler:
13 | def __init__(self, request_handler: Optional[RequestHandler] = None, html_parser: Optional[HTMLParser] = None, verbose: bool = True):
14 | """
15 | Initialize the sitemap crawler.
16 |
17 | Args:
18 | request_handler: Optional RequestHandler instance. If not provided, one will be created.
19 | html_parser: Optional HTMLParser instance. If not provided, one will be created.
20 | verbose: Whether to print progress messages
21 | """
22 | self.verbose = verbose
23 | self.request_handler = request_handler or RequestHandler(
24 | rate_limit=1.0,
25 | concurrent_limit=5,
26 | user_agent="DocsScraperBot/1.0",
27 | timeout=30
28 | )
29 | self._html_parser = html_parser
30 |
31 | async def fetch_sitemap(self, sitemap_url: str) -> List[str]:
32 | """
33 | Fetch and parse an XML sitemap to extract URLs.
34 |
35 | Args:
36 | sitemap_url (str): The URL of the XML sitemap
37 |
38 | Returns:
39 | List[str]: List of URLs found in the sitemap
40 | """
41 | if self.verbose:
42 | print(f"\nFetching sitemap from: {sitemap_url}")
43 |
44 | async with self.request_handler as handler:
45 | try:
46 | response = await handler.get(sitemap_url)
47 | if not response["success"]:
48 | raise Exception(f"Failed to fetch sitemap: {response['error']}")
49 |
50 | content = response["content"]
51 |
52 | # Parse XML content
53 | root = ET.fromstring(content)
54 |
55 | # Handle both standard sitemaps and sitemap indexes
56 | urls = []
57 |
58 | # Remove XML namespace for easier parsing
59 | namespace = root.tag.split('}')[0] + '}' if '}' in root.tag else ''
60 |
61 | if root.tag == f"{namespace}sitemapindex":
62 | # This is a sitemap index file
63 | if self.verbose:
64 | print("Found sitemap index, processing nested sitemaps...")
65 |
66 | for sitemap in root.findall(f".//{namespace}sitemap"):
67 | loc = sitemap.find(f"{namespace}loc")
68 | if loc is not None and loc.text:
69 | nested_urls = await self.fetch_sitemap(loc.text)
70 | urls.extend(nested_urls)
71 | else:
72 | # This is a standard sitemap
73 | for url in root.findall(f".//{namespace}url"):
74 | loc = url.find(f"{namespace}loc")
75 | if loc is not None and loc.text:
76 | urls.append(loc.text)
77 |
78 | if self.verbose:
79 | print(f"Found {len(urls)} URLs in sitemap")
80 | return urls
81 |
82 | except Exception as e:
83 | print(f"Error fetching sitemap: {str(e)}")
84 | return []
85 |
86 | def process_markdown_content(self, content: str, url: str) -> str:
87 | """Process markdown content to start from first H1 and add URL as H2"""
88 | # Find the first H1 tag
89 | h1_match = re.search(r'^# .+$', content, re.MULTILINE)
90 | if not h1_match:
91 | # If no H1 found, return original content with URL as H1
92 | return f"# No Title Found\n\n## Source\n{url}\n\n{content}"
93 |
94 | # Get the content starting from the first H1
95 | content_from_h1 = content[h1_match.start():]
96 |
97 | # Remove "Was this page helpful?" section and everything after it
98 | helpful_patterns = [
99 | r'^#+\s*Was this page helpful\?.*$', # Matches any heading level with this text
100 | r'^Was this page helpful\?.*$', # Matches the text without heading
101 | r'^#+\s*Was this helpful\?.*$', # Matches any heading level with shorter text
102 | r'^Was this helpful\?.*$' # Matches shorter text without heading
103 | ]
104 |
105 | for pattern in helpful_patterns:
106 | parts = re.split(pattern, content_from_h1, flags=re.MULTILINE | re.IGNORECASE)
107 | if len(parts) > 1:
108 | content_from_h1 = parts[0].strip()
109 | break
110 |
111 | # Insert URL as H2 after the H1
112 | lines = content_from_h1.split('\n')
113 | h1_line = lines[0]
114 | rest_of_content = '\n'.join(lines[1:]).strip()
115 |
116 | return f"{h1_line}\n\n## Source\n{url}\n\n{rest_of_content}"
117 |
118 | def save_markdown_content(self, results: List[dict], filename_prefix: str = "vercel_ai_docs"):
119 | """Save all markdown content to a single file"""
120 | timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
121 | filename = f"{filename_prefix}_{timestamp}.md"
122 | filepath = os.path.join("scraped_docs", filename)
123 |
124 | # Create scraped_docs directory if it doesn't exist
125 | os.makedirs("scraped_docs", exist_ok=True)
126 |
127 | with open(filepath, "w", encoding="utf-8") as f:
128 | for result in results:
129 | if result["success"]:
130 | processed_content = self.process_markdown_content(
131 | result["content"],
132 | result["url"]
133 | )
134 | f.write(processed_content)
135 | f.write("\n\n---\n\n")
136 |
137 | if self.verbose:
138 | print(f"\nMarkdown content saved to: {filepath}")
139 | return filepath
140 |
141 | async def crawl(self, sitemap_url: str) -> List[dict]:
142 | """
143 | Crawl a sitemap URL and all URLs it contains.
144 |
145 | Args:
146 | sitemap_url: URL of the sitemap to crawl
147 |
148 | Returns:
149 | List of dictionaries containing crawl results
150 | """
151 | if self.verbose:
152 | print("\n=== Starting Crawl ===")
153 |
154 | # First fetch all URLs from the sitemap
155 | urls = await self.fetch_sitemap(sitemap_url)
156 |
157 | if self.verbose:
158 | print(f"Total URLs to crawl: {len(urls)}")
159 |
160 | results = []
161 | async with self.request_handler as handler:
162 | for idx, url in enumerate(urls, 1):
163 | try:
164 | if self.verbose:
165 | progress = (idx / len(urls)) * 100
166 | print(f"\nProgress: {idx}/{len(urls)} ({progress:.1f}%)")
167 | print(f"Crawling: {url}")
168 |
169 | response = await handler.get(url)
170 | html_parser = self._html_parser or HTMLParser(url)
171 |
172 | if response["success"]:
173 | parsed_content = html_parser.parse_content(response["content"])
174 | results.append({
175 | "url": url,
176 | "success": True,
177 | "content": parsed_content["text_content"],
178 | "metadata": {
179 | "title": parsed_content["title"],
180 | "description": parsed_content["description"]
181 | },
182 | "links": parsed_content["links"],
183 | "status_code": response["status"],
184 | "error": None
185 | })
186 |
187 | if self.verbose:
188 | print(f"✓ Successfully crawled URL {idx}/{len(urls)}")
189 | print(f"Content length: {len(parsed_content['text_content'])} characters")
190 | else:
191 | results.append({
192 | "url": url,
193 | "success": False,
194 | "content": "",
195 | "metadata": {"title": None, "description": None},
196 | "links": [],
197 | "status_code": response.get("status"),
198 | "error": response["error"]
199 | })
200 | if self.verbose:
201 | print(f"✗ Error crawling URL {idx}/{len(urls)}: {response['error']}")
202 |
203 | except Exception as e:
204 | results.append({
205 | "url": url,
206 | "success": False,
207 | "content": "",
208 | "metadata": {"title": None, "description": None},
209 | "links": [],
210 | "status_code": None,
211 | "error": str(e)
212 | })
213 | if self.verbose:
214 | print(f"✗ Error crawling URL {idx}/{len(urls)}: {str(e)}")
215 |
216 | if self.verbose:
217 | successful = sum(1 for r in results if r["success"])
218 | print(f"\n=== Crawl Complete ===")
219 | print(f"Successfully crawled: {successful}/{len(urls)} URLs")
220 |
221 | return results
222 |
223 | def get_filename_prefix(self, url: str) -> str:
224 | """
225 | Generate a filename prefix from a sitemap URL.
226 | Examples:
227 | - https://docs.literalai.com/sitemap.xml -> literalai_documentation
228 | - https://literalai.com/docs/sitemap.xml -> literalai_docs
229 | - https://api.example.com/sitemap.xml -> example_api
230 |
231 | Args:
232 | url (str): The sitemap URL
233 |
234 | Returns:
235 | str: Generated filename prefix
236 | """
237 | # Remove protocol and split URL parts
238 | clean_url = url.split('://')[1]
239 | url_parts = clean_url.split('/')
240 |
241 | # Get domain parts
242 | domain_parts = url_parts[0].split('.')
243 |
244 | # Extract main domain name (ignoring TLD)
245 | main_domain = domain_parts[-2]
246 |
247 | # Determine the qualifier (subdomain or path segment)
248 | qualifier = None
249 |
250 | # First check subdomain
251 | if len(domain_parts) > 2:
252 | qualifier = domain_parts[0]
253 | # Then check path
254 | elif len(url_parts) > 2:
255 | # Get the first meaningful path segment
256 | for segment in url_parts[1:]:
257 | if segment and segment != 'sitemap.xml':
258 | qualifier = segment
259 | break
260 |
261 | # Build the prefix
262 | if qualifier:
263 | # Clean up qualifier (remove special characters, convert to lowercase)
264 | qualifier = re.sub(r'[^a-zA-Z0-9]', '', qualifier.lower())
265 | # Don't duplicate parts if they're the same
266 | if qualifier != main_domain:
267 | return f"{main_domain}_{qualifier}"
268 |
269 | return main_domain
270 |
271 | async def main():
272 | # Set up argument parser
273 | parser = argparse.ArgumentParser(description='Crawl a sitemap and generate markdown documentation')
274 | parser.add_argument('sitemap_url', type=str, help='URL of the sitemap (e.g., https://docs.example.com/sitemap.xml)')
275 | parser.add_argument('--max-depth', type=int, default=10, help='Maximum sitemap recursion depth')
276 | parser.add_argument('--patterns', type=str, nargs='+', help='URL patterns to include (e.g., "/docs/*" "/guide/*")')
277 | args = parser.parse_args()
278 |
279 | try:
280 | print(colored(f"\nFetching sitemap: {args.sitemap_url}", "cyan"))
281 |
282 | # Initialize crawler
283 | crawler = SitemapCrawler(verbose=True)
284 |
285 | # Fetch URLs from sitemap
286 | urls = await crawler.fetch_sitemap(args.sitemap_url)
287 |
288 | if not urls:
289 | print(colored("No URLs found in sitemap", "red"))
290 | sys.exit(1)
291 |
292 | # Filter URLs by pattern if specified
293 | if args.patterns:
294 | print(colored("\nFiltering URLs by patterns:", "cyan"))
295 | for pattern in args.patterns:
296 | print(colored(f" {pattern}", "yellow"))
297 |
298 | filtered_urls = []
299 | for url in urls:
300 | if any(pattern.replace('*', '') in url for pattern in args.patterns):
301 | filtered_urls.append(url)
302 |
303 | print(colored(f"\nFound {len(filtered_urls)} URLs matching patterns", "green"))
304 | urls = filtered_urls
305 |
306 | # Crawl the URLs
307 | results = await crawler.crawl(args.sitemap_url)
308 |
309 | # Save results to markdown file with dynamic name
310 | filename_prefix = crawler.get_filename_prefix(args.sitemap_url)
311 | crawler.save_markdown_content(results, filename_prefix)
312 |
313 | except Exception as e:
314 | print(colored(f"Error during crawling: {str(e)}", "red"))
315 | sys.exit(1)
316 |
317 | if __name__ == "__main__":
318 | asyncio.run(main())
```
--------------------------------------------------------------------------------
/src/docs_scraper/crawlers/menu_crawler.py:
--------------------------------------------------------------------------------
```python
1 | #!/usr/bin/env python3
2 |
3 | import asyncio
4 | from typing import List, Set
5 | from termcolor import colored
6 | from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
7 | from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
8 | from urllib.parse import urljoin, urlparse
9 | import json
10 | import os
11 | import sys
12 | import argparse
13 | from datetime import datetime
14 | import re
15 |
16 | # Constants
17 | BASE_URL = "https://developers.cloudflare.com/agents/"
18 | INPUT_DIR = "input_files" # Changed from OUTPUT_DIR
19 | MENU_SELECTORS = [
20 | # Traditional documentation selectors
21 | "nav a", # General navigation links
22 | "[role='navigation'] a", # Role-based navigation
23 | ".sidebar a", # Common sidebar class
24 | "[class*='nav'] a", # Classes containing 'nav'
25 | "[class*='menu'] a", # Classes containing 'menu'
26 | "aside a", # Side navigation
27 | ".toc a", # Table of contents
28 |
29 | # Modern framework selectors (Mintlify, Docusaurus, etc)
30 | "[class*='sidebar'] [role='navigation'] [class*='group'] a", # Navigation groups
31 | "[class*='sidebar'] [role='navigation'] [class*='item'] a", # Navigation items
32 | "[class*='sidebar'] [role='navigation'] [class*='link'] a", # Direct links
33 | "[class*='sidebar'] [role='navigation'] div[class*='text']", # Text items
34 | "[class*='sidebar'] [role='navigation'] [class*='nav-item']", # Nav items
35 |
36 | # Additional common patterns
37 | "[class*='docs-'] a", # Documentation-specific links
38 | "[class*='navigation'] a", # Navigation containers
39 | "[class*='toc'] a", # Table of contents variations
40 | ".docNavigation a", # Documentation navigation
41 | "[class*='menu-item'] a", # Menu items
42 |
43 | # Client-side rendered navigation
44 | "[class*='sidebar'] a[href]", # Any link in sidebar
45 | "[class*='sidebar'] [role='link']", # ARIA role links
46 | "[class*='sidebar'] [role='menuitem']", # Menu items
47 | "[class*='sidebar'] [role='treeitem']", # Tree navigation items
48 | "[class*='sidebar'] [onclick]", # Elements with click handlers
49 | "[class*='sidebar'] [class*='link']", # Elements with link classes
50 | "a[href^='/']", # Root-relative links
51 | "a[href^='./']", # Relative links
52 | "a[href^='../']" # Parent-relative links
53 | ]
54 |
55 | # JavaScript to expand nested menus
56 | EXPAND_MENUS_JS = """
57 | (async () => {
58 | // Wait for client-side rendering to complete
59 | await new Promise(r => setTimeout(r, 2000));
60 |
61 | // Function to expand all menu items
62 | async function expandAllMenus() {
63 | // Combined selectors for expandable menu items
64 | const expandableSelectors = [
65 | // Previous selectors...
66 | // Additional selectors for client-side rendered menus
67 | '[class*="sidebar"] button',
68 | '[class*="sidebar"] [role="button"]',
69 | '[class*="sidebar"] [aria-controls]',
70 | '[class*="sidebar"] [aria-expanded]',
71 | '[class*="sidebar"] [data-state]',
72 | '[class*="sidebar"] [class*="expand"]',
73 | '[class*="sidebar"] [class*="toggle"]',
74 | '[class*="sidebar"] [class*="collapse"]'
75 | ];
76 |
77 | let expanded = 0;
78 | let lastExpanded = -1;
79 | let attempts = 0;
80 | const maxAttempts = 10; // Increased attempts for client-side rendering
81 |
82 | while (expanded !== lastExpanded && attempts < maxAttempts) {
83 | lastExpanded = expanded;
84 | attempts++;
85 |
86 | for (const selector of expandableSelectors) {
87 | const elements = document.querySelectorAll(selector);
88 | for (const el of elements) {
89 | try {
90 | // Click the element
91 | el.click();
92 |
93 | // Try multiple expansion methods
94 | el.setAttribute('aria-expanded', 'true');
95 | el.setAttribute('data-state', 'open');
96 | el.classList.add('expanded', 'show', 'active');
97 | el.classList.remove('collapsed', 'closed');
98 |
99 | // Handle parent groups - multiple patterns
100 | ['[class*="group"]', '[class*="parent"]', '[class*="submenu"]'].forEach(parentSelector => {
101 | let parent = el.closest(parentSelector);
102 | if (parent) {
103 | parent.setAttribute('data-state', 'open');
104 | parent.setAttribute('aria-expanded', 'true');
105 | parent.classList.add('expanded', 'show', 'active');
106 | }
107 | });
108 |
109 | expanded++;
110 | await new Promise(r => setTimeout(r, 200)); // Increased delay between clicks
111 | } catch (e) {
112 | continue;
113 | }
114 | }
115 | }
116 |
117 | // Wait longer between attempts for client-side rendering
118 | await new Promise(r => setTimeout(r, 500));
119 | }
120 |
121 | // After expansion, try to convert text items to links if needed
122 | const textSelectors = [
123 | '[class*="sidebar"] [role="navigation"] [class*="text"]',
124 | '[class*="menu-item"]',
125 | '[class*="nav-item"]',
126 | '[class*="sidebar"] [role="menuitem"]',
127 | '[class*="sidebar"] [role="treeitem"]'
128 | ];
129 |
130 | textSelectors.forEach(selector => {
131 | const textItems = document.querySelectorAll(selector);
132 | textItems.forEach(item => {
133 | if (!item.querySelector('a') && item.textContent && item.textContent.trim()) {
134 | const text = item.textContent.trim();
135 | // Only create link if it doesn't already exist
136 | if (!Array.from(item.children).some(child => child.tagName === 'A')) {
137 | const link = document.createElement('a');
138 | link.href = '#' + text.toLowerCase().replace(/[^a-z0-9]+/g, '-');
139 | link.textContent = text;
140 | item.appendChild(link);
141 | }
142 | }
143 | });
144 | });
145 |
146 | return expanded;
147 | }
148 |
149 | const expandedCount = await expandAllMenus();
150 | // Final wait to ensure all client-side updates are complete
151 | await new Promise(r => setTimeout(r, 1000));
152 | return expandedCount;
153 | })();
154 | """
155 |
156 | def get_filename_prefix(url: str) -> str:
157 | """
158 | Generate a filename prefix from a URL including path components.
159 | Examples:
160 | - https://docs.literalai.com/page -> literalai_docs_page
161 | - https://literalai.com/docs/page -> literalai_docs_page
162 | - https://api.example.com/path/to/page -> example_api_path_to_page
163 |
164 | Args:
165 | url (str): The URL to process
166 |
167 | Returns:
168 | str: A filename-safe string derived from the URL
169 | """
170 | try:
171 | # Parse the URL
172 | parsed = urlparse(url)
173 |
174 | # Split hostname and reverse it (e.g., 'docs.example.com' -> ['com', 'example', 'docs'])
175 | hostname_parts = parsed.hostname.split('.')
176 | hostname_parts.reverse()
177 |
178 | # Remove common TLDs and 'www'
179 | hostname_parts = [p for p in hostname_parts if p not in ('com', 'org', 'net', 'www')]
180 |
181 | # Get path components, removing empty strings
182 | path_parts = [p for p in parsed.path.split('/') if p]
183 |
184 | # Combine hostname and path parts
185 | all_parts = hostname_parts + path_parts
186 |
187 | # Clean up parts: lowercase, remove special chars, limit length
188 | cleaned_parts = []
189 | for part in all_parts:
190 | # Convert to lowercase and remove special characters
191 | cleaned = re.sub(r'[^a-zA-Z0-9]+', '_', part.lower())
192 | # Remove leading/trailing underscores
193 | cleaned = cleaned.strip('_')
194 | # Only add non-empty parts
195 | if cleaned:
196 | cleaned_parts.append(cleaned)
197 |
198 | # Join parts with underscores
199 | return '_'.join(cleaned_parts)
200 |
201 | except Exception as e:
202 | print(colored(f"Error generating filename prefix: {str(e)}", "red"))
203 | return "default"
204 |
205 | class MenuCrawler:
206 | def __init__(self, start_url: str):
207 | self.start_url = start_url
208 |
209 | # Configure browser settings
210 | self.browser_config = BrowserConfig(
211 | headless=True,
212 | viewport_width=1920,
213 | viewport_height=1080,
214 | java_script_enabled=True # Ensure JavaScript is enabled
215 | )
216 |
217 | # Create extraction strategy for menu links
218 | extraction_schema = {
219 | "name": "MenuLinks",
220 | "baseSelector": ", ".join(MENU_SELECTORS),
221 | "fields": [
222 | {
223 | "name": "href",
224 | "type": "attribute",
225 | "attribute": "href"
226 | },
227 | {
228 | "name": "text",
229 | "type": "text"
230 | },
231 | {
232 | "name": "onclick",
233 | "type": "attribute",
234 | "attribute": "onclick"
235 | },
236 | {
237 | "name": "role",
238 | "type": "attribute",
239 | "attribute": "role"
240 | }
241 | ]
242 | }
243 | extraction_strategy = JsonCssExtractionStrategy(extraction_schema)
244 |
245 | # Configure crawler settings with proper wait conditions
246 | self.crawler_config = CrawlerRunConfig(
247 | extraction_strategy=extraction_strategy,
248 | cache_mode=CacheMode.BYPASS, # Don't use cache for fresh results
249 | verbose=True, # Enable detailed logging
250 | wait_for_images=True, # Ensure lazy-loaded content is captured
251 | js_code=[
252 | # Initial wait for client-side rendering
253 | "await new Promise(r => setTimeout(r, 2000));",
254 | EXPAND_MENUS_JS
255 | ], # Add JavaScript to expand nested menus
256 | wait_for="""js:() => {
257 | // Wait for sidebar and its content to be present
258 | const sidebar = document.querySelector('[class*="sidebar"]');
259 | if (!sidebar) return false;
260 |
261 | // Check if we have navigation items
262 | const hasNavItems = sidebar.querySelectorAll('a').length > 0;
263 | if (hasNavItems) return true;
264 |
265 | // If no nav items yet, check for loading indicators
266 | const isLoading = document.querySelector('[class*="loading"]') !== null;
267 | return !isLoading; // Return true if not loading anymore
268 | }""",
269 | session_id="menu_crawler", # Use a session to maintain state
270 | js_only=False # We want full page load first
271 | )
272 |
273 | # Create output directory if it doesn't exist
274 | if not os.path.exists(INPUT_DIR):
275 | os.makedirs(INPUT_DIR)
276 | print(colored(f"Created output directory: {INPUT_DIR}", "green"))
277 |
278 | async def extract_all_menu_links(self) -> List[str]:
279 | """Extract all menu links from the main page, including nested menus."""
280 | try:
281 | print(colored(f"Crawling main page: {self.start_url}", "cyan"))
282 | print(colored("Expanding all nested menus...", "yellow"))
283 |
284 | async with AsyncWebCrawler(config=self.browser_config) as crawler:
285 | # Get page content using crawl4ai
286 | result = await crawler.arun(
287 | url=self.start_url,
288 | config=self.crawler_config
289 | )
290 |
291 | if not result or not result.success:
292 | print(colored(f"Failed to get page data", "red"))
293 | if result and result.error_message:
294 | print(colored(f"Error: {result.error_message}", "red"))
295 | return []
296 |
297 | links = set()
298 |
299 | # Parse the base domain from start_url
300 | base_domain = urlparse(self.start_url).netloc
301 |
302 | # Add the base URL first (without trailing slash for consistency)
303 | base_url = self.start_url.rstrip('/')
304 | links.add(base_url)
305 | print(colored(f"Added base URL: {base_url}", "green"))
306 |
307 | # Extract links from the result
308 | if hasattr(result, 'extracted_content') and result.extracted_content:
309 | try:
310 | menu_links = json.loads(result.extracted_content)
311 | for link in menu_links:
312 | href = link.get('href', '')
313 | text = link.get('text', '').strip()
314 |
315 | # Skip empty hrefs
316 | if not href:
317 | continue
318 |
319 | # Convert relative URLs to absolute
320 | absolute_url = urljoin(self.start_url, href)
321 | parsed_url = urlparse(absolute_url)
322 |
323 | # Accept internal links (same domain) that aren't anchors
324 | if (parsed_url.netloc == base_domain and
325 | not href.startswith('#') and
326 | '#' not in absolute_url):
327 |
328 | # Remove any trailing slashes for consistency
329 | absolute_url = absolute_url.rstrip('/')
330 |
331 | links.add(absolute_url)
332 | print(colored(f"Found link: {text} -> {absolute_url}", "green"))
333 | else:
334 | print(colored(f"Skipping external or anchor link: {text} -> {href}", "yellow"))
335 |
336 | except json.JSONDecodeError as e:
337 | print(colored(f"Error parsing extracted content: {str(e)}", "red"))
338 |
339 | print(colored(f"\nFound {len(links)} unique menu links", "green"))
340 | return sorted(list(links))
341 |
342 | except Exception as e:
343 | print(colored(f"Error extracting menu links: {str(e)}", "red"))
344 | return []
345 |
346 | def save_results(self, results: dict) -> str:
347 | """Save crawling results to a JSON file in the input_files directory."""
348 | try:
349 | # Create input_files directory if it doesn't exist
350 | os.makedirs(INPUT_DIR, exist_ok=True)
351 |
352 | # Generate filename using the same pattern
353 | filename_prefix = get_filename_prefix(self.start_url)
354 | timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
355 | filename = f"{filename_prefix}_menu_links_{timestamp}.json"
356 | filepath = os.path.join(INPUT_DIR, filename)
357 |
358 | with open(filepath, "w", encoding="utf-8") as f:
359 | json.dump(results, f, indent=2)
360 |
361 | print(colored(f"\n✓ Menu links saved to: {filepath}", "green"))
362 | print(colored("\nTo crawl these URLs with multi_url_crawler.py, run:", "cyan"))
363 | print(colored(f"python multi_url_crawler.py --urls {filename}", "yellow"))
364 | return filepath
365 |
366 | except Exception as e:
367 | print(colored(f"\n✗ Error saving menu links: {str(e)}", "red"))
368 | return None
369 |
370 | async def crawl(self):
371 | """Main crawling method."""
372 | try:
373 | # Extract all menu links from the main page
374 | menu_links = await self.extract_all_menu_links()
375 |
376 | # Save results
377 | results = {
378 | "start_url": self.start_url,
379 | "total_links_found": len(menu_links),
380 | "menu_links": menu_links
381 | }
382 |
383 | self.save_results(results)
384 |
385 | print(colored(f"\nCrawling completed!", "green"))
386 | print(colored(f"Total unique menu links found: {len(menu_links)}", "green"))
387 |
388 | except Exception as e:
389 | print(colored(f"Error during crawling: {str(e)}", "red"))
390 |
391 | async def main():
392 | # Set up argument parser
393 | parser = argparse.ArgumentParser(description='Extract menu links from a documentation website')
394 | parser.add_argument('url', type=str, help='Documentation site URL to crawl')
395 | parser.add_argument('--selectors', type=str, nargs='+', help='Custom menu selectors (optional)')
396 | args = parser.parse_args()
397 |
398 | try:
399 | # Update menu selectors if custom ones provided
400 | if args.selectors:
401 | print(colored("Using custom menu selectors:", "cyan"))
402 | for selector in args.selectors:
403 | print(colored(f" {selector}", "yellow"))
404 | global MENU_SELECTORS
405 | MENU_SELECTORS = args.selectors
406 |
407 | crawler = MenuCrawler(args.url)
408 | await crawler.crawl()
409 | except Exception as e:
410 | print(colored(f"Error in main: {str(e)}", "red"))
411 | sys.exit(1)
412 |
413 | if __name__ == "__main__":
414 | print(colored("Starting documentation menu crawler...", "cyan"))
415 | asyncio.run(main())
```