# Directory Structure
```
├── .cursor
│ └── rules
│ ├── implementation-plan.mdc
│ └── mcp-development-protocol.mdc
├── .gitignore
├── .venv
│ ├── Include
│ │ └── site
│ │ └── python3.12
│ │ └── greenlet
│ │ └── greenlet.h
│ ├── pyvenv.cfg
│ └── Scripts
│ ├── activate
│ ├── activate.bat
│ ├── Activate.ps1
│ ├── cchardetect
│ ├── crawl4ai-doctor.exe
│ ├── crawl4ai-download-models.exe
│ ├── crawl4ai-migrate.exe
│ ├── crawl4ai-setup.exe
│ ├── crwl.exe
│ ├── deactivate.bat
│ ├── distro.exe
│ ├── docs-scraper.exe
│ ├── dotenv.exe
│ ├── f2py.exe
│ ├── httpx.exe
│ ├── huggingface-cli.exe
│ ├── jsonschema.exe
│ ├── litellm.exe
│ ├── markdown-it.exe
│ ├── mcp.exe
│ ├── nltk.exe
│ ├── normalizer.exe
│ ├── numpy-config.exe
│ ├── openai.exe
│ ├── pip.exe
│ ├── pip3.12.exe
│ ├── pip3.exe
│ ├── playwright.exe
│ ├── py.test.exe
│ ├── pygmentize.exe
│ ├── pytest.exe
│ ├── python.exe
│ ├── pythonw.exe
│ ├── tqdm.exe
│ ├── typer.exe
│ └── uvicorn.exe
├── input_files
│ └── .gitkeep
├── LICENSE
├── pyproject.toml
├── README.md
├── requirements.txt
├── scraped_docs
│ └── .gitkeep
├── src
│ └── docs_scraper
│ ├── __init__.py
│ ├── cli.py
│ ├── crawlers
│ │ ├── __init__.py
│ │ ├── menu_crawler.py
│ │ ├── multi_url_crawler.py
│ │ ├── single_url_crawler.py
│ │ └── sitemap_crawler.py
│ ├── server.py
│ └── utils
│ ├── __init__.py
│ ├── html_parser.py
│ └── request_handler.py
└── tests
├── conftest.py
├── test_crawlers
│ ├── test_menu_crawler.py
│ ├── test_multi_url_crawler.py
│ ├── test_single_url_crawler.py
│ └── test_sitemap_crawler.py
└── test_utils
├── test_html_parser.py
└── test_request_handler.py
```
# Files
--------------------------------------------------------------------------------
/input_files/.gitkeep:
--------------------------------------------------------------------------------
```
```
--------------------------------------------------------------------------------
/scraped_docs/.gitkeep:
--------------------------------------------------------------------------------
```
```
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
```
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# Virtual Environment
venv/
ENV/
.env
# IDE
.idea/
.vscode/
*.swp
*.swo
.DS_Store
# Scraped Docs - ignore contents but keep directory
scraped_docs/*
!scraped_docs/.gitkeep
# Input Files - ignore contents but keep directory
input_files/*
!input_files/.gitkeep
```
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
```markdown
# Crawl4AI Documentation Scraper
Keep your dependency documentation lean, current, and AI-ready. This toolkit helps you extract clean, focused documentation from any framework or library website, perfect for both human readers and LLM consumption.
## Why This Tool?
In today's fast-paced development environment, you need:
- 📚 Quick access to dependency documentation without the bloat
- 🤖 Documentation in a format that's ready for RAG systems and LLMs
- 🎯 Focused content without navigation elements, ads, or irrelevant sections
- ⚡ Fast, efficient way to keep documentation up-to-date
- 🧹 Clean Markdown output for easy integration with documentation tools
Traditional web scraping often gives you everything - including navigation menus, footers, ads, and other noise. This toolkit is specifically designed to extract only what matters: the actual documentation content.
### Key Benefits
1. **Clean Documentation Output**
- Markdown format for content-focused documentation
- JSON format for structured menu data
- Perfect for documentation sites, wikis, and knowledge bases
- Ideal format for LLM training and RAG systems
2. **Smart Content Extraction**
- Automatically identifies main content areas
- Strips away navigation, ads, and irrelevant sections
- Preserves code blocks and technical formatting
- Maintains proper Markdown structure
3. **Flexible Crawling Strategies**
- Single page for quick reference docs
- Multi-page for comprehensive library documentation
- Sitemap-based for complete framework coverage
- Menu-based for structured documentation hierarchies
4. **LLM and RAG Ready**
- Clean Markdown text suitable for embeddings
- Preserved code blocks for technical accuracy
- Structured menu data in JSON format
- Consistent formatting for reliable processing
A comprehensive Python toolkit for scraping documentation websites using different crawling strategies. Built using the Crawl4AI library for efficient web crawling.
[](https://github.com/unclecode/crawl4ai)
## Features
### Core Features
- 🚀 Multiple crawling strategies
- 📑 Automatic nested menu expansion
- 🔄 Handles dynamic content and lazy-loaded elements
- 🎯 Configurable selectors
- 📝 Clean Markdown output for documentation
- 📊 JSON output for menu structure
- 🎨 Colorful terminal feedback
- 🔍 Smart URL processing
- ⚡ Asynchronous execution
### Available Crawlers
1. **Single URL Crawler** (`single_url_crawler.py`)
- Extracts content from a single documentation page
- Outputs clean Markdown format
- Perfect for targeted content extraction
- Configurable content selectors
2. **Multi URL Crawler** (`multi_url_crawler.py`)
- Processes multiple URLs in parallel
- Generates individual Markdown files per page
- Efficient batch processing
- Shared browser session for better performance
3. **Sitemap Crawler** (`sitemap_crawler.py`)
- Automatically discovers and crawls sitemap.xml
- Creates Markdown files for each page
- Supports recursive sitemap parsing
- Handles gzipped sitemaps
4. **Menu Crawler** (`menu_crawler.py`)
- Extracts all menu links from documentation
- Outputs structured JSON format
- Handles nested and dynamic menus
- Smart menu expansion
## Requirements
- Python 3.7+
- Virtual Environment (recommended)
## Installation
1. Clone the repository:
```bash
git clone https://github.com/felores/crawl4ai_docs_scraper.git
cd crawl4ai_docs_scraper
```
2. Create and activate a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
3. Install dependencies:
```bash
pip install -r requirements.txt
```
## Usage
### 1. Single URL Crawler
```bash
python single_url_crawler.py https://docs.example.com/page
```
Arguments:
- URL: Target documentation URL (required, first argument)
Note: Use quotes only if your URL contains special characters or spaces.
Output format (Markdown):
```markdown
# Page Title
## Section 1
Content with preserved formatting, including:
- Lists
- Links
- Tables
### Code Examples
```python
def example():
return "Code blocks are preserved"
```
### 2. Multi URL Crawler
```bash
# Using a text file with URLs
python multi_url_crawler.py urls.txt
# Using JSON output from menu crawler
python multi_url_crawler.py menu_links.json
# Using custom output prefix
python multi_url_crawler.py menu_links.json --output-prefix custom_name
```
Arguments:
- URLs file: Path to file containing URLs (required, first argument)
- Can be .txt with one URL per line
- Or .json from menu crawler output
- `--output-prefix`: Custom prefix for output markdown file (optional)
Note: Use quotes only if your file path contains spaces.
Output filename format:
- Without `--output-prefix`: `domain_path_docs_content_timestamp.md` (e.g., `cloudflare_agents_docs_content_20240323_223656.md`)
- With `--output-prefix`: `custom_prefix_docs_content_timestamp.md` (e.g., `custom_name_docs_content_20240323_223656.md`)
The crawler accepts two types of input files:
1. Text file with one URL per line:
```text
https://docs.example.com/page1
https://docs.example.com/page2
https://docs.example.com/page3
```
2. JSON file (compatible with menu crawler output):
```json
{
"menu_links": [
"https://docs.example.com/page1",
"https://docs.example.com/page2"
]
}
```
### 3. Sitemap Crawler
```bash
python sitemap_crawler.py https://docs.example.com/sitemap.xml
```
Options:
- `--max-depth`: Maximum sitemap recursion depth (optional)
- `--patterns`: URL patterns to include (optional)
### 4. Menu Crawler
```bash
python menu_crawler.py https://docs.example.com
```
Options:
- `--selectors`: Custom menu selectors (optional)
The menu crawler now saves its output to the `input_files` directory, making it ready for immediate use with the multi-url crawler. The output JSON has this format:
```json
{
"start_url": "https://docs.example.com/",
"total_links_found": 42,
"menu_links": [
"https://docs.example.com/page1",
"https://docs.example.com/page2"
]
}
```
After running the menu crawler, you'll get a command to run the multi-url crawler with the generated file.
## Directory Structure
```
crawl4ai_docs_scraper/
├── input_files/ # Input files for URL processing
│ ├── urls.txt # Text file with URLs
│ └── menu_links.json # JSON output from menu crawler
├── scraped_docs/ # Output directory for markdown files
│ └── docs_timestamp.md # Generated documentation
├── multi_url_crawler.py
├── menu_crawler.py
└── requirements.txt
```
## Error Handling
All crawlers include comprehensive error handling with colored terminal output:
- 🟢 Green: Success messages
- 🔵 Cyan: Processing status
- 🟡 Yellow: Warnings
- 🔴 Red: Error messages
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Attribution
This project uses [Crawl4AI](https://github.com/unclecode/crawl4ai) for web data extraction.
## Acknowledgments
- Built with [Crawl4AI](https://github.com/unclecode/crawl4ai)
- Uses [termcolor](https://pypi.org/project/termcolor/) for colorful terminal output
```
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
```
crawl4ai
aiohttp
termcolor
playwright
```
--------------------------------------------------------------------------------
/src/docs_scraper/utils/__init__.py:
--------------------------------------------------------------------------------
```python
"""
Utility modules for web crawling and HTML parsing.
"""
from .request_handler import RequestHandler
from .html_parser import HTMLParser
__all__ = [
'RequestHandler',
'HTMLParser'
]
```
--------------------------------------------------------------------------------
/src/docs_scraper/__init__.py:
--------------------------------------------------------------------------------
```python
"""
Documentation scraper MCP server package.
"""
# Import subpackages but not modules to avoid circular imports
from . import crawlers
from . import utils
# Expose important items at package level
__all__ = ['crawlers', 'utils']
```
--------------------------------------------------------------------------------
/src/docs_scraper/crawlers/__init__.py:
--------------------------------------------------------------------------------
```python
"""
Web crawler implementations for documentation scraping.
"""
from .single_url_crawler import SingleURLCrawler
from .multi_url_crawler import MultiURLCrawler
from .sitemap_crawler import SitemapCrawler
from .menu_crawler import MenuCrawler
__all__ = [
'SingleURLCrawler',
'MultiURLCrawler',
'SitemapCrawler',
'MenuCrawler'
]
```
--------------------------------------------------------------------------------
/.venv/Scripts/deactivate.bat:
--------------------------------------------------------------------------------
```
@echo off
if defined _OLD_VIRTUAL_PROMPT (
set "PROMPT=%_OLD_VIRTUAL_PROMPT%"
)
set _OLD_VIRTUAL_PROMPT=
if defined _OLD_VIRTUAL_PYTHONHOME (
set "PYTHONHOME=%_OLD_VIRTUAL_PYTHONHOME%"
set _OLD_VIRTUAL_PYTHONHOME=
)
if defined _OLD_VIRTUAL_PATH (
set "PATH=%_OLD_VIRTUAL_PATH%"
)
set _OLD_VIRTUAL_PATH=
set VIRTUAL_ENV=
set VIRTUAL_ENV_PROMPT=
:END
```
--------------------------------------------------------------------------------
/src/docs_scraper/cli.py:
--------------------------------------------------------------------------------
```python
"""
Command line interface for the docs_scraper package.
"""
import logging
def main():
"""Entry point for the package when run from the command line."""
from docs_scraper.server import main as server_main
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
# Run the server
server_main()
if __name__ == "__main__":
main()
```
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
```toml
[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"
[project]
name = "docs_scraper"
version = "0.1.0"
authors = [
{ name = "Your Name", email = "[email protected]" }
]
description = "A documentation scraping tool"
requires-python = ">=3.7"
dependencies = [
"beautifulsoup4",
"requests",
"aiohttp",
"lxml",
"termcolor",
"crawl4ai"
]
classifiers = [
"Programming Language :: Python :: 3",
"Operating System :: OS Independent",
]
[project.optional-dependencies]
test = [
"pytest",
"pytest-asyncio",
"aioresponses"
]
[project.scripts]
docs-scraper = "docs_scraper.cli:main"
[tool.setuptools.packages.find]
where = ["src"]
include = ["docs_scraper*"]
namespaces = false
[tool.hatch.build]
packages = ["src/docs_scraper"]
```
--------------------------------------------------------------------------------
/.venv/Scripts/activate.bat:
--------------------------------------------------------------------------------
```
@echo off
rem This file is UTF-8 encoded, so we need to update the current code page while executing it
for /f "tokens=2 delims=:." %%a in ('"%SystemRoot%\System32\chcp.com"') do (
set _OLD_CODEPAGE=%%a
)
if defined _OLD_CODEPAGE (
"%SystemRoot%\System32\chcp.com" 65001 > nul
)
set "VIRTUAL_ENV=D:\AI-DEV\mcp\docs_scraper_mcp\.venv"
if not defined PROMPT set PROMPT=$P$G
if defined _OLD_VIRTUAL_PROMPT set PROMPT=%_OLD_VIRTUAL_PROMPT%
if defined _OLD_VIRTUAL_PYTHONHOME set PYTHONHOME=%_OLD_VIRTUAL_PYTHONHOME%
set _OLD_VIRTUAL_PROMPT=%PROMPT%
set PROMPT=(.venv) %PROMPT%
if defined PYTHONHOME set _OLD_VIRTUAL_PYTHONHOME=%PYTHONHOME%
set PYTHONHOME=
if defined _OLD_VIRTUAL_PATH set PATH=%_OLD_VIRTUAL_PATH%
if not defined _OLD_VIRTUAL_PATH set _OLD_VIRTUAL_PATH=%PATH%
set "PATH=%VIRTUAL_ENV%\Scripts;%PATH%"
set "VIRTUAL_ENV_PROMPT=(.venv) "
:END
if defined _OLD_CODEPAGE (
"%SystemRoot%\System32\chcp.com" %_OLD_CODEPAGE% > nul
set _OLD_CODEPAGE=
)
```
--------------------------------------------------------------------------------
/tests/conftest.py:
--------------------------------------------------------------------------------
```python
"""
Test configuration and fixtures for the docs_scraper package.
"""
import os
import pytest
import aiohttp
from typing import AsyncGenerator, Dict, Any
from aioresponses import aioresponses
from bs4 import BeautifulSoup
@pytest.fixture
def mock_aiohttp() -> aioresponses:
"""Fixture for mocking aiohttp requests."""
with aioresponses() as m:
yield m
@pytest.fixture
def sample_html() -> str:
"""Sample HTML content for testing."""
return """
<!DOCTYPE html>
<html>
<head>
<title>Test Page</title>
<meta name="description" content="Test description">
</head>
<body>
<nav class="menu">
<ul>
<li><a href="/page1">Page 1</a></li>
<li>
<a href="/section1">Section 1</a>
<ul>
<li><a href="/section1/page1">Section 1.1</a></li>
<li><a href="/section1/page2">Section 1.2</a></li>
</ul>
</li>
</ul>
</nav>
<main>
<h1>Welcome</h1>
<p>Test content</p>
<a href="/test1">Link 1</a>
<a href="/test2">Link 2</a>
</main>
</body>
</html>
"""
@pytest.fixture
def sample_sitemap() -> str:
"""Sample sitemap.xml content for testing."""
return """<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2024-03-24</lastmod>
</url>
<url>
<loc>https://example.com/page1</loc>
<lastmod>2024-03-24</lastmod>
</url>
<url>
<loc>https://example.com/page2</loc>
<lastmod>2024-03-24</lastmod>
</url>
</urlset>
"""
@pytest.fixture
def mock_website(mock_aiohttp, sample_html, sample_sitemap) -> None:
"""Set up a mock website with various pages and a sitemap."""
base_url = "https://example.com"
pages = {
"/": sample_html,
"/page1": sample_html.replace("Test Page", "Page 1"),
"/page2": sample_html.replace("Test Page", "Page 2"),
"/section1": sample_html.replace("Test Page", "Section 1"),
"/section1/page1": sample_html.replace("Test Page", "Section 1.1"),
"/section1/page2": sample_html.replace("Test Page", "Section 1.2"),
"/robots.txt": "User-agent: *\nAllow: /",
"/sitemap.xml": sample_sitemap
}
for path, content in pages.items():
mock_aiohttp.get(f"{base_url}{path}", status=200, body=content)
@pytest.fixture
async def aiohttp_session() -> AsyncGenerator[aiohttp.ClientSession, None]:
"""Create an aiohttp ClientSession for testing."""
async with aiohttp.ClientSession() as session:
yield session
@pytest.fixture
def test_urls() -> Dict[str, Any]:
"""Test URLs and related data for testing."""
base_url = "https://example.com"
return {
"base_url": base_url,
"valid_urls": [
f"{base_url}/",
f"{base_url}/page1",
f"{base_url}/page2"
],
"invalid_urls": [
"not_a_url",
"ftp://example.com",
"https://nonexistent.example.com"
],
"menu_selector": "nav.menu",
"sitemap_url": f"{base_url}/sitemap.xml"
}
```
--------------------------------------------------------------------------------
/tests/test_crawlers/test_single_url_crawler.py:
--------------------------------------------------------------------------------
```python
"""
Tests for the SingleURLCrawler class.
"""
import pytest
from docs_scraper.crawlers import SingleURLCrawler
from docs_scraper.utils import RequestHandler, HTMLParser
@pytest.mark.asyncio
async def test_single_url_crawler_successful_crawl(mock_website, test_urls, aiohttp_session):
"""Test successful crawling of a single URL."""
url = test_urls["valid_urls"][0]
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser()
crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
result = await crawler.crawl(url)
assert result["success"] is True
assert result["url"] == url
assert "content" in result
assert "title" in result["metadata"]
assert "description" in result["metadata"]
assert len(result["links"]) > 0
assert result["status_code"] == 200
assert result["error"] is None
@pytest.mark.asyncio
async def test_single_url_crawler_invalid_url(mock_website, test_urls, aiohttp_session):
"""Test crawling with an invalid URL."""
url = test_urls["invalid_urls"][0]
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser()
crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
result = await crawler.crawl(url)
assert result["success"] is False
assert result["url"] == url
assert result["content"] is None
assert result["metadata"] == {}
assert result["links"] == []
assert result["error"] is not None
@pytest.mark.asyncio
async def test_single_url_crawler_nonexistent_url(mock_website, test_urls, aiohttp_session):
"""Test crawling a URL that doesn't exist."""
url = test_urls["invalid_urls"][2]
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser()
crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
result = await crawler.crawl(url)
assert result["success"] is False
assert result["url"] == url
assert result["content"] is None
assert result["metadata"] == {}
assert result["links"] == []
assert result["error"] is not None
@pytest.mark.asyncio
async def test_single_url_crawler_metadata_extraction(mock_website, test_urls, aiohttp_session):
"""Test extraction of metadata from a crawled page."""
url = test_urls["valid_urls"][0]
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser()
crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
result = await crawler.crawl(url)
assert result["success"] is True
assert result["metadata"]["title"] == "Test Page"
assert result["metadata"]["description"] == "Test description"
@pytest.mark.asyncio
async def test_single_url_crawler_link_extraction(mock_website, test_urls, aiohttp_session):
"""Test extraction of links from a crawled page."""
url = test_urls["valid_urls"][0]
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser()
crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
result = await crawler.crawl(url)
assert result["success"] is True
assert len(result["links"]) >= 6 # Number of links in sample HTML
assert "/page1" in result["links"]
assert "/section1" in result["links"]
assert "/test1" in result["links"]
assert "/test2" in result["links"]
@pytest.mark.asyncio
async def test_single_url_crawler_rate_limiting(mock_website, test_urls, aiohttp_session):
"""Test rate limiting functionality."""
url = test_urls["valid_urls"][0]
request_handler = RequestHandler(session=aiohttp_session, rate_limit=1) # 1 request per second
html_parser = HTMLParser()
crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
import time
start_time = time.time()
# Make multiple requests
for _ in range(3):
result = await crawler.crawl(url)
assert result["success"] is True
end_time = time.time()
elapsed_time = end_time - start_time
# Should take at least 2 seconds due to rate limiting
assert elapsed_time >= 2.0
```
--------------------------------------------------------------------------------
/tests/test_crawlers/test_multi_url_crawler.py:
--------------------------------------------------------------------------------
```python
"""
Tests for the MultiURLCrawler class.
"""
import pytest
from docs_scraper.crawlers import MultiURLCrawler
from docs_scraper.utils import RequestHandler, HTMLParser
@pytest.mark.asyncio
async def test_multi_url_crawler_successful_crawl(mock_website, test_urls, aiohttp_session):
"""Test successful crawling of multiple URLs."""
urls = test_urls["valid_urls"]
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser()
crawler = MultiURLCrawler(request_handler=request_handler, html_parser=html_parser)
results = await crawler.crawl(urls)
assert len(results) == len(urls)
for result, url in zip(results, urls):
assert result["success"] is True
assert result["url"] == url
assert "content" in result
assert "title" in result["metadata"]
assert "description" in result["metadata"]
assert len(result["links"]) > 0
assert result["status_code"] == 200
assert result["error"] is None
@pytest.mark.asyncio
async def test_multi_url_crawler_mixed_urls(mock_website, test_urls, aiohttp_session):
"""Test crawling a mix of valid and invalid URLs."""
urls = test_urls["valid_urls"][:1] + test_urls["invalid_urls"][:1]
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser()
crawler = MultiURLCrawler(request_handler=request_handler, html_parser=html_parser)
results = await crawler.crawl(urls)
assert len(results) == len(urls)
# Valid URL
assert results[0]["success"] is True
assert results[0]["url"] == urls[0]
assert "content" in results[0]
# Invalid URL
assert results[1]["success"] is False
assert results[1]["url"] == urls[1]
assert results[1]["content"] is None
@pytest.mark.asyncio
async def test_multi_url_crawler_concurrent_limit(mock_website, test_urls, aiohttp_session):
"""Test concurrent request limiting."""
urls = test_urls["valid_urls"] * 2 # Duplicate URLs to have more requests
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser()
crawler = MultiURLCrawler(
request_handler=request_handler,
html_parser=html_parser,
concurrent_limit=2
)
import time
start_time = time.time()
results = await crawler.crawl(urls)
end_time = time.time()
elapsed_time = end_time - start_time
assert len(results) == len(urls)
# With concurrent_limit=2, processing 6 URLs should take at least 3 time units
assert elapsed_time >= (len(urls) / 2) * 0.1 # Assuming each request takes ~0.1s
@pytest.mark.asyncio
async def test_multi_url_crawler_empty_urls(mock_website, aiohttp_session):
"""Test crawling with empty URL list."""
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser()
crawler = MultiURLCrawler(request_handler=request_handler, html_parser=html_parser)
results = await crawler.crawl([])
assert len(results) == 0
@pytest.mark.asyncio
async def test_multi_url_crawler_duplicate_urls(mock_website, test_urls, aiohttp_session):
"""Test crawling with duplicate URLs."""
url = test_urls["valid_urls"][0]
urls = [url, url, url] # Same URL multiple times
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser()
crawler = MultiURLCrawler(request_handler=request_handler, html_parser=html_parser)
results = await crawler.crawl(urls)
assert len(results) == len(urls)
for result in results:
assert result["success"] is True
assert result["url"] == url
assert result["metadata"]["title"] == "Test Page"
@pytest.mark.asyncio
async def test_multi_url_crawler_rate_limiting(mock_website, test_urls, aiohttp_session):
"""Test rate limiting with multiple URLs."""
urls = test_urls["valid_urls"]
request_handler = RequestHandler(session=aiohttp_session, rate_limit=1) # 1 request per second
html_parser = HTMLParser()
crawler = MultiURLCrawler(request_handler=request_handler, html_parser=html_parser)
import time
start_time = time.time()
results = await crawler.crawl(urls)
end_time = time.time()
elapsed_time = end_time - start_time
assert len(results) == len(urls)
# Should take at least (len(urls) - 1) seconds due to rate limiting
assert elapsed_time >= len(urls) - 1
```
--------------------------------------------------------------------------------
/.venv/Include/site/python3.12/greenlet/greenlet.h:
--------------------------------------------------------------------------------
```
/* -*- indent-tabs-mode: nil; tab-width: 4; -*- */
/* Greenlet object interface */
#ifndef Py_GREENLETOBJECT_H
#define Py_GREENLETOBJECT_H
#include <Python.h>
#ifdef __cplusplus
extern "C" {
#endif
/* This is deprecated and undocumented. It does not change. */
#define GREENLET_VERSION "1.0.0"
#ifndef GREENLET_MODULE
#define implementation_ptr_t void*
#endif
typedef struct _greenlet {
PyObject_HEAD
PyObject* weakreflist;
PyObject* dict;
implementation_ptr_t pimpl;
} PyGreenlet;
#define PyGreenlet_Check(op) (op && PyObject_TypeCheck(op, &PyGreenlet_Type))
/* C API functions */
/* Total number of symbols that are exported */
#define PyGreenlet_API_pointers 12
#define PyGreenlet_Type_NUM 0
#define PyExc_GreenletError_NUM 1
#define PyExc_GreenletExit_NUM 2
#define PyGreenlet_New_NUM 3
#define PyGreenlet_GetCurrent_NUM 4
#define PyGreenlet_Throw_NUM 5
#define PyGreenlet_Switch_NUM 6
#define PyGreenlet_SetParent_NUM 7
#define PyGreenlet_MAIN_NUM 8
#define PyGreenlet_STARTED_NUM 9
#define PyGreenlet_ACTIVE_NUM 10
#define PyGreenlet_GET_PARENT_NUM 11
#ifndef GREENLET_MODULE
/* This section is used by modules that uses the greenlet C API */
static void** _PyGreenlet_API = NULL;
# define PyGreenlet_Type \
(*(PyTypeObject*)_PyGreenlet_API[PyGreenlet_Type_NUM])
# define PyExc_GreenletError \
((PyObject*)_PyGreenlet_API[PyExc_GreenletError_NUM])
# define PyExc_GreenletExit \
((PyObject*)_PyGreenlet_API[PyExc_GreenletExit_NUM])
/*
* PyGreenlet_New(PyObject *args)
*
* greenlet.greenlet(run, parent=None)
*/
# define PyGreenlet_New \
(*(PyGreenlet * (*)(PyObject * run, PyGreenlet * parent)) \
_PyGreenlet_API[PyGreenlet_New_NUM])
/*
* PyGreenlet_GetCurrent(void)
*
* greenlet.getcurrent()
*/
# define PyGreenlet_GetCurrent \
(*(PyGreenlet * (*)(void)) _PyGreenlet_API[PyGreenlet_GetCurrent_NUM])
/*
* PyGreenlet_Throw(
* PyGreenlet *greenlet,
* PyObject *typ,
* PyObject *val,
* PyObject *tb)
*
* g.throw(...)
*/
# define PyGreenlet_Throw \
(*(PyObject * (*)(PyGreenlet * self, \
PyObject * typ, \
PyObject * val, \
PyObject * tb)) \
_PyGreenlet_API[PyGreenlet_Throw_NUM])
/*
* PyGreenlet_Switch(PyGreenlet *greenlet, PyObject *args)
*
* g.switch(*args, **kwargs)
*/
# define PyGreenlet_Switch \
(*(PyObject * \
(*)(PyGreenlet * greenlet, PyObject * args, PyObject * kwargs)) \
_PyGreenlet_API[PyGreenlet_Switch_NUM])
/*
* PyGreenlet_SetParent(PyObject *greenlet, PyObject *new_parent)
*
* g.parent = new_parent
*/
# define PyGreenlet_SetParent \
(*(int (*)(PyGreenlet * greenlet, PyGreenlet * nparent)) \
_PyGreenlet_API[PyGreenlet_SetParent_NUM])
/*
* PyGreenlet_GetParent(PyObject* greenlet)
*
* return greenlet.parent;
*
* This could return NULL even if there is no exception active.
* If it does not return NULL, you are responsible for decrementing the
* reference count.
*/
# define PyGreenlet_GetParent \
(*(PyGreenlet* (*)(PyGreenlet*)) \
_PyGreenlet_API[PyGreenlet_GET_PARENT_NUM])
/*
* deprecated, undocumented alias.
*/
# define PyGreenlet_GET_PARENT PyGreenlet_GetParent
# define PyGreenlet_MAIN \
(*(int (*)(PyGreenlet*)) \
_PyGreenlet_API[PyGreenlet_MAIN_NUM])
# define PyGreenlet_STARTED \
(*(int (*)(PyGreenlet*)) \
_PyGreenlet_API[PyGreenlet_STARTED_NUM])
# define PyGreenlet_ACTIVE \
(*(int (*)(PyGreenlet*)) \
_PyGreenlet_API[PyGreenlet_ACTIVE_NUM])
/* Macro that imports greenlet and initializes C API */
/* NOTE: This has actually moved to ``greenlet._greenlet._C_API``, but we
keep the older definition to be sure older code that might have a copy of
the header still works. */
# define PyGreenlet_Import() \
{ \
_PyGreenlet_API = (void**)PyCapsule_Import("greenlet._C_API", 0); \
}
#endif /* GREENLET_MODULE */
#ifdef __cplusplus
}
#endif
#endif /* !Py_GREENLETOBJECT_H */
```
--------------------------------------------------------------------------------
/src/docs_scraper/utils/html_parser.py:
--------------------------------------------------------------------------------
```python
"""
HTML parser module for extracting content and links from HTML documents.
"""
from typing import List, Dict, Any, Optional
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
class HTMLParser:
def __init__(self, base_url: str):
"""
Initialize the HTML parser.
Args:
base_url: Base URL for resolving relative links
"""
self.base_url = base_url
def parse_content(self, html: str) -> Dict[str, Any]:
"""
Parse HTML content and extract useful information.
Args:
html: Raw HTML content
Returns:
Dict containing:
- title: Page title
- description: Meta description
- text_content: Main text content
- links: List of links found
- headers: List of headers found
"""
soup = BeautifulSoup(html, 'lxml')
# Extract title
title = soup.title.string if soup.title else None
# Extract meta description
meta_desc = None
meta_tag = soup.find('meta', attrs={'name': 'description'})
if meta_tag:
meta_desc = meta_tag.get('content')
# Extract main content (remove script, style, etc.)
for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
tag.decompose()
# Get text content
text_content = ' '.join(soup.stripped_strings)
# Extract headers
headers = []
for tag in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
headers.append({
'level': int(tag.name[1]),
'text': tag.get_text(strip=True)
})
# Extract links
links = self._extract_links(soup)
return {
'title': title,
'description': meta_desc,
'text_content': text_content,
'links': links,
'headers': headers
}
def parse_menu(self, html: str, menu_selector: str) -> List[Dict[str, Any]]:
"""
Parse navigation menu from HTML using a CSS selector.
Args:
html: Raw HTML content
menu_selector: CSS selector for the menu element
Returns:
List of menu items with their structure
"""
soup = BeautifulSoup(html, 'lxml')
menu = soup.select_one(menu_selector)
if not menu:
return []
return self._extract_menu_items(menu)
def _extract_links(self, soup: BeautifulSoup) -> List[Dict[str, str]]:
"""Extract and normalize all links from the document."""
links = []
for a in soup.find_all('a', href=True):
href = a['href']
text = a.get_text(strip=True)
# Skip empty or javascript links
if not href or href.startswith(('javascript:', '#')):
continue
# Resolve relative URLs
absolute_url = urljoin(self.base_url, href)
# Only include links to the same domain
if urlparse(absolute_url).netloc == urlparse(self.base_url).netloc:
links.append({
'url': absolute_url,
'text': text
})
return links
def _extract_menu_items(self, element: BeautifulSoup) -> List[Dict[str, Any]]:
"""Recursively extract menu structure."""
items = []
for item in element.find_all(['li', 'a'], recursive=False):
if item.name == 'a':
# Single link item
href = item.get('href')
if href and not href.startswith(('javascript:', '#')):
items.append({
'type': 'link',
'url': urljoin(self.base_url, href),
'text': item.get_text(strip=True)
})
else:
# Potentially nested menu item
link = item.find('a')
if link and link.get('href'):
menu_item = {
'type': 'menu',
'text': link.get_text(strip=True),
'url': urljoin(self.base_url, link['href']),
'children': []
}
# Look for nested lists
nested = item.find(['ul', 'ol'])
if nested:
menu_item['children'] = self._extract_menu_items(nested)
items.append(menu_item)
return items
```
--------------------------------------------------------------------------------
/tests/test_utils/test_request_handler.py:
--------------------------------------------------------------------------------
```python
"""
Tests for the RequestHandler class.
"""
import asyncio
import pytest
import aiohttp
import time
from docs_scraper.utils import RequestHandler
@pytest.mark.asyncio
async def test_request_handler_successful_get(mock_website, test_urls, aiohttp_session):
"""Test successful GET request."""
url = test_urls["valid_urls"][0]
handler = RequestHandler(session=aiohttp_session)
response = await handler.get(url)
assert response.status == 200
assert "<!DOCTYPE html>" in await response.text()
@pytest.mark.asyncio
async def test_request_handler_invalid_url(mock_website, test_urls, aiohttp_session):
"""Test handling of invalid URL."""
url = test_urls["invalid_urls"][0]
handler = RequestHandler(session=aiohttp_session)
with pytest.raises(aiohttp.ClientError):
await handler.get(url)
@pytest.mark.asyncio
async def test_request_handler_nonexistent_url(mock_website, test_urls, aiohttp_session):
"""Test handling of nonexistent URL."""
url = test_urls["invalid_urls"][2]
handler = RequestHandler(session=aiohttp_session)
with pytest.raises(aiohttp.ClientError):
await handler.get(url)
@pytest.mark.asyncio
async def test_request_handler_rate_limiting(mock_website, test_urls, aiohttp_session):
"""Test rate limiting functionality."""
url = test_urls["valid_urls"][0]
rate_limit = 2 # 2 requests per second
handler = RequestHandler(session=aiohttp_session, rate_limit=rate_limit)
start_time = time.time()
# Make multiple requests
for _ in range(3):
response = await handler.get(url)
assert response.status == 200
end_time = time.time()
elapsed_time = end_time - start_time
# Should take at least 1 second due to rate limiting
assert elapsed_time >= 1.0
@pytest.mark.asyncio
async def test_request_handler_custom_headers(mock_website, test_urls, aiohttp_session):
"""Test custom headers in requests."""
url = test_urls["valid_urls"][0]
custom_headers = {
"User-Agent": "Custom Bot 1.0",
"Accept-Language": "en-US,en;q=0.9"
}
handler = RequestHandler(session=aiohttp_session, headers=custom_headers)
response = await handler.get(url)
assert response.status == 200
# Headers should be merged with default headers
assert handler.headers["User-Agent"] == "Custom Bot 1.0"
assert handler.headers["Accept-Language"] == "en-US,en;q=0.9"
@pytest.mark.asyncio
async def test_request_handler_timeout(mock_website, test_urls, aiohttp_session):
"""Test request timeout handling."""
url = test_urls["valid_urls"][0]
handler = RequestHandler(session=aiohttp_session, timeout=0.001) # Very short timeout
# Mock a delayed response
mock_website.get(url, status=200, body="Delayed response", delay=0.1)
with pytest.raises(aiohttp.ClientTimeout):
await handler.get(url)
@pytest.mark.asyncio
async def test_request_handler_retry(mock_website, test_urls, aiohttp_session):
"""Test request retry functionality."""
url = test_urls["valid_urls"][0]
handler = RequestHandler(session=aiohttp_session, max_retries=3)
# Mock temporary failures followed by success
mock_website.get(url, status=500) # First attempt fails
mock_website.get(url, status=500) # Second attempt fails
mock_website.get(url, status=200, body="Success") # Third attempt succeeds
response = await handler.get(url)
assert response.status == 200
assert await response.text() == "Success"
@pytest.mark.asyncio
async def test_request_handler_max_retries_exceeded(mock_website, test_urls, aiohttp_session):
"""Test behavior when max retries are exceeded."""
url = test_urls["valid_urls"][0]
handler = RequestHandler(session=aiohttp_session, max_retries=2)
# Mock consistent failures
mock_website.get(url, status=500)
mock_website.get(url, status=500)
mock_website.get(url, status=500)
with pytest.raises(aiohttp.ClientError):
await handler.get(url)
@pytest.mark.asyncio
async def test_request_handler_session_management(mock_website, test_urls):
"""Test session management."""
url = test_urls["valid_urls"][0]
# Test with context manager
async with aiohttp.ClientSession() as session:
handler = RequestHandler(session=session)
response = await handler.get(url)
assert response.status == 200
# Test with closed session
with pytest.raises(aiohttp.ClientError):
await handler.get(url)
@pytest.mark.asyncio
async def test_request_handler_concurrent_requests(mock_website, test_urls, aiohttp_session):
"""Test handling of concurrent requests."""
urls = test_urls["valid_urls"]
handler = RequestHandler(session=aiohttp_session)
# Make concurrent requests
tasks = [handler.get(url) for url in urls]
responses = await asyncio.gather(*tasks)
assert all(response.status == 200 for response in responses)
```
--------------------------------------------------------------------------------
/tests/test_crawlers/test_menu_crawler.py:
--------------------------------------------------------------------------------
```python
"""
Tests for the MenuCrawler class.
"""
import pytest
from docs_scraper.crawlers import MenuCrawler
from docs_scraper.utils import RequestHandler, HTMLParser
@pytest.mark.asyncio
async def test_menu_crawler_successful_crawl(mock_website, test_urls, aiohttp_session):
"""Test successful crawling of menu links."""
url = test_urls["valid_urls"][0]
menu_selector = test_urls["menu_selector"]
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser()
crawler = MenuCrawler(request_handler=request_handler, html_parser=html_parser)
results = await crawler.crawl(url, menu_selector)
assert len(results) >= 4 # Number of menu links in sample HTML
for result in results:
assert result["success"] is True
assert result["url"].startswith("https://example.com")
assert "content" in result
assert "title" in result["metadata"]
assert "description" in result["metadata"]
assert len(result["links"]) > 0
assert result["status_code"] == 200
assert result["error"] is None
@pytest.mark.asyncio
async def test_menu_crawler_invalid_url(mock_website, test_urls, aiohttp_session):
"""Test crawling with an invalid URL."""
url = test_urls["invalid_urls"][0]
menu_selector = test_urls["menu_selector"]
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser()
crawler = MenuCrawler(request_handler=request_handler, html_parser=html_parser)
results = await crawler.crawl(url, menu_selector)
assert len(results) == 1
assert results[0]["success"] is False
assert results[0]["url"] == url
assert results[0]["error"] is not None
@pytest.mark.asyncio
async def test_menu_crawler_invalid_selector(mock_website, test_urls, aiohttp_session):
"""Test crawling with an invalid CSS selector."""
url = test_urls["valid_urls"][0]
invalid_selector = "#nonexistent-menu"
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser()
crawler = MenuCrawler(request_handler=request_handler, html_parser=html_parser)
results = await crawler.crawl(url, invalid_selector)
assert len(results) == 1
assert results[0]["success"] is False
assert results[0]["url"] == url
assert "No menu links found" in results[0]["error"]
@pytest.mark.asyncio
async def test_menu_crawler_nested_menu(mock_website, test_urls, aiohttp_session):
"""Test crawling nested menu structure."""
url = test_urls["valid_urls"][0]
menu_selector = test_urls["menu_selector"]
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser()
crawler = MenuCrawler(
request_handler=request_handler,
html_parser=html_parser,
max_depth=2 # Crawl up to 2 levels deep
)
results = await crawler.crawl(url, menu_selector)
# Check if nested menu items were crawled
urls = {result["url"] for result in results}
assert "https://example.com/section1" in urls
assert "https://example.com/section1/page1" in urls
assert "https://example.com/section1/page2" in urls
@pytest.mark.asyncio
async def test_menu_crawler_concurrent_limit(mock_website, test_urls, aiohttp_session):
"""Test concurrent request limiting for menu crawling."""
url = test_urls["valid_urls"][0]
menu_selector = test_urls["menu_selector"]
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser()
crawler = MenuCrawler(
request_handler=request_handler,
html_parser=html_parser,
concurrent_limit=1 # Process one URL at a time
)
import time
start_time = time.time()
results = await crawler.crawl(url, menu_selector)
end_time = time.time()
elapsed_time = end_time - start_time
assert len(results) >= 4
# With concurrent_limit=1, processing should take at least 0.4 seconds
assert elapsed_time >= 0.4
@pytest.mark.asyncio
async def test_menu_crawler_rate_limiting(mock_website, test_urls, aiohttp_session):
"""Test rate limiting for menu crawling."""
url = test_urls["valid_urls"][0]
menu_selector = test_urls["menu_selector"]
request_handler = RequestHandler(session=aiohttp_session, rate_limit=1) # 1 request per second
html_parser = HTMLParser()
crawler = MenuCrawler(request_handler=request_handler, html_parser=html_parser)
import time
start_time = time.time()
results = await crawler.crawl(url, menu_selector)
end_time = time.time()
elapsed_time = end_time - start_time
assert len(results) >= 4
# Should take at least 3 seconds due to rate limiting
assert elapsed_time >= 3.0
@pytest.mark.asyncio
async def test_menu_crawler_max_depth(mock_website, test_urls, aiohttp_session):
"""Test max depth limitation for menu crawling."""
url = test_urls["valid_urls"][0]
menu_selector = test_urls["menu_selector"]
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser()
crawler = MenuCrawler(
request_handler=request_handler,
html_parser=html_parser,
max_depth=1 # Only crawl top-level menu items
)
results = await crawler.crawl(url, menu_selector)
# Should only include top-level menu items
urls = {result["url"] for result in results}
assert "https://example.com/section1" in urls
assert "https://example.com/page1" in urls
assert "https://example.com/section1/page1" not in urls # Nested item should not be included
```
--------------------------------------------------------------------------------
/src/docs_scraper/utils/request_handler.py:
--------------------------------------------------------------------------------
```python
"""
Request handler module for managing HTTP requests with rate limiting and error handling.
"""
import asyncio
import logging
from typing import Optional, Dict, Any
import aiohttp
from urllib.robotparser import RobotFileParser
from urllib.parse import urljoin
logger = logging.getLogger(__name__)
class RequestHandler:
def __init__(
self,
rate_limit: float = 1.0,
concurrent_limit: int = 5,
user_agent: str = "DocsScraperBot/1.0",
timeout: int = 30,
session: Optional[aiohttp.ClientSession] = None
):
"""
Initialize the request handler.
Args:
rate_limit: Minimum time between requests to the same domain (in seconds)
concurrent_limit: Maximum number of concurrent requests
user_agent: User agent string to use for requests
timeout: Request timeout in seconds
session: Optional aiohttp.ClientSession to use. If not provided, one will be created.
"""
self.rate_limit = rate_limit
self.concurrent_limit = concurrent_limit
self.user_agent = user_agent
self.timeout = timeout
self._provided_session = session
self._domain_locks: Dict[str, asyncio.Lock] = {}
self._domain_last_request: Dict[str, float] = {}
self._semaphore = asyncio.Semaphore(concurrent_limit)
self._session: Optional[aiohttp.ClientSession] = None
self._robot_parsers: Dict[str, RobotFileParser] = {}
async def __aenter__(self):
"""Set up the aiohttp session."""
if self._provided_session:
self._session = self._provided_session
else:
self._session = aiohttp.ClientSession(
headers={"User-Agent": self.user_agent},
timeout=aiohttp.ClientTimeout(total=self.timeout)
)
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
"""Clean up the aiohttp session."""
if self._session and not self._provided_session:
await self._session.close()
async def _check_robots_txt(self, url: str) -> bool:
"""
Check if the URL is allowed by robots.txt.
Args:
url: URL to check
Returns:
bool: True if allowed, False if disallowed
"""
from urllib.parse import urlparse
parsed = urlparse(url)
domain = f"{parsed.scheme}://{parsed.netloc}"
if domain not in self._robot_parsers:
parser = RobotFileParser()
parser.set_url(urljoin(domain, "/robots.txt"))
try:
async with self._session.get(parser.url) as response:
content = await response.text()
parser.parse(content.splitlines())
except Exception as e:
logger.warning(f"Failed to fetch robots.txt for {domain}: {e}")
return True
self._robot_parsers[domain] = parser
return self._robot_parsers[domain].can_fetch(self.user_agent, url)
async def get(self, url: str, **kwargs) -> Dict[str, Any]:
"""
Make a GET request with rate limiting and error handling.
Args:
url: URL to request
**kwargs: Additional arguments to pass to aiohttp.ClientSession.get()
Returns:
Dict containing:
- success: bool indicating if request was successful
- status: HTTP status code if available
- content: Response content if successful
- error: Error message if unsuccessful
"""
from urllib.parse import urlparse
parsed = urlparse(url)
domain = parsed.netloc
# Get or create domain lock
if domain not in self._domain_locks:
self._domain_locks[domain] = asyncio.Lock()
# Check robots.txt
if not await self._check_robots_txt(url):
return {
"success": False,
"status": None,
"error": "URL disallowed by robots.txt",
"content": None
}
try:
async with self._semaphore: # Limit concurrent requests
async with self._domain_locks[domain]: # Lock per domain
# Rate limiting
if domain in self._domain_last_request:
elapsed = asyncio.get_event_loop().time() - self._domain_last_request[domain]
if elapsed < self.rate_limit:
await asyncio.sleep(self.rate_limit - elapsed)
self._domain_last_request[domain] = asyncio.get_event_loop().time()
# Make request
async with self._session.get(url, **kwargs) as response:
content = await response.text()
return {
"success": response.status < 400,
"status": response.status,
"content": content,
"error": None if response.status < 400 else f"HTTP {response.status}"
}
except asyncio.TimeoutError:
return {
"success": False,
"status": None,
"error": "Request timed out",
"content": None
}
except Exception as e:
return {
"success": False,
"status": None,
"error": str(e),
"content": None
}
```
--------------------------------------------------------------------------------
/tests/test_crawlers/test_sitemap_crawler.py:
--------------------------------------------------------------------------------
```python
"""
Tests for the SitemapCrawler class.
"""
import pytest
from docs_scraper.crawlers import SitemapCrawler
from docs_scraper.utils import RequestHandler, HTMLParser
@pytest.mark.asyncio
async def test_sitemap_crawler_successful_crawl(mock_website, test_urls, aiohttp_session):
"""Test successful crawling of a sitemap."""
sitemap_url = test_urls["sitemap_url"]
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser(base_url=test_urls["base_url"])
crawler = SitemapCrawler(request_handler=request_handler, html_parser=html_parser)
results = await crawler.crawl(sitemap_url)
assert len(results) == 3 # Number of URLs in sample sitemap
for result in results:
assert result["success"] is True
assert result["url"].startswith("https://example.com")
assert "content" in result
assert "title" in result["metadata"]
assert "description" in result["metadata"]
assert len(result["links"]) > 0
assert result["status_code"] == 200
assert result["error"] is None
@pytest.mark.asyncio
async def test_sitemap_crawler_invalid_sitemap_url(mock_website, aiohttp_session):
"""Test crawling with an invalid sitemap URL."""
sitemap_url = "https://nonexistent.example.com/sitemap.xml"
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser(base_url="https://nonexistent.example.com")
crawler = SitemapCrawler(request_handler=request_handler, html_parser=html_parser)
results = await crawler.crawl(sitemap_url)
assert len(results) == 1
assert results[0]["success"] is False
assert results[0]["url"] == sitemap_url
assert results[0]["error"] is not None
@pytest.mark.asyncio
async def test_sitemap_crawler_invalid_xml(mock_website, aiohttp_session):
"""Test crawling with invalid XML content."""
sitemap_url = "https://example.com/invalid-sitemap.xml"
mock_website.get(sitemap_url, status=200, body="<invalid>xml</invalid>")
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser(base_url="https://example.com")
crawler = SitemapCrawler(request_handler=request_handler, html_parser=html_parser)
results = await crawler.crawl(sitemap_url)
assert len(results) == 1
assert results[0]["success"] is False
assert results[0]["url"] == sitemap_url
assert "Invalid sitemap format" in results[0]["error"]
@pytest.mark.asyncio
async def test_sitemap_crawler_concurrent_limit(mock_website, test_urls, aiohttp_session):
"""Test concurrent request limiting for sitemap crawling."""
sitemap_url = test_urls["sitemap_url"]
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser(base_url=test_urls["base_url"])
crawler = SitemapCrawler(request_handler=request_handler, html_parser=html_parser)
import time
start_time = time.time()
results = await crawler.crawl(sitemap_url)
end_time = time.time()
elapsed_time = end_time - start_time
assert len(results) == 3
# With concurrent_limit=1, processing should take at least 0.3 seconds
assert elapsed_time >= 0.3
@pytest.mark.asyncio
async def test_sitemap_crawler_rate_limiting(mock_website, test_urls, aiohttp_session):
"""Test rate limiting for sitemap crawling."""
sitemap_url = test_urls["sitemap_url"]
request_handler = RequestHandler(session=aiohttp_session, rate_limit=1) # 1 request per second
html_parser = HTMLParser(base_url=test_urls["base_url"])
crawler = SitemapCrawler(request_handler=request_handler, html_parser=html_parser)
import time
start_time = time.time()
results = await crawler.crawl(sitemap_url)
end_time = time.time()
elapsed_time = end_time - start_time
assert len(results) == 3
# Should take at least 3 seconds due to rate limiting (1 for sitemap + 2 for pages)
assert elapsed_time >= 2.0
@pytest.mark.asyncio
async def test_sitemap_crawler_nested_sitemaps(mock_website, test_urls, aiohttp_session):
"""Test crawling nested sitemaps."""
# Create a sitemap index
sitemap_index = """<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap1.xml</loc>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap2.xml</loc>
</sitemap>
</sitemapindex>
"""
# Create sub-sitemaps
sitemap1 = """<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/page1</loc>
</url>
</urlset>
"""
sitemap2 = """<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/page2</loc>
</url>
</urlset>
"""
mock_website.get("https://example.com/sitemap-index.xml", status=200, body=sitemap_index)
mock_website.get("https://example.com/sitemap1.xml", status=200, body=sitemap1)
mock_website.get("https://example.com/sitemap2.xml", status=200, body=sitemap2)
request_handler = RequestHandler(session=aiohttp_session)
html_parser = HTMLParser(base_url="https://example.com")
crawler = SitemapCrawler(request_handler=request_handler, html_parser=html_parser)
results = await crawler.crawl("https://example.com/sitemap-index.xml")
assert len(results) == 2 # Two pages from two sub-sitemaps
urls = {result["url"] for result in results}
assert "https://example.com/page1" in urls
assert "https://example.com/page2" in urls
```
--------------------------------------------------------------------------------
/tests/test_utils/test_html_parser.py:
--------------------------------------------------------------------------------
```python
"""
Tests for the HTMLParser class.
"""
import pytest
from bs4 import BeautifulSoup
from docs_scraper.utils import HTMLParser
@pytest.fixture
def html_parser():
"""Fixture for HTMLParser instance."""
return HTMLParser()
@pytest.fixture
def sample_html():
"""Sample HTML content for testing."""
return """
<!DOCTYPE html>
<html>
<head>
<title>Test Page</title>
<meta name="description" content="Test description">
<meta name="keywords" content="test, keywords">
<meta property="og:title" content="OG Title">
<meta property="og:description" content="OG Description">
</head>
<body>
<nav class="menu">
<ul>
<li><a href="/page1">Page 1</a></li>
<li>
<a href="/section1">Section 1</a>
<ul>
<li><a href="/section1/page1">Section 1.1</a></li>
<li><a href="/section1/page2">Section 1.2</a></li>
</ul>
</li>
</ul>
</nav>
<main>
<h1>Welcome</h1>
<p>Test content with a <a href="/test1">link</a> and another <a href="/test2">link</a>.</p>
<div class="content">
<p>More content</p>
<a href="mailto:[email protected]">Email</a>
<a href="tel:+1234567890">Phone</a>
<a href="javascript:void(0)">JavaScript</a>
<a href="#section">Hash</a>
<a href="ftp://example.com">FTP</a>
</div>
</main>
</body>
</html>
"""
def test_parse_html(html_parser, sample_html):
"""Test HTML parsing."""
soup = html_parser.parse_html(sample_html)
assert isinstance(soup, BeautifulSoup)
assert soup.title.string == "Test Page"
def test_extract_metadata(html_parser, sample_html):
"""Test metadata extraction."""
soup = html_parser.parse_html(sample_html)
metadata = html_parser.extract_metadata(soup)
assert metadata["title"] == "Test Page"
assert metadata["description"] == "Test description"
assert metadata["keywords"] == "test, keywords"
assert metadata["og:title"] == "OG Title"
assert metadata["og:description"] == "OG Description"
def test_extract_links(html_parser, sample_html):
"""Test link extraction."""
soup = html_parser.parse_html(sample_html)
links = html_parser.extract_links(soup)
# Should only include valid HTTP(S) links
assert "/page1" in links
assert "/section1" in links
assert "/section1/page1" in links
assert "/section1/page2" in links
assert "/test1" in links
assert "/test2" in links
# Should not include invalid or special links
assert "mailto:[email protected]" not in links
assert "tel:+1234567890" not in links
assert "javascript:void(0)" not in links
assert "#section" not in links
assert "ftp://example.com" not in links
def test_extract_menu_links(html_parser, sample_html):
"""Test menu link extraction."""
soup = html_parser.parse_html(sample_html)
menu_links = html_parser.extract_menu_links(soup, "nav.menu")
assert len(menu_links) == 4
assert "/page1" in menu_links
assert "/section1" in menu_links
assert "/section1/page1" in menu_links
assert "/section1/page2" in menu_links
def test_extract_menu_links_invalid_selector(html_parser, sample_html):
"""Test menu link extraction with invalid selector."""
soup = html_parser.parse_html(sample_html)
menu_links = html_parser.extract_menu_links(soup, "#nonexistent")
assert len(menu_links) == 0
def test_extract_text_content(html_parser, sample_html):
"""Test text content extraction."""
soup = html_parser.parse_html(sample_html)
content = html_parser.extract_text_content(soup)
assert "Welcome" in content
assert "Test content" in content
assert "More content" in content
# Should not include navigation text
assert "Section 1.1" not in content
def test_clean_html(html_parser):
"""Test HTML cleaning."""
dirty_html = """
<html>
<body>
<script>alert('test');</script>
<style>body { color: red; }</style>
<p>Test content</p>
<!-- Comment -->
<iframe src="test.html"></iframe>
</body>
</html>
"""
clean_html = html_parser.clean_html(dirty_html)
soup = html_parser.parse_html(clean_html)
assert len(soup.find_all("script")) == 0
assert len(soup.find_all("style")) == 0
assert len(soup.find_all("iframe")) == 0
assert "Test content" in soup.get_text()
def test_normalize_url(html_parser):
"""Test URL normalization."""
base_url = "https://example.com/docs"
test_cases = [
("/test", "https://example.com/test"),
("test", "https://example.com/docs/test"),
("../test", "https://example.com/test"),
("https://other.com/test", "https://other.com/test"),
("//other.com/test", "https://other.com/test"),
]
for input_url, expected_url in test_cases:
assert html_parser.normalize_url(input_url, base_url) == expected_url
def test_is_valid_link(html_parser):
"""Test link validation."""
valid_links = [
"https://example.com",
"http://example.com",
"/absolute/path",
"relative/path",
"../parent/path",
"./current/path"
]
invalid_links = [
"mailto:[email protected]",
"tel:+1234567890",
"javascript:void(0)",
"#hash",
"ftp://example.com",
""
]
for link in valid_links:
assert html_parser.is_valid_link(link) is True
for link in invalid_links:
assert html_parser.is_valid_link(link) is False
def test_extract_structured_data(html_parser):
"""Test structured data extraction."""
html = """
<html>
<head>
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "Test Article",
"author": {
"@type": "Person",
"name": "John Doe"
}
}
</script>
</head>
<body>
<p>Test content</p>
</body>
</html>
"""
soup = html_parser.parse_html(html)
structured_data = html_parser.extract_structured_data(soup)
assert len(structured_data) == 1
assert structured_data[0]["@type"] == "Article"
assert structured_data[0]["headline"] == "Test Article"
assert structured_data[0]["author"]["name"] == "John Doe"
```
--------------------------------------------------------------------------------
/src/docs_scraper/crawlers/single_url_crawler.py:
--------------------------------------------------------------------------------
```python
import os
import sys
import asyncio
import re
import argparse
from datetime import datetime
from termcolor import colored
from crawl4ai import *
from ..utils import RequestHandler, HTMLParser
from typing import Dict, Any, Optional
class SingleURLCrawler:
"""A crawler that processes a single URL."""
def __init__(self, request_handler: RequestHandler, html_parser: HTMLParser):
"""
Initialize the crawler.
Args:
request_handler: Handler for making HTTP requests
html_parser: Parser for processing HTML content
"""
self.request_handler = request_handler
self.html_parser = html_parser
async def crawl(self, url: str) -> Dict[str, Any]:
"""
Crawl a single URL and extract its content.
Args:
url: The URL to crawl
Returns:
Dict containing:
- success: Whether the crawl was successful
- url: The URL that was crawled
- content: The extracted content (if successful)
- metadata: Additional metadata about the page
- links: Links found on the page
- status_code: HTTP status code
- error: Error message (if unsuccessful)
"""
try:
response = await self.request_handler.get(url)
if not response["success"]:
return {
"success": False,
"url": url,
"content": None,
"metadata": {},
"links": [],
"status_code": response.get("status"),
"error": response.get("error", "Unknown error")
}
html_content = response["content"]
parsed_content = self.html_parser.parse_content(html_content)
return {
"success": True,
"url": url,
"content": parsed_content["text_content"],
"metadata": {
"title": parsed_content["title"],
"description": parsed_content["description"]
},
"links": parsed_content["links"],
"status_code": response["status"],
"error": None
}
except Exception as e:
return {
"success": False,
"url": url,
"content": None,
"metadata": {},
"links": [],
"status_code": None,
"error": str(e)
}
def get_filename_prefix(url: str) -> str:
"""
Generate a filename prefix from a URL including path components.
Examples:
- https://docs.literalai.com/page -> literalai_docs_page
- https://literalai.com/docs/page -> literalai_docs_page
- https://api.example.com/path/to/page -> example_api_path_to_page
Args:
url (str): The URL to process
Returns:
str: Generated filename prefix
"""
# Remove protocol and split URL parts
clean_url = url.split('://')[1]
url_parts = clean_url.split('/')
# Get domain parts
domain_parts = url_parts[0].split('.')
# Extract main domain name (ignoring TLD)
main_domain = domain_parts[-2]
# Start building the prefix with domain
prefix_parts = [main_domain]
# Add subdomain if exists
if len(domain_parts) > 2:
subdomain = domain_parts[0]
if subdomain != main_domain:
prefix_parts.append(subdomain)
# Add all path segments
if len(url_parts) > 1:
path_segments = [segment for segment in url_parts[1:] if segment]
for segment in path_segments:
# Clean up segment (remove special characters, convert to lowercase)
clean_segment = re.sub(r'[^a-zA-Z0-9]', '', segment.lower())
if clean_segment and clean_segment != main_domain:
prefix_parts.append(clean_segment)
# Join all parts with underscore
return '_'.join(prefix_parts)
def process_markdown_content(content: str, url: str) -> str:
"""Process markdown content to start from first H1 and add URL as H2"""
# Find the first H1 tag
h1_match = re.search(r'^# .+$', content, re.MULTILINE)
if not h1_match:
# If no H1 found, return original content with URL as H1
return f"# No Title Found\n\n## Source\n{url}\n\n{content}"
# Get the content starting from the first H1
content_from_h1 = content[h1_match.start():]
# Remove "Was this page helpful?" section and everything after it
helpful_patterns = [
r'^#+\s*Was this page helpful\?.*$', # Matches any heading level with this text
r'^Was this page helpful\?.*$', # Matches the text without heading
r'^#+\s*Was this helpful\?.*$', # Matches any heading level with shorter text
r'^Was this helpful\?.*$' # Matches shorter text without heading
]
for pattern in helpful_patterns:
parts = re.split(pattern, content_from_h1, flags=re.MULTILINE | re.IGNORECASE)
if len(parts) > 1:
content_from_h1 = parts[0].strip()
break
# Insert URL as H2 after the H1
lines = content_from_h1.split('\n')
h1_line = lines[0]
rest_of_content = '\n'.join(lines[1:]).strip()
return f"{h1_line}\n\n## Source\n{url}\n\n{rest_of_content}"
def save_markdown_content(content: str, url: str) -> str:
"""Save markdown content to a file"""
try:
# Generate filename prefix from URL
filename_prefix = get_filename_prefix(url)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"{filename_prefix}_{timestamp}.md"
filepath = os.path.join("scraped_docs", filename)
# Create scraped_docs directory if it doesn't exist
os.makedirs("scraped_docs", exist_ok=True)
processed_content = process_markdown_content(content, url)
with open(filepath, "w", encoding="utf-8") as f:
f.write(processed_content)
print(colored(f"\n✓ Markdown content saved to: {filepath}", "green"))
return filepath
except Exception as e:
print(colored(f"\n✗ Error saving markdown content: {str(e)}", "red"))
return None
async def main():
# Set up argument parser
parser = argparse.ArgumentParser(description='Crawl a single URL and generate markdown documentation')
parser.add_argument('url', type=str, help='Target documentation URL to crawl')
args = parser.parse_args()
try:
print(colored("\n=== Starting Single URL Crawl ===", "cyan"))
print(colored(f"\nCrawling URL: {args.url}", "yellow"))
browser_config = BrowserConfig(headless=True, verbose=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
)
)
result = await crawler.arun(
url=args.url,
config=crawler_config
)
if result.success:
print(colored("\n✓ Successfully crawled URL", "green"))
print(colored(f"Content length: {len(result.markdown.raw_markdown)} characters", "cyan"))
save_markdown_content(result.markdown.raw_markdown, args.url)
else:
print(colored(f"\n✗ Failed to crawl URL: {result.error_message}", "red"))
except Exception as e:
print(colored(f"\n✗ Error during crawl: {str(e)}", "red"))
sys.exit(1)
if __name__ == "__main__":
asyncio.run(main())
```
--------------------------------------------------------------------------------
/src/docs_scraper/server.py:
--------------------------------------------------------------------------------
```python
"""
MCP server implementation for web crawling and documentation scraping.
"""
import asyncio
import logging
from typing import List, Dict, Any, Optional
from pydantic import BaseModel, Field, HttpUrl
from mcp.server.fastmcp import FastMCP
# Import the crawlers with relative imports
# This helps prevent circular import issues
from .crawlers.single_url_crawler import SingleURLCrawler
from .crawlers.multi_url_crawler import MultiURLCrawler
from .crawlers.sitemap_crawler import SitemapCrawler
from .crawlers.menu_crawler import MenuCrawler
# Import utility classes
from .utils import RequestHandler, HTMLParser
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# Create MCP server
mcp = FastMCP(
name="DocsScraperMCP",
version="0.1.0"
)
# Input validation models
class SingleUrlInput(BaseModel):
url: HttpUrl = Field(..., description="Target URL to crawl")
depth: int = Field(0, ge=0, description="How many levels deep to follow links")
exclusion_patterns: Optional[List[str]] = Field(None, description="List of regex patterns for URLs to exclude")
rate_limit: float = Field(1.0, gt=0, description="Minimum time between requests (seconds)")
class MultiUrlInput(BaseModel):
urls: List[HttpUrl] = Field(..., min_items=1, description="List of URLs to crawl")
concurrent_limit: int = Field(5, gt=0, description="Maximum number of concurrent requests")
exclusion_patterns: Optional[List[str]] = Field(None, description="List of regex patterns for URLs to exclude")
rate_limit: float = Field(1.0, gt=0, description="Minimum time between requests to the same domain (seconds)")
class SitemapInput(BaseModel):
base_url: HttpUrl = Field(..., description="Base URL of the website")
sitemap_url: Optional[HttpUrl] = Field(None, description="Optional explicit sitemap URL")
concurrent_limit: int = Field(5, gt=0, description="Maximum number of concurrent requests")
exclusion_patterns: Optional[List[str]] = Field(None, description="List of regex patterns for URLs to exclude")
rate_limit: float = Field(1.0, gt=0, description="Minimum time between requests (seconds)")
class MenuInput(BaseModel):
base_url: HttpUrl = Field(..., description="Base URL of the website")
menu_selector: str = Field(..., min_length=1, description="CSS selector for the navigation menu element")
concurrent_limit: int = Field(5, gt=0, description="Maximum number of concurrent requests")
exclusion_patterns: Optional[List[str]] = Field(None, description="List of regex patterns for URLs to exclude")
rate_limit: float = Field(1.0, gt=0, description="Minimum time between requests (seconds)")
@mcp.tool()
async def single_url_crawler(
url: str,
depth: int = 0,
exclusion_patterns: Optional[List[str]] = None,
rate_limit: float = 1.0
) -> Dict[str, Any]:
"""
Crawl a single URL and optionally follow links up to a specified depth.
Args:
url: Target URL to crawl
depth: How many levels deep to follow links (0 means only the target URL)
exclusion_patterns: List of regex patterns for URLs to exclude
rate_limit: Minimum time between requests (seconds)
Returns:
Dict containing crawled content and statistics
"""
try:
# Validate input
input_data = SingleUrlInput(
url=url,
depth=depth,
exclusion_patterns=exclusion_patterns,
rate_limit=rate_limit
)
# Create required utility instances
request_handler = RequestHandler(rate_limit=input_data.rate_limit)
html_parser = HTMLParser(base_url=str(input_data.url))
# Create the crawler with the proper parameters
crawler = SingleURLCrawler(request_handler=request_handler, html_parser=html_parser)
# Use request_handler as a context manager to ensure proper session initialization
async with request_handler:
# Call the crawl method with the URL
return await crawler.crawl(str(input_data.url))
except Exception as e:
logger.error(f"Single URL crawler failed: {str(e)}")
return {
"success": False,
"error": str(e),
"content": None,
"stats": {
"urls_crawled": 0,
"urls_failed": 1,
"max_depth_reached": 0
}
}
@mcp.tool()
async def multi_url_crawler(
urls: List[str],
concurrent_limit: int = 5,
exclusion_patterns: Optional[List[str]] = None,
rate_limit: float = 1.0
) -> Dict[str, Any]:
"""
Crawl multiple URLs in parallel with rate limiting.
Args:
urls: List of URLs to crawl
concurrent_limit: Maximum number of concurrent requests
exclusion_patterns: List of regex patterns for URLs to exclude
rate_limit: Minimum time between requests to the same domain (seconds)
Returns:
Dict containing results for each URL and overall statistics
"""
try:
# Validate input
input_data = MultiUrlInput(
urls=urls,
concurrent_limit=concurrent_limit,
exclusion_patterns=exclusion_patterns,
rate_limit=rate_limit
)
# Create the crawler with the proper parameters
crawler = MultiURLCrawler(verbose=True)
# Call the crawl method with the URLs
url_list = [str(url) for url in input_data.urls]
results = await crawler.crawl(url_list)
# Return a standardized response format
return {
"success": True,
"results": results,
"stats": {
"urls_crawled": len(results),
"urls_succeeded": sum(1 for r in results if r["success"]),
"urls_failed": sum(1 for r in results if not r["success"])
}
}
except Exception as e:
logger.error(f"Multi URL crawler failed: {str(e)}")
return {
"success": False,
"error": str(e),
"content": None,
"stats": {
"urls_crawled": 0,
"urls_failed": len(urls),
"concurrent_requests_max": 0
}
}
@mcp.tool()
async def sitemap_crawler(
base_url: str,
sitemap_url: Optional[str] = None,
concurrent_limit: int = 5,
exclusion_patterns: Optional[List[str]] = None,
rate_limit: float = 1.0
) -> Dict[str, Any]:
"""
Crawl a website using its sitemap.xml.
Args:
base_url: Base URL of the website
sitemap_url: Optional explicit sitemap URL (if different from base_url/sitemap.xml)
concurrent_limit: Maximum number of concurrent requests
exclusion_patterns: List of regex patterns for URLs to exclude
rate_limit: Minimum time between requests (seconds)
Returns:
Dict containing crawled pages and statistics
"""
try:
# Validate input
input_data = SitemapInput(
base_url=base_url,
sitemap_url=sitemap_url,
concurrent_limit=concurrent_limit,
exclusion_patterns=exclusion_patterns,
rate_limit=rate_limit
)
# Create required utility instances
request_handler = RequestHandler(
rate_limit=input_data.rate_limit,
concurrent_limit=input_data.concurrent_limit
)
html_parser = HTMLParser(base_url=str(input_data.base_url))
# Create the crawler with the proper parameters
crawler = SitemapCrawler(
request_handler=request_handler,
html_parser=html_parser,
verbose=True
)
# Determine the sitemap URL to use
sitemap_url_to_use = str(input_data.sitemap_url) if input_data.sitemap_url else f"{str(input_data.base_url).rstrip('/')}/sitemap.xml"
# Call the crawl method with the sitemap URL
results = await crawler.crawl(sitemap_url_to_use)
return {
"success": True,
"content": results,
"stats": {
"urls_crawled": len(results),
"urls_succeeded": sum(1 for r in results if r["success"]),
"urls_failed": sum(1 for r in results if not r["success"]),
"sitemap_found": len(results) > 0
}
}
except Exception as e:
logger.error(f"Sitemap crawler failed: {str(e)}")
return {
"success": False,
"error": str(e),
"content": None,
"stats": {
"urls_crawled": 0,
"urls_failed": 1,
"sitemap_found": False
}
}
@mcp.tool()
async def menu_crawler(
base_url: str,
menu_selector: str,
concurrent_limit: int = 5,
exclusion_patterns: Optional[List[str]] = None,
rate_limit: float = 1.0
) -> Dict[str, Any]:
"""
Crawl a website by following its navigation menu structure.
Args:
base_url: Base URL of the website
menu_selector: CSS selector for the navigation menu element
concurrent_limit: Maximum number of concurrent requests
exclusion_patterns: List of regex patterns for URLs to exclude
rate_limit: Minimum time between requests (seconds)
Returns:
Dict containing menu structure and crawled content
"""
try:
# Validate input
input_data = MenuInput(
base_url=base_url,
menu_selector=menu_selector,
concurrent_limit=concurrent_limit,
exclusion_patterns=exclusion_patterns,
rate_limit=rate_limit
)
# Create the crawler with the proper parameters
crawler = MenuCrawler(start_url=str(input_data.base_url))
# Call the crawl method
results = await crawler.crawl()
return {
"success": True,
"content": results,
"stats": {
"urls_crawled": len(results.get("menu_links", [])),
"urls_failed": 0,
"menu_items_found": len(results.get("menu_structure", {}).get("items", []))
}
}
except Exception as e:
logger.error(f"Menu crawler failed: {str(e)}")
return {
"success": False,
"error": str(e),
"content": None,
"stats": {
"urls_crawled": 0,
"urls_failed": 1,
"menu_items_found": 0
}
}
def main():
"""Main entry point for the MCP server."""
try:
logger.info("Starting DocsScraperMCP server...")
mcp.run() # Using run() method instead of start()
except Exception as e:
logger.error(f"Server failed: {str(e)}")
raise
finally:
logger.info("DocsScraperMCP server stopped.")
if __name__ == "__main__":
main()
```
--------------------------------------------------------------------------------
/src/docs_scraper/crawlers/multi_url_crawler.py:
--------------------------------------------------------------------------------
```python
import os
import sys
import asyncio
import re
import json
import argparse
from typing import List, Optional
from datetime import datetime
from termcolor import colored
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import PruningContentFilter
from urllib.parse import urlparse
def load_urls_from_file(file_path: str) -> List[str]:
"""Load URLs from either a text file or JSON file"""
try:
# Create input_files directory if it doesn't exist
input_dir = "input_files"
os.makedirs(input_dir, exist_ok=True)
# Check if file exists in current directory or input_files directory
if os.path.exists(file_path):
actual_path = file_path
elif os.path.exists(os.path.join(input_dir, file_path)):
actual_path = os.path.join(input_dir, file_path)
else:
print(colored(f"Error: File {file_path} not found", "red"))
print(colored(f"Please place your URL files in either:", "yellow"))
print(colored(f"1. The root directory ({os.getcwd()})", "yellow"))
print(colored(f"2. The input_files directory ({os.path.join(os.getcwd(), input_dir)})", "yellow"))
sys.exit(1)
file_ext = os.path.splitext(actual_path)[1].lower()
if file_ext == '.json':
print(colored(f"Loading URLs from JSON file: {actual_path}", "cyan"))
with open(actual_path, 'r', encoding='utf-8') as f:
try:
data = json.load(f)
# Handle menu crawler output format
if isinstance(data, dict) and 'menu_links' in data:
urls = data['menu_links']
elif isinstance(data, dict) and 'urls' in data:
urls = data['urls']
elif isinstance(data, list):
urls = data
else:
print(colored("Error: Invalid JSON format. Expected 'menu_links' or 'urls' key, or list of URLs", "red"))
sys.exit(1)
print(colored(f"Successfully loaded {len(urls)} URLs from JSON file", "green"))
return urls
except json.JSONDecodeError as e:
print(colored(f"Error: Invalid JSON file - {str(e)}", "red"))
sys.exit(1)
else:
print(colored(f"Loading URLs from text file: {actual_path}", "cyan"))
with open(actual_path, 'r', encoding='utf-8') as f:
urls = [line.strip() for line in f if line.strip()]
print(colored(f"Successfully loaded {len(urls)} URLs from text file", "green"))
return urls
except Exception as e:
print(colored(f"Error loading URLs from file: {str(e)}", "red"))
sys.exit(1)
class MultiURLCrawler:
def __init__(self, verbose: bool = True):
self.browser_config = BrowserConfig(
headless=True,
verbose=True,
viewport_width=800,
viewport_height=600
)
self.crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48,
threshold_type="fixed",
min_word_threshold=0
)
),
)
self.verbose = verbose
def process_markdown_content(self, content: str, url: str) -> str:
"""Process markdown content to start from first H1 and add URL as H2"""
# Find the first H1 tag
h1_match = re.search(r'^# .+$', content, re.MULTILINE)
if not h1_match:
# If no H1 found, return original content with URL as H1
return f"# No Title Found\n\n## Source\n{url}\n\n{content}"
# Get the content starting from the first H1
content_from_h1 = content[h1_match.start():]
# Remove "Was this page helpful?" section and everything after it
helpful_patterns = [
r'^#+\s*Was this page helpful\?.*$', # Matches any heading level with this text
r'^Was this page helpful\?.*$', # Matches the text without heading
r'^#+\s*Was this helpful\?.*$', # Matches any heading level with shorter text
r'^Was this helpful\?.*$' # Matches shorter text without heading
]
for pattern in helpful_patterns:
parts = re.split(pattern, content_from_h1, flags=re.MULTILINE | re.IGNORECASE)
if len(parts) > 1:
content_from_h1 = parts[0].strip()
break
# Insert URL as H2 after the H1
lines = content_from_h1.split('\n')
h1_line = lines[0]
rest_of_content = '\n'.join(lines[1:])
return f"{h1_line}\n\n## Source\n{url}\n\n{rest_of_content}"
def get_filename_prefix(self, url: str) -> str:
"""
Generate a filename prefix from a URL including path components.
Examples:
- https://docs.literalai.com/page -> literalai_docs_page
- https://literalai.com/docs/page -> literalai_docs_page
- https://api.example.com/path/to/page -> example_api_path_to_page
"""
try:
# Parse the URL
parsed = urlparse(url)
# Split hostname and reverse it (e.g., 'docs.example.com' -> ['com', 'example', 'docs'])
hostname_parts = parsed.hostname.split('.')
hostname_parts.reverse()
# Remove common TLDs and 'www'
hostname_parts = [p for p in hostname_parts if p not in ('com', 'org', 'net', 'www')]
# Get path components, removing empty strings
path_parts = [p for p in parsed.path.split('/') if p]
# Combine hostname and path parts
all_parts = hostname_parts + path_parts
# Clean up parts: lowercase, remove special chars, limit length
cleaned_parts = []
for part in all_parts:
# Convert to lowercase and remove special characters
cleaned = re.sub(r'[^a-zA-Z0-9]+', '_', part.lower())
# Remove leading/trailing underscores
cleaned = cleaned.strip('_')
# Only add non-empty parts
if cleaned:
cleaned_parts.append(cleaned)
# Join parts with underscores
return '_'.join(cleaned_parts)
except Exception as e:
print(colored(f"Error generating filename prefix: {str(e)}", "red"))
return "default"
def save_markdown_content(self, results: List[dict], filename_prefix: str = None):
"""Save all markdown content to a single file"""
try:
# Use the first successful URL to generate the filename prefix if none provided
if not filename_prefix and results:
# Find first successful result
first_url = next((result["url"] for result in results if result["success"]), None)
if first_url:
filename_prefix = self.get_filename_prefix(first_url)
else:
filename_prefix = "docs" # Fallback if no successful results
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"{filename_prefix}_{timestamp}.md"
filepath = os.path.join("scraped_docs", filename)
# Create scraped_docs directory if it doesn't exist
os.makedirs("scraped_docs", exist_ok=True)
with open(filepath, "w", encoding="utf-8") as f:
for result in results:
if result["success"]:
processed_content = self.process_markdown_content(
result["markdown_content"],
result["url"]
)
f.write(processed_content)
f.write("\n\n---\n\n")
if self.verbose:
print(colored(f"\nMarkdown content saved to: {filepath}", "green"))
return filepath
except Exception as e:
print(colored(f"\nError saving markdown content: {str(e)}", "red"))
return None
async def crawl(self, urls: List[str]) -> List[dict]:
"""
Crawl multiple URLs sequentially using session reuse for optimal performance
"""
if self.verbose:
print("\n=== Starting Crawl ===")
total_urls = len(urls)
print(f"Total URLs to crawl: {total_urls}")
results = []
async with AsyncWebCrawler(config=self.browser_config) as crawler:
session_id = "crawl_session" # Reuse the same session for all URLs
for idx, url in enumerate(urls, 1):
try:
if self.verbose:
progress = (idx / total_urls) * 100
print(f"\nProgress: {idx}/{total_urls} ({progress:.1f}%)")
print(f"Crawling: {url}")
result = await crawler.arun(
url=url,
config=self.crawler_config,
session_id=session_id,
)
results.append({
"url": url,
"success": result.success,
"content_length": len(result.markdown.raw_markdown) if result.success else 0,
"markdown_content": result.markdown.raw_markdown if result.success else "",
"error": result.error_message if not result.success else None
})
if self.verbose and result.success:
print(f"✓ Successfully crawled URL {idx}/{total_urls}")
print(f"Content length: {len(result.markdown.raw_markdown)} characters")
except Exception as e:
results.append({
"url": url,
"success": False,
"content_length": 0,
"markdown_content": "",
"error": str(e)
})
if self.verbose:
print(f"✗ Error crawling URL {idx}/{total_urls}: {str(e)}")
if self.verbose:
successful = sum(1 for r in results if r["success"])
print(f"\n=== Crawl Complete ===")
print(f"Successfully crawled: {successful}/{total_urls} URLs")
return results
async def main():
parser = argparse.ArgumentParser(description='Crawl multiple URLs and generate markdown documentation')
parser.add_argument('urls_file', type=str, help='Path to file containing URLs (either .txt or .json)')
parser.add_argument('--output-prefix', type=str, help='Prefix for output markdown file (optional)')
args = parser.parse_args()
try:
# Load URLs from file
urls = load_urls_from_file(args.urls_file)
if not urls:
print(colored("Error: No URLs found in the input file", "red"))
sys.exit(1)
print(colored(f"Found {len(urls)} URLs to crawl", "green"))
# Initialize and run crawler
crawler = MultiURLCrawler(verbose=True)
results = await crawler.crawl(urls)
# Save results to markdown file - only pass output_prefix if explicitly set
crawler.save_markdown_content(results, args.output_prefix if args.output_prefix else None)
except Exception as e:
print(colored(f"Error during crawling: {str(e)}", "red"))
sys.exit(1)
if __name__ == "__main__":
asyncio.run(main())
```
--------------------------------------------------------------------------------
/src/docs_scraper/crawlers/sitemap_crawler.py:
--------------------------------------------------------------------------------
```python
import os
import sys
import asyncio
import re
import xml.etree.ElementTree as ET
import argparse
from typing import List, Optional, Dict
from datetime import datetime
from termcolor import colored
from ..utils import RequestHandler, HTMLParser
class SitemapCrawler:
def __init__(self, request_handler: Optional[RequestHandler] = None, html_parser: Optional[HTMLParser] = None, verbose: bool = True):
"""
Initialize the sitemap crawler.
Args:
request_handler: Optional RequestHandler instance. If not provided, one will be created.
html_parser: Optional HTMLParser instance. If not provided, one will be created.
verbose: Whether to print progress messages
"""
self.verbose = verbose
self.request_handler = request_handler or RequestHandler(
rate_limit=1.0,
concurrent_limit=5,
user_agent="DocsScraperBot/1.0",
timeout=30
)
self._html_parser = html_parser
async def fetch_sitemap(self, sitemap_url: str) -> List[str]:
"""
Fetch and parse an XML sitemap to extract URLs.
Args:
sitemap_url (str): The URL of the XML sitemap
Returns:
List[str]: List of URLs found in the sitemap
"""
if self.verbose:
print(f"\nFetching sitemap from: {sitemap_url}")
async with self.request_handler as handler:
try:
response = await handler.get(sitemap_url)
if not response["success"]:
raise Exception(f"Failed to fetch sitemap: {response['error']}")
content = response["content"]
# Parse XML content
root = ET.fromstring(content)
# Handle both standard sitemaps and sitemap indexes
urls = []
# Remove XML namespace for easier parsing
namespace = root.tag.split('}')[0] + '}' if '}' in root.tag else ''
if root.tag == f"{namespace}sitemapindex":
# This is a sitemap index file
if self.verbose:
print("Found sitemap index, processing nested sitemaps...")
for sitemap in root.findall(f".//{namespace}sitemap"):
loc = sitemap.find(f"{namespace}loc")
if loc is not None and loc.text:
nested_urls = await self.fetch_sitemap(loc.text)
urls.extend(nested_urls)
else:
# This is a standard sitemap
for url in root.findall(f".//{namespace}url"):
loc = url.find(f"{namespace}loc")
if loc is not None and loc.text:
urls.append(loc.text)
if self.verbose:
print(f"Found {len(urls)} URLs in sitemap")
return urls
except Exception as e:
print(f"Error fetching sitemap: {str(e)}")
return []
def process_markdown_content(self, content: str, url: str) -> str:
"""Process markdown content to start from first H1 and add URL as H2"""
# Find the first H1 tag
h1_match = re.search(r'^# .+$', content, re.MULTILINE)
if not h1_match:
# If no H1 found, return original content with URL as H1
return f"# No Title Found\n\n## Source\n{url}\n\n{content}"
# Get the content starting from the first H1
content_from_h1 = content[h1_match.start():]
# Remove "Was this page helpful?" section and everything after it
helpful_patterns = [
r'^#+\s*Was this page helpful\?.*$', # Matches any heading level with this text
r'^Was this page helpful\?.*$', # Matches the text without heading
r'^#+\s*Was this helpful\?.*$', # Matches any heading level with shorter text
r'^Was this helpful\?.*$' # Matches shorter text without heading
]
for pattern in helpful_patterns:
parts = re.split(pattern, content_from_h1, flags=re.MULTILINE | re.IGNORECASE)
if len(parts) > 1:
content_from_h1 = parts[0].strip()
break
# Insert URL as H2 after the H1
lines = content_from_h1.split('\n')
h1_line = lines[0]
rest_of_content = '\n'.join(lines[1:]).strip()
return f"{h1_line}\n\n## Source\n{url}\n\n{rest_of_content}"
def save_markdown_content(self, results: List[dict], filename_prefix: str = "vercel_ai_docs"):
"""Save all markdown content to a single file"""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"{filename_prefix}_{timestamp}.md"
filepath = os.path.join("scraped_docs", filename)
# Create scraped_docs directory if it doesn't exist
os.makedirs("scraped_docs", exist_ok=True)
with open(filepath, "w", encoding="utf-8") as f:
for result in results:
if result["success"]:
processed_content = self.process_markdown_content(
result["content"],
result["url"]
)
f.write(processed_content)
f.write("\n\n---\n\n")
if self.verbose:
print(f"\nMarkdown content saved to: {filepath}")
return filepath
async def crawl(self, sitemap_url: str) -> List[dict]:
"""
Crawl a sitemap URL and all URLs it contains.
Args:
sitemap_url: URL of the sitemap to crawl
Returns:
List of dictionaries containing crawl results
"""
if self.verbose:
print("\n=== Starting Crawl ===")
# First fetch all URLs from the sitemap
urls = await self.fetch_sitemap(sitemap_url)
if self.verbose:
print(f"Total URLs to crawl: {len(urls)}")
results = []
async with self.request_handler as handler:
for idx, url in enumerate(urls, 1):
try:
if self.verbose:
progress = (idx / len(urls)) * 100
print(f"\nProgress: {idx}/{len(urls)} ({progress:.1f}%)")
print(f"Crawling: {url}")
response = await handler.get(url)
html_parser = self._html_parser or HTMLParser(url)
if response["success"]:
parsed_content = html_parser.parse_content(response["content"])
results.append({
"url": url,
"success": True,
"content": parsed_content["text_content"],
"metadata": {
"title": parsed_content["title"],
"description": parsed_content["description"]
},
"links": parsed_content["links"],
"status_code": response["status"],
"error": None
})
if self.verbose:
print(f"✓ Successfully crawled URL {idx}/{len(urls)}")
print(f"Content length: {len(parsed_content['text_content'])} characters")
else:
results.append({
"url": url,
"success": False,
"content": "",
"metadata": {"title": None, "description": None},
"links": [],
"status_code": response.get("status"),
"error": response["error"]
})
if self.verbose:
print(f"✗ Error crawling URL {idx}/{len(urls)}: {response['error']}")
except Exception as e:
results.append({
"url": url,
"success": False,
"content": "",
"metadata": {"title": None, "description": None},
"links": [],
"status_code": None,
"error": str(e)
})
if self.verbose:
print(f"✗ Error crawling URL {idx}/{len(urls)}: {str(e)}")
if self.verbose:
successful = sum(1 for r in results if r["success"])
print(f"\n=== Crawl Complete ===")
print(f"Successfully crawled: {successful}/{len(urls)} URLs")
return results
def get_filename_prefix(self, url: str) -> str:
"""
Generate a filename prefix from a sitemap URL.
Examples:
- https://docs.literalai.com/sitemap.xml -> literalai_documentation
- https://literalai.com/docs/sitemap.xml -> literalai_docs
- https://api.example.com/sitemap.xml -> example_api
Args:
url (str): The sitemap URL
Returns:
str: Generated filename prefix
"""
# Remove protocol and split URL parts
clean_url = url.split('://')[1]
url_parts = clean_url.split('/')
# Get domain parts
domain_parts = url_parts[0].split('.')
# Extract main domain name (ignoring TLD)
main_domain = domain_parts[-2]
# Determine the qualifier (subdomain or path segment)
qualifier = None
# First check subdomain
if len(domain_parts) > 2:
qualifier = domain_parts[0]
# Then check path
elif len(url_parts) > 2:
# Get the first meaningful path segment
for segment in url_parts[1:]:
if segment and segment != 'sitemap.xml':
qualifier = segment
break
# Build the prefix
if qualifier:
# Clean up qualifier (remove special characters, convert to lowercase)
qualifier = re.sub(r'[^a-zA-Z0-9]', '', qualifier.lower())
# Don't duplicate parts if they're the same
if qualifier != main_domain:
return f"{main_domain}_{qualifier}"
return main_domain
async def main():
# Set up argument parser
parser = argparse.ArgumentParser(description='Crawl a sitemap and generate markdown documentation')
parser.add_argument('sitemap_url', type=str, help='URL of the sitemap (e.g., https://docs.example.com/sitemap.xml)')
parser.add_argument('--max-depth', type=int, default=10, help='Maximum sitemap recursion depth')
parser.add_argument('--patterns', type=str, nargs='+', help='URL patterns to include (e.g., "/docs/*" "/guide/*")')
args = parser.parse_args()
try:
print(colored(f"\nFetching sitemap: {args.sitemap_url}", "cyan"))
# Initialize crawler
crawler = SitemapCrawler(verbose=True)
# Fetch URLs from sitemap
urls = await crawler.fetch_sitemap(args.sitemap_url)
if not urls:
print(colored("No URLs found in sitemap", "red"))
sys.exit(1)
# Filter URLs by pattern if specified
if args.patterns:
print(colored("\nFiltering URLs by patterns:", "cyan"))
for pattern in args.patterns:
print(colored(f" {pattern}", "yellow"))
filtered_urls = []
for url in urls:
if any(pattern.replace('*', '') in url for pattern in args.patterns):
filtered_urls.append(url)
print(colored(f"\nFound {len(filtered_urls)} URLs matching patterns", "green"))
urls = filtered_urls
# Crawl the URLs
results = await crawler.crawl(args.sitemap_url)
# Save results to markdown file with dynamic name
filename_prefix = crawler.get_filename_prefix(args.sitemap_url)
crawler.save_markdown_content(results, filename_prefix)
except Exception as e:
print(colored(f"Error during crawling: {str(e)}", "red"))
sys.exit(1)
if __name__ == "__main__":
asyncio.run(main())
```
--------------------------------------------------------------------------------
/src/docs_scraper/crawlers/menu_crawler.py:
--------------------------------------------------------------------------------
```python
#!/usr/bin/env python3
import asyncio
from typing import List, Set
from termcolor import colored
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from urllib.parse import urljoin, urlparse
import json
import os
import sys
import argparse
from datetime import datetime
import re
# Constants
BASE_URL = "https://developers.cloudflare.com/agents/"
INPUT_DIR = "input_files" # Changed from OUTPUT_DIR
MENU_SELECTORS = [
# Traditional documentation selectors
"nav a", # General navigation links
"[role='navigation'] a", # Role-based navigation
".sidebar a", # Common sidebar class
"[class*='nav'] a", # Classes containing 'nav'
"[class*='menu'] a", # Classes containing 'menu'
"aside a", # Side navigation
".toc a", # Table of contents
# Modern framework selectors (Mintlify, Docusaurus, etc)
"[class*='sidebar'] [role='navigation'] [class*='group'] a", # Navigation groups
"[class*='sidebar'] [role='navigation'] [class*='item'] a", # Navigation items
"[class*='sidebar'] [role='navigation'] [class*='link'] a", # Direct links
"[class*='sidebar'] [role='navigation'] div[class*='text']", # Text items
"[class*='sidebar'] [role='navigation'] [class*='nav-item']", # Nav items
# Additional common patterns
"[class*='docs-'] a", # Documentation-specific links
"[class*='navigation'] a", # Navigation containers
"[class*='toc'] a", # Table of contents variations
".docNavigation a", # Documentation navigation
"[class*='menu-item'] a", # Menu items
# Client-side rendered navigation
"[class*='sidebar'] a[href]", # Any link in sidebar
"[class*='sidebar'] [role='link']", # ARIA role links
"[class*='sidebar'] [role='menuitem']", # Menu items
"[class*='sidebar'] [role='treeitem']", # Tree navigation items
"[class*='sidebar'] [onclick]", # Elements with click handlers
"[class*='sidebar'] [class*='link']", # Elements with link classes
"a[href^='/']", # Root-relative links
"a[href^='./']", # Relative links
"a[href^='../']" # Parent-relative links
]
# JavaScript to expand nested menus
EXPAND_MENUS_JS = """
(async () => {
// Wait for client-side rendering to complete
await new Promise(r => setTimeout(r, 2000));
// Function to expand all menu items
async function expandAllMenus() {
// Combined selectors for expandable menu items
const expandableSelectors = [
// Previous selectors...
// Additional selectors for client-side rendered menus
'[class*="sidebar"] button',
'[class*="sidebar"] [role="button"]',
'[class*="sidebar"] [aria-controls]',
'[class*="sidebar"] [aria-expanded]',
'[class*="sidebar"] [data-state]',
'[class*="sidebar"] [class*="expand"]',
'[class*="sidebar"] [class*="toggle"]',
'[class*="sidebar"] [class*="collapse"]'
];
let expanded = 0;
let lastExpanded = -1;
let attempts = 0;
const maxAttempts = 10; // Increased attempts for client-side rendering
while (expanded !== lastExpanded && attempts < maxAttempts) {
lastExpanded = expanded;
attempts++;
for (const selector of expandableSelectors) {
const elements = document.querySelectorAll(selector);
for (const el of elements) {
try {
// Click the element
el.click();
// Try multiple expansion methods
el.setAttribute('aria-expanded', 'true');
el.setAttribute('data-state', 'open');
el.classList.add('expanded', 'show', 'active');
el.classList.remove('collapsed', 'closed');
// Handle parent groups - multiple patterns
['[class*="group"]', '[class*="parent"]', '[class*="submenu"]'].forEach(parentSelector => {
let parent = el.closest(parentSelector);
if (parent) {
parent.setAttribute('data-state', 'open');
parent.setAttribute('aria-expanded', 'true');
parent.classList.add('expanded', 'show', 'active');
}
});
expanded++;
await new Promise(r => setTimeout(r, 200)); // Increased delay between clicks
} catch (e) {
continue;
}
}
}
// Wait longer between attempts for client-side rendering
await new Promise(r => setTimeout(r, 500));
}
// After expansion, try to convert text items to links if needed
const textSelectors = [
'[class*="sidebar"] [role="navigation"] [class*="text"]',
'[class*="menu-item"]',
'[class*="nav-item"]',
'[class*="sidebar"] [role="menuitem"]',
'[class*="sidebar"] [role="treeitem"]'
];
textSelectors.forEach(selector => {
const textItems = document.querySelectorAll(selector);
textItems.forEach(item => {
if (!item.querySelector('a') && item.textContent && item.textContent.trim()) {
const text = item.textContent.trim();
// Only create link if it doesn't already exist
if (!Array.from(item.children).some(child => child.tagName === 'A')) {
const link = document.createElement('a');
link.href = '#' + text.toLowerCase().replace(/[^a-z0-9]+/g, '-');
link.textContent = text;
item.appendChild(link);
}
}
});
});
return expanded;
}
const expandedCount = await expandAllMenus();
// Final wait to ensure all client-side updates are complete
await new Promise(r => setTimeout(r, 1000));
return expandedCount;
})();
"""
def get_filename_prefix(url: str) -> str:
"""
Generate a filename prefix from a URL including path components.
Examples:
- https://docs.literalai.com/page -> literalai_docs_page
- https://literalai.com/docs/page -> literalai_docs_page
- https://api.example.com/path/to/page -> example_api_path_to_page
Args:
url (str): The URL to process
Returns:
str: A filename-safe string derived from the URL
"""
try:
# Parse the URL
parsed = urlparse(url)
# Split hostname and reverse it (e.g., 'docs.example.com' -> ['com', 'example', 'docs'])
hostname_parts = parsed.hostname.split('.')
hostname_parts.reverse()
# Remove common TLDs and 'www'
hostname_parts = [p for p in hostname_parts if p not in ('com', 'org', 'net', 'www')]
# Get path components, removing empty strings
path_parts = [p for p in parsed.path.split('/') if p]
# Combine hostname and path parts
all_parts = hostname_parts + path_parts
# Clean up parts: lowercase, remove special chars, limit length
cleaned_parts = []
for part in all_parts:
# Convert to lowercase and remove special characters
cleaned = re.sub(r'[^a-zA-Z0-9]+', '_', part.lower())
# Remove leading/trailing underscores
cleaned = cleaned.strip('_')
# Only add non-empty parts
if cleaned:
cleaned_parts.append(cleaned)
# Join parts with underscores
return '_'.join(cleaned_parts)
except Exception as e:
print(colored(f"Error generating filename prefix: {str(e)}", "red"))
return "default"
class MenuCrawler:
def __init__(self, start_url: str):
self.start_url = start_url
# Configure browser settings
self.browser_config = BrowserConfig(
headless=True,
viewport_width=1920,
viewport_height=1080,
java_script_enabled=True # Ensure JavaScript is enabled
)
# Create extraction strategy for menu links
extraction_schema = {
"name": "MenuLinks",
"baseSelector": ", ".join(MENU_SELECTORS),
"fields": [
{
"name": "href",
"type": "attribute",
"attribute": "href"
},
{
"name": "text",
"type": "text"
},
{
"name": "onclick",
"type": "attribute",
"attribute": "onclick"
},
{
"name": "role",
"type": "attribute",
"attribute": "role"
}
]
}
extraction_strategy = JsonCssExtractionStrategy(extraction_schema)
# Configure crawler settings with proper wait conditions
self.crawler_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
cache_mode=CacheMode.BYPASS, # Don't use cache for fresh results
verbose=True, # Enable detailed logging
wait_for_images=True, # Ensure lazy-loaded content is captured
js_code=[
# Initial wait for client-side rendering
"await new Promise(r => setTimeout(r, 2000));",
EXPAND_MENUS_JS
], # Add JavaScript to expand nested menus
wait_for="""js:() => {
// Wait for sidebar and its content to be present
const sidebar = document.querySelector('[class*="sidebar"]');
if (!sidebar) return false;
// Check if we have navigation items
const hasNavItems = sidebar.querySelectorAll('a').length > 0;
if (hasNavItems) return true;
// If no nav items yet, check for loading indicators
const isLoading = document.querySelector('[class*="loading"]') !== null;
return !isLoading; // Return true if not loading anymore
}""",
session_id="menu_crawler", # Use a session to maintain state
js_only=False # We want full page load first
)
# Create output directory if it doesn't exist
if not os.path.exists(INPUT_DIR):
os.makedirs(INPUT_DIR)
print(colored(f"Created output directory: {INPUT_DIR}", "green"))
async def extract_all_menu_links(self) -> List[str]:
"""Extract all menu links from the main page, including nested menus."""
try:
print(colored(f"Crawling main page: {self.start_url}", "cyan"))
print(colored("Expanding all nested menus...", "yellow"))
async with AsyncWebCrawler(config=self.browser_config) as crawler:
# Get page content using crawl4ai
result = await crawler.arun(
url=self.start_url,
config=self.crawler_config
)
if not result or not result.success:
print(colored(f"Failed to get page data", "red"))
if result and result.error_message:
print(colored(f"Error: {result.error_message}", "red"))
return []
links = set()
# Parse the base domain from start_url
base_domain = urlparse(self.start_url).netloc
# Add the base URL first (without trailing slash for consistency)
base_url = self.start_url.rstrip('/')
links.add(base_url)
print(colored(f"Added base URL: {base_url}", "green"))
# Extract links from the result
if hasattr(result, 'extracted_content') and result.extracted_content:
try:
menu_links = json.loads(result.extracted_content)
for link in menu_links:
href = link.get('href', '')
text = link.get('text', '').strip()
# Skip empty hrefs
if not href:
continue
# Convert relative URLs to absolute
absolute_url = urljoin(self.start_url, href)
parsed_url = urlparse(absolute_url)
# Accept internal links (same domain) that aren't anchors
if (parsed_url.netloc == base_domain and
not href.startswith('#') and
'#' not in absolute_url):
# Remove any trailing slashes for consistency
absolute_url = absolute_url.rstrip('/')
links.add(absolute_url)
print(colored(f"Found link: {text} -> {absolute_url}", "green"))
else:
print(colored(f"Skipping external or anchor link: {text} -> {href}", "yellow"))
except json.JSONDecodeError as e:
print(colored(f"Error parsing extracted content: {str(e)}", "red"))
print(colored(f"\nFound {len(links)} unique menu links", "green"))
return sorted(list(links))
except Exception as e:
print(colored(f"Error extracting menu links: {str(e)}", "red"))
return []
def save_results(self, results: dict) -> str:
"""Save crawling results to a JSON file in the input_files directory."""
try:
# Create input_files directory if it doesn't exist
os.makedirs(INPUT_DIR, exist_ok=True)
# Generate filename using the same pattern
filename_prefix = get_filename_prefix(self.start_url)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"{filename_prefix}_menu_links_{timestamp}.json"
filepath = os.path.join(INPUT_DIR, filename)
with open(filepath, "w", encoding="utf-8") as f:
json.dump(results, f, indent=2)
print(colored(f"\n✓ Menu links saved to: {filepath}", "green"))
print(colored("\nTo crawl these URLs with multi_url_crawler.py, run:", "cyan"))
print(colored(f"python multi_url_crawler.py --urls {filename}", "yellow"))
return filepath
except Exception as e:
print(colored(f"\n✗ Error saving menu links: {str(e)}", "red"))
return None
async def crawl(self):
"""Main crawling method."""
try:
# Extract all menu links from the main page
menu_links = await self.extract_all_menu_links()
# Save results
results = {
"start_url": self.start_url,
"total_links_found": len(menu_links),
"menu_links": menu_links
}
self.save_results(results)
print(colored(f"\nCrawling completed!", "green"))
print(colored(f"Total unique menu links found: {len(menu_links)}", "green"))
except Exception as e:
print(colored(f"Error during crawling: {str(e)}", "red"))
async def main():
# Set up argument parser
parser = argparse.ArgumentParser(description='Extract menu links from a documentation website')
parser.add_argument('url', type=str, help='Documentation site URL to crawl')
parser.add_argument('--selectors', type=str, nargs='+', help='Custom menu selectors (optional)')
args = parser.parse_args()
try:
# Update menu selectors if custom ones provided
if args.selectors:
print(colored("Using custom menu selectors:", "cyan"))
for selector in args.selectors:
print(colored(f" {selector}", "yellow"))
global MENU_SELECTORS
MENU_SELECTORS = args.selectors
crawler = MenuCrawler(args.url)
await crawler.crawl()
except Exception as e:
print(colored(f"Error in main: {str(e)}", "red"))
sys.exit(1)
if __name__ == "__main__":
print(colored("Starting documentation menu crawler...", "cyan"))
asyncio.run(main())
```
--------------------------------------------------------------------------------
/.venv/Scripts/Activate.ps1:
--------------------------------------------------------------------------------
```
<#
.Synopsis
Activate a Python virtual environment for the current PowerShell session.
.Description
Pushes the python executable for a virtual environment to the front of the
$Env:PATH environment variable and sets the prompt to signify that you are
in a Python virtual environment. Makes use of the command line switches as
well as the `pyvenv.cfg` file values present in the virtual environment.
.Parameter VenvDir
Path to the directory that contains the virtual environment to activate. The
default value for this is the parent of the directory that the Activate.ps1
script is located within.
.Parameter Prompt
The prompt prefix to display when this virtual environment is activated. By
default, this prompt is the name of the virtual environment folder (VenvDir)
surrounded by parentheses and followed by a single space (ie. '(.venv) ').
.Example
Activate.ps1
Activates the Python virtual environment that contains the Activate.ps1 script.
.Example
Activate.ps1 -Verbose
Activates the Python virtual environment that contains the Activate.ps1 script,
and shows extra information about the activation as it executes.
.Example
Activate.ps1 -VenvDir C:\Users\MyUser\Common\.venv
Activates the Python virtual environment located in the specified location.
.Example
Activate.ps1 -Prompt "MyPython"
Activates the Python virtual environment that contains the Activate.ps1 script,
and prefixes the current prompt with the specified string (surrounded in
parentheses) while the virtual environment is active.
.Notes
On Windows, it may be required to enable this Activate.ps1 script by setting the
execution policy for the user. You can do this by issuing the following PowerShell
command:
PS C:\> Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
For more information on Execution Policies:
https://go.microsoft.com/fwlink/?LinkID=135170
#>
Param(
[Parameter(Mandatory = $false)]
[String]
$VenvDir,
[Parameter(Mandatory = $false)]
[String]
$Prompt
)
<# Function declarations --------------------------------------------------- #>
<#
.Synopsis
Remove all shell session elements added by the Activate script, including the
addition of the virtual environment's Python executable from the beginning of
the PATH variable.
.Parameter NonDestructive
If present, do not remove this function from the global namespace for the
session.
#>
function global:deactivate ([switch]$NonDestructive) {
# Revert to original values
# The prior prompt:
if (Test-Path -Path Function:_OLD_VIRTUAL_PROMPT) {
Copy-Item -Path Function:_OLD_VIRTUAL_PROMPT -Destination Function:prompt
Remove-Item -Path Function:_OLD_VIRTUAL_PROMPT
}
# The prior PYTHONHOME:
if (Test-Path -Path Env:_OLD_VIRTUAL_PYTHONHOME) {
Copy-Item -Path Env:_OLD_VIRTUAL_PYTHONHOME -Destination Env:PYTHONHOME
Remove-Item -Path Env:_OLD_VIRTUAL_PYTHONHOME
}
# The prior PATH:
if (Test-Path -Path Env:_OLD_VIRTUAL_PATH) {
Copy-Item -Path Env:_OLD_VIRTUAL_PATH -Destination Env:PATH
Remove-Item -Path Env:_OLD_VIRTUAL_PATH
}
# Just remove the VIRTUAL_ENV altogether:
if (Test-Path -Path Env:VIRTUAL_ENV) {
Remove-Item -Path env:VIRTUAL_ENV
}
# Just remove VIRTUAL_ENV_PROMPT altogether.
if (Test-Path -Path Env:VIRTUAL_ENV_PROMPT) {
Remove-Item -Path env:VIRTUAL_ENV_PROMPT
}
# Just remove the _PYTHON_VENV_PROMPT_PREFIX altogether:
if (Get-Variable -Name "_PYTHON_VENV_PROMPT_PREFIX" -ErrorAction SilentlyContinue) {
Remove-Variable -Name _PYTHON_VENV_PROMPT_PREFIX -Scope Global -Force
}
# Leave deactivate function in the global namespace if requested:
if (-not $NonDestructive) {
Remove-Item -Path function:deactivate
}
}
<#
.Description
Get-PyVenvConfig parses the values from the pyvenv.cfg file located in the
given folder, and returns them in a map.
For each line in the pyvenv.cfg file, if that line can be parsed into exactly
two strings separated by `=` (with any amount of whitespace surrounding the =)
then it is considered a `key = value` line. The left hand string is the key,
the right hand is the value.
If the value starts with a `'` or a `"` then the first and last character is
stripped from the value before being captured.
.Parameter ConfigDir
Path to the directory that contains the `pyvenv.cfg` file.
#>
function Get-PyVenvConfig(
[String]
$ConfigDir
) {
Write-Verbose "Given ConfigDir=$ConfigDir, obtain values in pyvenv.cfg"
# Ensure the file exists, and issue a warning if it doesn't (but still allow the function to continue).
$pyvenvConfigPath = Join-Path -Resolve -Path $ConfigDir -ChildPath 'pyvenv.cfg' -ErrorAction Continue
# An empty map will be returned if no config file is found.
$pyvenvConfig = @{ }
if ($pyvenvConfigPath) {
Write-Verbose "File exists, parse `key = value` lines"
$pyvenvConfigContent = Get-Content -Path $pyvenvConfigPath
$pyvenvConfigContent | ForEach-Object {
$keyval = $PSItem -split "\s*=\s*", 2
if ($keyval[0] -and $keyval[1]) {
$val = $keyval[1]
# Remove extraneous quotations around a string value.
if ("'""".Contains($val.Substring(0, 1))) {
$val = $val.Substring(1, $val.Length - 2)
}
$pyvenvConfig[$keyval[0]] = $val
Write-Verbose "Adding Key: '$($keyval[0])'='$val'"
}
}
}
return $pyvenvConfig
}
<# Begin Activate script --------------------------------------------------- #>
# Determine the containing directory of this script
$VenvExecPath = Split-Path -Parent $MyInvocation.MyCommand.Definition
$VenvExecDir = Get-Item -Path $VenvExecPath
Write-Verbose "Activation script is located in path: '$VenvExecPath'"
Write-Verbose "VenvExecDir Fullname: '$($VenvExecDir.FullName)"
Write-Verbose "VenvExecDir Name: '$($VenvExecDir.Name)"
# Set values required in priority: CmdLine, ConfigFile, Default
# First, get the location of the virtual environment, it might not be
# VenvExecDir if specified on the command line.
if ($VenvDir) {
Write-Verbose "VenvDir given as parameter, using '$VenvDir' to determine values"
}
else {
Write-Verbose "VenvDir not given as a parameter, using parent directory name as VenvDir."
$VenvDir = $VenvExecDir.Parent.FullName.TrimEnd("\\/")
Write-Verbose "VenvDir=$VenvDir"
}
# Next, read the `pyvenv.cfg` file to determine any required value such
# as `prompt`.
$pyvenvCfg = Get-PyVenvConfig -ConfigDir $VenvDir
# Next, set the prompt from the command line, or the config file, or
# just use the name of the virtual environment folder.
if ($Prompt) {
Write-Verbose "Prompt specified as argument, using '$Prompt'"
}
else {
Write-Verbose "Prompt not specified as argument to script, checking pyvenv.cfg value"
if ($pyvenvCfg -and $pyvenvCfg['prompt']) {
Write-Verbose " Setting based on value in pyvenv.cfg='$($pyvenvCfg['prompt'])'"
$Prompt = $pyvenvCfg['prompt'];
}
else {
Write-Verbose " Setting prompt based on parent's directory's name. (Is the directory name passed to venv module when creating the virtual environment)"
Write-Verbose " Got leaf-name of $VenvDir='$(Split-Path -Path $venvDir -Leaf)'"
$Prompt = Split-Path -Path $venvDir -Leaf
}
}
Write-Verbose "Prompt = '$Prompt'"
Write-Verbose "VenvDir='$VenvDir'"
# Deactivate any currently active virtual environment, but leave the
# deactivate function in place.
deactivate -nondestructive
# Now set the environment variable VIRTUAL_ENV, used by many tools to determine
# that there is an activated venv.
$env:VIRTUAL_ENV = $VenvDir
if (-not $Env:VIRTUAL_ENV_DISABLE_PROMPT) {
Write-Verbose "Setting prompt to '$Prompt'"
# Set the prompt to include the env name
# Make sure _OLD_VIRTUAL_PROMPT is global
function global:_OLD_VIRTUAL_PROMPT { "" }
Copy-Item -Path function:prompt -Destination function:_OLD_VIRTUAL_PROMPT
New-Variable -Name _PYTHON_VENV_PROMPT_PREFIX -Description "Python virtual environment prompt prefix" -Scope Global -Option ReadOnly -Visibility Public -Value $Prompt
function global:prompt {
Write-Host -NoNewline -ForegroundColor Green "($_PYTHON_VENV_PROMPT_PREFIX) "
_OLD_VIRTUAL_PROMPT
}
$env:VIRTUAL_ENV_PROMPT = $Prompt
}
# Clear PYTHONHOME
if (Test-Path -Path Env:PYTHONHOME) {
Copy-Item -Path Env:PYTHONHOME -Destination Env:_OLD_VIRTUAL_PYTHONHOME
Remove-Item -Path Env:PYTHONHOME
}
# Add the venv to the PATH
Copy-Item -Path Env:PATH -Destination Env:_OLD_VIRTUAL_PATH
$Env:PATH = "$VenvExecDir$([System.IO.Path]::PathSeparator)$Env:PATH"
# SIG # Begin signature block
# MII3ewYJKoZIhvcNAQcCoII3bDCCN2gCAQExDzANBglghkgBZQMEAgEFADB5Bgor
# BgEEAYI3AgEEoGswaTA0BgorBgEEAYI3AgEeMCYCAwEAAAQQH8w7YFlLCE63JNLG
# KX7zUQIBAAIBAAIBAAIBAAIBADAxMA0GCWCGSAFlAwQCAQUABCBnL745ElCYk8vk
# dBtMuQhLeWJ3ZGfzKW4DHCYzAn+QB6CCG9IwggXMMIIDtKADAgECAhBUmNLR1FsZ
# lUgTecgRwIeZMA0GCSqGSIb3DQEBDAUAMHcxCzAJBgNVBAYTAlVTMR4wHAYDVQQK
# ExVNaWNyb3NvZnQgQ29ycG9yYXRpb24xSDBGBgNVBAMTP01pY3Jvc29mdCBJZGVu
# dGl0eSBWZXJpZmljYXRpb24gUm9vdCBDZXJ0aWZpY2F0ZSBBdXRob3JpdHkgMjAy
# MDAeFw0yMDA0MTYxODM2MTZaFw00NTA0MTYxODQ0NDBaMHcxCzAJBgNVBAYTAlVT
# MR4wHAYDVQQKExVNaWNyb3NvZnQgQ29ycG9yYXRpb24xSDBGBgNVBAMTP01pY3Jv
# c29mdCBJZGVudGl0eSBWZXJpZmljYXRpb24gUm9vdCBDZXJ0aWZpY2F0ZSBBdXRo
# b3JpdHkgMjAyMDCCAiIwDQYJKoZIhvcNAQEBBQADggIPADCCAgoCggIBALORKgeD
# Bmf9np3gx8C3pOZCBH8Ppttf+9Va10Wg+3cL8IDzpm1aTXlT2KCGhFdFIMeiVPvH
# or+Kx24186IVxC9O40qFlkkN/76Z2BT2vCcH7kKbK/ULkgbk/WkTZaiRcvKYhOuD
# PQ7k13ESSCHLDe32R0m3m/nJxxe2hE//uKya13NnSYXjhr03QNAlhtTetcJtYmrV
# qXi8LW9J+eVsFBT9FMfTZRY33stuvF4pjf1imxUs1gXmuYkyM6Nix9fWUmcIxC70
# ViueC4fM7Ke0pqrrBc0ZV6U6CwQnHJFnni1iLS8evtrAIMsEGcoz+4m+mOJyoHI1
# vnnhnINv5G0Xb5DzPQCGdTiO0OBJmrvb0/gwytVXiGhNctO/bX9x2P29Da6SZEi3
# W295JrXNm5UhhNHvDzI9e1eM80UHTHzgXhgONXaLbZ7LNnSrBfjgc10yVpRnlyUK
# xjU9lJfnwUSLgP3B+PR0GeUw9gb7IVc+BhyLaxWGJ0l7gpPKWeh1R+g/OPTHU3mg
# trTiXFHvvV84wRPmeAyVWi7FQFkozA8kwOy6CXcjmTimthzax7ogttc32H83rwjj
# O3HbbnMbfZlysOSGM1l0tRYAe1BtxoYT2v3EOYI9JACaYNq6lMAFUSw0rFCZE4e7
# swWAsk0wAly4JoNdtGNz764jlU9gKL431VulAgMBAAGjVDBSMA4GA1UdDwEB/wQE
# AwIBhjAPBgNVHRMBAf8EBTADAQH/MB0GA1UdDgQWBBTIftJqhSobyhmYBAcnz1AQ
# T2ioojAQBgkrBgEEAYI3FQEEAwIBADANBgkqhkiG9w0BAQwFAAOCAgEAr2rd5hnn
# LZRDGU7L6VCVZKUDkQKL4jaAOxWiUsIWGbZqWl10QzD0m/9gdAmxIR6QFm3FJI9c
# Zohj9E/MffISTEAQiwGf2qnIrvKVG8+dBetJPnSgaFvlVixlHIJ+U9pW2UYXeZJF
# xBA2CFIpF8svpvJ+1Gkkih6PsHMNzBxKq7Kq7aeRYwFkIqgyuH4yKLNncy2RtNwx
# AQv3Rwqm8ddK7VZgxCwIo3tAsLx0J1KH1r6I3TeKiW5niB31yV2g/rarOoDXGpc8
# FzYiQR6sTdWD5jw4vU8w6VSp07YEwzJ2YbuwGMUrGLPAgNW3lbBeUU0i/OxYqujY
# lLSlLu2S3ucYfCFX3VVj979tzR/SpncocMfiWzpbCNJbTsgAlrPhgzavhgplXHT2
# 6ux6anSg8Evu75SjrFDyh+3XOjCDyft9V77l4/hByuVkrrOj7FjshZrM77nq81YY
# uVxzmq/FdxeDWds3GhhyVKVB0rYjdaNDmuV3fJZ5t0GNv+zcgKCf0Xd1WF81E+Al
# GmcLfc4l+gcK5GEh2NQc5QfGNpn0ltDGFf5Ozdeui53bFv0ExpK91IjmqaOqu/dk
# ODtfzAzQNb50GQOmxapMomE2gj4d8yu8l13bS3g7LfU772Aj6PXsCyM2la+YZr9T
# 03u4aUoqlmZpxJTG9F9urJh4iIAGXKKy7aIwggb+MIIE5qADAgECAhMzAAE6wJsA
# snAfNTDUAAAAATrAMA0GCSqGSIb3DQEBDAUAMFoxCzAJBgNVBAYTAlVTMR4wHAYD
# VQQKExVNaWNyb3NvZnQgQ29ycG9yYXRpb24xKzApBgNVBAMTIk1pY3Jvc29mdCBJ
# RCBWZXJpZmllZCBDUyBFT0MgQ0EgMDIwHhcNMjQxMjAzMDk0MTAzWhcNMjQxMjA2
# MDk0MTAzWjB8MQswCQYDVQQGEwJVUzEPMA0GA1UECBMGT3JlZ29uMRIwEAYDVQQH
# EwlCZWF2ZXJ0b24xIzAhBgNVBAoTGlB5dGhvbiBTb2Z0d2FyZSBGb3VuZGF0aW9u
# MSMwIQYDVQQDExpQeXRob24gU29mdHdhcmUgRm91bmRhdGlvbjCCAaIwDQYJKoZI
# hvcNAQEBBQADggGPADCCAYoCggGBAOb6O8bNcUcKFE6vSWXiMf3/EALRFfo4QQDq
# m/XTiKG7GeHjRzVGTGhooTagMmKc4m52JNEPsQxCwXYCcTz9bRt/I+CYfHcqIAKL
# Ga46L/7a7T6qFF0tcAX15uMED8xwApWO8GP4wYxGS/a0GD3OGpl2qsZmw0lKFm+f
# rD8ip4F4z6KrnmsVlC0qMdCndPH+2NAstO7k+ArRhqDvsu5Jt39O2M7LuaX6loEp
# 43CWmkr7i+cLlkNbRx5hBrN/8MJQDhnprS2a6y+3PGLNDDUvN9Yb9UMZFwu5xcjM
# AP8VqdQYmgdz7PZDlIncayelUgxG1Zdx6CWqngNOKpc4lkTKzxf3Q8m+awDXCw7L
# 18jWlFL1Jki2khRmU4HhPzKf2WcgQCHtUPekmvnNUeNFT3TZsxXyhXuDTmB+BOV7
# TpUO594HbMIhdktkdZ2mA/mcKPN4NnYJy55pUmwQIOuqPY1dlR+nv2ko466xxd8x
# pUO3R/pAse5WXaNx6fy9jj0ChNuNmQIDAQABo4ICGTCCAhUwDAYDVR0TAQH/BAIw
# ADAOBgNVHQ8BAf8EBAMCB4AwPAYDVR0lBDUwMwYKKwYBBAGCN2EBAAYIKwYBBQUH
# AwMGGysGAQQBgjdhgqKNuwqmkohkgZH0oEWCk/3hbzAdBgNVHQ4EFgQUJrRvpIJJ
# B/SCh4T20kb8yCbcj7MwHwYDVR0jBBgwFoAUZZ9RzoVofy+KRYiq3acxux4NAF4w
# ZwYDVR0fBGAwXjBcoFqgWIZWaHR0cDovL3d3dy5taWNyb3NvZnQuY29tL3BraW9w
# cy9jcmwvTWljcm9zb2Z0JTIwSUQlMjBWZXJpZmllZCUyMENTJTIwRU9DJTIwQ0El
# MjAwMi5jcmwwgaUGCCsGAQUFBwEBBIGYMIGVMGQGCCsGAQUFBzAChlhodHRwOi8v
# d3d3Lm1pY3Jvc29mdC5jb20vcGtpb3BzL2NlcnRzL01pY3Jvc29mdCUyMElEJTIw
# VmVyaWZpZWQlMjBDUyUyMEVPQyUyMENBJTIwMDIuY3J0MC0GCCsGAQUFBzABhiFo
# dHRwOi8vb25lb2NzcC5taWNyb3NvZnQuY29tL29jc3AwZgYDVR0gBF8wXTBRBgwr
# BgEEAYI3TIN9AQEwQTA/BggrBgEFBQcCARYzaHR0cDovL3d3dy5taWNyb3NvZnQu
# Y29tL3BraW9wcy9Eb2NzL1JlcG9zaXRvcnkuaHRtMAgGBmeBDAEEATANBgkqhkiG
# 9w0BAQwFAAOCAgEAvsdWH8gSPoZt5yblw1MghytJmjs6zeME3/iAYXPzcwwXfB08
# RLMesfF0svS4P1GEdP+CcIIqFl7/48ECI4eregPZjypMWhqPQQWuT8+gMDiWRtYj
# KGEhu+faY+U8Bqv/OrRRS6MHNAoJAGH3t/oxeLXTaeW1URtswm+gvEx+K9KFpHP9
# j1mF0wh4wuVvUJAGYc4KpcorLk7P9vQTLCDmFZi08HMcRYQURHEpskWiSS4czpfv
# ImpRPD5RbIqkQnK7M5wX8X+cI/0hCUb0TjIuxBo92FU726ivQ2vqpzuuRo2Uw1wR
# hzerJI2Bilj66tBzPNn02EZt978Ju/f6+N7b0tFD0kSz1DwTYAR3eK9CwRxIXCGf
# 2rrHz/T9ATSlF71xw2g0R5HBPaJKaySj7PXIU5Nq9sEVCuUSJmNmiZRdBkF76LKH
# mg3cCM5QbgpKfnc46Hi6I7D9q8F1XF0IJwRgP/fwDquWnkTHOoWC9nA0+eLedlAh
# oKYQGEKzAdNLdXdWyUNTjnUKhEXLQgPyDPP9ZLHK+jjQwc6ptwPoX3uuSHtrNmnu
# EIFibcHCp6JVIIu5B94xKS99hGXhaxbuaOMyCvTEVmoLMxAWpF1lVyUUTgj9Lr4W
# /qdqUE+7a1QTDLLswxY8djZCJejn0pDLoKoNfjaeIaRKXXrRs2BND+qKDaowggda
# MIIFQqADAgECAhMzAAAABft6XDITYd9dAAAAAAAFMA0GCSqGSIb3DQEBDAUAMGMx
# CzAJBgNVBAYTAlVTMR4wHAYDVQQKExVNaWNyb3NvZnQgQ29ycG9yYXRpb24xNDAy
# BgNVBAMTK01pY3Jvc29mdCBJRCBWZXJpZmllZCBDb2RlIFNpZ25pbmcgUENBIDIw
# MjEwHhcNMjEwNDEzMTczMTUzWhcNMjYwNDEzMTczMTUzWjBaMQswCQYDVQQGEwJV
# UzEeMBwGA1UEChMVTWljcm9zb2Z0IENvcnBvcmF0aW9uMSswKQYDVQQDEyJNaWNy
# b3NvZnQgSUQgVmVyaWZpZWQgQ1MgRU9DIENBIDAyMIICIjANBgkqhkiG9w0BAQEF
# AAOCAg8AMIICCgKCAgEA0hqZfD8ykKTA6CDbWvshmBpDoBf7Lv132RVuSqVwQO3a
# ALLkuRnnTIoRmMGo0fIMQrtwR6UHB06xdqOkAfqB6exubXTHu44+duHUCdE4ngjE
# LBQyluMuSOnHaEdveIbt31OhMEX/4nQkph4+Ah0eR4H2sTRrVKmKrlOoQlhia73Q
# g2dHoitcX1uT1vW3Knpt9Mt76H7ZHbLNspMZLkWBabKMl6BdaWZXYpPGdS+qY80g
# DaNCvFq0d10UMu7xHesIqXpTDT3Q3AeOxSylSTc/74P3og9j3OuemEFauFzL55t1
# MvpadEhQmD8uFMxFv/iZOjwvcdY1zhanVLLyplz13/NzSoU3QjhPdqAGhRIwh/YD
# zo3jCdVJgWQRrW83P3qWFFkxNiME2iO4IuYgj7RwseGwv7I9cxOyaHihKMdT9Neo
# SjpSNzVnKKGcYMtOdMtKFqoV7Cim2m84GmIYZTBorR/Po9iwlasTYKFpGZqdWKyY
# nJO2FV8oMmWkIK1iagLLgEt6ZaR0rk/1jUYssyTiRqWr84Qs3XL/V5KUBEtUEQfQ
# /4RtnI09uFFUIGJZV9mD/xOUksWodGrCQSem6Hy261xMJAHqTqMuDKgwi8xk/mfl
# r7yhXPL73SOULmu1Aqu4I7Gpe6QwNW2TtQBxM3vtSTmdPW6rK5y0gED51RjsyK0C
# AwEAAaOCAg4wggIKMA4GA1UdDwEB/wQEAwIBhjAQBgkrBgEEAYI3FQEEAwIBADAd
# BgNVHQ4EFgQUZZ9RzoVofy+KRYiq3acxux4NAF4wVAYDVR0gBE0wSzBJBgRVHSAA
# MEEwPwYIKwYBBQUHAgEWM2h0dHA6Ly93d3cubWljcm9zb2Z0LmNvbS9wa2lvcHMv
# RG9jcy9SZXBvc2l0b3J5Lmh0bTAZBgkrBgEEAYI3FAIEDB4KAFMAdQBiAEMAQTAS
# BgNVHRMBAf8ECDAGAQH/AgEAMB8GA1UdIwQYMBaAFNlBKbAPD2Ns72nX9c0pnqRI
# ajDmMHAGA1UdHwRpMGcwZaBjoGGGX2h0dHA6Ly93d3cubWljcm9zb2Z0LmNvbS9w
# a2lvcHMvY3JsL01pY3Jvc29mdCUyMElEJTIwVmVyaWZpZWQlMjBDb2RlJTIwU2ln
# bmluZyUyMFBDQSUyMDIwMjEuY3JsMIGuBggrBgEFBQcBAQSBoTCBnjBtBggrBgEF
# BQcwAoZhaHR0cDovL3d3dy5taWNyb3NvZnQuY29tL3BraW9wcy9jZXJ0cy9NaWNy
# b3NvZnQlMjBJRCUyMFZlcmlmaWVkJTIwQ29kZSUyMFNpZ25pbmclMjBQQ0ElMjAy
# MDIxLmNydDAtBggrBgEFBQcwAYYhaHR0cDovL29uZW9jc3AubWljcm9zb2Z0LmNv
# bS9vY3NwMA0GCSqGSIb3DQEBDAUAA4ICAQBFSWDUd08X4g5HzvVfrB1SiV8pk6XP
# HT9jPkCmvU/uvBzmZRAjYk2gKYR3pXoStRJaJ/lhjC5Dq/2R7P1YRZHCDYyK0zvS
# RMdE6YQtgGjmsdhzD0nCS6hVVcgfmNQscPJ1WHxbvG5EQgYQ0ZED1FN0MOPQzWe1
# zbH5Va0dSxtnodBVRjnyDYEm7sNEcvJHTG3eXzAyd00E5KDCsEl4z5O0mvXqwaH2
# PS0200E6P4WqLwgs/NmUu5+Aa8Lw/2En2VkIW7Pkir4Un1jG6+tj/ehuqgFyUPPC
# h6kbnvk48bisi/zPjAVkj7qErr7fSYICCzJ4s4YUNVVHgdoFn2xbW7ZfBT3QA9zf
# hq9u4ExXbrVD5rxXSTFEUg2gzQq9JHxsdHyMfcCKLFQOXODSzcYeLpCd+r6GcoDB
# ToyPdKccjC6mAq6+/hiMDnpvKUIHpyYEzWUeattyKXtMf+QrJeQ+ny5jBL+xqdOO
# PEz3dg7qn8/oprUrUbGLBv9fWm18fWXdAv1PCtLL/acMLtHoyeSVMKQYqDHb3Qm0
# uQ+NQ0YE4kUxSQa+W/cCzYAI32uN0nb9M4Mr1pj4bJZidNkM4JyYqezohILxYkgH
# bboJQISrQWrm5RYdyhKBpptJ9JJn0Z63LjdnzlOUxjlsAbQir2Wmz/OJE703BbHm
# QZRwzPx1vu7S5zCCB54wggWGoAMCAQICEzMAAAAHh6M0o3uljhwAAAAAAAcwDQYJ
# KoZIhvcNAQEMBQAwdzELMAkGA1UEBhMCVVMxHjAcBgNVBAoTFU1pY3Jvc29mdCBD
# b3Jwb3JhdGlvbjFIMEYGA1UEAxM/TWljcm9zb2Z0IElkZW50aXR5IFZlcmlmaWNh
# dGlvbiBSb290IENlcnRpZmljYXRlIEF1dGhvcml0eSAyMDIwMB4XDTIxMDQwMTIw
# MDUyMFoXDTM2MDQwMTIwMTUyMFowYzELMAkGA1UEBhMCVVMxHjAcBgNVBAoTFU1p
# Y3Jvc29mdCBDb3Jwb3JhdGlvbjE0MDIGA1UEAxMrTWljcm9zb2Z0IElEIFZlcmlm
# aWVkIENvZGUgU2lnbmluZyBQQ0EgMjAyMTCCAiIwDQYJKoZIhvcNAQEBBQADggIP
# ADCCAgoCggIBALLwwK8ZiCji3VR6TElsaQhVCbRS/3pK+MHrJSj3Zxd3KU3rlfL3
# qrZilYKJNqztA9OQacr1AwoNcHbKBLbsQAhBnIB34zxf52bDpIO3NJlfIaTE/xrw
# eLoQ71lzCHkD7A4As1Bs076Iu+mA6cQzsYYH/Cbl1icwQ6C65rU4V9NQhNUwgrx9
# rGQ//h890Q8JdjLLw0nV+ayQ2Fbkd242o9kH82RZsH3HEyqjAB5a8+Ae2nPIPc8s
# ZU6ZE7iRrRZywRmrKDp5+TcmJX9MRff241UaOBs4NmHOyke8oU1TYrkxh+YeHgfW
# o5tTgkoSMoayqoDpHOLJs+qG8Tvh8SnifW2Jj3+ii11TS8/FGngEaNAWrbyfNrC6
# 9oKpRQXY9bGH6jn9NEJv9weFxhTwyvx9OJLXmRGbAUXN1U9nf4lXezky6Uh/cgjk
# Vd6CGUAf0K+Jw+GE/5VpIVbcNr9rNE50Sbmy/4RTCEGvOq3GhjITbCa4crCzTTHg
# YYjHs1NbOc6brH+eKpWLtr+bGecy9CrwQyx7S/BfYJ+ozst7+yZtG2wR461uckFu
# 0t+gCwLdN0A6cFtSRtR8bvxVFyWwTtgMMFRuBa3vmUOTnfKLsLefRaQcVTgRnzeL
# zdpt32cdYKp+dhr2ogc+qM6K4CBI5/j4VFyC4QFeUP2YAidLtvpXRRo3AgMBAAGj
# ggI1MIICMTAOBgNVHQ8BAf8EBAMCAYYwEAYJKwYBBAGCNxUBBAMCAQAwHQYDVR0O
# BBYEFNlBKbAPD2Ns72nX9c0pnqRIajDmMFQGA1UdIARNMEswSQYEVR0gADBBMD8G
# CCsGAQUFBwIBFjNodHRwOi8vd3d3Lm1pY3Jvc29mdC5jb20vcGtpb3BzL0RvY3Mv
# UmVwb3NpdG9yeS5odG0wGQYJKwYBBAGCNxQCBAweCgBTAHUAYgBDAEEwDwYDVR0T
# AQH/BAUwAwEB/zAfBgNVHSMEGDAWgBTIftJqhSobyhmYBAcnz1AQT2ioojCBhAYD
# VR0fBH0wezB5oHegdYZzaHR0cDovL3d3dy5taWNyb3NvZnQuY29tL3BraW9wcy9j
# cmwvTWljcm9zb2Z0JTIwSWRlbnRpdHklMjBWZXJpZmljYXRpb24lMjBSb290JTIw
# Q2VydGlmaWNhdGUlMjBBdXRob3JpdHklMjAyMDIwLmNybDCBwwYIKwYBBQUHAQEE
# gbYwgbMwgYEGCCsGAQUFBzAChnVodHRwOi8vd3d3Lm1pY3Jvc29mdC5jb20vcGtp
# b3BzL2NlcnRzL01pY3Jvc29mdCUyMElkZW50aXR5JTIwVmVyaWZpY2F0aW9uJTIw
# Um9vdCUyMENlcnRpZmljYXRlJTIwQXV0aG9yaXR5JTIwMjAyMC5jcnQwLQYIKwYB
# BQUHMAGGIWh0dHA6Ly9vbmVvY3NwLm1pY3Jvc29mdC5jb20vb2NzcDANBgkqhkiG
# 9w0BAQwFAAOCAgEAfyUqnv7Uq+rdZgrbVyNMul5skONbhls5fccPlmIbzi+OwVdP
# Q4H55v7VOInnmezQEeW4LqK0wja+fBznANbXLB0KrdMCbHQpbLvG6UA/Xv2pfpVI
# E1CRFfNF4XKO8XYEa3oW8oVH+KZHgIQRIwAbyFKQ9iyj4aOWeAzwk+f9E5StNp5T
# 8FG7/VEURIVWArbAzPt9ThVN3w1fAZkF7+YU9kbq1bCR2YD+MtunSQ1Rft6XG7b4
# e0ejRA7mB2IoX5hNh3UEauY0byxNRG+fT2MCEhQl9g2i2fs6VOG19CNep7SquKaB
# jhWmirYyANb0RJSLWjinMLXNOAga10n8i9jqeprzSMU5ODmrMCJE12xS/NWShg/t
# uLjAsKP6SzYZ+1Ry358ZTFcx0FS/mx2vSoU8s8HRvy+rnXqyUJ9HBqS0DErVLjQw
# K8VtsBdekBmdTbQVoCgPCqr+PDPB3xajYnzevs7eidBsM71PINK2BoE2UfMwxCCX
# 3mccFgx6UsQeRSdVVVNSyALQe6PT12418xon2iDGE81OGCreLzDcMAZnrUAx4XQL
# Uz6ZTl65yPUiOh3k7Yww94lDf+8oG2oZmDh5O1Qe38E+M3vhKwmzIeoB1dVLlz4i
# 3IpaDcR+iuGjH2TdaC1ZOmBXiCRKJLj4DT2uhJ04ji+tHD6n58vhavFIrmcxghr/
# MIIa+wIBATBxMFoxCzAJBgNVBAYTAlVTMR4wHAYDVQQKExVNaWNyb3NvZnQgQ29y
# cG9yYXRpb24xKzApBgNVBAMTIk1pY3Jvc29mdCBJRCBWZXJpZmllZCBDUyBFT0Mg
# Q0EgMDICEzMAATrAmwCycB81MNQAAAABOsAwDQYJYIZIAWUDBAIBBQCggcgwGQYJ
# KoZIhvcNAQkDMQwGCisGAQQBgjcCAQQwHAYKKwYBBAGCNwIBCzEOMAwGCisGAQQB
# gjcCARUwLwYJKoZIhvcNAQkEMSIEIGcBno/ti9PCrR9sXrajsTvlHQvGxbk63JiI
# URJByQuGMFwGCisGAQQBgjcCAQwxTjBMoEaARABCAHUAaQBsAHQAOgAgAFIAZQBs
# AGUAYQBzAGUAXwB2ADMALgAxADIALgA4AF8AMgAwADIANAAxADIAMAAzAC4AMAAx
# oQKAADANBgkqhkiG9w0BAQEFAASCAYA12ir0FCUD2IvWiy7MqsqAciOsLhTQmvF0
# /jSGXaCIrGUzlySVKbuQ47XFYcT5xIoz7ChRS/OvrKCJ0eTjFHn7osTLM4BrKZFi
# G0CwUrFom77qPC4XQQY238IGPMiJZCza+PbrXun7tGNqJSH4uCybQMnB1XH3W9qy
# o3Mn89gnba36QesJ8wl5no+HvIS0LnylhzvDcqdO3yI/EC22XJ/f/XENtDIRI+nC
# nwFuo22Ez2ElzjpFtn9kLyA0/Z8Q6SEcboSbpu6daFBgFe6Ztfrs2ga8C4BRgale
# NJJjL9EqnuHQmMC6TctFqw6bMdQi1OhJsKOaPrp3jWY1np+1jI7X+jcrJ0ZYrTpN
# iH5tIgbYarDF6D2HA8Tka5vfKtG7Q2KBOWH9PqJIAjwHCr0XaIQa1NbkAYFmsVMH
# 4MJluljUeMx3Gh9wfCx1R9485sNAFp+Triawf20YtgbTXPw+e0gzjE8vGDBht6i7
# +8Yrqbq00uH3Y+JtjWAis+oi8ZrMPDGhghgUMIIYEAYKKwYBBAGCNwMDATGCGAAw
# ghf8BgkqhkiG9w0BBwKgghftMIIX6QIBAzEPMA0GCWCGSAFlAwQCAQUAMIIBYgYL
# KoZIhvcNAQkQAQSgggFRBIIBTTCCAUkCAQEGCisGAQQBhFkKAwEwMTANBglghkgB
# ZQMEAgEFAAQgZFXp7g3d1kespGffVXL6HjYsxYd3bAR3mRUetFkYstkCBmdEaWXk
# sRgTMjAyNDEyMDMyMDE0MzIuODM2WjAEgAIB9KCB4aSB3jCB2zELMAkGA1UEBhMC
# VVMxEzARBgNVBAgTCldhc2hpbmd0b24xEDAOBgNVBAcTB1JlZG1vbmQxHjAcBgNV
# BAoTFU1pY3Jvc29mdCBDb3Jwb3JhdGlvbjElMCMGA1UECxMcTWljcm9zb2Z0IEFt
# ZXJpY2EgT3BlcmF0aW9uczEnMCUGA1UECxMeblNoaWVsZCBUU1MgRVNOOjc4MDAt
# MDVFMC1EOTQ3MTUwMwYDVQQDEyxNaWNyb3NvZnQgUHVibGljIFJTQSBUaW1lIFN0
# YW1waW5nIEF1dGhvcml0eaCCDyEwggeCMIIFaqADAgECAhMzAAAABeXPD/9mLsmH
# AAAAAAAFMA0GCSqGSIb3DQEBDAUAMHcxCzAJBgNVBAYTAlVTMR4wHAYDVQQKExVN
# aWNyb3NvZnQgQ29ycG9yYXRpb24xSDBGBgNVBAMTP01pY3Jvc29mdCBJZGVudGl0
# eSBWZXJpZmljYXRpb24gUm9vdCBDZXJ0aWZpY2F0ZSBBdXRob3JpdHkgMjAyMDAe
# Fw0yMDExMTkyMDMyMzFaFw0zNTExMTkyMDQyMzFaMGExCzAJBgNVBAYTAlVTMR4w
# HAYDVQQKExVNaWNyb3NvZnQgQ29ycG9yYXRpb24xMjAwBgNVBAMTKU1pY3Jvc29m
# dCBQdWJsaWMgUlNBIFRpbWVzdGFtcGluZyBDQSAyMDIwMIICIjANBgkqhkiG9w0B
# AQEFAAOCAg8AMIICCgKCAgEAnnznUmP94MWfBX1jtQYioxwe1+eXM9ETBb1lRkd3
# kcFdcG9/sqtDlwxKoVIcaqDb+omFio5DHC4RBcbyQHjXCwMk/l3TOYtgoBjxnG/e
# ViS4sOx8y4gSq8Zg49REAf5huXhIkQRKe3Qxs8Sgp02KHAznEa/Ssah8nWo5hJM1
# xznkRsFPu6rfDHeZeG1Wa1wISvlkpOQooTULFm809Z0ZYlQ8Lp7i5F9YciFlyAKw
# n6yjN/kR4fkquUWfGmMopNq/B8U/pdoZkZZQbxNlqJOiBGgCWpx69uKqKhTPVi3g
# VErnc/qi+dR8A2MiAz0kN0nh7SqINGbmw5OIRC0EsZ31WF3Uxp3GgZwetEKxLms7
# 3KG/Z+MkeuaVDQQheangOEMGJ4pQZH55ngI0Tdy1bi69INBV5Kn2HVJo9XxRYR/J
# PGAaM6xGl57Ei95HUw9NV/uC3yFjrhc087qLJQawSC3xzY/EXzsT4I7sDbxOmM2r
# l4uKK6eEpurRduOQ2hTkmG1hSuWYBunFGNv21Kt4N20AKmbeuSnGnsBCd2cjRKG7
# 9+TX+sTehawOoxfeOO/jR7wo3liwkGdzPJYHgnJ54UxbckF914AqHOiEV7xTnD1a
# 69w/UTxwjEugpIPMIIE67SFZ2PMo27xjlLAHWW3l1CEAFjLNHd3EQ79PUr8FUXet
# Xr0CAwEAAaOCAhswggIXMA4GA1UdDwEB/wQEAwIBhjAQBgkrBgEEAYI3FQEEAwIB
# ADAdBgNVHQ4EFgQUa2koOjUvSGNAz3vYr0npPtk92yEwVAYDVR0gBE0wSzBJBgRV
# HSAAMEEwPwYIKwYBBQUHAgEWM2h0dHA6Ly93d3cubWljcm9zb2Z0LmNvbS9wa2lv
# cHMvRG9jcy9SZXBvc2l0b3J5Lmh0bTATBgNVHSUEDDAKBggrBgEFBQcDCDAZBgkr
# BgEEAYI3FAIEDB4KAFMAdQBiAEMAQTAPBgNVHRMBAf8EBTADAQH/MB8GA1UdIwQY
# MBaAFMh+0mqFKhvKGZgEByfPUBBPaKiiMIGEBgNVHR8EfTB7MHmgd6B1hnNodHRw
# Oi8vd3d3Lm1pY3Jvc29mdC5jb20vcGtpb3BzL2NybC9NaWNyb3NvZnQlMjBJZGVu
# dGl0eSUyMFZlcmlmaWNhdGlvbiUyMFJvb3QlMjBDZXJ0aWZpY2F0ZSUyMEF1dGhv
# cml0eSUyMDIwMjAuY3JsMIGUBggrBgEFBQcBAQSBhzCBhDCBgQYIKwYBBQUHMAKG
# dWh0dHA6Ly93d3cubWljcm9zb2Z0LmNvbS9wa2lvcHMvY2VydHMvTWljcm9zb2Z0
# JTIwSWRlbnRpdHklMjBWZXJpZmljYXRpb24lMjBSb290JTIwQ2VydGlmaWNhdGUl
# MjBBdXRob3JpdHklMjAyMDIwLmNydDANBgkqhkiG9w0BAQwFAAOCAgEAX4h2x35t
# tVoVdedMeGj6TuHYRJklFaW4sTQ5r+k77iB79cSLNe+GzRjv4pVjJviceW6AF6yc
# WoEYR0LYhaa0ozJLU5Yi+LCmcrdovkl53DNt4EXs87KDogYb9eGEndSpZ5ZM74LN
# vVzY0/nPISHz0Xva71QjD4h+8z2XMOZzY7YQ0Psw+etyNZ1CesufU211rLslLKsO
# 8F2aBs2cIo1k+aHOhrw9xw6JCWONNboZ497mwYW5EfN0W3zL5s3ad4Xtm7yFM7Uj
# rhc0aqy3xL7D5FR2J7x9cLWMq7eb0oYioXhqV2tgFqbKHeDick+P8tHYIFovIP7Y
# G4ZkJWag1H91KlELGWi3SLv10o4KGag42pswjybTi4toQcC/irAodDW8HNtX+cbz
# 0sMptFJK+KObAnDFHEsukxD+7jFfEV9Hh/+CSxKRsmnuiovCWIOb+H7DRon9Tlxy
# diFhvu88o0w35JkNbJxTk4MhF/KgaXn0GxdH8elEa2Imq45gaa8D+mTm8LWVydt4
# ytxYP/bqjN49D9NZ81coE6aQWm88TwIf4R4YZbOpMKN0CyejaPNN41LGXHeCUMYm
# Bx3PkP8ADHD1J2Cr/6tjuOOCztfp+o9Nc+ZoIAkpUcA/X2gSMkgHAPUvIdtoSAHE
# UKiBhI6JQivRepyvWcl+JYbYbBh7pmgAXVswggeXMIIFf6ADAgECAhMzAAAAO4pp
# Wb4UBWRxAAAAAAA7MA0GCSqGSIb3DQEBDAUAMGExCzAJBgNVBAYTAlVTMR4wHAYD
# VQQKExVNaWNyb3NvZnQgQ29ycG9yYXRpb24xMjAwBgNVBAMTKU1pY3Jvc29mdCBQ
# dWJsaWMgUlNBIFRpbWVzdGFtcGluZyBDQSAyMDIwMB4XDTI0MDIxNTIwMzYxMloX
# DTI1MDIxNTIwMzYxMlowgdsxCzAJBgNVBAYTAlVTMRMwEQYDVQQIEwpXYXNoaW5n
# dG9uMRAwDgYDVQQHEwdSZWRtb25kMR4wHAYDVQQKExVNaWNyb3NvZnQgQ29ycG9y
# YXRpb24xJTAjBgNVBAsTHE1pY3Jvc29mdCBBbWVyaWNhIE9wZXJhdGlvbnMxJzAl
# BgNVBAsTHm5TaGllbGQgVFNTIEVTTjo3ODAwLTA1RTAtRDk0NzE1MDMGA1UEAxMs
# TWljcm9zb2Z0IFB1YmxpYyBSU0EgVGltZSBTdGFtcGluZyBBdXRob3JpdHkwggIi
# MA0GCSqGSIb3DQEBAQUAA4ICDwAwggIKAoICAQCoN2tZV70ADgeArSKowvN7sD1W
# j9d2dKDzNsSpQZSD3kwUftP9qC4o/eDvvzx/AzPtJpkW5JpDqYKGIk3NSyyWFlY1
# 2loL6mhkRO8K3lLLgZ9wAr68z+1W0NLs0Bd48QUtLfckAiToekndsqKFP28jZOKB
# U43nW2SiLEL1Wo2JUHFW5Crw16Bkms3b8U9etQKcErNDgTbUnxFbc73Dr47el6pp
# sy6ZMFK7aWzryjKZZfJwS1EmgT2CTQ4XY9qj2Fd9y3gSWNlP+XrGyCiPQ3oQ5cdr
# 9Ms59najNa0WxHbR7B8DPIxXRDxCmdQxHw3HL9N8SC017cvwA4hEuBMfix2gC7xi
# DyM+pTkl28BZ1ANnBznEMZs9rbHtKQpyz2bsNO0RYRP+xrIZtWduvwCWEB6k2H5U
# HSYErMUTm2T4VOQeGsjPRFco+t/5spFqPBsUr/774i4Z+fAfD91D1DFgiK5CVZgg
# k1StKFVDfQSKU5YRXI/TaM4bVocAW3S9rVgpQXCcWI/WJEBxYZn6SJ5dE45VlCwy
# C7HEZvCOrtM02rELlCcXbGdICL3FltPh9A2ZsDw0HA6/7NXF3mhyZ37yQ3sprS/M
# glb5ddY3/KL7nyCfehVuQDjFD2S/h7FCkM1tFFOJnHrn+UHaBsWS/LjyKdBLSK26
# D/C6RPbM6m5MqeJQIwIDAQABo4IByzCCAccwHQYDVR0OBBYEFBmDkSnO3Ykx3QWs
# 933wkNmnHPEVMB8GA1UdIwQYMBaAFGtpKDo1L0hjQM972K9J6T7ZPdshMGwGA1Ud
# HwRlMGMwYaBfoF2GW2h0dHA6Ly93d3cubWljcm9zb2Z0LmNvbS9wa2lvcHMvY3Js
# L01pY3Jvc29mdCUyMFB1YmxpYyUyMFJTQSUyMFRpbWVzdGFtcGluZyUyMENBJTIw
# MjAyMC5jcmwweQYIKwYBBQUHAQEEbTBrMGkGCCsGAQUFBzAChl1odHRwOi8vd3d3
# Lm1pY3Jvc29mdC5jb20vcGtpb3BzL2NlcnRzL01pY3Jvc29mdCUyMFB1YmxpYyUy
# MFJTQSUyMFRpbWVzdGFtcGluZyUyMENBJTIwMjAyMC5jcnQwDAYDVR0TAQH/BAIw
# ADAWBgNVHSUBAf8EDDAKBggrBgEFBQcDCDAOBgNVHQ8BAf8EBAMCB4AwZgYDVR0g
# BF8wXTBRBgwrBgEEAYI3TIN9AQEwQTA/BggrBgEFBQcCARYzaHR0cDovL3d3dy5t
# aWNyb3NvZnQuY29tL3BraW9wcy9Eb2NzL1JlcG9zaXRvcnkuaHRtMAgGBmeBDAEE
# AjANBgkqhkiG9w0BAQwFAAOCAgEAKrAu6dFJYu6BKhLMdAxnEMzeKzJOMunyOeCM
# X9VC/meVFudURy3RKZMYUq0YFqQ0BsmfufswGszwfSnqaq116/fomiYokxBDQU/r
# 2u8sXod6NfSaD8/xx/pAFSU28YFYJh46+wdlR30wgf+8uJJMlpZ90fGiZ2crTw0K
# ZJWWSg53MlXTalBP7ZepnoVp9NmcRD9CDw+3IdkjzH1yCnfjbWp0HfBJdv7WJVlc
# nRM45MYqUX1x+5LCeeDnBw2pTj3cDKPNNtNhb8BHRcTJSH84tjVRTtpCtc1XZE5u
# +u0g1tCzLSm7AmR+SZjoClyzinuQuqk/8kx6YRow7Y8wBiZjP5LfriRreaDGpm97
# efzhkwVKcsZsKnw007GhPRQWz52fSgMsRzg6rWx6MRBv3c+kBcefgLVVEI3gguge
# j9NwDXUnmH+DC6ir5NTQ3ZVLhwA2Fjbn+rctcXeozP5g/CS9Qx4C8RpkvyZGvBEB
# DyNFdU9r2HyMvFP/NaUCI0xC7oLde5FONeRFI01itSXk1N7R80JUW7jqRKvy7Ueq
# g6T6PwWfAd/R+vh7oQXhLH98dPJMODz3cdCtw5MeAnfcfUDEE8b6mzJK5iLJbnKY
# IQ+o9T/AcS0A1yCiClaBZBTociaFT5JStvCe7CDzvUWVBY375ezQ+l6M3tTzy63G
# pBDohSMxggdGMIIHQgIBATB4MGExCzAJBgNVBAYTAlVTMR4wHAYDVQQKExVNaWNy
# b3NvZnQgQ29ycG9yYXRpb24xMjAwBgNVBAMTKU1pY3Jvc29mdCBQdWJsaWMgUlNB
# IFRpbWVzdGFtcGluZyBDQSAyMDIwAhMzAAAAO4ppWb4UBWRxAAAAAAA7MA0GCWCG
# SAFlAwQCAQUAoIIEnzARBgsqhkiG9w0BCRACDzECBQAwGgYJKoZIhvcNAQkDMQ0G
# CyqGSIb3DQEJEAEEMBwGCSqGSIb3DQEJBTEPFw0yNDEyMDMyMDE0MzJaMC8GCSqG
# SIb3DQEJBDEiBCBNv25RRNMkG3WBRCPMirWBBK5bDR5n2jrt3NxatbZRRjCBuQYL
# KoZIhvcNAQkQAi8xgakwgaYwgaMwgaAEIJPbJzLEniYkzwpcwDrQSswJJ/yvXnr9
# 1KPiO2/Blq7cMHwwZaRjMGExCzAJBgNVBAYTAlVTMR4wHAYDVQQKExVNaWNyb3Nv
# ZnQgQ29ycG9yYXRpb24xMjAwBgNVBAMTKU1pY3Jvc29mdCBQdWJsaWMgUlNBIFRp
# bWVzdGFtcGluZyBDQSAyMDIwAhMzAAAAO4ppWb4UBWRxAAAAAAA7MIIDYQYLKoZI
# hvcNAQkQAhIxggNQMIIDTKGCA0gwggNEMIICLAIBATCCAQmhgeGkgd4wgdsxCzAJ
# BgNVBAYTAlVTMRMwEQYDVQQIEwpXYXNoaW5ndG9uMRAwDgYDVQQHEwdSZWRtb25k
# MR4wHAYDVQQKExVNaWNyb3NvZnQgQ29ycG9yYXRpb24xJTAjBgNVBAsTHE1pY3Jv
# c29mdCBBbWVyaWNhIE9wZXJhdGlvbnMxJzAlBgNVBAsTHm5TaGllbGQgVFNTIEVT
# Tjo3ODAwLTA1RTAtRDk0NzE1MDMGA1UEAxMsTWljcm9zb2Z0IFB1YmxpYyBSU0Eg
# VGltZSBTdGFtcGluZyBBdXRob3JpdHmiIwoBATAHBgUrDgMCGgMVAClO3IQMdSPc
# u/l/uN44JqTQiG/3oGcwZaRjMGExCzAJBgNVBAYTAlVTMR4wHAYDVQQKExVNaWNy
# b3NvZnQgQ29ycG9yYXRpb24xMjAwBgNVBAMTKU1pY3Jvc29mdCBQdWJsaWMgUlNB
# IFRpbWVzdGFtcGluZyBDQSAyMDIwMA0GCSqGSIb3DQEBCwUAAgUA6vlz6jAiGA8y
# MDI0MTIwMzEyMTEyMloYDzIwMjQxMjA0MTIxMTIyWjB3MD0GCisGAQQBhFkKBAEx
# LzAtMAoCBQDq+XPqAgEAMAoCAQACAiVbAgH/MAcCAQACAhKYMAoCBQDq+sVqAgEA
# MDYGCisGAQQBhFkKBAIxKDAmMAwGCisGAQQBhFkKAwKgCjAIAgEAAgMHoSChCjAI
# AgEAAgMBhqAwDQYJKoZIhvcNAQELBQADggEBALcfsg+u49h09/2fVfz+1Oa0FPvl
# IbwTVUeWlo8UR47kMKA9YMmJ6keO8V0AZFgm2gpw7makh+I1YuB94yqAdmD927Y+
# W/mDiAT8+42WeSa73vAM0vg0ec6fwS03d+R312Cz0qgZ6MfzqmHOQqSpmxX2hMrF
# fGwmZPaAgUwBovI0Hlv5L5Y4hUOF1nP0OOdkJYpqUJNv7Zf/i7cwJ2cUNHdrakld
# W/2arqSCAC9sduzylh2SYCkvrOgAy/AkAHlazB3iaX1KptwZj8yaK6IfXE9nlMy5
# lMpyDoqyU2nBQaqxcJ2tFAnTqT5LNrDan1XtSTXokbRfZ52wUeNv02NIy7cwDQYJ
# KoZIhvcNAQEBBQAEggIAKSKc+LD6QvnR4RvbNgzgxW9+v7RnkF3ymYnzljaCs++W
# zZVRZxTGF1+ploDbpD9MvJiLgOqO5B5ROh2hfs3GtPBsC8vWek4glRhzja9HCTZT
# WBjsqT7zCj/cemZXBr91bImFKawdxy1PL+YeWYQieQ6f+nMRZzL9HT4qqXTuwrHt
# GbG7yUJvau7AT2jynHPZhQTO8vkHHyMbs2CXxO5D/SyDtqpszYJyCAl+V/jgaZZl
# KGN4Vyfh4fLCaTOGSpoAp1qQBICuHcT+3ZdZpUFbr9hyqBCgAGshqNGwEcjrPGc/
# Sisb0AsTjGAJL9bprQ+94InpaRNdKFj2PwNZ3A02+Re8rWAdcH2vQ/NW7fDSD0aE
# MO26CnK7J+XG1HwVjsqivX/4XS5R1MVIMswcbSUiaouTAN33h5m/vsEMFFG46Qv2
# wJYHr/5QzzqrRsMkuDMqECUBrfJXrIQhwiG8paMMv//91XHniJj8vfB8a1EWQLA3
# +xqOcI9WGdLKzMSmalAOu3LE9PaVdeESNA8pp7EzmbgkzgkJ/uL/+c8ZQvxpXx6K
# DA4vpw2utiUATCPB1J8R+TXkLiSoIZUNNPzF3OndLm91Teziwgd9X2sWrfaBJh5r
# 70oKaRn7atZnxen4l5eUcas0WvsgM0Pb55UfhlPZeX1YC1bX2jpoYUblRMc6bCY=
# SIG # End signature block
```