pragmar/mcp_server_webcrawl # codebase.md

This is page 5 of 35. Use http://codebase.md/pragmar/mcp_server_webcrawl?lines=true&page={x} to view the full context.

# Directory Structure

```
├── .gitignore
├── CONTRIBUTING.md
├── docs
│   ├── _images
│   │   ├── interactive.document.webp
│   │   ├── interactive.search.webp
│   │   └── mcpswc.svg
│   ├── _modules
│   │   ├── index.html
│   │   ├── mcp_server_webcrawl
│   │   │   ├── crawlers
│   │   │   │   ├── archivebox
│   │   │   │   │   ├── adapter.html
│   │   │   │   │   ├── crawler.html
│   │   │   │   │   └── tests.html
│   │   │   │   ├── base
│   │   │   │   │   ├── adapter.html
│   │   │   │   │   ├── api.html
│   │   │   │   │   ├── crawler.html
│   │   │   │   │   ├── indexed.html
│   │   │   │   │   └── tests.html
│   │   │   │   ├── httrack
│   │   │   │   │   ├── adapter.html
│   │   │   │   │   ├── crawler.html
│   │   │   │   │   └── tests.html
│   │   │   │   ├── interrobot
│   │   │   │   │   ├── adapter.html
│   │   │   │   │   ├── crawler.html
│   │   │   │   │   └── tests.html
│   │   │   │   ├── katana
│   │   │   │   │   ├── adapter.html
│   │   │   │   │   ├── crawler.html
│   │   │   │   │   └── tests.html
│   │   │   │   ├── siteone
│   │   │   │   │   ├── adapter.html
│   │   │   │   │   ├── crawler.html
│   │   │   │   │   └── tests.html
│   │   │   │   ├── warc
│   │   │   │   │   ├── adapter.html
│   │   │   │   │   ├── crawler.html
│   │   │   │   │   └── tests.html
│   │   │   │   └── wget
│   │   │   │       ├── adapter.html
│   │   │   │       ├── crawler.html
│   │   │   │       └── tests.html
│   │   │   ├── crawlers.html
│   │   │   ├── extras
│   │   │   │   ├── markdown.html
│   │   │   │   ├── regex.html
│   │   │   │   ├── snippets.html
│   │   │   │   ├── thumbnails.html
│   │   │   │   └── xpath.html
│   │   │   ├── interactive
│   │   │   │   ├── highlights.html
│   │   │   │   ├── search.html
│   │   │   │   ├── session.html
│   │   │   │   └── ui.html
│   │   │   ├── main.html
│   │   │   ├── models
│   │   │   │   ├── resources.html
│   │   │   │   └── sites.html
│   │   │   ├── templates
│   │   │   │   └── tests.html
│   │   │   ├── utils
│   │   │   │   ├── blobs.html
│   │   │   │   ├── cli.html
│   │   │   │   ├── logger.html
│   │   │   │   ├── querycache.html
│   │   │   │   ├── server.html
│   │   │   │   └── tools.html
│   │   │   └── utils.html
│   │   └── re.html
│   ├── _sources
│   │   ├── guides
│   │   │   ├── archivebox.rst.txt
│   │   │   ├── httrack.rst.txt
│   │   │   ├── interrobot.rst.txt
│   │   │   ├── katana.rst.txt
│   │   │   ├── siteone.rst.txt
│   │   │   ├── warc.rst.txt
│   │   │   └── wget.rst.txt
│   │   ├── guides.rst.txt
│   │   ├── index.rst.txt
│   │   ├── installation.rst.txt
│   │   ├── interactive.rst.txt
│   │   ├── mcp_server_webcrawl.crawlers.archivebox.rst.txt
│   │   ├── mcp_server_webcrawl.crawlers.base.rst.txt
│   │   ├── mcp_server_webcrawl.crawlers.httrack.rst.txt
│   │   ├── mcp_server_webcrawl.crawlers.interrobot.rst.txt
│   │   ├── mcp_server_webcrawl.crawlers.katana.rst.txt
│   │   ├── mcp_server_webcrawl.crawlers.rst.txt
│   │   ├── mcp_server_webcrawl.crawlers.siteone.rst.txt
│   │   ├── mcp_server_webcrawl.crawlers.warc.rst.txt
│   │   ├── mcp_server_webcrawl.crawlers.wget.rst.txt
│   │   ├── mcp_server_webcrawl.extras.rst.txt
│   │   ├── mcp_server_webcrawl.interactive.rst.txt
│   │   ├── mcp_server_webcrawl.models.rst.txt
│   │   ├── mcp_server_webcrawl.rst.txt
│   │   ├── mcp_server_webcrawl.templates.rst.txt
│   │   ├── mcp_server_webcrawl.utils.rst.txt
│   │   ├── modules.rst.txt
│   │   ├── prompts.rst.txt
│   │   └── usage.rst.txt
│   ├── _static
│   │   ├── _sphinx_javascript_frameworks_compat.js
│   │   ├── basic.css
│   │   ├── css
│   │   │   ├── badge_only.css
│   │   │   ├── fonts
│   │   │   │   ├── fontawesome-webfont.eot
│   │   │   │   ├── fontawesome-webfont.svg
│   │   │   │   ├── fontawesome-webfont.ttf
│   │   │   │   ├── fontawesome-webfont.woff
│   │   │   │   ├── fontawesome-webfont.woff2
│   │   │   │   ├── lato-bold-italic.woff
│   │   │   │   ├── lato-bold-italic.woff2
│   │   │   │   ├── lato-bold.woff
│   │   │   │   ├── lato-bold.woff2
│   │   │   │   ├── lato-normal-italic.woff
│   │   │   │   ├── lato-normal-italic.woff2
│   │   │   │   ├── lato-normal.woff
│   │   │   │   ├── lato-normal.woff2
│   │   │   │   ├── Roboto-Slab-Bold.woff
│   │   │   │   ├── Roboto-Slab-Bold.woff2
│   │   │   │   ├── Roboto-Slab-Regular.woff
│   │   │   │   └── Roboto-Slab-Regular.woff2
│   │   │   └── theme.css
│   │   ├── doctools.js
│   │   ├── documentation_options.js
│   │   ├── file.png
│   │   ├── fonts
│   │   │   ├── Lato
│   │   │   │   ├── lato-bold.eot
│   │   │   │   ├── lato-bold.ttf
│   │   │   │   ├── lato-bold.woff
│   │   │   │   ├── lato-bold.woff2
│   │   │   │   ├── lato-bolditalic.eot
│   │   │   │   ├── lato-bolditalic.ttf
│   │   │   │   ├── lato-bolditalic.woff
│   │   │   │   ├── lato-bolditalic.woff2
│   │   │   │   ├── lato-italic.eot
│   │   │   │   ├── lato-italic.ttf
│   │   │   │   ├── lato-italic.woff
│   │   │   │   ├── lato-italic.woff2
│   │   │   │   ├── lato-regular.eot
│   │   │   │   ├── lato-regular.ttf
│   │   │   │   ├── lato-regular.woff
│   │   │   │   └── lato-regular.woff2
│   │   │   └── RobotoSlab
│   │   │       ├── roboto-slab-v7-bold.eot
│   │   │       ├── roboto-slab-v7-bold.ttf
│   │   │       ├── roboto-slab-v7-bold.woff
│   │   │       ├── roboto-slab-v7-bold.woff2
│   │   │       ├── roboto-slab-v7-regular.eot
│   │   │       ├── roboto-slab-v7-regular.ttf
│   │   │       ├── roboto-slab-v7-regular.woff
│   │   │       └── roboto-slab-v7-regular.woff2
│   │   ├── images
│   │   │   ├── interactive.document.png
│   │   │   ├── interactive.document.webp
│   │   │   ├── interactive.search.png
│   │   │   ├── interactive.search.webp
│   │   │   └── mcpswc.svg
│   │   ├── jquery.js
│   │   ├── js
│   │   │   ├── badge_only.js
│   │   │   ├── theme.js
│   │   │   └── versions.js
│   │   ├── language_data.js
│   │   ├── minus.png
│   │   ├── plus.png
│   │   ├── pygments.css
│   │   ├── searchtools.js
│   │   └── sphinx_highlight.js
│   ├── .buildinfo
│   ├── .nojekyll
│   ├── genindex.html
│   ├── guides
│   │   ├── archivebox.html
│   │   ├── httrack.html
│   │   ├── interrobot.html
│   │   ├── katana.html
│   │   ├── siteone.html
│   │   ├── warc.html
│   │   └── wget.html
│   ├── guides.html
│   ├── index.html
│   ├── installation.html
│   ├── interactive.html
│   ├── mcp_server_webcrawl.crawlers.archivebox.html
│   ├── mcp_server_webcrawl.crawlers.base.html
│   ├── mcp_server_webcrawl.crawlers.html
│   ├── mcp_server_webcrawl.crawlers.httrack.html
│   ├── mcp_server_webcrawl.crawlers.interrobot.html
│   ├── mcp_server_webcrawl.crawlers.katana.html
│   ├── mcp_server_webcrawl.crawlers.siteone.html
│   ├── mcp_server_webcrawl.crawlers.warc.html
│   ├── mcp_server_webcrawl.crawlers.wget.html
│   ├── mcp_server_webcrawl.extras.html
│   ├── mcp_server_webcrawl.html
│   ├── mcp_server_webcrawl.interactive.html
│   ├── mcp_server_webcrawl.models.html
│   ├── mcp_server_webcrawl.templates.html
│   ├── mcp_server_webcrawl.utils.html
│   ├── modules.html
│   ├── objects.inv
│   ├── prompts.html
│   ├── py-modindex.html
│   ├── search.html
│   ├── searchindex.js
│   └── usage.html
├── LICENSE
├── MANIFEST.in
├── prompts
│   ├── audit404.md
│   ├── auditfiles.md
│   ├── auditperf.md
│   ├── auditseo.md
│   ├── gopher.md
│   ├── README.md
│   └── testsearch.md
├── pyproject.toml
├── README.md
├── setup.py
├── sphinx
│   ├── _static
│   │   └── images
│   │       ├── interactive.document.png
│   │       ├── interactive.document.webp
│   │       ├── interactive.search.png
│   │       ├── interactive.search.webp
│   │       └── mcpswc.svg
│   ├── _templates
│   │   └── layout.html
│   ├── conf.py
│   ├── guides
│   │   ├── archivebox.rst
│   │   ├── httrack.rst
│   │   ├── interrobot.rst
│   │   ├── katana.rst
│   │   ├── siteone.rst
│   │   ├── warc.rst
│   │   └── wget.rst
│   ├── guides.rst
│   ├── index.rst
│   ├── installation.rst
│   ├── interactive.rst
│   ├── make.bat
│   ├── Makefile
│   ├── mcp_server_webcrawl.crawlers.archivebox.rst
│   ├── mcp_server_webcrawl.crawlers.base.rst
│   ├── mcp_server_webcrawl.crawlers.httrack.rst
│   ├── mcp_server_webcrawl.crawlers.interrobot.rst
│   ├── mcp_server_webcrawl.crawlers.katana.rst
│   ├── mcp_server_webcrawl.crawlers.rst
│   ├── mcp_server_webcrawl.crawlers.siteone.rst
│   ├── mcp_server_webcrawl.crawlers.warc.rst
│   ├── mcp_server_webcrawl.crawlers.wget.rst
│   ├── mcp_server_webcrawl.extras.rst
│   ├── mcp_server_webcrawl.interactive.rst
│   ├── mcp_server_webcrawl.models.rst
│   ├── mcp_server_webcrawl.rst
│   ├── mcp_server_webcrawl.templates.rst
│   ├── mcp_server_webcrawl.utils.rst
│   ├── modules.rst
│   ├── prompts.rst
│   ├── readme.txt
│   └── usage.rst
└── src
    └── mcp_server_webcrawl
        ├── __init__.py
        ├── crawlers
        │   ├── __init__.py
        │   ├── archivebox
        │   │   ├── __init__.py
        │   │   ├── adapter.py
        │   │   ├── crawler.py
        │   │   └── tests.py
        │   ├── base
        │   │   ├── __init__.py
        │   │   ├── adapter.py
        │   │   ├── api.py
        │   │   ├── crawler.py
        │   │   ├── indexed.py
        │   │   └── tests.py
        │   ├── httrack
        │   │   ├── __init__.py
        │   │   ├── adapter.py
        │   │   ├── crawler.py
        │   │   └── tests.py
        │   ├── interrobot
        │   │   ├── __init__.py
        │   │   ├── adapter.py
        │   │   ├── crawler.py
        │   │   └── tests.py
        │   ├── katana
        │   │   ├── __init__.py
        │   │   ├── adapter.py
        │   │   ├── crawler.py
        │   │   └── tests.py
        │   ├── siteone
        │   │   ├── __init__.py
        │   │   ├── adapter.py
        │   │   ├── crawler.py
        │   │   └── tests.py
        │   ├── warc
        │   │   ├── __init__.py
        │   │   ├── adapter.py
        │   │   ├── crawler.py
        │   │   └── tests.py
        │   └── wget
        │       ├── __init__.py
        │       ├── adapter.py
        │       ├── crawler.py
        │       └── tests.py
        ├── extras
        │   ├── __init__.py
        │   ├── markdown.py
        │   ├── regex.py
        │   ├── snippets.py
        │   ├── thumbnails.py
        │   └── xpath.py
        ├── interactive
        │   ├── __init__.py
        │   ├── highlights.py
        │   ├── search.py
        │   ├── session.py
        │   ├── ui.py
        │   └── views
        │       ├── base.py
        │       ├── document.py
        │       ├── help.py
        │       ├── requirements.py
        │       ├── results.py
        │       └── searchform.py
        ├── main.py
        ├── models
        │   ├── __init__.py
        │   ├── base.py
        │   ├── resources.py
        │   └── sites.py
        ├── settings.py
        ├── templates
        │   ├── __init__.py
        │   ├── markdown.xslt
        │   ├── tests_core.html
        │   └── tests.py
        └── utils
            ├── __init__.py
            ├── cli.py
            ├── logger.py
            ├── parser.py
            ├── parsetab.py
            ├── search.py
            ├── server.py
            ├── tests.py
            └── tools.py
```

# Files

--------------------------------------------------------------------------------
/docs/mcp_server_webcrawl.html:
--------------------------------------------------------------------------------

```html
  1 | 
  2 | 
  3 | <!DOCTYPE html>
  4 | <html class="writer-html5" lang="en" data-content_root="./">
  5 | <head>
  6 |   <meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />
  7 | 
  8 |   <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  9 |   <title>mcp_server_webcrawl package &mdash; mcp-server-webcrawl  documentation</title>
 10 |       <link rel="stylesheet" type="text/css" href="_static/pygments.css?v=80d5e7a1" />
 11 |       <link rel="stylesheet" type="text/css" href="_static/css/theme.css?v=e59714d7" />
 12 | 
 13 |   
 14 |       <script src="_static/jquery.js?v=5d32c60e"></script>
 15 |       <script src="_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
 16 |       <script src="_static/documentation_options.js?v=5929fcd5"></script>
 17 |       <script src="_static/doctools.js?v=888ff710"></script>
 18 |       <script src="_static/sphinx_highlight.js?v=dc90522c"></script>
 19 |     <script src="_static/js/theme.js"></script>
 20 |     <link rel="index" title="Index" href="genindex.html" />
 21 |     <link rel="search" title="Search" href="search.html" />
 22 |     <link rel="next" title="mcp_server_webcrawl.crawlers package" href="mcp_server_webcrawl.crawlers.html" />
 23 |     <link rel="prev" title="mcp_server_webcrawl" href="modules.html" /> 
 24 | </head>
 25 | 
 26 | <body class="wy-body-for-nav"> 
 27 |   <div class="wy-grid-for-nav">
 28 |     <nav data-toggle="wy-nav-shift" class="wy-nav-side">
 29 |       <div class="wy-side-scroll">
 30 |         <div class="wy-side-nav-search" >
 31 | 
 32 |           
 33 |           
 34 |           <a href="index.html" class="icon icon-home">
 35 |             mcp-server-webcrawl
 36 |           </a>
 37 | <div role="search">
 38 |   <form id="rtd-search-form" class="wy-form" action="search.html" method="get">
 39 |     <input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
 40 |     <input type="hidden" name="check_keywords" value="yes" />
 41 |     <input type="hidden" name="area" value="default" />
 42 |   </form>
 43 | </div>
 44 |         </div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
 45 |               <p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
 46 | <ul class="current">
 47 | <li class="toctree-l1"><a class="reference internal" href="installation.html">Installation</a></li>
 48 | <li class="toctree-l1"><a class="reference internal" href="guides.html">Setup Guides</a></li>
 49 | <li class="toctree-l1"><a class="reference internal" href="usage.html">Usage</a></li>
 50 | <li class="toctree-l1"><a class="reference internal" href="prompts.html">Prompt Routines</a></li>
 51 | <li class="toctree-l1"><a class="reference internal" href="interactive.html">Interactive Mode</a></li>
 52 | <li class="toctree-l1 current"><a class="reference internal" href="modules.html">mcp_server_webcrawl</a><ul class="current">
 53 | <li class="toctree-l2 current"><a class="current reference internal" href="#">mcp_server_webcrawl package</a></li>
 54 | </ul>
 55 | </li>
 56 | </ul>
 57 | 
 58 |         </div>
 59 |       </div>
 60 |     </nav>
 61 | 
 62 |     <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
 63 |           <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
 64 |           <a href="index.html">mcp-server-webcrawl</a>
 65 |       </nav>
 66 | 
 67 |       <div class="wy-nav-content">
 68 |         <div class="rst-content">
 69 |           <div role="navigation" aria-label="Page navigation">
 70 |   <ul class="wy-breadcrumbs">
 71 |       <li><a href="index.html" class="icon icon-home" aria-label="Home"></a></li>
 72 |           <li class="breadcrumb-item"><a href="modules.html">mcp_server_webcrawl</a></li>
 73 |       <li class="breadcrumb-item active">mcp_server_webcrawl package</li>
 74 |       <li class="wy-breadcrumbs-aside">
 75 |             <a href="_sources/mcp_server_webcrawl.rst.txt" rel="nofollow"> View page source</a>
 76 |       </li>
 77 |   </ul>
 78 |   <hr/>
 79 | </div>
 80 |           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
 81 |            <div itemprop="articleBody">
 82 |              
 83 |   <section id="mcp-server-webcrawl-package">
 84 | <h1>mcp_server_webcrawl package<a class="headerlink" href="#mcp-server-webcrawl-package" title="Link to this heading"></a></h1>
 85 | <section id="subpackages">
 86 | <h2>Subpackages<a class="headerlink" href="#subpackages" title="Link to this heading"></a></h2>
 87 | <div class="toctree-wrapper compound">
 88 | <ul>
 89 | <li class="toctree-l1"><a class="reference internal" href="mcp_server_webcrawl.crawlers.html">mcp_server_webcrawl.crawlers package</a><ul>
 90 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.crawlers.html#subpackages">Subpackages</a></li>
 91 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.crawlers.html#module-mcp_server_webcrawl.crawlers">Module contents</a></li>
 92 | </ul>
 93 | </li>
 94 | <li class="toctree-l1"><a class="reference internal" href="mcp_server_webcrawl.extras.html">mcp_server_webcrawl.extras package</a><ul>
 95 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.extras.html#submodules">Submodules</a></li>
 96 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.extras.html#module-mcp_server_webcrawl.extras.markdown">mcp_server_webcrawl.extras.markdown module</a></li>
 97 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.extras.html#module-mcp_server_webcrawl.extras.regex">mcp_server_webcrawl.extras.regex module</a></li>
 98 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.extras.html#module-mcp_server_webcrawl.extras.snippets">mcp_server_webcrawl.extras.snippets module</a></li>
 99 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.extras.html#module-mcp_server_webcrawl.extras.thumbnails">mcp_server_webcrawl.extras.thumbnails module</a></li>
100 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.extras.html#module-mcp_server_webcrawl.extras.xpath">mcp_server_webcrawl.extras.xpath module</a></li>
101 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.extras.html#module-mcp_server_webcrawl.extras">Module contents</a></li>
102 | </ul>
103 | </li>
104 | <li class="toctree-l1"><a class="reference internal" href="mcp_server_webcrawl.interactive.html">mcp_server_webcrawl.interactive package</a><ul>
105 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.interactive.html#submodules">Submodules</a></li>
106 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.interactive.html#module-mcp_server_webcrawl.interactive.highlights">mcp_server_webcrawl.interactive.highlights module</a></li>
107 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.interactive.html#module-mcp_server_webcrawl.interactive.search">mcp_server_webcrawl.interactive.search module</a></li>
108 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.interactive.html#module-mcp_server_webcrawl.interactive.session">mcp_server_webcrawl.interactive.session module</a></li>
109 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.interactive.html#module-mcp_server_webcrawl.interactive.ui">mcp_server_webcrawl.interactive.ui module</a></li>
110 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.interactive.html#module-mcp_server_webcrawl.interactive">Module contents</a></li>
111 | </ul>
112 | </li>
113 | <li class="toctree-l1"><a class="reference internal" href="mcp_server_webcrawl.models.html">mcp_server_webcrawl.models package</a><ul>
114 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.models.html#submodules">Submodules</a></li>
115 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.models.html#module-mcp_server_webcrawl.models.resources">mcp_server_webcrawl.models.resources module</a></li>
116 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.models.html#module-mcp_server_webcrawl.models.sites">mcp_server_webcrawl.models.sites module</a></li>
117 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.models.html#module-mcp_server_webcrawl.models">Module contents</a></li>
118 | </ul>
119 | </li>
120 | <li class="toctree-l1"><a class="reference internal" href="mcp_server_webcrawl.templates.html">mcp_server_webcrawl.templates package</a><ul>
121 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.templates.html#submodules">Submodules</a></li>
122 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.templates.html#module-mcp_server_webcrawl.templates.tests">mcp_server_webcrawl.templates.tests module</a></li>
123 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.templates.html#module-mcp_server_webcrawl.templates">Module contents</a></li>
124 | </ul>
125 | </li>
126 | <li class="toctree-l1"><a class="reference internal" href="mcp_server_webcrawl.utils.html">mcp_server_webcrawl.utils package</a><ul>
127 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.utils.html#submodules">Submodules</a></li>
128 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.utils.html#module-mcp_server_webcrawl.utils.cli">mcp_server_webcrawl.utils.cli module</a></li>
129 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.utils.html#module-mcp_server_webcrawl.utils.logger">mcp_server_webcrawl.utils.logger module</a></li>
130 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.utils.html#module-mcp_server_webcrawl.utils.server">mcp_server_webcrawl.utils.server module</a></li>
131 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.utils.html#module-mcp_server_webcrawl.utils.tools">mcp_server_webcrawl.utils.tools module</a></li>
132 | <li class="toctree-l2"><a class="reference internal" href="mcp_server_webcrawl.utils.html#module-mcp_server_webcrawl.utils">Module contents</a></li>
133 | </ul>
134 | </li>
135 | </ul>
136 | </div>
137 | </section>
138 | <section id="submodules">
139 | <h2>Submodules<a class="headerlink" href="#submodules" title="Link to this heading"></a></h2>
140 | </section>
141 | <section id="mcp-server-webcrawl-main-module">
142 | <h2>mcp_server_webcrawl.main module<a class="headerlink" href="#mcp-server-webcrawl-main-module" title="Link to this heading"></a></h2>
143 | <dl class="py function">
144 | <dt class="sig sig-object py">
145 | <em class="property"><span class="k"><span class="pre">async</span></span><span class="w"> </span></em><span class="sig-name descname"><span class="pre">main</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">crawler</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">datasrc</span></span></em><span class="sig-paren">)</span><a class="reference internal" href="_modules/mcp_server_webcrawl/main.html#main"><span class="viewcode-link"><span class="pre">[source]</span></span></a></dt>
146 | <dd><dl class="field-list simple">
147 | <dt class="field-odd">Parameters<span class="colon">:</span></dt>
148 | <dd class="field-odd"><ul class="simple">
149 | <li><p><strong>crawler</strong> (<a class="reference internal" href="mcp_server_webcrawl.crawlers.base.html#mcp_server_webcrawl.crawlers.base.crawler.BaseCrawler" title="mcp_server_webcrawl.crawlers.base.crawler.BaseCrawler"><em>BaseCrawler</em></a>) – </p></li>
150 | <li><p><strong>datasrc</strong> (<a class="reference external" href="https://docs.python.org/3/library/pathlib.html#pathlib.Path" title="(in Python v3.14)"><em>Path</em></a>) – </p></li>
151 | </ul>
152 | </dd>
153 | </dl>
154 | </dd></dl>
155 | 
156 | </section>
157 | <section id="module-mcp_server_webcrawl.settings">
158 | <span id="mcp-server-webcrawl-settings-module"></span><h2>mcp_server_webcrawl.settings module<a class="headerlink" href="#module-mcp_server_webcrawl.settings" title="Link to this heading"></a></h2>
159 | </section>
160 | <section id="module-mcp_server_webcrawl.settings_local">
161 | <span id="mcp-server-webcrawl-settings-local-module"></span><h2>mcp_server_webcrawl.settings_local module<a class="headerlink" href="#module-mcp_server_webcrawl.settings_local" title="Link to this heading"></a></h2>
162 | </section>
163 | <section id="module-mcp_server_webcrawl">
164 | <span id="module-contents"></span><h2>Module contents<a class="headerlink" href="#module-mcp_server_webcrawl" title="Link to this heading"></a></h2>
165 | </section>
166 | </section>
167 | 
168 | 
169 |            </div>
170 |           </div>
171 |           <footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
172 |         <a href="modules.html" class="btn btn-neutral float-left" title="mcp_server_webcrawl" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
173 |         <a href="mcp_server_webcrawl.crawlers.html" class="btn btn-neutral float-right" title="mcp_server_webcrawl.crawlers package" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
174 |     </div>
175 | 
176 |   <hr/>
177 | 
178 |   <div role="contentinfo">
179 |     <p>&#169; Copyright 2025, pragmar.</p>
180 |   </div>
181 | 
182 |   Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
183 |     <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
184 |     provided by <a href="https://readthedocs.org">Read the Docs</a>.
185 |    
186 | 
187 | </footer>
188 |         </div>
189 |       </div>
190 |     </section>
191 |   </div>
192 |   <script>
193 |       jQuery(function () {
194 |           SphinxRtdTheme.Navigation.enable(true);
195 |       });
196 |   </script> 
197 | 
198 | </body>
199 | </html>
```

--------------------------------------------------------------------------------
/src/mcp_server_webcrawl/crawlers/base/indexed.py:
--------------------------------------------------------------------------------

```python
  1 | import sqlite3
  2 | import traceback
  3 | 
  4 | from datetime import datetime
  5 | from contextlib import closing, contextmanager
  6 | from pathlib import Path
  7 | from typing import Callable
  8 | from mcp.types import Tool
  9 | 
 10 | from mcp_server_webcrawl.crawlers.base.adapter import (
 11 |     BaseManager,
 12 |     IndexState,
 13 |     IndexStatus,
 14 |     SitesGroup,
 15 |     SitesStat,
 16 |     INDEXED_MANAGER_CACHE_MAX,
 17 |     INDEXED_RESOURCE_DEFAULT_PROTOCOL,
 18 |     INDEXED_IGNORE_DIRECTORIES,
 19 | )
 20 | from mcp_server_webcrawl.crawlers.base.crawler import BaseCrawler
 21 | from mcp_server_webcrawl.models.resources import (
 22 |     ResourceResult,
 23 |     ResourceResultType,
 24 |     RESOURCES_DEFAULT_FIELD_MAPPING,
 25 | )
 26 | from mcp_server_webcrawl.models.sites import (
 27 |     SiteResult,
 28 |     SiteType,
 29 |     SITES_FIELDS_BASE,
 30 |     SITES_FIELDS_DEFAULT,
 31 | )
 32 | from mcp_server_webcrawl.utils import to_isoformat_zulu
 33 | from mcp_server_webcrawl.utils.logger import get_logger
 34 | from mcp_server_webcrawl.utils.tools import get_crawler_tools
 35 | 
 36 | logger = get_logger()
 37 | 
 38 | class IndexedManager(BaseManager):
 39 | 
 40 |     def __init__(self):
 41 |         super().__init__()
 42 |         self._db_cache: dict[frozenset, tuple[sqlite3.Connection, IndexState]] = {}
 43 |         self._build_locks: dict[frozenset, tuple[datetime, str]] = {}
 44 | 
 45 |     def get_connection(self, group: SitesGroup) -> tuple[sqlite3.Connection | None, IndexState]:
 46 |         """
 47 |         Get database connection for sites in the group, creating if needed.
 48 | 
 49 |         Args:
 50 |             group: group of sites to connect to
 51 | 
 52 |         Returns:
 53 |             Tuple of (SQLite connection to in-memory database with data loaded or None if building,
 54 |                      IndexState associated with this database)
 55 |         """
 56 |         if group.cache_key in self._build_locks:
 57 |             build_time, status = self._build_locks[group.cache_key]
 58 |             get_logger().info(f"Database for {group} is currently {status} (started at {build_time})")
 59 |             return None, IndexState()  # Return empty IndexState for building databases
 60 | 
 61 |         if len(self._db_cache) >= INDEXED_MANAGER_CACHE_MAX:
 62 |             logger.warning(f"Cache limit reached ({INDEXED_MANAGER_CACHE_MAX}), clearing all cached databases")
 63 |             self._db_cache.clear()
 64 | 
 65 |         is_cached: bool = group.cache_key in self._db_cache
 66 |         self._stats.append(SitesStat(group, is_cached))
 67 | 
 68 |         if not is_cached:
 69 |             index_state = IndexState()
 70 |             index_state.set_status(IndexStatus.INDEXING)
 71 |             with self._building_lock(group):
 72 |                 connection: sqlite3.Connection = sqlite3.connect(":memory:", check_same_thread=False)
 73 |                 self._setup_database(connection)
 74 |                 for site_id, site_path in group.get_sites().items():
 75 |                     self._load_site_data(connection, Path(site_path), site_id, index_state=index_state)
 76 |                     if index_state.is_timeout():
 77 |                         index_state.set_status(IndexStatus.PARTIAL)
 78 |                         break
 79 |                 if index_state is not None and index_state.status == IndexStatus.INDEXING:
 80 |                     index_state.set_status(IndexStatus.COMPLETE)
 81 |                 self._db_cache[group.cache_key] = (connection, index_state)
 82 | 
 83 |         # returns cached or newly created connection with IndexState
 84 |         connection, index_state = self._db_cache[group.cache_key]
 85 |         return connection, index_state
 86 | 
 87 |     def get_sites_for_directories(
 88 |         self,
 89 |         datasrc: Path,
 90 |         ids: list[int] | None = None,
 91 |         fields: list[str] | None = None
 92 |     ) -> list[SiteResult]:
 93 |         """
 94 |         List site directories in the datasrc directory as sites.
 95 | 
 96 |         Args:
 97 |             datasrc: path to the directory containing site subdirectories
 98 |             ids: optional list of site IDs to filter by
 99 |             fields: optional list of fields to include in the response
100 | 
101 |         Returns:
102 |             List of SiteResult objects, one for each site directory
103 | 
104 |         Notes:
105 |             Returns an empty list if the datasrc directory doesn't exist.
106 |         """
107 |         assert datasrc is not None, f"datasrc not provided ({datasrc})"
108 | 
109 |         if not datasrc.exists():
110 |             logger.error(f"Directory not found ({datasrc})")
111 |             return []
112 | 
113 |         # determine which fields to include
114 |         select_fields: set[str] = set(SITES_FIELDS_BASE)
115 |         if fields:
116 |             valid_fields: set[str] = set(SITES_FIELDS_DEFAULT)
117 |             select_fields.update(f for f in fields if f in valid_fields)
118 |         else:
119 |             select_fields.update(SITES_FIELDS_DEFAULT)
120 | 
121 |         results: list[SiteResult] = []
122 | 
123 |         # get all directories that contain HTTP text files
124 |         site_directories = [d for d in datasrc.iterdir() if d.is_dir() and
125 |             not d.name.startswith(".") and not d.name in INDEXED_IGNORE_DIRECTORIES]
126 | 
127 |         # map directory IDs to paths for filtering
128 |         site_directories_map: dict[int, Path] = {BaseManager.string_to_id(d.name): d for d in site_directories}
129 | 
130 |         if ids:
131 |             site_directories_map = {id_val: path for id_val, path in site_directories_map.items() if id_val in ids}
132 | 
133 |         # process each directory
134 |         for site_id, site_directory in sorted(site_directories_map.items()):
135 |             site_directory_stat = site_directory.stat()
136 |             created_time: datetime = datetime.fromtimestamp(site_directory_stat.st_ctime)
137 |             modified_time: datetime = datetime.fromtimestamp(site_directory_stat.st_mtime)
138 | 
139 |             # check for robots.txt
140 |             robots_content = None
141 |             robots_files = list(site_directory.glob("*robots.txt*"))
142 |             if robots_files:
143 |                 try:
144 |                     with open(robots_files[0], "r", encoding="utf-8", errors="replace") as f:
145 |                         # for robots.txt files in our format, extract only the content part
146 |                         content = f.read()
147 |                         parts = content.split("\n\n", 2)
148 |                         if len(parts) >= 3:
149 |                             response_parts = parts[2].split("\n\n", 1)
150 |                             if len(response_parts) > 1:
151 |                                 robots_content = response_parts[1]
152 |                             else:
153 |                                 robots_content = response_parts[0]
154 |                         else:
155 |                             robots_content = content
156 |                 except Exception as ex:
157 |                     logger.error(f"Error reading robots.txt")
158 | 
159 |             site = SiteResult(
160 |                 path=site_directory,
161 |                 id=site_id,
162 |                 name=site_directory.name,  # NEW: directory name
163 |                 type=SiteType.CRAWLED_URL,  # NEW: always single-site crawls
164 |                 urls=[f"{INDEXED_RESOURCE_DEFAULT_PROTOCOL}{site_directory.name}/"],  # CHANGED: now a list
165 |                 created=created_time if "created" in select_fields else None,
166 |                 modified=modified_time if "modified" in select_fields else None,
167 |                 robots=robots_content,
168 |                 metadata=None,
169 |             )
170 | 
171 |             results.append(site)
172 |         return results
173 | 
174 |     @contextmanager
175 |     def _building_lock(self, group: SitesGroup):
176 |         """
177 |         Context manager for database building operations.
178 |         Sets a lock during database building and releases it when done.
179 | 
180 |         Args:
181 |             group: SitesGroup to set the build lock for
182 |         """
183 |         try:
184 |             self._build_locks[group.cache_key] = (datetime.now(), "building")
185 |             yield
186 |         except Exception as ex:
187 |             self._build_locks[group.cache_key] = (self._build_locks[group.cache_key][0], f"failed: {ex}")
188 |             raise # re-raise
189 |         finally:
190 |             # clean up the lock
191 |             self._build_locks.pop(group.cache_key, None)
192 | 
193 |     def _setup_database(self, connection: sqlite3.Connection) -> None:
194 |         """
195 |         Create the database schema for storing resource data.
196 | 
197 |         Args:
198 |             connection: SQLite connection to set up
199 |         """
200 |         # store project/site (also) in fulltext, doesn't suppport >= <=,
201 |         # and pure fts search is much faster, want to only introduce
202 |         # Resource table sql clauses when field specified (Status,
203 |         # Size, or Time explicitly queried)
204 |         with closing(connection.cursor()) as cursor:
205 |             connection.execute("PRAGMA encoding = \"UTF-8\"")
206 |             connection.execute("PRAGMA synchronous = OFF")
207 |             connection.execute("PRAGMA journal_mode = MEMORY")
208 |             cursor.execute("""
209 |             CREATE TABLE Resources (
210 |                 Id INTEGER PRIMARY KEY,
211 |                 Project INTEGER NOT NULL,
212 |                 Created TEXT,
213 |                 Modified TEXT,
214 |                 Status INTEGER NOT NULL,
215 |                 Size INTEGER NOT NULL,
216 |                 Time INTEGER NOT NULL
217 |             )""")
218 |             cursor.execute("""
219 |             CREATE VIRTUAL TABLE ResourcesFullText USING fts5(
220 |                 Id,
221 |                 Project,
222 |                 Url,
223 |                 Type,
224 |                 Headers,
225 |                 Content,
226 |                 tokenize="unicode61 remove_diacritics 0 tokenchars '-_'"
227 |             )""")
228 | 
229 |     def _execute_batch_insert(self, connection: sqlite3.Connection, cursor: sqlite3.Cursor,
230 |         batch_records: list[ResourceResult]) -> None:
231 |         """
232 |         Execute batch insert of records with transaction handling.
233 |         Inserts data into both ResourcesFullText and Resources tables.
234 | 
235 |         Args:
236 |             connection: SQLite connection
237 |             cursor: SQLite cursor
238 |             batch_records: list of ResourceResult objects ready for insertion
239 |         """
240 |         if not batch_records:
241 |             return
242 | 
243 |         resources_base_records = []
244 |         resources_fts_records = []
245 |         for resource in batch_records:
246 |             resources_base_records.append((
247 |                 resource.id,
248 |                 resource.site,
249 |                 to_isoformat_zulu(resource.created) if resource.created else None,
250 |                 to_isoformat_zulu(resource.modified) if resource.modified else None,
251 |                 resource.status,
252 |                 resource.size if resource.size is not None else 0,
253 |                 resource.time if resource.time is not None else 0,
254 |             ))
255 |             resources_fts_records.append((
256 |                 resource.id,
257 |                 resource.site,
258 |                 resource.url,
259 |                 resource.type.value if resource.type else ResourceResultType.UNDEFINED.value,
260 |                 resource.headers,
261 |                 resource.content,
262 |             ))
263 | 
264 |         try:
265 |             connection.execute("BEGIN TRANSACTION")
266 |             cursor.executemany("""
267 |                 INSERT INTO Resources (
268 |                     Id, Project, Created, Modified, Status, Size, Time
269 |                 ) VALUES (?, ?, ?, ?, ?, ?, ?)
270 |             """, resources_base_records)
271 |             cursor.executemany("""
272 |                 INSERT INTO ResourcesFullText (
273 |                     Id, Project, Url, Type, Headers, Content
274 |                 ) VALUES (?, ?, ?, ?, ?, ?)
275 |             """, resources_fts_records)
276 |             connection.execute("COMMIT")
277 | 
278 |         except Exception as ex:
279 |             connection.execute("ROLLBACK")
280 |             logger.error(f"Error during batch insert: {ex}\n{traceback.format_exc()}")
281 | 
282 | class IndexedCrawler(BaseCrawler):
283 |     """
284 |     A crawler implementation for data sources that load into an in-memory sqlite.
285 |     Shares commonality between specialized crawlers.
286 |     """
287 | 
288 |     def __init__(
289 |         self,
290 |         datasrc: Path,
291 |         get_sites_func: Callable,
292 |         get_resources_func: Callable,
293 |         resource_field_mapping: dict[str, str] = RESOURCES_DEFAULT_FIELD_MAPPING
294 |     ) -> None:
295 |         """
296 |         Initialize the IndexedCrawler with a data source path and required adapter functions.
297 | 
298 |         Args:
299 |             datasrc: path to the data source
300 |             get_sites_func: function to retrieve sites from the data source
301 |             get_resources_func: function to retrieve resources from the data source
302 |             resource_field_mapping: mapping of resource field names to display names
303 |         """
304 | 
305 |         assert datasrc.is_dir(), f"{self.__class__.__name__} datasrc must be a directory"
306 |         super().__init__(datasrc, get_sites_func, get_resources_func, resource_field_mapping=resource_field_mapping)
307 | 
308 |     async def mcp_list_tools(self) -> list[Tool]:
309 |         """
310 |         List available tools for this crawler.
311 | 
312 |         Returns:
313 |             List of Tool objects
314 |         """
315 |         if self._adapter_get_sites is None:
316 |             logger.error(f"_adapter_get_sites not set (function required)")
317 |             return []
318 | 
319 |         all_sites = self._adapter_get_sites(self._datasrc)
320 |         default_tools: list[Tool] = get_crawler_tools(sites=all_sites)
321 |         assert len(default_tools) == 2, "expected exactly 2 Tools: sites and resources"
322 | 
323 |         default_sites_tool, default_resources_tool = default_tools
324 |         all_sites_display = ", ".join([f"{s.name} (site: {s.id})" for s in all_sites])
325 |         drt_props = default_resources_tool.inputSchema["properties"]
326 |         drt_props["sites"]["description"] = ("Optional "
327 |             "list of project ID to filter search results to a specific site. In 95% "
328 |             "of scenarios, you'd filter to only one site, but many site filtering is offered for "
329 |             f"advanced search scenarios. Available sites include {all_sites_display}.")
330 | 
331 |         return [default_sites_tool, default_resources_tool]
332 | 
333 | 
```

--------------------------------------------------------------------------------
/docs/guides/siteone.html:
--------------------------------------------------------------------------------

```html
  1 | 
  2 | 
  3 | <!DOCTYPE html>
  4 | <html class="writer-html5" lang="en" data-content_root="../">
  5 | <head>
  6 |   <meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />
  7 | 
  8 |   <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  9 |   <title>SiteOne MCP Setup Guide &mdash; mcp-server-webcrawl  documentation</title>
 10 |       <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=80d5e7a1" />
 11 |       <link rel="stylesheet" type="text/css" href="../_static/css/theme.css?v=e59714d7" />
 12 | 
 13 |   
 14 |       <script src="../_static/jquery.js?v=5d32c60e"></script>
 15 |       <script src="../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
 16 |       <script src="../_static/documentation_options.js?v=5929fcd5"></script>
 17 |       <script src="../_static/doctools.js?v=888ff710"></script>
 18 |       <script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
 19 |     <script src="../_static/js/theme.js"></script>
 20 |     <link rel="index" title="Index" href="../genindex.html" />
 21 |     <link rel="search" title="Search" href="../search.html" />
 22 |     <link rel="next" title="WARC MCP Setup Guide" href="warc.html" />
 23 |     <link rel="prev" title="Katana MCP Setup Guide" href="katana.html" /> 
 24 | </head>
 25 | 
 26 | <body class="wy-body-for-nav"> 
 27 |   <div class="wy-grid-for-nav">
 28 |     <nav data-toggle="wy-nav-shift" class="wy-nav-side">
 29 |       <div class="wy-side-scroll">
 30 |         <div class="wy-side-nav-search" >
 31 | 
 32 |           
 33 |           
 34 |           <a href="../index.html" class="icon icon-home">
 35 |             mcp-server-webcrawl
 36 |           </a>
 37 | <div role="search">
 38 |   <form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
 39 |     <input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
 40 |     <input type="hidden" name="check_keywords" value="yes" />
 41 |     <input type="hidden" name="area" value="default" />
 42 |   </form>
 43 | </div>
 44 |         </div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
 45 |               <p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
 46 | <ul class="current">
 47 | <li class="toctree-l1"><a class="reference internal" href="../installation.html">Installation</a></li>
 48 | <li class="toctree-l1 current"><a class="reference internal" href="../guides.html">Setup Guides</a><ul class="current">
 49 | <li class="toctree-l2"><a class="reference internal" href="archivebox.html">ArchiveBox MCP Setup Guide</a></li>
 50 | <li class="toctree-l2"><a class="reference internal" href="httrack.html">HTTrack MCP Setup Guide</a></li>
 51 | <li class="toctree-l2"><a class="reference internal" href="interrobot.html">InterroBot MCP Setup Guide</a></li>
 52 | <li class="toctree-l2"><a class="reference internal" href="katana.html">Katana MCP Setup Guide</a></li>
 53 | <li class="toctree-l2 current"><a class="current reference internal" href="#">SiteOne MCP Setup Guide</a></li>
 54 | <li class="toctree-l2"><a class="reference internal" href="warc.html">WARC MCP Setup Guide</a></li>
 55 | <li class="toctree-l2"><a class="reference internal" href="wget.html">wget MCP Setup Guide</a></li>
 56 | </ul>
 57 | </li>
 58 | <li class="toctree-l1"><a class="reference internal" href="../usage.html">Usage</a></li>
 59 | <li class="toctree-l1"><a class="reference internal" href="../prompts.html">Prompt Routines</a></li>
 60 | <li class="toctree-l1"><a class="reference internal" href="../modules.html">mcp_server_webcrawl</a></li>
 61 | </ul>
 62 | 
 63 |         </div>
 64 |       </div>
 65 |     </nav>
 66 | 
 67 |     <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
 68 |           <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
 69 |           <a href="../index.html">mcp-server-webcrawl</a>
 70 |       </nav>
 71 | 
 72 |       <div class="wy-nav-content">
 73 |         <div class="rst-content">
 74 |           <div role="navigation" aria-label="Page navigation">
 75 |   <ul class="wy-breadcrumbs">
 76 |       <li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
 77 |           <li class="breadcrumb-item"><a href="../guides.html">Setup Guides</a></li>
 78 |       <li class="breadcrumb-item active">SiteOne MCP Setup Guide</li>
 79 |       <li class="wy-breadcrumbs-aside">
 80 |             <a href="../_sources/guides/siteone.rst.txt" rel="nofollow"> View page source</a>
 81 |       </li>
 82 |   </ul>
 83 |   <hr/>
 84 | </div>
 85 |           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
 86 |            <div itemprop="articleBody">
 87 |              
 88 |   <section id="siteone-mcp-setup-guide">
 89 | <h1>SiteOne MCP Setup Guide<a class="headerlink" href="#siteone-mcp-setup-guide" title="Link to this heading"></a></h1>
 90 | <p>Instructions for setting up <a class="reference external" href="https://pragmar.com/mcp-server-webcrawl/">mcp-server-webcrawl</a> with SiteOne crawler.
 91 | This allows your LLM (e.g. Claude Desktop) to search content and metadata from websites you’ve crawled using SiteOne.</p>
 92 | <iframe width="560" height="315" src="https://www.youtube.com/embed/JOGRYbo6WwI" frameborder="0" allowfullscreen></iframe><p>Follow along with the video, or the step-action guide below.</p>
 93 | <section id="requirements">
 94 | <h2>Requirements<a class="headerlink" href="#requirements" title="Link to this heading"></a></h2>
 95 | <p>Before you begin, ensure you have:</p>
 96 | <ul class="simple">
 97 | <li><p><a class="reference external" href="https://claude.ai/download">Claude Desktop</a> installed</p></li>
 98 | <li><p><a class="reference external" href="https://python.org">Python</a> 3.10 or later installed</p></li>
 99 | <li><p><a class="reference external" href="https://crawler.siteone.io">SiteOne Crawler</a> installed</p></li>
100 | <li><p>Basic familiarity with command line interfaces</p></li>
101 | </ul>
102 | </section>
103 | <section id="what-is-siteone">
104 | <h2>What is SiteOne?<a class="headerlink" href="#what-is-siteone" title="Link to this heading"></a></h2>
105 | <p>SiteOne is a GUI crawler that offers:</p>
106 | <ul class="simple">
107 | <li><p>User-friendly desktop interface for setting up and managing crawls</p></li>
108 | <li><p>Offline website generation capabilities</p></li>
109 | <li><p>Comprehensive crawl reporting</p></li>
110 | <li><p>Intuitive controls for non-technical users</p></li>
111 | </ul>
112 | </section>
113 | <section id="installation-steps">
114 | <h2>Installation Steps<a class="headerlink" href="#installation-steps" title="Link to this heading"></a></h2>
115 | <section id="install-mcp-server-webcrawl">
116 | <h3>1. Install mcp-server-webcrawl<a class="headerlink" href="#install-mcp-server-webcrawl" title="Link to this heading"></a></h3>
117 | <p>Open your terminal or command line and install the package:</p>
118 | <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">pip</span> <span class="n">install</span> <span class="n">mcp</span><span class="o">-</span><span class="n">server</span><span class="o">-</span><span class="n">webcrawl</span>
119 | </pre></div>
120 | </div>
121 | <p>Verify installation was successful:</p>
122 | <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">mcp</span><span class="o">-</span><span class="n">server</span><span class="o">-</span><span class="n">webcrawl</span> <span class="o">--</span><span class="n">help</span>
123 | </pre></div>
124 | </div>
125 | </section>
126 | <section id="create-crawls-with-siteone">
127 | <h3>2. Create Crawls with SiteOne<a class="headerlink" href="#create-crawls-with-siteone" title="Link to this heading"></a></h3>
128 | <ol class="arabic simple">
129 | <li><p>Open SiteOne Crawler application</p></li>
130 | <li><p>Enter a URL to crawl (e.g., example.com)</p></li>
131 | <li><p><strong>Important</strong>: Check the “Generate offline website” option (this is required for MCP integration)</p></li>
132 | <li><p>Click the start button to begin crawling</p></li>
133 | <li><p>Repeat for additional sites as needed (e.g., pragmar.com)</p></li>
134 | <li><p>Note the directory where SiteOne is storing the generated offline content (this is shown in the application)</p></li>
135 | </ol>
136 | </section>
137 | <section id="configure-claude-desktop">
138 | <h3>3. Configure Claude Desktop<a class="headerlink" href="#configure-claude-desktop" title="Link to this heading"></a></h3>
139 | <ol class="arabic simple">
140 | <li><p>Open Claude Desktop</p></li>
141 | <li><p>Go to <strong>File → Settings → Developer → Edit Config</strong></p></li>
142 | <li><p>Add the following configuration (modify paths as needed):</p></li>
143 | </ol>
144 | <div class="highlight-json notranslate"><div class="highlight"><pre><span></span><span class="p">{</span>
145 | <span class="w">  </span><span class="nt">&quot;mcpServers&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
146 | <span class="w">    </span><span class="nt">&quot;webcrawl&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
147 | <span class="w">      </span><span class="nt">&quot;command&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;/path/to/mcp-server-webcrawl&quot;</span><span class="p">,</span>
148 | <span class="w">      </span><span class="nt">&quot;args&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">&quot;--crawler&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;siteone&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;--datasrc&quot;</span><span class="p">,</span>
149 | <span class="w">        </span><span class="s2">&quot;/path/to/siteone/archives/&quot;</span><span class="p">]</span>
150 | <span class="w">    </span><span class="p">}</span>
151 | <span class="w">  </span><span class="p">}</span>
152 | <span class="p">}</span>
153 | </pre></div>
154 | </div>
155 | <div class="admonition note">
156 | <p class="admonition-title">Note</p>
157 | <ul class="simple">
158 | <li><p>On Windows, use <code class="docutils literal notranslate"><span class="pre">&quot;mcp-server-webcrawl&quot;</span></code> as the command</p></li>
159 | <li><p>On macOS, use the absolute path (output of <code class="docutils literal notranslate"><span class="pre">which</span> <span class="pre">mcp-server-webcrawl</span></code>)</p></li>
160 | <li><p>Change <code class="docutils literal notranslate"><span class="pre">/path/to/siteone/archives/</span></code> to the actual path where SiteOne stores offline website content</p></li>
161 | </ul>
162 | </div>
163 | <ol class="arabic simple" start="4">
164 | <li><p>Save the file and <strong>completely exit</strong> Claude Desktop (not just close the window)</p></li>
165 | <li><p>Restart Claude Desktop</p></li>
166 | </ol>
167 | </section>
168 | <section id="verify-and-use">
169 | <h3>4. Verify and Use<a class="headerlink" href="#verify-and-use" title="Link to this heading"></a></h3>
170 | <ol class="arabic">
171 | <li><p>In Claude Desktop, you should now see MCP tools available under Search and Tools</p></li>
172 | <li><p>Ask Claude to list your crawled sites:</p>
173 | <div class="highlight-default notranslate"><div class="highlight"><pre><span></span>Can you list the crawled sites available?
174 | </pre></div>
175 | </div>
176 | </li>
177 | <li><p>Try searching content from your crawls:</p>
178 | <div class="highlight-default notranslate"><div class="highlight"><pre><span></span>Can you find information about [topic] on [crawled site]?
179 | </pre></div>
180 | </div>
181 | </li>
182 | <li><p>Explore specific topics on your crawled sites:</p>
183 | <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">I</span><span class="s1">&#39;m interested in [keyword] in [crawled domain]. Can you tell me about it?</span>
184 | </pre></div>
185 | </div>
186 | </li>
187 | </ol>
188 | </section>
189 | </section>
190 | <section id="troubleshooting">
191 | <h2>Troubleshooting<a class="headerlink" href="#troubleshooting" title="Link to this heading"></a></h2>
192 | <ul class="simple">
193 | <li><p>If Claude doesn’t show MCP tools after restart, verify your configuration file is correctly formatted</p></li>
194 | <li><p>Ensure Python and mcp-server-webcrawl are properly installed</p></li>
195 | <li><p>Check that your SiteOne archives path in the configuration is correct</p></li>
196 | <li><p>Make sure the “Generate offline website” option was checked when creating crawls</p></li>
197 | <li><p>Verify that each crawl completed successfully and files were saved to the expected location</p></li>
198 | <li><p>Remember that the first time you use a function, Claude will ask for permission</p></li>
199 | </ul>
200 | <p>For more details, including API documentation and other crawler options, visit the <a class="reference external" href="https://github.com/pragmar/mcp-server-webcrawl">mcp-server-webcrawl documentation</a>.</p>
201 | </section>
202 | </section>
203 | 
204 | 
205 |            </div>
206 |           </div>
207 |           <footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
208 |         <a href="katana.html" class="btn btn-neutral float-left" title="Katana MCP Setup Guide" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
209 |         <a href="warc.html" class="btn btn-neutral float-right" title="WARC MCP Setup Guide" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
210 |     </div>
211 | 
212 |   <hr/>
213 | 
214 |   <div role="contentinfo">
215 |     <p>&#169; Copyright 2025, pragmar.</p>
216 |   </div>
217 | 
218 |   Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
219 |     <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
220 |     provided by <a href="https://readthedocs.org">Read the Docs</a>.
221 |    
222 | 
223 | </footer>
224 |         </div>
225 |       </div>
226 |     </section>
227 |   </div>
228 |   <script>
229 |       jQuery(function () {
230 |           SphinxRtdTheme.Navigation.enable(true);
231 |       });
232 |   </script> 
233 | 
234 | </body>
235 | </html>
```

--------------------------------------------------------------------------------
/docs/guides/warc.html:
--------------------------------------------------------------------------------

```html
  1 | 
  2 | 
  3 | <!DOCTYPE html>
  4 | <html class="writer-html5" lang="en" data-content_root="../">
  5 | <head>
  6 |   <meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />
  7 | 
  8 |   <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  9 |   <title>WARC MCP Setup Guide &mdash; mcp-server-webcrawl  documentation</title>
 10 |       <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=80d5e7a1" />
 11 |       <link rel="stylesheet" type="text/css" href="../_static/css/theme.css?v=e59714d7" />
 12 | 
 13 |   
 14 |       <script src="../_static/jquery.js?v=5d32c60e"></script>
 15 |       <script src="../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
 16 |       <script src="../_static/documentation_options.js?v=5929fcd5"></script>
 17 |       <script src="../_static/doctools.js?v=888ff710"></script>
 18 |       <script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
 19 |     <script src="../_static/js/theme.js"></script>
 20 |     <link rel="index" title="Index" href="../genindex.html" />
 21 |     <link rel="search" title="Search" href="../search.html" />
 22 |     <link rel="next" title="wget MCP Setup Guide" href="wget.html" />
 23 |     <link rel="prev" title="SiteOne MCP Setup Guide" href="siteone.html" /> 
 24 | </head>
 25 | 
 26 | <body class="wy-body-for-nav"> 
 27 |   <div class="wy-grid-for-nav">
 28 |     <nav data-toggle="wy-nav-shift" class="wy-nav-side">
 29 |       <div class="wy-side-scroll">
 30 |         <div class="wy-side-nav-search" >
 31 | 
 32 |           
 33 |           
 34 |           <a href="../index.html" class="icon icon-home">
 35 |             mcp-server-webcrawl
 36 |           </a>
 37 | <div role="search">
 38 |   <form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
 39 |     <input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
 40 |     <input type="hidden" name="check_keywords" value="yes" />
 41 |     <input type="hidden" name="area" value="default" />
 42 |   </form>
 43 | </div>
 44 |         </div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
 45 |               <p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
 46 | <ul class="current">
 47 | <li class="toctree-l1"><a class="reference internal" href="../installation.html">Installation</a></li>
 48 | <li class="toctree-l1 current"><a class="reference internal" href="../guides.html">Setup Guides</a><ul class="current">
 49 | <li class="toctree-l2"><a class="reference internal" href="archivebox.html">ArchiveBox MCP Setup Guide</a></li>
 50 | <li class="toctree-l2"><a class="reference internal" href="httrack.html">HTTrack MCP Setup Guide</a></li>
 51 | <li class="toctree-l2"><a class="reference internal" href="interrobot.html">InterroBot MCP Setup Guide</a></li>
 52 | <li class="toctree-l2"><a class="reference internal" href="katana.html">Katana MCP Setup Guide</a></li>
 53 | <li class="toctree-l2"><a class="reference internal" href="siteone.html">SiteOne MCP Setup Guide</a></li>
 54 | <li class="toctree-l2 current"><a class="current reference internal" href="#">WARC MCP Setup Guide</a></li>
 55 | <li class="toctree-l2"><a class="reference internal" href="wget.html">wget MCP Setup Guide</a></li>
 56 | </ul>
 57 | </li>
 58 | <li class="toctree-l1"><a class="reference internal" href="../usage.html">Usage</a></li>
 59 | <li class="toctree-l1"><a class="reference internal" href="../prompts.html">Prompt Routines</a></li>
 60 | <li class="toctree-l1"><a class="reference internal" href="../modules.html">mcp_server_webcrawl</a></li>
 61 | </ul>
 62 | 
 63 |         </div>
 64 |       </div>
 65 |     </nav>
 66 | 
 67 |     <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
 68 |           <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
 69 |           <a href="../index.html">mcp-server-webcrawl</a>
 70 |       </nav>
 71 | 
 72 |       <div class="wy-nav-content">
 73 |         <div class="rst-content">
 74 |           <div role="navigation" aria-label="Page navigation">
 75 |   <ul class="wy-breadcrumbs">
 76 |       <li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
 77 |           <li class="breadcrumb-item"><a href="../guides.html">Setup Guides</a></li>
 78 |       <li class="breadcrumb-item active">WARC MCP Setup Guide</li>
 79 |       <li class="wy-breadcrumbs-aside">
 80 |             <a href="../_sources/guides/warc.rst.txt" rel="nofollow"> View page source</a>
 81 |       </li>
 82 |   </ul>
 83 |   <hr/>
 84 | </div>
 85 |           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
 86 |            <div itemprop="articleBody">
 87 |              
 88 |   <section id="warc-mcp-setup-guide">
 89 | <h1>WARC MCP Setup Guide<a class="headerlink" href="#warc-mcp-setup-guide" title="Link to this heading"></a></h1>
 90 | <p>Instructions for setting up <a class="reference external" href="https://pragmar.com/mcp-server-webcrawl/">mcp-server-webcrawl</a> with
 91 | <a class="reference external" href="https://en.wikipedia.org/wiki/WARC_(file_format)">WARC</a> files to allow your LLM (e.g.
 92 | Claude Desktop) to search content and metadata from websites you’ve archived in WARC format.</p>
 93 | <iframe width="560" height="315" src="https://www.youtube.com/embed/fx-4WZu-UT8" frameborder="0" allowfullscreen></iframe><p>Follow along with the video, or the step-action guide below.</p>
 94 | <section id="requirements">
 95 | <h2>Requirements<a class="headerlink" href="#requirements" title="Link to this heading"></a></h2>
 96 | <p>Before you begin, ensure you have:</p>
 97 | <ul class="simple">
 98 | <li><p><a class="reference external" href="https://claude.ai/download">Claude Desktop</a> installed</p></li>
 99 | <li><p><a class="reference external" href="https://python.org">Python</a> 3.10 or later installed</p></li>
100 | <li><p>Basic familiarity with command line interfaces</p></li>
101 | <li><p>wget installed (macOS users can install via Homebrew, Windows users need WSL/Ubuntu)</p></li>
102 | </ul>
103 | </section>
104 | <section id="what-are-warc-files">
105 | <h2>What are WARC Files?<a class="headerlink" href="#what-are-warc-files" title="Link to this heading"></a></h2>
106 | <p>WARC files are single-file archives that store complete crawl data including:</p>
107 | <ul class="simple">
108 | <li><p>HTTP status codes</p></li>
109 | <li><p>HTTP headers</p></li>
110 | <li><p>Response content</p></li>
111 | </ul>
112 | <p>Compared to wget running in mirror mode:</p>
113 | <ul class="simple">
114 | <li><p><strong>WARC</strong>: More comprehensive (preserves status codes and headers) but slower crawling</p></li>
115 | <li><p><strong>wget mirror</strong>: Faster crawling but doesn’t preserve status codes or headers</p></li>
116 | </ul>
117 | </section>
118 | <section id="installation-steps">
119 | <h2>Installation Steps<a class="headerlink" href="#installation-steps" title="Link to this heading"></a></h2>
120 | <section id="install-mcp-server-webcrawl">
121 | <h3>1. Install mcp-server-webcrawl<a class="headerlink" href="#install-mcp-server-webcrawl" title="Link to this heading"></a></h3>
122 | <p>Open your terminal or command line and install the package:</p>
123 | <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">pip</span> <span class="n">install</span> <span class="n">mcp</span><span class="o">-</span><span class="n">server</span><span class="o">-</span><span class="n">webcrawl</span>
124 | </pre></div>
125 | </div>
126 | <p>Verify installation was successful:</p>
127 | <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">mcp</span><span class="o">-</span><span class="n">server</span><span class="o">-</span><span class="n">webcrawl</span> <span class="o">--</span><span class="n">help</span>
128 | </pre></div>
129 | </div>
130 | </section>
131 | <section id="configure-claude-desktop">
132 | <h3>2. Configure Claude Desktop<a class="headerlink" href="#configure-claude-desktop" title="Link to this heading"></a></h3>
133 | <ol class="arabic simple">
134 | <li><p>Open Claude Desktop</p></li>
135 | <li><p>Go to <strong>File → Settings → Developer → Edit Config</strong></p></li>
136 | <li><p>Add the following configuration (modify paths as needed):</p></li>
137 | </ol>
138 | <div class="highlight-json notranslate"><div class="highlight"><pre><span></span><span class="p">{</span>
139 | <span class="w">  </span><span class="nt">&quot;mcpServers&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
140 | <span class="w">    </span><span class="nt">&quot;webcrawl&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
141 | <span class="w">      </span><span class="nt">&quot;command&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;/path/to/mcp-server-webcrawl&quot;</span><span class="p">,</span>
142 | <span class="w">      </span><span class="nt">&quot;args&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">&quot;--crawler&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;warc&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;--datasrc&quot;</span><span class="p">,</span>
143 | <span class="w">        </span><span class="s2">&quot;/path/to/warc/archives/&quot;</span><span class="p">]</span>
144 | <span class="w">    </span><span class="p">}</span>
145 | <span class="w">  </span><span class="p">}</span>
146 | <span class="p">}</span>
147 | </pre></div>
148 | </div>
149 | <div class="admonition note">
150 | <p class="admonition-title">Note</p>
151 | <ul class="simple">
152 | <li><p>On Windows, use <code class="docutils literal notranslate"><span class="pre">&quot;mcp-server-webcrawl&quot;</span></code> as the command</p></li>
153 | <li><p>On macOS, use the absolute path (output of <code class="docutils literal notranslate"><span class="pre">which</span> <span class="pre">mcp-server-webcrawl</span></code>)</p></li>
154 | <li><p>Change <code class="docutils literal notranslate"><span class="pre">/path/to/warc/archives/</span></code> to your actual directory path where WARC files are stored</p></li>
155 | </ul>
156 | </div>
157 | <ol class="arabic simple" start="4">
158 | <li><p>Save the file and <strong>completely exit</strong> Claude Desktop (not just close the window)</p></li>
159 | <li><p>Restart Claude Desktop</p></li>
160 | </ol>
161 | </section>
162 | <section id="create-warc-files-with-wget">
163 | <h3>3. Create WARC Files with Wget<a class="headerlink" href="#create-warc-files-with-wget" title="Link to this heading"></a></h3>
164 | <ol class="arabic simple">
165 | <li><p>Open Terminal (macOS) or Ubuntu/WSL (Windows)</p></li>
166 | <li><p>Navigate to your target directory for storing WARC files</p></li>
167 | <li><p>Run wget with WARC options:</p></li>
168 | </ol>
169 | <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># basic WARC capture</span>
170 | wget<span class="w"> </span>--warc-file<span class="o">=</span>example<span class="w"> </span>--recursive<span class="w"> </span>https://example.com
171 | 
172 | <span class="c1"># more comprehensive capture with page requirements (CSS, images, etc.)</span>
173 | wget<span class="w"> </span>--warc-file<span class="o">=</span>example<span class="w"> </span>--recursive<span class="w"> </span>--page-requisites<span class="w"> </span>https://example.com
174 | </pre></div>
175 | </div>
176 | <p>Your WARC files will be created with a .warc.gz extension in your current directory.</p>
177 | </section>
178 | <section id="verify-and-use">
179 | <h3>4. Verify and Use<a class="headerlink" href="#verify-and-use" title="Link to this heading"></a></h3>
180 | <ol class="arabic">
181 | <li><p>In Claude Desktop, you should now see MCP tools available under Search and Tools</p></li>
182 | <li><p>Ask Claude to list your crawled sites:</p>
183 | <div class="highlight-default notranslate"><div class="highlight"><pre><span></span>Can you list the crawled sites available?
184 | </pre></div>
185 | </div>
186 | </li>
187 | <li><p>Try searching content from your crawls:</p>
188 | <div class="highlight-default notranslate"><div class="highlight"><pre><span></span>Can you find information about [topic] on [crawled site]?
189 | </pre></div>
190 | </div>
191 | </li>
192 | </ol>
193 | </section>
194 | </section>
195 | <section id="troubleshooting">
196 | <h2>Troubleshooting<a class="headerlink" href="#troubleshooting" title="Link to this heading"></a></h2>
197 | <ul class="simple">
198 | <li><p>If Claude doesn’t show MCP tools after restart, verify your configuration file is correctly formatted</p></li>
199 | <li><p>Ensure Python and mcp-server-webcrawl are properly installed</p></li>
200 | <li><p>Check that your WARC directory path in the configuration is correct</p></li>
201 | <li><p>Make sure your WARC files have the correct extension (typically .warc.gz)</p></li>
202 | <li><p>Remember that the first time you use each function, Claude will ask for permission</p></li>
203 | <li><p>For large WARC files, initial indexing may take some time</p></li>
204 | </ul>
205 | <p>For more details, including API documentation and other crawler options, visit the <a class="reference external" href="https://github.com/pragmar/mcp-server-webcrawl">mcp-server-webcrawl documentation</a>.</p>
206 | </section>
207 | </section>
208 | 
209 | 
210 |            </div>
211 |           </div>
212 |           <footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
213 |         <a href="siteone.html" class="btn btn-neutral float-left" title="SiteOne MCP Setup Guide" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
214 |         <a href="wget.html" class="btn btn-neutral float-right" title="wget MCP Setup Guide" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
215 |     </div>
216 | 
217 |   <hr/>
218 | 
219 |   <div role="contentinfo">
220 |     <p>&#169; Copyright 2025, pragmar.</p>
221 |   </div>
222 | 
223 |   Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
224 |     <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
225 |     provided by <a href="https://readthedocs.org">Read the Docs</a>.
226 |    
227 | 
228 | </footer>
229 |         </div>
230 |       </div>
231 |     </section>
232 |   </div>
233 |   <script>
234 |       jQuery(function () {
235 |           SphinxRtdTheme.Navigation.enable(true);
236 |       });
237 |   </script> 
238 | 
239 | </body>
240 | </html>
```

--------------------------------------------------------------------------------
/docs/guides/interrobot.html:
--------------------------------------------------------------------------------

```html
  1 | 
  2 | 
  3 | <!DOCTYPE html>
  4 | <html class="writer-html5" lang="en" data-content_root="../">
  5 | <head>
  6 |   <meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />
  7 | 
  8 |   <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  9 |   <title>InterroBot MCP Setup Guide &mdash; mcp-server-webcrawl  documentation</title>
 10 |       <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=80d5e7a1" />
 11 |       <link rel="stylesheet" type="text/css" href="../_static/css/theme.css?v=e59714d7" />
 12 | 
 13 |   
 14 |       <script src="../_static/jquery.js?v=5d32c60e"></script>
 15 |       <script src="../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
 16 |       <script src="../_static/documentation_options.js?v=5929fcd5"></script>
 17 |       <script src="../_static/doctools.js?v=888ff710"></script>
 18 |       <script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
 19 |     <script src="../_static/js/theme.js"></script>
 20 |     <link rel="index" title="Index" href="../genindex.html" />
 21 |     <link rel="search" title="Search" href="../search.html" />
 22 |     <link rel="next" title="Katana MCP Setup Guide" href="katana.html" />
 23 |     <link rel="prev" title="HTTrack MCP Setup Guide" href="httrack.html" /> 
 24 | </head>
 25 | 
 26 | <body class="wy-body-for-nav"> 
 27 |   <div class="wy-grid-for-nav">
 28 |     <nav data-toggle="wy-nav-shift" class="wy-nav-side">
 29 |       <div class="wy-side-scroll">
 30 |         <div class="wy-side-nav-search" >
 31 | 
 32 |           
 33 |           
 34 |           <a href="../index.html" class="icon icon-home">
 35 |             mcp-server-webcrawl
 36 |           </a>
 37 | <div role="search">
 38 |   <form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
 39 |     <input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
 40 |     <input type="hidden" name="check_keywords" value="yes" />
 41 |     <input type="hidden" name="area" value="default" />
 42 |   </form>
 43 | </div>
 44 |         </div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
 45 |               <p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
 46 | <ul class="current">
 47 | <li class="toctree-l1"><a class="reference internal" href="../installation.html">Installation</a></li>
 48 | <li class="toctree-l1 current"><a class="reference internal" href="../guides.html">Setup Guides</a><ul class="current">
 49 | <li class="toctree-l2"><a class="reference internal" href="archivebox.html">ArchiveBox MCP Setup Guide</a></li>
 50 | <li class="toctree-l2"><a class="reference internal" href="httrack.html">HTTrack MCP Setup Guide</a></li>
 51 | <li class="toctree-l2 current"><a class="current reference internal" href="#">InterroBot MCP Setup Guide</a></li>
 52 | <li class="toctree-l2"><a class="reference internal" href="katana.html">Katana MCP Setup Guide</a></li>
 53 | <li class="toctree-l2"><a class="reference internal" href="siteone.html">SiteOne MCP Setup Guide</a></li>
 54 | <li class="toctree-l2"><a class="reference internal" href="warc.html">WARC MCP Setup Guide</a></li>
 55 | <li class="toctree-l2"><a class="reference internal" href="wget.html">wget MCP Setup Guide</a></li>
 56 | </ul>
 57 | </li>
 58 | <li class="toctree-l1"><a class="reference internal" href="../usage.html">Usage</a></li>
 59 | <li class="toctree-l1"><a class="reference internal" href="../prompts.html">Prompt Routines</a></li>
 60 | <li class="toctree-l1"><a class="reference internal" href="../modules.html">mcp_server_webcrawl</a></li>
 61 | </ul>
 62 | 
 63 |         </div>
 64 |       </div>
 65 |     </nav>
 66 | 
 67 |     <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
 68 |           <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
 69 |           <a href="../index.html">mcp-server-webcrawl</a>
 70 |       </nav>
 71 | 
 72 |       <div class="wy-nav-content">
 73 |         <div class="rst-content">
 74 |           <div role="navigation" aria-label="Page navigation">
 75 |   <ul class="wy-breadcrumbs">
 76 |       <li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
 77 |           <li class="breadcrumb-item"><a href="../guides.html">Setup Guides</a></li>
 78 |       <li class="breadcrumb-item active">InterroBot MCP Setup Guide</li>
 79 |       <li class="wy-breadcrumbs-aside">
 80 |             <a href="../_sources/guides/interrobot.rst.txt" rel="nofollow"> View page source</a>
 81 |       </li>
 82 |   </ul>
 83 |   <hr/>
 84 | </div>
 85 |           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
 86 |            <div itemprop="articleBody">
 87 |              
 88 |   <section id="interrobot-mcp-setup-guide">
 89 | <h1>InterroBot MCP Setup Guide<a class="headerlink" href="#interrobot-mcp-setup-guide" title="Link to this heading"></a></h1>
 90 | <p>Instructions for setting up <a class="reference external" href="https://pragmar.com/mcp-server-webcrawl/">mcp-server-webcrawl</a> with InterroBot.
 91 | This allows your LLM (e.g. Claude Desktop) to search content and metadata from websites you’ve crawled with InterroBot.</p>
 92 | <iframe width="560" height="315" src="https://www.youtube.com/embed/55y8oKWXJLs" frameborder="0" allowfullscreen></iframe><p>Follow along with the video, or the step-action guide below.</p>
 93 | <section id="requirements">
 94 | <h2>Requirements<a class="headerlink" href="#requirements" title="Link to this heading"></a></h2>
 95 | <p>Before you begin, ensure you have:</p>
 96 | <ul class="simple">
 97 | <li><p><a class="reference external" href="https://claude.ai/download">Claude Desktop</a> installed</p></li>
 98 | <li><p><a class="reference external" href="https://python.org">Python</a> 3.10 or later installed</p></li>
 99 | <li><p><a class="reference external" href="https://interro.bot">InterroBot</a> installed</p></li>
100 | <li><p>Basic familiarity with command line interfaces</p></li>
101 | </ul>
102 | </section>
103 | <section id="what-is-interrobot">
104 | <h2>What is InterroBot?<a class="headerlink" href="#what-is-interrobot" title="Link to this heading"></a></h2>
105 | <p>InterroBot is a commercial web crawler and analyzer that works seamlessly with mcp-server-webcrawl, providing several advantages:</p>
106 | <ul class="simple">
107 | <li><p>User-friendly graphical interface for managing crawls</p></li>
108 | <li><p>Comprehensive data collection including page content and metadata</p></li>
109 | <li><p>Natively indexed, no first search build lag</p></li>
110 | <li><p>Cross-platform (Windows, macOS, Android)</p></li>
111 | </ul>
112 | </section>
113 | <section id="installation-steps">
114 | <h2>Installation Steps<a class="headerlink" href="#installation-steps" title="Link to this heading"></a></h2>
115 | <section id="install-mcp-server-webcrawl">
116 | <h3>1. Install mcp-server-webcrawl<a class="headerlink" href="#install-mcp-server-webcrawl" title="Link to this heading"></a></h3>
117 | <p>Open your terminal or command line and install the package:</p>
118 | <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">pip</span> <span class="n">install</span> <span class="n">mcp</span><span class="o">-</span><span class="n">server</span><span class="o">-</span><span class="n">webcrawl</span>
119 | </pre></div>
120 | </div>
121 | <p>Verify installation was successful:</p>
122 | <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">mcp</span><span class="o">-</span><span class="n">server</span><span class="o">-</span><span class="n">webcrawl</span> <span class="o">--</span><span class="n">help</span>
123 | </pre></div>
124 | </div>
125 | </section>
126 | <section id="create-crawls-with-interrobot">
127 | <h3>2. Create Crawls with InterroBot<a class="headerlink" href="#create-crawls-with-interrobot" title="Link to this heading"></a></h3>
128 | <ol class="arabic simple">
129 | <li><p>Open InterroBot</p></li>
130 | <li><p>For a new project, you’ll see an empty project screen</p></li>
131 | <li><p>Add websites to crawl by entering URLs (e.g., example.com, pragmar.com)</p></li>
132 | <li><p>Wait for the crawling to complete (typically takes a few seconds to minutes depending on site size)</p></li>
133 | <li><p>Note the location of your InterroBot database file, which will be needed for configuration. It is available in InterroBot options, under Advanced section:
134 | - On Windows: Typically in <code class="docutils literal notranslate"><span class="pre">[homedir]/Documents/InterroBot/interrobot.v2.db</span></code>
135 | - On macOS: Path can be found in InterroBot settings page</p></li>
136 | </ol>
137 | </section>
138 | <section id="configure-claude-desktop">
139 | <h3>3. Configure Claude Desktop<a class="headerlink" href="#configure-claude-desktop" title="Link to this heading"></a></h3>
140 | <ol class="arabic simple">
141 | <li><p>Open Claude Desktop</p></li>
142 | <li><p>Go to <strong>File → Settings → Developer → Edit Config</strong></p></li>
143 | <li><p>Add the following configuration (modify paths as needed):</p></li>
144 | </ol>
145 | <div class="highlight-json notranslate"><div class="highlight"><pre><span></span><span class="p">{</span>
146 | <span class="w">  </span><span class="nt">&quot;mcpServers&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
147 | <span class="w">    </span><span class="nt">&quot;webcrawl&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
148 | <span class="w">      </span><span class="nt">&quot;command&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;/path/to/mcp-server-webcrawl&quot;</span><span class="p">,</span>
149 | <span class="w">      </span><span class="nt">&quot;args&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">&quot;--crawler&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;interrobot&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;--datasrc&quot;</span><span class="p">,</span>
150 | <span class="w">        </span><span class="s2">&quot;[homedir]/Documents/InterroBot/interrobot.v2.db&quot;</span><span class="p">]</span>
151 | <span class="w">    </span><span class="p">}</span>
152 | <span class="w">  </span><span class="p">}</span>
153 | <span class="p">}</span>
154 | </pre></div>
155 | </div>
156 | <div class="admonition note">
157 | <p class="admonition-title">Note</p>
158 | <ul class="simple">
159 | <li><p>On Windows, use <code class="docutils literal notranslate"><span class="pre">&quot;mcp-server-webcrawl&quot;</span></code> as the command</p></li>
160 | <li><p>On macOS, use the absolute path (output of <code class="docutils literal notranslate"><span class="pre">which</span> <span class="pre">mcp-server-webcrawl</span></code>)</p></li>
161 | <li><p>Replace <code class="docutils literal notranslate"><span class="pre">[homedir]/Documents/InterroBot/interrobot.v2.db</span></code> with the actual path to your InterroBot database file, available in InterroBot options</p></li>
162 | </ul>
163 | </div>
164 | <ol class="arabic simple" start="4">
165 | <li><p>Save the file and <strong>completely exit</strong> Claude Desktop (not just close the window)</p></li>
166 | <li><p>Restart Claude Desktop</p></li>
167 | </ol>
168 | </section>
169 | <section id="verify-and-use">
170 | <h3>4. Verify and Use<a class="headerlink" href="#verify-and-use" title="Link to this heading"></a></h3>
171 | <ol class="arabic">
172 | <li><p>In Claude Desktop, you should now see MCP tools available under Search and Tools</p></li>
173 | <li><p>Ask Claude to list your crawled sites:</p>
174 | <div class="highlight-default notranslate"><div class="highlight"><pre><span></span>Can you list the crawled sites available?
175 | </pre></div>
176 | </div>
177 | </li>
178 | <li><p>Try searching content from your crawls:</p>
179 | <div class="highlight-default notranslate"><div class="highlight"><pre><span></span>Can you find information about [topic] on [crawled site]?
180 | </pre></div>
181 | </div>
182 | </li>
183 | <li><p>Explore specific capabilities, such as generating site reports:</p>
184 | <div class="highlight-default notranslate"><div class="highlight"><pre><span></span>Can you give me a file type summary for [crawled site]? Which types of files are there, page count, etc.
185 | </pre></div>
186 | </div>
187 | </li>
188 | </ol>
189 | </section>
190 | </section>
191 | <section id="troubleshooting">
192 | <h2>Troubleshooting<a class="headerlink" href="#troubleshooting" title="Link to this heading"></a></h2>
193 | <ul class="simple">
194 | <li><p>If Claude doesn’t show MCP tools after restart, verify your configuration file is correctly formatted</p></li>
195 | <li><p>Ensure Python and mcp-server-webcrawl are properly installed</p></li>
196 | <li><p>Check that your InterroBot database path in the configuration is correct</p></li>
197 | <li><p>Make sure InterroBot has successfully completed crawling the websites</p></li>
198 | <li><p>Remember that the first time you use a function, Claude will ask for permission</p></li>
199 | <li><p>For large websites with many pages, search queries might take longer to process initially</p></li>
200 | </ul>
201 | <p>For more details, including API documentation and other crawler options, visit the <a class="reference external" href="https://github.com/pragmar/mcp-server-webcrawl">mcp-server-webcrawl documentation</a>.</p>
202 | </section>
203 | </section>
204 | 
205 | 
206 |            </div>
207 |           </div>
208 |           <footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
209 |         <a href="httrack.html" class="btn btn-neutral float-left" title="HTTrack MCP Setup Guide" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
210 |         <a href="katana.html" class="btn btn-neutral float-right" title="Katana MCP Setup Guide" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
211 |     </div>
212 | 
213 |   <hr/>
214 | 
215 |   <div role="contentinfo">
216 |     <p>&#169; Copyright 2025, pragmar.</p>
217 |   </div>
218 | 
219 |   Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
220 |     <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
221 |     provided by <a href="https://readthedocs.org">Read the Docs</a>.
222 |    
223 | 
224 | </footer>
225 |         </div>
226 |       </div>
227 |     </section>
228 |   </div>
229 |   <script>
230 |       jQuery(function () {
231 |           SphinxRtdTheme.Navigation.enable(true);
232 |       });
233 |   </script> 
234 | 
235 | </body>
236 | </html>
```

--------------------------------------------------------------------------------
/docs/_sources/usage.rst.txt:
--------------------------------------------------------------------------------

```
  1 | .. raw:: html
  2 | 
  3 |    <style>
  4 |    .wy-table-responsive table td, .wy-table-responsive table th {
  5 |      white-space: normal !important;
  6 |    }
  7 |    table th.head {
  8 |     font-size: 80%;
  9 |     word-break: break-word;
 10 |     }
 11 |     table td, table th {
 12 |         text-align: center;
 13 |         vertical-align: middle;
 14 |     }
 15 |     table td:first-of-type, table th:first-of-type {
 16 |         text-align: left;
 17 |         white-space: nowrap !important;
 18 |     }
 19 |    </style>
 20 | 
 21 | Usage
 22 | =====
 23 | 
 24 | Once installed, **mcp-server-webcrawl** can leverage search and retrieval capabilities to pull
 25 | your website crawl data as needed, using advanced filtering. Use it to help manage your website,
 26 | as an on-demand resource database (marketing, SEO, etc.), or anything else.
 27 | 
 28 | The truth of the matter is, you don't need to know the API behind the MCP server, as it is
 29 | designed to be consumed by the LLM. It is, however, useful to understand for advanced use cases.
 30 | 
 31 | Available Tools
 32 | ---------------
 33 | 
 34 | The API is *supposed* to stay out your way, and to a large degree
 35 | it can be navigated autonomously by your MCP client. Sometimes
 36 | you may need to nudge the LLM to the correct field or search strategy. The
 37 | following is the currect API interface for your reference.
 38 | 
 39 | webcrawl_sites
 40 | ~~~~~~~~~~~~~~
 41 | 
 42 | This tool retrieves a list of sites (project websites or crawl directories).
 43 | 
 44 | .. list-table::
 45 |    :header-rows: 1
 46 |    :widths: 15 15 15 55
 47 | 
 48 |    * - Parameter
 49 |      - Type
 50 |      - Required
 51 |      - Description
 52 |    * - ids
 53 |      - array<int>
 54 |      - No
 55 |      - List of project IDs to retrieve. Leave empty for all projects.
 56 |    * - fields
 57 |      - array<string>
 58 |      - No
 59 |      - List of additional fields to include beyond defaults (id, type, name, urls). Empty list means default fields only. Options include created (ISO 8601), modified (ISO 8601).
 60 | 
 61 | webcrawl_search
 62 | ~~~~~~~~~~~~~~~
 63 | 
 64 | This tool searches for resources (webpages, CSS, images, etc.) across projects and retrieves specified fields.
 65 | 
 66 | .. list-table::
 67 |    :header-rows: 1
 68 |    :widths: 15 15 15 55
 69 | 
 70 |    * - Parameter
 71 |      - Type
 72 |      - Required
 73 |      - Description
 74 |    * - sites
 75 |      - array<int>
 76 |      - No
 77 |      - Optional list of project IDs to filter search results to specific sites. In most scenarios, you'd filter to only one site.
 78 |    * - query
 79 |      - string
 80 |      - No
 81 |      - Fulltext search query string. Leave empty to return all resources when filtering on other fields for better precision. Supports fulltext and boolean operators (AND, OR, NOT), quoted phrases, and suffix wildcards, but not prefix wildcards. See below for complete boolean and field search capabilities.
 82 |    * - fields
 83 |      - array<string>
 84 |      - No
 85 |      - List of additional fields to include beyond defaults (modified, created). Empty list means default fields only. The content field can lead to large results and should be used with LIMIT.
 86 |    * - sort
 87 |      - string
 88 |      - No
 89 |      - Sort order for results. Prefixed with + for ascending, - for descending. ? is a special option for random sort, useful in statistical sampling. Options include: +id, -id, +url, -url, +status, -status, ?.
 90 |    * - limit
 91 |      - integer
 92 |      - No
 93 |      - Maximum number of results to return. Default is 20, max is 100.
 94 |    * - offset
 95 |      - integer
 96 |      - No
 97 |      - Number of results to skip for pagination. Default is 0.
 98 |    * - extras
 99 |      - array<string>
100 |      - No
101 |      - Array of extra features to include in results. Options include markdown, snippets, thumbnails, regex, and xpath. (see extras table)
102 |    * - extrasRegex
103 |      - array<string>
104 |      - No
105 |      - Array of regular expression patterns to extract content. One or more regex patterns can be requested. Only used when 'regex' is included in the extras array.
106 |    * - extrasXpath
107 |      - array<string>
108 |      - No
109 |      - Array of XPath expressions to extract specific content from HTML resources. One or more XPath selectors can be requested. Only used when 'xpath' is included in the extras array.
110 | 
111 | 
112 | Crawler Features Support
113 | ~~~~~~~~~~~~~~~~~~~~~~~~
114 | 
115 | API support, by parameter, across crawler type.
116 | 
117 | .. list-table::
118 |    :header-rows: 1
119 |    :widths: 13 12 12 13 12 12 13 13
120 |    :class: featuresgrid
121 | 
122 |    * - Parameter
123 |      - ArchiveBox
124 |      - HTTrack
125 |      - InterroBot
126 |      - Katana
127 |      - SiteOne
128 |      - WARC
129 |      - wget
130 |    * - Sites/ids
131 |      - ✔
132 |      - ✔
133 |      - ✔
134 |      - ✔
135 |      - ✔
136 |      - ✔
137 |      - ✔
138 |    * - Sites/fields
139 |      - ✔
140 |      - ✔
141 |      - ✔
142 |      - ✔
143 |      - ✔
144 |      - ✔
145 |      - ✔
146 |    * - Search/ids
147 |      - ✔
148 |      - ✔
149 |      - ✔
150 |      - ✔
151 |      - ✔
152 |      - ✔
153 |      - ✔
154 |    * - Search/sites
155 |      - ✔
156 |      - ✔
157 |      - ✔
158 |      - ✔
159 |      - ✔
160 |      - ✔
161 |      - ✔
162 |    * - Search/query
163 |      - ✔
164 |      - ✔
165 |      - ✔
166 |      - ✔
167 |      - ①
168 |      - ✔
169 |      - ①
170 |    * - Search/fields
171 |      - ✔
172 |      - ✔
173 |      - ✔
174 |      - ✔
175 |      - ②
176 |      - ✔
177 |      - ②
178 |    * - Search/sort
179 |      - ✔
180 |      - ✔
181 |      - ✔
182 |      - ✔
183 |      - ✔
184 |      - ✔
185 |      - ✔
186 |    * - Search/limit
187 |      - ✔
188 |      - ✔
189 |      - ✔
190 |      - ✔
191 |      - ✔
192 |      - ✔
193 |      - ✔
194 |    * - Search/offset
195 |      - ✔
196 |      - ✔
197 |      - ✔
198 |      - ✔
199 |      - ✔
200 |      - ✔
201 |      - ✔
202 |    * - Search/extras
203 |      - ✔
204 |      - ✔
205 |      - ✔
206 |      - ✔
207 |      - ✔
208 |      - ✔
209 |      - ✔
210 | 
211 | Crawler Field Support
212 | ~~~~~~~~~~~~~~~~~~~~~
213 | 
214 | API support, by field, across crawler type.
215 | 
216 | .. list-table::
217 |    :header-rows: 1
218 |    :widths: 13 12 12 13 12 12 13 13
219 |    :class: featuresgrid
220 | 
221 |    * - Parameter
222 |      - ArchiveBox
223 |      - HTTrack
224 |      - InterroBot
225 |      - Katana
226 |      - SiteOne
227 |      - WARC
228 |      - wget
229 |    * - site.id
230 |      - ✔
231 |      - ✔
232 |      - ✔
233 |      - ✔
234 |      - ✔
235 |      - ✔
236 |      - ✔
237 |    * - site.name
238 |      - ✔
239 |      - ✔
240 |      - ✔
241 |      - ✔
242 |      - ✔
243 |      - ✔
244 |      - ✔
245 |    * - site.type
246 |      - ✔
247 |      - ✔
248 |      - ✔
249 |      - ✔
250 |      - ✔
251 |      - ✔
252 |      - ✔
253 |    * - site.urls
254 |      - ✔
255 |      - ✔
256 |      - ✔
257 |      - ✔
258 |      - ✔
259 |      - ✔
260 |      - ✔
261 |    * - resource.id
262 |      - ✔
263 |      - ✔
264 |      - ✔
265 |      - ✔
266 |      - ✔
267 |      - ✔
268 |      - ✔
269 |    * - resource.url
270 |      - ✔
271 |      - ✔
272 |      - ✔
273 |      - ✔
274 |      - ✔
275 |      - ✔
276 |      - ✔
277 |    * - resource.type
278 |      - ✔
279 |      - ✔
280 |      - ✔
281 |      - ✔
282 |      - ✔
283 |      - ✔
284 |      - ✔
285 |    * - resource.status
286 |      - ✔
287 |      - ✔
288 |      - ✔
289 |      - ✔
290 |      - ✔
291 |      - ✔
292 |      - ③
293 |    * - resource.size
294 |      - ✔
295 |      - ✔
296 |      - ✔
297 |      - ✔
298 |      - ✔
299 |      - ✔
300 |      - ✔
301 |    * - resource.headers
302 |      - ✔
303 |      - ✔
304 |      - ✔
305 |      - ✔
306 |      -
307 |      - ✔
308 |      -
309 |    * - resource.content
310 |      - ✔
311 |      - ✔
312 |      - ✔
313 |      - ✔
314 |      - ✔
315 |      - ✔
316 |      - ✔
317 | 
318 | ①②③ wget (--mirror) does not index HTTP status beyond 200 OK (HTTP errors not saved to disk).
319 | wget and SiteOne crawler implementations do not support field searchable HTTP headers. When used in
320 | WARC mode (as opposed to simple mirror), wget is capable of collecting HTTP headers
321 | and status.
322 | 
323 | Crawlers all have strengths and weaknesses, judge them on how well they
324 | fit your needs, and don't be all that concerned over headers field support. They all
325 | support fulltext boolean search across the crawl data.
326 | 
327 | Boolean Search Syntax
328 | ~~~~~~~~~~~~~~~~~~~~~
329 | 
330 | The query engine supports field-specific (``field: value``) searches and complex boolean
331 | expressions. Fulltext is supported as a combination of the url, content, and headers fields.
332 | 
333 | While the API interface is designed to be consumed by the LLM directly, it can be helpful
334 | to familiarize yourself with the search syntax. Searches generated by the LLM are
335 | inspectable, but generally collapsed in the UI. If you need to see the query, expand
336 | the MCP collapsable.
337 | 
338 | .. list-table::
339 |    :header-rows: 1
340 |    :widths: 30 70
341 | 
342 |    * - Query Example
343 |      - Description
344 |    * - privacy
345 |      - fulltext single keyword match
346 |    * - "privacy policy"
347 |      - fulltext match exact phrase
348 |    * - boundar*
349 |      - fulltext wildcard matches results starting with *boundar* (boundary, boundaries)
350 |    * - id: 12345
351 |      - id field matches a specific resource by ID
352 |    * - url: example.com/*
353 |      - url field matches results with URL containing example.com/
354 |    * - type: html
355 |      - type field matches for HTML pages only
356 |    * - status: 200
357 |      - status field matches specific HTTP status codes (equal to 200)
358 |    * - status: >=400
359 |      - status field matches specific HTTP status code (greater than or equal to 400)
360 |    * - content: h1
361 |      - content field matches content (HTTP response body, often, but not always HTML)
362 |    * - headers: text/xml
363 |      - headers field matches HTTP response headers
364 |    * - privacy AND policy
365 |      - fulltext matches both
366 |    * - privacy OR policy
367 |      - fulltext matches either
368 |    * - policy NOT privacy
369 |      - fulltext matches policies not containing privacy
370 |    * - (login OR signin) AND form
371 |      - fulltext matches fullext login or signin with form
372 |    * - type: html AND status: 200
373 |      - fulltext matches only HTML pages with HTTP success
374 | 
375 | Field Search Definitions
376 | ~~~~~~~~~~~~~~~~~~~~~~~~
377 | 
378 | Field search provides search precision, allowing you to specify which columns of the search index to filter.
379 | Rather than searching the entire content, you can restrict your query to specific attributes like URLs,
380 | headers, or content body. This approach improves efficiency when looking for
381 | specific attributes or patterns within crawl data.
382 | 
383 | .. list-table::
384 |    :header-rows: 1
385 |    :widths: 30 70
386 | 
387 |    * - Field
388 |      - Description
389 |    * - id
390 |      - resource database ID
391 |    * - url
392 |      - resource URL
393 |    * - type
394 |      - enumerated list of types (see types table)
395 |    * - size
396 |      - resource size in bytes
397 |    * - status
398 |      - HTTP response codes
399 |    * - headers
400 |      - HTTP response headers
401 |    * - content
402 |      - HTTP body—HTML, CSS, JS, and more
403 | 
404 | Field Content
405 | ~~~~~~~~~~~~~
406 | 
407 | A subset of fields can be independently requested with results, while core fields are always on. Use of headers and content can consume tokens quickly. Use judiciously, or use extras to crunch more results into the context window. Fields are a top level argument, independent of any field searching taking place in the query.
408 | 
409 | .. list-table::
410 |    :header-rows: 1
411 |    :widths: 30 70
412 | 
413 |    * - Field
414 |      - Description
415 |    * - id
416 |      - always available
417 |    * - url
418 |      - always available
419 |    * - type
420 |      - always available
421 |    * - status
422 |      - always available
423 |    * - created
424 |      - on request
425 |    * - modified
426 |      - on request
427 |    * - size
428 |      - on request
429 |    * - headers
430 |      - on request
431 |    * - content
432 |      - on request
433 | 
434 | Content Types
435 | ~~~~~~~~~~~~~
436 | 
437 | Crawls contain a multitude of resource types beyond HTML pages. The ``type:`` field search
438 | allows filtering by broad content type groups, particularly useful when filtering images without complex extension queries.
439 | For example, you might search for ``type: html NOT content: login``
440 | to find pages without "login," or ``type: img`` to analyze image resources. The table below lists all
441 | supported content types in the search system.
442 | 
443 | .. list-table::
444 |    :header-rows: 1
445 |    :widths: 30 70
446 | 
447 |    * - Type
448 |      - Description
449 |    * - html
450 |      - webpages
451 |    * - iframe
452 |      - iframes
453 |    * - img
454 |      - web images
455 |    * - audio
456 |      - web audio files
457 |    * - video
458 |      - web video files
459 |    * - font
460 |      - web font files
461 |    * - style
462 |      - CSS stylesheets
463 |    * - script
464 |      - JavaScript files
465 |    * - rss
466 |      - RSS syndication feeds
467 |    * - text
468 |      - plain text content
469 |    * - pdf
470 |      - PDF files
471 |    * - doc
472 |      - MS Word documents
473 |    * - other
474 |      - uncategorized
475 | 
476 | Extras
477 | ~~~~~~
478 | 
479 | The ``extras`` parameter provides additional processing options, transforming result data (markdown, snippets), or connecting the LLM to external data (thumbnails). These options can be combined as needed to achieve the desired result format.
480 | 
481 | .. list-table::
482 |    :header-rows: 1
483 |    :widths: 20 80
484 | 
485 |    * - Extra
486 |      - Description
487 |    * - thumbnails
488 |      - Generates base64 encoded images to be viewed and analyzed by AI models. Enables image description, content analysis, and visual understanding while keeping token output minimal. Works with images, which can be filtered using ``type: img`` in queries. SVG is not supported.
489 |    * - markdown
490 |      - Provides the HTML content field as concise Markdown, reducing token usage and improving readability for LLMs. Works with HTML, which can be filtered using ``type: html`` in queries.
491 |    * - snippets
492 |      - Matches fulltext queries to contextual keyword usage within the content. When used without requesting the content field (or markdown extra), it can provide an efficient means of refining a search without pulling down the complete page contents. Also great for rendering old school hit-highlighted results as a list, like Google search in 1999. Works with HTML, CSS, JS, or any text-based, crawled file.
493 |    * - regex
494 |      - Extracts regular expression matches from crawled files such as HTML, CSS, JavaScript, etc. Not as precise a tool as XPath for HTML, but supports any text file as a data source. One or more regex patterns can be requested, using the ``extrasRegex`` argument.
495 |    * - xpath
496 |      - Extracts XPath selector data, used in scraping HTML content. Use XPath's text() selector for text-only, element selectors return outerHTML. Only supported with ``type: html``, other types will be ignored. One or more XPath selectors (//h1, count(//h1), etc.) can be requested, using the ``extrasXpath`` argument.
497 | 
498 | Extras provide a means of producing token-efficient HTTP content responses. Markdown produces roughly 1/3 the bytes of the source HTML, snippets are generally 500 or so bytes per result, and XPath can be as specific or broad as you choose. The more focused your requests, the more results you can fit into your LLM session.
499 | 
500 | The idea, of course, is that the LLM takes care of this for you. If you notice your LLM developing an affinity to the "content" field (full HTML), a nudge in chat to budget tokens using the extras feature should be all that is needed.
```

--------------------------------------------------------------------------------
/src/mcp_server_webcrawl/crawlers/httrack/adapter.py:
--------------------------------------------------------------------------------

```python
  1 | import os
  2 | import re
  3 | import sqlite3
  4 | import traceback
  5 | 
  6 | from contextlib import closing
  7 | from datetime import datetime, timezone
  8 | from pathlib import Path
  9 | 
 10 | from mcp_server_webcrawl.crawlers.base.adapter import (
 11 |     BaseManager,
 12 |     IndexState,
 13 |     IndexStatus,
 14 |     SitesGroup,
 15 |     INDEXED_BATCH_SIZE,
 16 |     INDEXED_RESOURCE_DEFAULT_PROTOCOL,
 17 |     INDEXED_TYPE_MAPPING
 18 | )
 19 | from mcp_server_webcrawl.crawlers.base.indexed import IndexedManager
 20 | from mcp_server_webcrawl.models.resources import (
 21 |     ResourceResult,
 22 |     ResourceResultType,
 23 |     RESOURCES_LIMIT_DEFAULT,
 24 | )
 25 | from mcp_server_webcrawl.models.sites import (
 26 |     SiteResult,
 27 |     SiteType,
 28 | )
 29 | from mcp_server_webcrawl.utils.logger import get_logger
 30 | 
 31 | HTTRACK_REGEX_LAUNCH_URL = re.compile(r"launched on .+ at (https?://[^\s]+)")
 32 | HTTRACK_REGEX_REDIRECT = re.compile(r"File has moved from (https?://[^\s]+) to (.+)")
 33 | HTTRACK_REGEX_ERROR = re.compile(r'"([^"]+)" \((\d+)\) at link (https?://[^\s]+)')
 34 | HTTRACK_REGEX_DOMAIN = re.compile(r'^[a-zA-Z0-9]([a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?(\.[a-zA-Z0-9]([a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?)*$')
 35 | HTTRACK_REGEX_INDEX_HTML = re.compile(r"/index\.html($|\?)")
 36 | 
 37 | logger = get_logger()
 38 | 
 39 | class HtTrackManager(IndexedManager):
 40 |     """
 41 |     Manages HTTrack project data in in-memory SQLite databases.
 42 |     """
 43 | 
 44 |     def __init__(self) -> None:
 45 |         """
 46 |         Initialize the HTTrack manager with empty cache and statistics.
 47 |         """
 48 |         super().__init__()
 49 | 
 50 |     def _load_site_data(self, connection: sqlite3.Connection, project_directory: Path,
 51 |                        site_id: int, index_state: IndexState = None) -> None:
 52 |         """
 53 |         Load an HTTrack project directory into the database.
 54 | 
 55 |         Args:
 56 |             connection: SQLite connection
 57 |             project_dir: path to the HTTrack project directory
 58 |             site_id: ID for the site
 59 |             index_state: IndexState object for tracking progress
 60 |         """
 61 |         if not project_directory.exists() or not project_directory.is_dir():
 62 |             logger.error(f"Directory not found or not a directory: {project_directory}")
 63 |             return
 64 | 
 65 |         if index_state is not None:
 66 |             index_state.set_status(IndexStatus.INDEXING)
 67 | 
 68 |         # metadata from hts-log.txt
 69 |         project_metadata = self._get_project_metadata(project_directory)
 70 | 
 71 |         # domain directories discovery
 72 |         domain_directories = self._get_content_directories(project_directory)
 73 | 
 74 |         if not domain_directories:
 75 |             logger.warning(f"No domain directories found in HTTrack project: {project_directory}")
 76 |             return
 77 | 
 78 |         httrack_skip_files_lower = ["hts-log.txt", "index.html"]
 79 |         with closing(connection.cursor()) as cursor:
 80 |             for domain_directory in domain_directories:
 81 |                 base_url = self._get_base_url(domain_directory, project_metadata)
 82 |                 file_paths = []
 83 |                 for root, _, files in os.walk(domain_directory):
 84 |                     for filename in files:
 85 |                         file_path = Path(root) / filename
 86 | 
 87 |                         if filename.lower() in httrack_skip_files_lower and file_path.parent == project_directory:
 88 |                             continue
 89 |                         file_paths.append(file_path)
 90 | 
 91 |                 # batch process
 92 |                 for i in range(0, len(file_paths), INDEXED_BATCH_SIZE):
 93 |                     if index_state is not None and index_state.is_timeout():
 94 |                         index_state.set_status(IndexStatus.PARTIAL)
 95 |                         return
 96 | 
 97 |                     batch_file_paths = file_paths[i:i+INDEXED_BATCH_SIZE]
 98 |                     batch_file_contents = BaseManager.read_files(batch_file_paths)
 99 |                     batch_insert_resource_results = []
100 | 
101 |                     for file_path in batch_file_paths:
102 |                         content = batch_file_contents.get(file_path)
103 |                         try:
104 |                             result = self._create_resource(
105 |                                 file_path, site_id, domain_directory, base_url,
106 |                                 project_metadata, content
107 |                             )
108 |                             if result:
109 |                                 batch_insert_resource_results.append(result)
110 |                                 if index_state is not None:
111 |                                     index_state.increment_processed()
112 |                         except Exception as ex:
113 |                             logger.error(f"Error processing file {file_path}: {ex}")
114 | 
115 |                     self._execute_batch_insert(connection, cursor, batch_insert_resource_results)
116 | 
117 |             if index_state is not None and index_state.status == IndexStatus.INDEXING:
118 |                 index_state.set_status(IndexStatus.COMPLETE)
119 | 
120 |     def _create_resource(self, file_path: Path, site_id: int, domain_directory: Path,
121 |                              base_url: str, project_metadata: dict, content: str = None) -> ResourceResult | None:
122 |         """
123 |         Create ResourceResult for an HTTrack file.
124 | 
125 |         Args:
126 |             file_path: path to the file
127 |             site_id: ID for the site
128 |             domain_dir: path to the domain directory
129 |             base_url: reconstructed base URL for the domain
130 |             project_metadata: extracted project metadata
131 |             content: optional pre-loaded file content
132 | 
133 |         Returns:
134 |             ResourceResult object ready for insertion, or None if processing fails
135 |         """
136 |         try:
137 |             relative_path: Path = file_path.relative_to(domain_directory)
138 |             url = base_url + str(relative_path).replace(os.sep, "/")
139 | 
140 |             # Handle homepage index.html like wget does
141 |             url = HTTRACK_REGEX_INDEX_HTML.sub(r"/\1", url)
142 | 
143 |             # Determine resource type from file extension
144 |             extension = file_path.suffix.lower()
145 |             resource_type = INDEXED_TYPE_MAPPING.get(extension, ResourceResultType.OTHER)
146 | 
147 |             # Get file metadata
148 |             if file_path.is_file():
149 |                 file_stat = file_path.stat()
150 |                 file_size = file_stat.st_size
151 |                 file_created = datetime.fromtimestamp(file_stat.st_ctime, tz=timezone.utc)
152 |                 file_modified = datetime.fromtimestamp(file_stat.st_mtime, tz=timezone.utc)
153 |             else:
154 |                 file_created = None
155 |                 file_modified = None
156 |                 file_size = 0
157 | 
158 |             status_code = 200  # Default for files that exist
159 |             errors = project_metadata.get("errors", {})
160 |             redirects = project_metadata.get("redirects", {})
161 | 
162 |             if url in errors:
163 |                 status_code = errors[url]
164 |             elif url in redirects:
165 |                 status_code = 302  # Assume redirect
166 | 
167 |             # pre-loaded content if available
168 |             file_content = content
169 |             if file_content is None:
170 |                 file_content = BaseManager.read_file_contents(file_path, resource_type)
171 | 
172 |             return ResourceResult(
173 |                 id=BaseManager.string_to_id(url),
174 |                 site=site_id,
175 |                 created=file_created,
176 |                 modified=file_modified,
177 |                 url=url,
178 |                 type=resource_type,
179 |                 status=status_code,
180 |                 headers=BaseManager.get_basic_headers(file_size, resource_type, file_path),
181 |                 content=file_content,
182 |                 size=file_size,
183 |                 time=0  # data unavailable (HTTrack)
184 |             )
185 | 
186 |         except Exception as ex:
187 |             logger.error(f"Error creating resource for file {file_path}: {ex}\n{traceback.format_exc()}")
188 |             return None
189 | 
190 |     def _get_project_metadata(self, project_directory: Path) -> dict[str, str]:
191 |         """
192 |         Get metadata from HTTrack hts-log.txt file.
193 | 
194 |         Args:
195 |             project_dir: path to the HTTrack project directory
196 | 
197 |         Returns:
198 |             Dictionary containing extracted metadata (urls, launch_time, etc.)
199 |         """
200 |         metadata: dict = {}
201 |         hts_log_path: Path = project_directory / "hts-log.txt"
202 | 
203 |         if not hts_log_path.exists():
204 |             logger.warning(f"No hts-log.txt found in {project_directory}")
205 |             return metadata
206 | 
207 |         # into fragile territory, if in doubt follow latest official HTTrack
208 |         try:
209 |             with open(hts_log_path, "r", encoding="utf-8", errors="replace") as f:
210 |                 content = f.read()
211 | 
212 |                 # extract primary network domain (http) from first line
213 |                 launch_match = HTTRACK_REGEX_LAUNCH_URL.search(content)
214 |                 if launch_match:
215 |                     metadata["launch_url"] = launch_match.group(1)
216 | 
217 |                 redirects = {}
218 |                 errors = {}
219 | 
220 |                 for line in content.split("\n"):
221 |                     line = line.strip()
222 | 
223 |                     # redirects - file has moved from X to Y
224 |                     redirect_match = HTTRACK_REGEX_REDIRECT.search(line)
225 |                     if redirect_match:
226 |                         redirects[redirect_match.group(1)] = redirect_match.group(2)
227 | 
228 |                     # errors - Not Found (404) at link X
229 |                     error_match = HTTRACK_REGEX_ERROR.search(line)
230 |                     if error_match:
231 |                         error_text, status_code, url = error_match.groups()
232 |                         errors[url] = int(status_code)
233 | 
234 |                 metadata["redirects"] = redirects
235 |                 metadata["errors"] = errors
236 | 
237 |         except (FileNotFoundError, PermissionError, UnicodeDecodeError) as ex:
238 |             logger.warning(f"Could not read hts-log.txt from {project_directory}: {ex}")
239 |         except Exception as ex:
240 |             logger.error(f"Error parsing hts-log.txt from {project_directory}: {ex}")
241 | 
242 |         return metadata
243 | 
244 |     def _get_content_directories(self, project_directory: Path) -> list[Path]:
245 |         """
246 |         Get domain directories within an HTTrack project.
247 | 
248 |         Args:
249 |             project_dir: path to the HTTrack project directory
250 | 
251 |         Returns:
252 |             List of domain directory paths
253 |         """
254 |         content_directories: list[Path] = []
255 | 
256 |         for item in project_directory.iterdir():
257 |             if (item.is_dir() and
258 |                 not item.name.startswith(".") and
259 |                 item.name not in ["hts-cache", "hts-tmp"] and
260 |                 not item.name.startswith("hts-")):
261 | 
262 |                 # if directory contains web content (has HTML, CSS, JS, or image files)
263 |                 has_web_content = any(
264 |                     file_path.suffix.lower() in [".html", ".htm", ".css", ".js", ".png", ".jpg", ".gif"]
265 |                     for file_path in item.rglob("*") if file_path.is_file()
266 |                 )
267 | 
268 |                 if has_web_content:
269 |                     content_directories.append(item)
270 | 
271 |         return content_directories
272 | 
273 |     def _get_base_url(self, domain_directory: Path, project_metadata: dict) -> str:
274 |         """
275 |         Get the base URL for a domain directory.
276 | 
277 |         Args:
278 |             domain_dir: path to the domain directory
279 |             project_metadata: extracted project metadata
280 | 
281 |         Returns:
282 |             Reconstructed base URL
283 |         """
284 |         #  use launch URL if match
285 |         if "launch_url" in project_metadata:
286 |             launch_url = project_metadata["launch_url"]
287 |             try:
288 |                 from urllib.parse import urlparse
289 |                 parsed = urlparse(launch_url)
290 |                 if parsed.netloc.replace("www.", "") == domain_directory.name.replace("www.", ""):
291 |                     return f"{parsed.scheme}://{parsed.netloc}/"
292 |             except Exception:
293 |                 pass
294 | 
295 |         # if domain_directory name looks like a domain
296 |         if HTTRACK_REGEX_DOMAIN.match(domain_directory.name):
297 |             return f"{INDEXED_RESOURCE_DEFAULT_PROTOCOL}{domain_directory.name}/"
298 | 
299 |         # fallback
300 |         project_name = domain_directory.parent.name
301 |         logger.warning(f"Could not determine domain for {domain_directory}, using fallback: {project_name}")
302 |         return f"{INDEXED_RESOURCE_DEFAULT_PROTOCOL}{project_name}.local/{domain_directory.name}/"
303 | 
304 | manager: HtTrackManager = HtTrackManager()
305 | 
306 | def get_sites(
307 |     datasrc: Path,
308 |     ids: list[int] | None = None,
309 |     fields: list[str] | None = None
310 | ) -> list[SiteResult]:
311 |     """
312 |     List HTTrack project directories as sites.
313 | 
314 |     Args:
315 |         datasrc: path to the directory containing HTTrack projects
316 |         ids: optional list of site IDs to filter by
317 |         fields: optional list of fields to include in the response
318 | 
319 |     Returns:
320 |         List of SiteResult objects, one for each HTTrack project
321 |     """
322 |     return manager.get_sites_for_directories(datasrc, ids, fields)
323 | 
324 | def get_resources(
325 |     datasrc: Path,
326 |     sites: list[int] | None = None,
327 |     query: str = "",
328 |     fields: list[str] | None = None,
329 |     sort: str | None = None,
330 |     limit: int = RESOURCES_LIMIT_DEFAULT,
331 |     offset: int = 0,
332 | ) -> tuple[list[ResourceResult], int, IndexState]:
333 |     """
334 |     Get resources from HTTrack project directories using in-memory SQLite.
335 | 
336 |     Args:
337 |         datasrc: path to the directory containing HTTrack projects
338 |         sites: optional list of site IDs to filter by
339 |         query: search query string
340 |         fields: optional list of fields to include in response
341 |         sort: sort order for results
342 |         limit: maximum number of results to return
343 |         offset: number of results to skip for pagination
344 | 
345 |     Returns:
346 |         Tuple of (list of ResourceResult objects, total count, IndexState)
347 |     """
348 |     sites_results: list[SiteResult] = get_sites(datasrc=datasrc, ids=sites)
349 |     assert sites_results, "At least one site is required to search"
350 |     site_paths = [site.path for site in sites_results]
351 |     sites_group = SitesGroup(datasrc, sites, site_paths)
352 |     return manager.get_resources_for_sites_group(sites_group, query, fields, sort, limit, offset)
353 | 
```

--------------------------------------------------------------------------------
/docs/guides/httrack.html:
--------------------------------------------------------------------------------

```html
  1 | 
  2 | 
  3 | <!DOCTYPE html>
  4 | <html class="writer-html5" lang="en" data-content_root="../">
  5 | <head>
  6 |   <meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />
  7 | 
  8 |   <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  9 |   <title>HTTrack MCP Setup Guide &mdash; mcp-server-webcrawl  documentation</title>
 10 |       <link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=80d5e7a1" />
 11 |       <link rel="stylesheet" type="text/css" href="../_static/css/theme.css?v=e59714d7" />
 12 | 
 13 |   
 14 |       <script src="../_static/jquery.js?v=5d32c60e"></script>
 15 |       <script src="../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
 16 |       <script src="../_static/documentation_options.js?v=5929fcd5"></script>
 17 |       <script src="../_static/doctools.js?v=888ff710"></script>
 18 |       <script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
 19 |     <script src="../_static/js/theme.js"></script>
 20 |     <link rel="index" title="Index" href="../genindex.html" />
 21 |     <link rel="search" title="Search" href="../search.html" />
 22 |     <link rel="next" title="InterroBot MCP Setup Guide" href="interrobot.html" />
 23 |     <link rel="prev" title="ArchiveBox MCP Setup Guide" href="archivebox.html" /> 
 24 | </head>
 25 | 
 26 | <body class="wy-body-for-nav"> 
 27 |   <div class="wy-grid-for-nav">
 28 |     <nav data-toggle="wy-nav-shift" class="wy-nav-side">
 29 |       <div class="wy-side-scroll">
 30 |         <div class="wy-side-nav-search" >
 31 | 
 32 |           
 33 |           
 34 |           <a href="../index.html" class="icon icon-home">
 35 |             mcp-server-webcrawl
 36 |           </a>
 37 | <div role="search">
 38 |   <form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
 39 |     <input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
 40 |     <input type="hidden" name="check_keywords" value="yes" />
 41 |     <input type="hidden" name="area" value="default" />
 42 |   </form>
 43 | </div>
 44 |         </div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
 45 |               <p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
 46 | <ul class="current">
 47 | <li class="toctree-l1"><a class="reference internal" href="../installation.html">Installation</a></li>
 48 | <li class="toctree-l1 current"><a class="reference internal" href="../guides.html">Setup Guides</a><ul class="current">
 49 | <li class="toctree-l2"><a class="reference internal" href="archivebox.html">ArchiveBox MCP Setup Guide</a></li>
 50 | <li class="toctree-l2 current"><a class="current reference internal" href="#">HTTrack MCP Setup Guide</a></li>
 51 | <li class="toctree-l2"><a class="reference internal" href="interrobot.html">InterroBot MCP Setup Guide</a></li>
 52 | <li class="toctree-l2"><a class="reference internal" href="katana.html">Katana MCP Setup Guide</a></li>
 53 | <li class="toctree-l2"><a class="reference internal" href="siteone.html">SiteOne MCP Setup Guide</a></li>
 54 | <li class="toctree-l2"><a class="reference internal" href="warc.html">WARC MCP Setup Guide</a></li>
 55 | <li class="toctree-l2"><a class="reference internal" href="wget.html">wget MCP Setup Guide</a></li>
 56 | </ul>
 57 | </li>
 58 | <li class="toctree-l1"><a class="reference internal" href="../usage.html">Usage</a></li>
 59 | <li class="toctree-l1"><a class="reference internal" href="../prompts.html">Prompt Routines</a></li>
 60 | <li class="toctree-l1"><a class="reference internal" href="../modules.html">mcp_server_webcrawl</a></li>
 61 | </ul>
 62 | 
 63 |         </div>
 64 |       </div>
 65 |     </nav>
 66 | 
 67 |     <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
 68 |           <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
 69 |           <a href="../index.html">mcp-server-webcrawl</a>
 70 |       </nav>
 71 | 
 72 |       <div class="wy-nav-content">
 73 |         <div class="rst-content">
 74 |           <div role="navigation" aria-label="Page navigation">
 75 |   <ul class="wy-breadcrumbs">
 76 |       <li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
 77 |           <li class="breadcrumb-item"><a href="../guides.html">Setup Guides</a></li>
 78 |       <li class="breadcrumb-item active">HTTrack MCP Setup Guide</li>
 79 |       <li class="wy-breadcrumbs-aside">
 80 |             <a href="../_sources/guides/httrack.rst.txt" rel="nofollow"> View page source</a>
 81 |       </li>
 82 |   </ul>
 83 |   <hr/>
 84 | </div>
 85 |           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
 86 |            <div itemprop="articleBody">
 87 |              
 88 |   <section id="httrack-mcp-setup-guide">
 89 | <h1>HTTrack MCP Setup Guide<a class="headerlink" href="#httrack-mcp-setup-guide" title="Link to this heading"></a></h1>
 90 | <p>Instructions for setting up <a class="reference external" href="https://pragmar.com/mcp-server-webcrawl/">mcp-server-webcrawl</a> with <a class="reference external" href="https://www.httrack.com/">HTTrack Website Copier</a>.
 91 | This allows your LLM (e.g. Claude Desktop) to search content and metadata from websites you’ve mirrored using HTTrack.</p>
 92 | <iframe width="560" height="315" src="https://www.youtube.com/embed/HAVfvmrZjRk" frameborder="0" allowfullscreen></iframe><p>Follow along with the video, or the step-action guide below.</p>
 93 | <section id="requirements">
 94 | <h2>Requirements<a class="headerlink" href="#requirements" title="Link to this heading"></a></h2>
 95 | <p>Before you begin, ensure you have:</p>
 96 | <ul class="simple">
 97 | <li><p><a class="reference external" href="https://claude.ai/download">Claude Desktop</a> installed</p></li>
 98 | <li><p><a class="reference external" href="https://python.org">Python</a> 3.10 or later installed</p></li>
 99 | <li><p><a class="reference external" href="https://www.httrack.com/">HTTrack Website Copier</a> installed</p></li>
100 | <li><p>Basic familiarity with command line interfaces</p></li>
101 | </ul>
102 | </section>
103 | <section id="what-is-httrack">
104 | <h2>What is HTTrack?<a class="headerlink" href="#what-is-httrack" title="Link to this heading"></a></h2>
105 | <p>HTTrack is a well-established open source website mirror tool that offers:</p>
106 | <ul class="simple">
107 | <li><p>Complete website mirroring with organized project directories</p></li>
108 | <li><p>User-friendly wizard-style interface for setup</p></li>
109 | <li><p>Comprehensive content capture including HTML, CSS, images, and other assets</p></li>
110 | <li><p>Ability to manage multiple site mirrors efficiently</p></li>
111 | <li><p>Cross-platform support (Windows, macOS, Linux)</p></li>
112 | </ul>
113 | </section>
114 | <section id="installation-steps">
115 | <h2>Installation Steps<a class="headerlink" href="#installation-steps" title="Link to this heading"></a></h2>
116 | <section id="install-mcp-server-webcrawl">
117 | <h3>1. Install mcp-server-webcrawl<a class="headerlink" href="#install-mcp-server-webcrawl" title="Link to this heading"></a></h3>
118 | <p>Open your terminal or command line and install the package:</p>
119 | <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">pip</span> <span class="n">install</span> <span class="n">mcp</span><span class="o">-</span><span class="n">server</span><span class="o">-</span><span class="n">webcrawl</span>
120 | </pre></div>
121 | </div>
122 | <p>Verify installation was successful:</p>
123 | <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">mcp</span><span class="o">-</span><span class="n">server</span><span class="o">-</span><span class="n">webcrawl</span> <span class="o">--</span><span class="n">help</span>
124 | </pre></div>
125 | </div>
126 | </section>
127 | <section id="create-website-mirrors-with-httrack">
128 | <h3>2. Create Website Mirrors with HTTrack<a class="headerlink" href="#create-website-mirrors-with-httrack" title="Link to this heading"></a></h3>
129 | <ol class="arabic simple">
130 | <li><p>Open HTTrack Website Copier application</p></li>
131 | <li><p>Create a new project (e.g., “example”) and specify where to save it</p></li>
132 | <li><p>Add the URL you want to mirror (e.g., <a class="reference external" href="https://example.com">https://example.com</a>)</p></li>
133 | <li><p>Use the wizard interface to configure your crawling options</p></li>
134 | <li><p>Start the mirroring process and wait for completion</p></li>
135 | <li><p>Repeat for additional sites as needed (e.g., create another project for pragmar.com)</p></li>
136 | </ol>
137 | <p>HTTrack will create organized project directories under your specified location (typically “My Web Sites” on Windows or “websites” on macOS/Linux). Each project contains the complete website mirror with all HTML files, images, CSS, and other assets properly organized.</p>
138 | </section>
139 | <section id="configure-claude-desktop">
140 | <h3>3. Configure Claude Desktop<a class="headerlink" href="#configure-claude-desktop" title="Link to this heading"></a></h3>
141 | <ol class="arabic simple">
142 | <li><p>Open Claude Desktop</p></li>
143 | <li><p>Go to <strong>File → Settings → Developer → Edit Config</strong></p></li>
144 | <li><p>Add the following configuration (modify paths as needed):</p></li>
145 | </ol>
146 | <div class="highlight-json notranslate"><div class="highlight"><pre><span></span><span class="p">{</span>
147 | <span class="w">  </span><span class="nt">&quot;mcpServers&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
148 | <span class="w">    </span><span class="nt">&quot;webcrawl&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
149 | <span class="w">      </span><span class="nt">&quot;command&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;/path/to/mcp-server-webcrawl&quot;</span><span class="p">,</span>
150 | <span class="w">      </span><span class="nt">&quot;args&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">&quot;--crawler&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;httrack&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;--datasrc&quot;</span><span class="p">,</span>
151 | <span class="w">        </span><span class="s2">&quot;/path/to/httrack/projects/&quot;</span><span class="p">]</span>
152 | <span class="w">    </span><span class="p">}</span>
153 | <span class="w">  </span><span class="p">}</span>
154 | <span class="p">}</span>
155 | </pre></div>
156 | </div>
157 | <div class="admonition note">
158 | <p class="admonition-title">Note</p>
159 | <ul class="simple">
160 | <li><p>On macOS/Linux, use the absolute path (output of <code class="docutils literal notranslate"><span class="pre">which</span> <span class="pre">mcp-server-webcrawl</span></code>), and the default path is typically <code class="docutils literal notranslate"><span class="pre">&quot;~/websites&quot;</span></code></p></li>
161 | <li><p>The datasrc path should point to your HTTrack project directory containing all your mirrored sites</p></li>
162 | </ul>
163 | </div>
164 | <ol class="arabic simple" start="4">
165 | <li><p>Save the file and <strong>completely exit</strong> Claude Desktop (not just close the window)</p></li>
166 | <li><p>Restart Claude Desktop</p></li>
167 | </ol>
168 | </section>
169 | <section id="verify-and-use">
170 | <h3>4. Verify and Use<a class="headerlink" href="#verify-and-use" title="Link to this heading"></a></h3>
171 | <ol class="arabic">
172 | <li><p>In Claude Desktop, you should now see MCP tools available under Search and Tools</p></li>
173 | <li><p>Ask Claude to list your crawled sites:</p>
174 | <div class="highlight-default notranslate"><div class="highlight"><pre><span></span>Can you list the crawled sites available?
175 | </pre></div>
176 | </div>
177 | </li>
178 | <li><p>Try searching content from your crawls:</p>
179 | <div class="highlight-default notranslate"><div class="highlight"><pre><span></span>Can you find information about [topic] on [crawled site]?
180 | </pre></div>
181 | </div>
182 | </li>
183 | <li><p>Conduct content audits and SEO analysis:</p>
184 | <div class="highlight-default notranslate"><div class="highlight"><pre><span></span>Can you analyze the content structure and SEO elements for [crawled site]?
185 | </pre></div>
186 | </div>
187 | </li>
188 | </ol>
189 | </section>
190 | </section>
191 | <section id="troubleshooting">
192 | <h2>Troubleshooting<a class="headerlink" href="#troubleshooting" title="Link to this heading"></a></h2>
193 | <ul class="simple">
194 | <li><p>If Claude doesn’t show MCP tools after restart, verify your configuration file is correctly formatted</p></li>
195 | <li><p>Ensure Python and mcp-server-webcrawl are properly installed</p></li>
196 | <li><p>Check that your HTTrack project directory path in the configuration is correct</p></li>
197 | <li><p>Make sure HTTrack has successfully completed mirroring the websites and created the project directories</p></li>
198 | <li><p>Remember that the first time you use a function, Claude will ask for permission</p></li>
199 | <li><p>For large websites, initial indexing may take some time during the first search</p></li>
200 | </ul>
201 | <p>HTTrack’s project structure makes it easy to manage multiple site mirrors, and when combined with mcp-server-webcrawl, provides for content analysis, SEO audits, and searchable archives.</p>
202 | <p>For more details, including API documentation and other crawler options, visit the <a class="reference external" href="https://github.com/pragmar/mcp-server-webcrawl">mcp-server-webcrawl documentation</a>.</p>
203 | </section>
204 | </section>
205 | 
206 | 
207 |            </div>
208 |           </div>
209 |           <footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
210 |         <a href="archivebox.html" class="btn btn-neutral float-left" title="ArchiveBox MCP Setup Guide" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
211 |         <a href="interrobot.html" class="btn btn-neutral float-right" title="InterroBot MCP Setup Guide" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
212 |     </div>
213 | 
214 |   <hr/>
215 | 
216 |   <div role="contentinfo">
217 |     <p>&#169; Copyright 2025, pragmar.</p>
218 |   </div>
219 | 
220 |   Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
221 |     <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
222 |     provided by <a href="https://readthedocs.org">Read the Docs</a>.
223 |    
224 | 
225 | </footer>
226 |         </div>
227 |       </div>
228 |     </section>
229 |   </div>
230 |   <script>
231 |       jQuery(function () {
232 |           SphinxRtdTheme.Navigation.enable(true);
233 |       });
234 |   </script> 
235 | 
236 | </body>
237 | </html>
```

--------------------------------------------------------------------------------
/docs/_modules/mcp_server_webcrawl/extras/xpath.html:
--------------------------------------------------------------------------------

```html
  1 | 
  2 | 
  3 | <!DOCTYPE html>
  4 | <html class="writer-html5" lang="en" data-content_root="../../../">
  5 | <head>
  6 |   <meta charset="utf-8" />
  7 |   <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  8 |   <title>mcp_server_webcrawl.extras.xpath &mdash; mcp-server-webcrawl  documentation</title>
  9 |       <link rel="stylesheet" type="text/css" href="../../../_static/pygments.css?v=80d5e7a1" />
 10 |       <link rel="stylesheet" type="text/css" href="../../../_static/css/theme.css?v=e59714d7" />
 11 | 
 12 |   
 13 |       <script src="../../../_static/jquery.js?v=5d32c60e"></script>
 14 |       <script src="../../../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
 15 |       <script src="../../../_static/documentation_options.js?v=5929fcd5"></script>
 16 |       <script src="../../../_static/doctools.js?v=888ff710"></script>
 17 |       <script src="../../../_static/sphinx_highlight.js?v=dc90522c"></script>
 18 |     <script src="../../../_static/js/theme.js"></script>
 19 |     <link rel="index" title="Index" href="../../../genindex.html" />
 20 |     <link rel="search" title="Search" href="../../../search.html" /> 
 21 | </head>
 22 | 
 23 | <body class="wy-body-for-nav"> 
 24 |   <div class="wy-grid-for-nav">
 25 |     <nav data-toggle="wy-nav-shift" class="wy-nav-side">
 26 |       <div class="wy-side-scroll">
 27 |         <div class="wy-side-nav-search" >
 28 | 
 29 |           
 30 |           
 31 |           <a href="../../../index.html" class="icon icon-home">
 32 |             mcp-server-webcrawl
 33 |           </a>
 34 | <div role="search">
 35 |   <form id="rtd-search-form" class="wy-form" action="../../../search.html" method="get">
 36 |     <input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
 37 |     <input type="hidden" name="check_keywords" value="yes" />
 38 |     <input type="hidden" name="area" value="default" />
 39 |   </form>
 40 | </div>
 41 |         </div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
 42 |               <p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
 43 | <ul>
 44 | <li class="toctree-l1"><a class="reference internal" href="../../../installation.html">Installation</a></li>
 45 | <li class="toctree-l1"><a class="reference internal" href="../../../guides.html">Setup Guides</a></li>
 46 | <li class="toctree-l1"><a class="reference internal" href="../../../usage.html">Usage</a></li>
 47 | <li class="toctree-l1"><a class="reference internal" href="../../../prompts.html">Prompt Routines</a></li>
 48 | <li class="toctree-l1"><a class="reference internal" href="../../../modules.html">mcp_server_webcrawl</a></li>
 49 | </ul>
 50 | 
 51 |         </div>
 52 |       </div>
 53 |     </nav>
 54 | 
 55 |     <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
 56 |           <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
 57 |           <a href="../../../index.html">mcp-server-webcrawl</a>
 58 |       </nav>
 59 | 
 60 |       <div class="wy-nav-content">
 61 |         <div class="rst-content">
 62 |           <div role="navigation" aria-label="Page navigation">
 63 |   <ul class="wy-breadcrumbs">
 64 |       <li><a href="../../../index.html" class="icon icon-home" aria-label="Home"></a></li>
 65 |           <li class="breadcrumb-item"><a href="../../index.html">Module code</a></li>
 66 |       <li class="breadcrumb-item active">mcp_server_webcrawl.extras.xpath</li>
 67 |       <li class="wy-breadcrumbs-aside">
 68 |       </li>
 69 |   </ul>
 70 |   <hr/>
 71 | </div>
 72 |           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
 73 |            <div itemprop="articleBody">
 74 |              
 75 |   <h1>Source code for mcp_server_webcrawl.extras.xpath</h1><div class="highlight"><pre>
 76 | <span></span><span class="kn">import</span> <span class="nn">lxml.html</span>
 77 | 
 78 | <span class="kn">from</span> <span class="nn">lxml</span> <span class="kn">import</span> <span class="n">etree</span>
 79 | <span class="kn">from</span> <span class="nn">lxml.etree</span> <span class="kn">import</span> <span class="n">ParserError</span><span class="p">,</span> <span class="n">XPathEvalError</span><span class="p">,</span> <span class="n">XPathSyntaxError</span>
 80 | <span class="kn">from</span> <span class="nn">logging</span> <span class="kn">import</span> <span class="n">Logger</span>
 81 | 
 82 | <span class="kn">from</span> <span class="nn">mcp_server_webcrawl.utils.logger</span> <span class="kn">import</span> <span class="n">get_logger</span>
 83 | 
 84 | <span class="n">logger</span><span class="p">:</span> <span class="n">Logger</span> <span class="o">=</span> <span class="n">get_logger</span><span class="p">()</span>
 85 | 
 86 | <div class="viewcode-block" id="get_xpath">
 87 | <a class="viewcode-back" href="../../../mcp_server_webcrawl.extras.html#mcp_server_webcrawl.extras.xpath.get_xpath">[docs]</a>
 88 | <span class="k">def</span> <span class="nf">get_xpath</span><span class="p">(</span><span class="n">content</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">xpaths</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span> <span class="o">|</span> <span class="nb">int</span> <span class="o">|</span> <span class="nb">float</span><span class="p">]]:</span>
 89 | <span class="w">    </span><span class="sd">&quot;&quot;&quot;</span>
 90 | <span class="sd">    Takes content and gets xpath hits</span>
 91 | 
 92 | <span class="sd">    Arguments:</span>
 93 | <span class="sd">        content: The HTML source</span>
 94 | <span class="sd">        xpaths: The xpath selectors</span>
 95 | 
 96 | <span class="sd">    Returns:</span>
 97 | <span class="sd">        A list of dicts, with selector and value</span>
 98 | <span class="sd">    &quot;&quot;&quot;</span>
 99 | 
100 |     <span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">content</span><span class="p">,</span> <span class="nb">str</span><span class="p">):</span>
101 |         <span class="k">return</span> <span class="p">[]</span>
102 | 
103 |     <span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">xpaths</span><span class="p">,</span> <span class="nb">list</span><span class="p">)</span> <span class="ow">or</span> <span class="ow">not</span> <span class="nb">all</span><span class="p">(</span><span class="nb">isinstance</span><span class="p">(</span><span class="n">item</span><span class="p">,</span> <span class="nb">str</span><span class="p">)</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">xpaths</span><span class="p">):</span>
104 |         <span class="k">raise</span> <span class="ne">ValueError</span><span class="p">(</span><span class="s2">&quot;xpaths must be a list of strings&quot;</span><span class="p">)</span>
105 | 
106 |     <span class="n">results</span> <span class="o">=</span> <span class="p">[]</span>
107 | 
108 |     <span class="k">if</span> <span class="n">content</span> <span class="o">==</span> <span class="s2">&quot;&quot;</span><span class="p">:</span>
109 |         <span class="k">return</span> <span class="n">results</span>
110 | 
111 |     <span class="k">try</span><span class="p">:</span>
112 |         <span class="n">doc</span><span class="p">:</span> <span class="n">lxml</span><span class="o">.</span><span class="n">html</span><span class="o">.</span><span class="n">HtmlElement</span> <span class="o">=</span> <span class="n">lxml</span><span class="o">.</span><span class="n">html</span><span class="o">.</span><span class="n">fromstring</span><span class="p">(</span><span class="n">content</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s2">&quot;utf-8&quot;</span><span class="p">))</span>
113 |     <span class="k">except</span> <span class="n">ParserError</span><span class="p">:</span>
114 |         <span class="k">return</span> <span class="n">results</span>
115 | 
116 |     <span class="k">for</span> <span class="n">xpath</span> <span class="ow">in</span> <span class="n">xpaths</span><span class="p">:</span>
117 |         <span class="k">try</span><span class="p">:</span>
118 |             <span class="n">selector_result</span> <span class="o">=</span> <span class="n">doc</span><span class="o">.</span><span class="n">xpath</span><span class="p">(</span><span class="n">xpath</span><span class="p">)</span>
119 |         <span class="k">except</span> <span class="p">(</span><span class="n">XPathEvalError</span><span class="p">,</span> <span class="n">XPathSyntaxError</span><span class="p">)</span> <span class="k">as</span> <span class="n">ex</span><span class="p">:</span>
120 |             <span class="n">logger</span><span class="o">.</span><span class="n">warning</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;Invalid xpath &#39;</span><span class="si">{</span><span class="n">xpath</span><span class="si">}</span><span class="s2">&#39;: </span><span class="si">{</span><span class="n">ex</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span>
121 |             <span class="k">continue</span>
122 | 
123 |         <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">selector_result</span><span class="p">,</span> <span class="p">(</span><span class="nb">list</span><span class="p">,</span> <span class="nb">tuple</span><span class="p">)):</span>
124 |             <span class="c1"># normal xpath query returns a list</span>
125 |             <span class="k">for</span> <span class="n">result</span> <span class="ow">in</span> <span class="n">selector_result</span><span class="p">:</span>
126 |                 <span class="c1"># a new dict for each result</span>
127 |                 <span class="n">xpath_hit</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span> <span class="o">|</span> <span class="nb">int</span> <span class="o">|</span> <span class="nb">float</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="s2">&quot;selector&quot;</span><span class="p">:</span> <span class="n">xpath</span><span class="p">}</span>
128 |                 <span class="k">if</span> <span class="nb">hasattr</span><span class="p">(</span><span class="n">result</span><span class="p">,</span> <span class="s2">&quot;tag&quot;</span><span class="p">):</span>
129 |                     <span class="n">html_string</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="n">etree</span><span class="o">.</span><span class="n">tostring</span><span class="p">(</span><span class="n">result</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s2">&quot;unicode&quot;</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s2">&quot;html&quot;</span><span class="p">)</span>
130 |                     <span class="n">xpath_hit</span><span class="p">[</span><span class="s2">&quot;value&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">html_string</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
131 |                 <span class="k">else</span><span class="p">:</span>
132 |                     <span class="n">xpath_hit</span><span class="p">[</span><span class="s2">&quot;value&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">result</span><span class="p">)</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
133 |                 <span class="n">results</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">xpath_hit</span><span class="p">)</span>
134 |         <span class="k">else</span><span class="p">:</span>
135 |             <span class="c1"># single value case (count(//h1), sum(), etc.) is also valid xpath</span>
136 |             <span class="n">xpath_hit</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span> <span class="o">|</span> <span class="nb">int</span> <span class="o">|</span> <span class="nb">float</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="s2">&quot;selector&quot;</span><span class="p">:</span> <span class="n">xpath</span><span class="p">}</span>
137 |             <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">selector_result</span><span class="p">,</span> <span class="p">(</span><span class="nb">int</span><span class="p">,</span> <span class="nb">float</span><span class="p">)):</span>
138 |                 <span class="n">xpath_hit</span><span class="p">[</span><span class="s2">&quot;value&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">selector_result</span>
139 |             <span class="k">else</span><span class="p">:</span>
140 |                 <span class="n">xpath_hit</span><span class="p">[</span><span class="s2">&quot;value&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">selector_result</span><span class="p">)</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
141 |             <span class="n">results</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">xpath_hit</span><span class="p">)</span>
142 | 
143 |     <span class="k">return</span> <span class="n">results</span></div>
144 | 
145 | </pre></div>
146 | 
147 |            </div>
148 |           </div>
149 |           <footer>
150 | 
151 |   <hr/>
152 | 
153 |   <div role="contentinfo">
154 |     <p>&#169; Copyright 2025, pragmar.</p>
155 |   </div>
156 | 
157 |   Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
158 |     <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
159 |     provided by <a href="https://readthedocs.org">Read the Docs</a>.
160 |    
161 | 
162 | </footer>
163 |         </div>
164 |       </div>
165 |     </section>
166 |   </div>
167 |   <script>
168 |       jQuery(function () {
169 |           SphinxRtdTheme.Navigation.enable(true);
170 |       });
171 |   </script> 
172 | 
173 | </body>
174 | </html>
```

--------------------------------------------------------------------------------
/prompts/testsearch.md:
--------------------------------------------------------------------------------

```markdown
  1 | # mcp-server-webcrawl Boolean Search Self-Test Instructions
  2 | 
  3 | ## Query Sequence
  4 | 
  5 | ### 1. Identify Target Domain & Homepage
  6 | 
  7 | **FIRST:** Get available sites and let user choose:
  8 | ```
  9 | webcrawl_sites() - get all available domains
 10 | ```
 11 | 
 12 | **THEN:** Find homepage with sorted URL approach:
 13 | ```
 14 | query: type: html AND url: [target_site_domain]
 15 | limit: 1
 16 | sites: [target_site_id]
 17 | fields: []
 18 | extras: ["markdown"]
 19 | sort: +url
 20 | ```
 21 | 
 22 | **Extract exact domain** from homepage URL for filtering (e.g., `example.com`)
 23 | 
 24 | ### 2. Extract Boolean Test Terms from Homepage
 25 | 
 26 | **Scan homepage markdown to identify:**
 27 | - **High-frequency keywords:** Terms appearing multiple times (good for large result sets)
 28 | - **Unique/distinctive terms:** Terms likely appearing on fewer pages (good for small result sets)
 29 | - **Exact phrases:** Multi-word phrases in quotes (good for phrase matching tests)
 30 | - **Technical terms:** Domain-specific vocabulary that should appear consistently
 31 | 
 32 | **Select test term strategy:**
 33 | - **Term A (Common):** High-frequency keyword likely appearing on 10+ pages
 34 | - **Term B (Specific):** Lower-frequency keyword likely appearing on 3-8 pages
 35 | - **Phrase C:** Exact phrase in quotes for phrase matching validation
 36 | - **Term D (Rare):** Very specific term likely appearing on 1-3 pages
 37 | 
 38 | *When selecting test terms, avoid combinations where a term is a subset of a phrase, as this will lead to suspicious counts during AND operation testing. Avoid terms in global navigation and header/footer listed on every page, the result counts will all look the same.*
 39 | 
 40 | ### 3. Establish Baseline Counts
 41 | 
 42 | **Test each term individually to establish baseline sets:**
 43 | 
 44 | ```
 45 | query: [term_a]
 46 | limit: 1
 47 | sites: [target_site_id]
 48 | fields: []
 49 | extras: ["markdown"]
 50 | sort: +url
 51 | ```
 52 | 
 53 | ```
 54 | query: [term_b]
 55 | limit: 1
 56 | sites: [target_site_id]
 57 | fields: []
 58 | extras: ["markdown"]
 59 | sort: +url
 60 | ```
 61 | 
 62 | ```
 63 | query: "[phrase_c]"
 64 | limit: 1
 65 | sites: [target_site_id]
 66 | fields: []
 67 | extras: ["markdown"]
 68 | sort: +url
 69 | ```
 70 | 
 71 | ```
 72 | query: [term_d]
 73 | limit: 1
 74 | sites: [target_site_id]
 75 | fields: []
 76 | extras: ["markdown"]
 77 | sort: +url
 78 | ```
 79 | 
 80 | **Record baseline totals and document which pages contain each term for mathematical validation.**
 81 | 
 82 | In the case of a missing, but expected keyword, or a present but unanticipated keyword in the markdown,
 83 | verify with complete picture (HTTP headers and content) of the same document. The default fulltext search
 84 | MATCHes URL, headers, and content. It is capable of producing false positives against markdown. Generally,
 85 | it works fine and saves tokens and time to use the Markdown strategy.
 86 | 
 87 | ```
 88 | query: id: [document_id]
 89 | limit: 1
 90 | sites: [target_site_id]
 91 | fields: ["content", "headers"]
 92 | extras: ["markdown"]
 93 | sort: +url
 94 | ```
 95 | 
 96 | ### 4. Boolean Logic Validation Tests
 97 | 
 98 | **Execute tests in this specific order for mathematical verification:**
 99 | 
100 | #### 4.1: AND Operations (Intersection Tests)
101 | ```
102 | query: [term_a] AND [term_b]
103 | limit: 1
104 | sites: [target_site_id]
105 | fields: []
106 | extras: ["markdown"]
107 | sort: +url
108 | ```
109 | 
110 | ```
111 | query: [term_a] AND "[phrase_c]"
112 | limit: 1
113 | sites: [target_site_id]
114 | fields: []
115 | extras: ["markdown"]
116 | sort: +url
117 | ```
118 | 
119 | ```
120 | query: [term_b] AND [term_d]
121 | limit: 1
122 | sites: [target_site_id]
123 | fields: []
124 | extras: ["markdown"]
125 | sort: +url
126 | ```
127 | 
128 | **Validation:** AND results must be ≤ smallest individual term count. Verify content contains both terms.
129 | 
130 | #### 4.2: OR Operations (Union Tests)
131 | ```
132 | query: [term_a] OR [term_b]
133 | limit: 1
134 | sites: [target_site_id]
135 | fields: []
136 | extras: ["markdown"]
137 | sort: +url
138 | ```
139 | 
140 | ```
141 | query: [term_b] OR [term_d]
142 | limit: 1
143 | sites: [target_site_id]
144 | fields: []
145 | extras: ["markdown"]
146 | sort: +url
147 | ```
148 | 
149 | ```
150 | query: "[phrase_c]" OR [term_d]
151 | limit: 1
152 | sites: [target_site_id]
153 | sort: +url
154 | ```
155 | 
156 | **Validation:** OR results must be ≥ largest individual term count. Verify content contains at least one term.
157 | 
158 | #### 4.3: NOT Operations (Difference Tests)
159 | ```
160 | query: [term_a] NOT [term_b]
161 | limit: 1
162 | sites: [target_site_id]
163 | fields: []
164 | extras: ["markdown"]
165 | sort: +url
166 | ```
167 | 
168 | ```
169 | query: [term_b] NOT [term_a]
170 | limit: 1
171 | sites: [target_site_id]
172 | fields: []
173 | extras: ["markdown"]
174 | sort: +url
175 | ```
176 | 
177 | ```
178 | query: [term_a] NOT "[phrase_c]"
179 | limit: 1
180 | sites: [target_site_id]
181 | fields: []
182 | extras: ["markdown"]
183 | sort: +url
184 | ```
185 | 
186 | **Validation:** NOT results = (Term1 count) - (Term1 AND Term2 count). Verify content contains first term but not second.
187 | 
188 | #### 4.4: Field/NOT Operations (Dynamic Field Exclusion Tests)
189 | 
190 | **Test field/NOT syntax using established terms across different field types:**
191 | 
192 | 
193 | ```
194 | query: type: html NOT content: [term_b]
195 | limit: 1
196 | sites: [target_site_id]
197 | fields: []
198 | extras: ["markdown"]
199 | sort: +url
200 | ```
201 | 
202 | ```
203 | query: [term_a] NOT type: img
204 | limit: 1
205 | sites: [target_site_id]
206 | fields: []
207 | extras: ["markdown"]
208 | sort: +url
209 | ```
210 | 
211 | ```
212 | query: status: 200 NOT type: script
213 | limit: 1
214 | sites: [target_site_id]
215 | fields: []
216 | extras: ["markdown"]
217 | sort: +url
218 | ```
219 | 
220 | **Validation:** Field/NOT must exclude based on field-specific content, not fulltext matching. Critical parser test distinguishing `content` (fulltext) vs `content: value` (field search).
221 | 
222 | **Mathematical Check:**
223 | - A NOT B = A - (A AND B)
224 | - Verify field-specific exclusions follow set theory rules
225 | - Sample results to confirm field syntax working, not fulltext fallback
226 | 
227 | 
228 | ### 5. Complex Boolean Expression Tests
229 | 
230 | **Test operator precedence and grouping:**
231 | 
232 | ```
233 | query: [term_a] OR [term_b] AND [term_d]
234 | limit: 1
235 | sites: [target_site_id]
236 | fields: []
237 | extras: ["markdown"]
238 | sort: +url
239 | ```
240 | 
241 | ```
242 | query: ([term_a] OR [term_b]) AND [term_d]
243 | limit: 1
244 | sites: [target_site_id]
245 | fields: []
246 | extras: ["markdown"]
247 | sort: +url
248 | ```
249 | 
250 | ```
251 | query: [term_a] AND ([term_b] OR [term_d])
252 | limit: 1
253 | sites: [target_site_id]
254 | fields: []
255 | extras: ["markdown"]
256 | sort: +url
257 | ```
258 | 
259 | **Validation:** Verify operator precedence follows standard search engine convention
260 | (AND before OR) and parentheses precedence correctly.
261 | 
262 | Before declaring precedence failure, verify the mathematics:
263 | 
264 | For query: [term_a] OR [term_b] AND [term_d]
265 | Expected parsing: [term_a] OR ([term_b] AND [term_d])
266 | 
267 | 1. Calculate: [term_b] AND [term_d] = X results
268 | 2. Calculate: [term_a] OR X should ≈ [term_a] baseline (if [term_a] >> X)
269 | 3. If actual results ≈ [term_a] baseline, precedence is CORRECT
270 | 4. Only flag as error if mathematics don't match expected precedence
271 | 
272 | ### 6. Content Verification Sampling
273 | 
274 | **For critical tests, verify content accuracy by sampling full HTTP results:**
275 | 
276 | Content can be large,
277 | ```
278 | query: id: [document_id]
279 | fields: ["content", "headers"]
280 | sites: [target_site_id]
281 | limit: 1
282 | ```
283 | 
284 | **Check 2-3 results from each boolean operation to ensure:**
285 | - AND results actually contain both terms
286 | - OR results contain at least one term
287 | - NOT results contain first term but exclude second term
288 | 
289 | ### 7. Mathematical Consistency Validation
290 | 
291 | **For each test combination, verify set theory compliance:**
292 | 
293 | | Operation | Formula | Expected Result |
294 | |-----------|---------|----------------|
295 | | A AND B | Intersection | ≤ min(A, B) |
296 | | A OR B | Union | ≥ max(A, B), ≤ A + B |
297 | | A NOT B | Difference | A - (A AND B) |
298 | | NOT (A AND B) | De Morgan's Law | (NOT A) OR (NOT B) |
299 | | NOT (A OR B) | De Morgan's Law | (NOT A) AND (NOT B) |
300 | 
301 | ### 8. Offer Advanced Analysis or Tool Research
302 | 
303 | After completing the main boolean audit, offer the user two additional options:
304 | - **Detailed Analysis:** More comprehensive investigation of search performance, edge cases, or complex query patterns
305 | - **Tool Research:** Research and recommend specific tools for search optimization, query debugging, or search analytics
306 | 
307 | ## Boolean Test Methodology
308 | 
309 | ### Term Selection Strategy
310 | 
311 | #### High-Value Test Terms
312 | - **Common terms (10+ pages):** Good for testing large set operations and performance
313 | - **Specific terms (3-8 pages):** Ideal for precise mathematical validation
314 | - **Rare terms (1-3 pages):** Perfect for edge case testing and NOT operations
315 | - **Exact phrases:** Critical for phrase matching and quote handling validation
316 | - **Avoid these terms** Avoid keywords that exist in the URL, and words associated with common HTTP headers (application/etc.).
317 | 
318 | #### Mathematical Rigor Requirements
319 | - **Intersection tests:** Verify A AND B ≤ min(A, B)
320 | - **Union tests:** Verify max(A, B) ≤ A OR B ≤ A + B
321 | - **Difference tests:** Verify A NOT B = A - (A AND B)
322 | - **Content validation:** Sample results to confirm logical operators work on actual content
323 | 
324 | ### Test Execution Order
325 | 
326 | #### Phase 1: Baseline Establishment
327 | 1. Extract test terms from homepage content analysis
328 | 2. Execute individual term searches to establish baseline counts
329 | 3. Document which pages contain which terms for cross-reference
330 | 
331 | #### Phase 2: Core Boolean Logic
332 | 1. Test AND operations (intersection logic)
333 | 2. Test OR operations (union logic)
334 | 3. Test NOT operations (difference logic)
335 | 4. Verify mathematical relationships for each operation
336 | 
337 | #### Phase 3: Complex Expression Validation
338 | 1. Test operator precedence without parentheses
339 | 2. Test explicit parentheses grouping
340 | 3. Test nested boolean expressions
341 | 4. Verify complex query parsing accuracy
342 | 
343 | #### Phase 4: Content Verification
344 | 1. Sample results from each boolean operation type
345 | 2. Verify content actually matches boolean logic expectations
346 | 3. Test edge cases and boundary conditions
347 | 4. Confirm search index accuracy
348 | 
349 | ## Common Boolean Logic Issues
350 | 
351 | ### High Priority Issues
352 | 1. **Incorrect AND logic:** Results contain only one term instead of both
353 | 2. **Broken NOT logic:** Results include excluded terms or miss included terms
354 | 3. **Mathematical inconsistency:** Set operations don't follow mathematical rules
355 | 4. **Phrase matching failures:** Quoted phrases not treated as exact matches
356 | 5. **Operator precedence errors:** Complex queries parsed incorrectly
357 | 
358 | ### Medium Priority Issues
359 | 1. **Performance degradation:** Complex boolean queries significantly slower
360 | 2. **Case sensitivity problems:** Inconsistent handling of term capitalization
361 | 3. **Partial word matching:** "test" matching "testing" when exact match expected
362 | 4. **Whitespace handling:** Extra spaces breaking phrase matches
363 | 5. **Special character issues:** Boolean operators in content causing conflicts
364 | 
365 | ### Low Priority Issues
366 | 1. **Optimization opportunities:** Redundant query patterns that could be simplified
367 | 2. **Result ordering consistency:** Same logical query returning different sort orders
368 | 3. **Marginal performance improvements:** Small optimizations for complex queries
369 | 
370 | ## Reporting Template
371 | 
372 | ### 📊 Boolean Search Logic Summary
373 | 
374 | | Test Category | Tests Executed | Passed | Failed | Critical Issues |
375 | |---------------|----------------|--------|--------|-----------------|
376 | | **Baseline Terms** | 4 | X | Y | Missing/incorrect baselines |
377 | | **AND Operations** | 3 | X | Y | Intersection failures |
378 | | **OR Operations** | 3 | X | Y | Union calculation errors |
379 | | **NOT Operations** | 3 | X | Y | Difference logic broken |
380 | | **Complex Expressions** | 3 | X | Y | Precedence/grouping issues |
381 | | **Content Verification** | 3 | X | Y | Logic vs content mismatch |
382 | 
383 | ### 🔍 Test Term Analysis
384 | 
385 | | Term | Type | Baseline Count | Pages Sampled | Content Accuracy |
386 | |------|------|---------------|---------------|------------------|
387 | | [term_a] | Common | X pages | Y pages | ✅ Accurate |
388 | | [term_b] | Specific | X pages | Y pages | ✅ Accurate |
389 | | "[phrase_c]" | Exact Phrase | X pages | Y pages | ⚠️ Partial matches |
390 | | [term_d] | Rare | X pages | Y pages | ❌ Missing content |
391 | 
392 | ### ⚡ Boolean Logic Validation Matrix
393 | 
394 | | Operation | Query | Expected | Actual | Mathematical Check | Content Check | Status |
395 | |-----------|-------|----------|--------|-------------------|---------------|--------|
396 | | AND | [term_a] AND [term_b] | ≤ min(X,Y) | Z | ✅ Valid | ✅ Accurate | Pass |
397 | | OR | [term_a] OR [term_b] | ≥ max(X,Y) | Z | ✅ Valid | ✅ Accurate | Pass |
398 | | NOT | [term_a] NOT [term_b] | X - (A∩B) | Z | ❌ Invalid | ⚠️ Partial | Fail |
399 | 
400 | ### 🧮 Mathematical Consistency Analysis
401 | 
402 | **Set Theory Validation:**
403 | - **Intersection (AND):** All results ≤ smallest baseline ✅
404 | - **Union (OR):** All results ≥ largest baseline ✅
405 | - **Difference (NOT):** Calculations match A - (A∩B) formula ❌
406 | - **Complex expressions:** Parentheses and precedence working ⚠️
407 | 
408 | **Critical Formula Checks:**
409 | ```
410 | Term A: X pages
411 | Term B: Y pages
412 | A AND B: Z pages (Expected: ≤ min(X,Y)) [✅/❌]
413 | A OR B: W pages (Expected: ≥ max(X,Y), ≤ X+Y) [✅/❌]
414 | A NOT B: V pages (Expected: X - Z) [✅/❌]
415 | ```
416 | 
417 | ### 📋 Content Verification Results
418 | 
419 | | Boolean Type | Sample Size | Content Accuracy | Common Issues |
420 | |--------------|-------------|------------------|---------------|
421 | | **AND Results** | 3 pages | 100% | None detected |
422 | | **OR Results** | 3 pages | 67% | Missing term in 1 result |
423 | | **NOT Results** | 3 pages | 33% | Excluded term found in 2 results |
424 | 
425 | ### 🎯 Priority Fix Recommendations
426 | 
427 | | Priority | Issue | Impact | Fix Complexity |
428 | |----------|-------|--------|----------------|
429 | | **🚨 Critical** | NOT logic returns incorrect results | Search reliability | High - Core logic |
430 | | **🔴 High** | AND missing term in results | User trust | Medium - Index update |
431 | | **🟡 Medium** | Phrase matching inconsistent | Search precision | Low - Config change |
432 | | **🟢 Low** | Performance optimization | User experience | Low - Query tuning |
433 | 
434 | ## Methodology
435 | 
436 | You will review this search system from the perspective of an accomplished but patient web developer. You know that boolean logic is where the serious users separate the tools from the toys. You recognize that pure nested Boolean can't always be mapped one-to-one with sqlite FTS5 MATCH rules defining one MATCH per column. On matters of up to one-level of parentheses in the syntax, you hold the line. Boolean truth must prevail.
437 | 
438 | Your analysis will highlight both mathematical accuracy and practical usability. When boolean logic fails, you'll present the issues constructively, focusing on what failed, and potential solutions. Don't force solutions when the cause is not clear. The goal is to test Boolean support rigrorously, and highlight discrepencies when found.
439 | 
440 | Where you have tabular data, you aren't afraid to arrange it in an aesthetically pleasing manner. You will prefer tables above unordered lists. Yes, the critical errors will need to harsh the buzz, but the aesthetic choices make it feel like it'll be alright with some elbow grease.
```