#
tokens: 47394/50000 21/216 files (page 2/35)
lines: on (toggle) GitHub
raw markdown copy reset
This is page 2 of 35. Use http://codebase.md/pragmar/mcp_server_webcrawl?lines=true&page={x} to view the full context.

# Directory Structure

```
├── .gitignore
├── CONTRIBUTING.md
├── docs
│   ├── _images
│   │   ├── interactive.document.webp
│   │   ├── interactive.search.webp
│   │   └── mcpswc.svg
│   ├── _modules
│   │   ├── index.html
│   │   ├── mcp_server_webcrawl
│   │   │   ├── crawlers
│   │   │   │   ├── archivebox
│   │   │   │   │   ├── adapter.html
│   │   │   │   │   ├── crawler.html
│   │   │   │   │   └── tests.html
│   │   │   │   ├── base
│   │   │   │   │   ├── adapter.html
│   │   │   │   │   ├── api.html
│   │   │   │   │   ├── crawler.html
│   │   │   │   │   ├── indexed.html
│   │   │   │   │   └── tests.html
│   │   │   │   ├── httrack
│   │   │   │   │   ├── adapter.html
│   │   │   │   │   ├── crawler.html
│   │   │   │   │   └── tests.html
│   │   │   │   ├── interrobot
│   │   │   │   │   ├── adapter.html
│   │   │   │   │   ├── crawler.html
│   │   │   │   │   └── tests.html
│   │   │   │   ├── katana
│   │   │   │   │   ├── adapter.html
│   │   │   │   │   ├── crawler.html
│   │   │   │   │   └── tests.html
│   │   │   │   ├── siteone
│   │   │   │   │   ├── adapter.html
│   │   │   │   │   ├── crawler.html
│   │   │   │   │   └── tests.html
│   │   │   │   ├── warc
│   │   │   │   │   ├── adapter.html
│   │   │   │   │   ├── crawler.html
│   │   │   │   │   └── tests.html
│   │   │   │   └── wget
│   │   │   │       ├── adapter.html
│   │   │   │       ├── crawler.html
│   │   │   │       └── tests.html
│   │   │   ├── crawlers.html
│   │   │   ├── extras
│   │   │   │   ├── markdown.html
│   │   │   │   ├── regex.html
│   │   │   │   ├── snippets.html
│   │   │   │   ├── thumbnails.html
│   │   │   │   └── xpath.html
│   │   │   ├── interactive
│   │   │   │   ├── highlights.html
│   │   │   │   ├── search.html
│   │   │   │   ├── session.html
│   │   │   │   └── ui.html
│   │   │   ├── main.html
│   │   │   ├── models
│   │   │   │   ├── resources.html
│   │   │   │   └── sites.html
│   │   │   ├── templates
│   │   │   │   └── tests.html
│   │   │   ├── utils
│   │   │   │   ├── blobs.html
│   │   │   │   ├── cli.html
│   │   │   │   ├── logger.html
│   │   │   │   ├── querycache.html
│   │   │   │   ├── server.html
│   │   │   │   └── tools.html
│   │   │   └── utils.html
│   │   └── re.html
│   ├── _sources
│   │   ├── guides
│   │   │   ├── archivebox.rst.txt
│   │   │   ├── httrack.rst.txt
│   │   │   ├── interrobot.rst.txt
│   │   │   ├── katana.rst.txt
│   │   │   ├── siteone.rst.txt
│   │   │   ├── warc.rst.txt
│   │   │   └── wget.rst.txt
│   │   ├── guides.rst.txt
│   │   ├── index.rst.txt
│   │   ├── installation.rst.txt
│   │   ├── interactive.rst.txt
│   │   ├── mcp_server_webcrawl.crawlers.archivebox.rst.txt
│   │   ├── mcp_server_webcrawl.crawlers.base.rst.txt
│   │   ├── mcp_server_webcrawl.crawlers.httrack.rst.txt
│   │   ├── mcp_server_webcrawl.crawlers.interrobot.rst.txt
│   │   ├── mcp_server_webcrawl.crawlers.katana.rst.txt
│   │   ├── mcp_server_webcrawl.crawlers.rst.txt
│   │   ├── mcp_server_webcrawl.crawlers.siteone.rst.txt
│   │   ├── mcp_server_webcrawl.crawlers.warc.rst.txt
│   │   ├── mcp_server_webcrawl.crawlers.wget.rst.txt
│   │   ├── mcp_server_webcrawl.extras.rst.txt
│   │   ├── mcp_server_webcrawl.interactive.rst.txt
│   │   ├── mcp_server_webcrawl.models.rst.txt
│   │   ├── mcp_server_webcrawl.rst.txt
│   │   ├── mcp_server_webcrawl.templates.rst.txt
│   │   ├── mcp_server_webcrawl.utils.rst.txt
│   │   ├── modules.rst.txt
│   │   ├── prompts.rst.txt
│   │   └── usage.rst.txt
│   ├── _static
│   │   ├── _sphinx_javascript_frameworks_compat.js
│   │   ├── basic.css
│   │   ├── css
│   │   │   ├── badge_only.css
│   │   │   ├── fonts
│   │   │   │   ├── fontawesome-webfont.eot
│   │   │   │   ├── fontawesome-webfont.svg
│   │   │   │   ├── fontawesome-webfont.ttf
│   │   │   │   ├── fontawesome-webfont.woff
│   │   │   │   ├── fontawesome-webfont.woff2
│   │   │   │   ├── lato-bold-italic.woff
│   │   │   │   ├── lato-bold-italic.woff2
│   │   │   │   ├── lato-bold.woff
│   │   │   │   ├── lato-bold.woff2
│   │   │   │   ├── lato-normal-italic.woff
│   │   │   │   ├── lato-normal-italic.woff2
│   │   │   │   ├── lato-normal.woff
│   │   │   │   ├── lato-normal.woff2
│   │   │   │   ├── Roboto-Slab-Bold.woff
│   │   │   │   ├── Roboto-Slab-Bold.woff2
│   │   │   │   ├── Roboto-Slab-Regular.woff
│   │   │   │   └── Roboto-Slab-Regular.woff2
│   │   │   └── theme.css
│   │   ├── doctools.js
│   │   ├── documentation_options.js
│   │   ├── file.png
│   │   ├── fonts
│   │   │   ├── Lato
│   │   │   │   ├── lato-bold.eot
│   │   │   │   ├── lato-bold.ttf
│   │   │   │   ├── lato-bold.woff
│   │   │   │   ├── lato-bold.woff2
│   │   │   │   ├── lato-bolditalic.eot
│   │   │   │   ├── lato-bolditalic.ttf
│   │   │   │   ├── lato-bolditalic.woff
│   │   │   │   ├── lato-bolditalic.woff2
│   │   │   │   ├── lato-italic.eot
│   │   │   │   ├── lato-italic.ttf
│   │   │   │   ├── lato-italic.woff
│   │   │   │   ├── lato-italic.woff2
│   │   │   │   ├── lato-regular.eot
│   │   │   │   ├── lato-regular.ttf
│   │   │   │   ├── lato-regular.woff
│   │   │   │   └── lato-regular.woff2
│   │   │   └── RobotoSlab
│   │   │       ├── roboto-slab-v7-bold.eot
│   │   │       ├── roboto-slab-v7-bold.ttf
│   │   │       ├── roboto-slab-v7-bold.woff
│   │   │       ├── roboto-slab-v7-bold.woff2
│   │   │       ├── roboto-slab-v7-regular.eot
│   │   │       ├── roboto-slab-v7-regular.ttf
│   │   │       ├── roboto-slab-v7-regular.woff
│   │   │       └── roboto-slab-v7-regular.woff2
│   │   ├── images
│   │   │   ├── interactive.document.png
│   │   │   ├── interactive.document.webp
│   │   │   ├── interactive.search.png
│   │   │   ├── interactive.search.webp
│   │   │   └── mcpswc.svg
│   │   ├── jquery.js
│   │   ├── js
│   │   │   ├── badge_only.js
│   │   │   ├── theme.js
│   │   │   └── versions.js
│   │   ├── language_data.js
│   │   ├── minus.png
│   │   ├── plus.png
│   │   ├── pygments.css
│   │   ├── searchtools.js
│   │   └── sphinx_highlight.js
│   ├── .buildinfo
│   ├── .nojekyll
│   ├── genindex.html
│   ├── guides
│   │   ├── archivebox.html
│   │   ├── httrack.html
│   │   ├── interrobot.html
│   │   ├── katana.html
│   │   ├── siteone.html
│   │   ├── warc.html
│   │   └── wget.html
│   ├── guides.html
│   ├── index.html
│   ├── installation.html
│   ├── interactive.html
│   ├── mcp_server_webcrawl.crawlers.archivebox.html
│   ├── mcp_server_webcrawl.crawlers.base.html
│   ├── mcp_server_webcrawl.crawlers.html
│   ├── mcp_server_webcrawl.crawlers.httrack.html
│   ├── mcp_server_webcrawl.crawlers.interrobot.html
│   ├── mcp_server_webcrawl.crawlers.katana.html
│   ├── mcp_server_webcrawl.crawlers.siteone.html
│   ├── mcp_server_webcrawl.crawlers.warc.html
│   ├── mcp_server_webcrawl.crawlers.wget.html
│   ├── mcp_server_webcrawl.extras.html
│   ├── mcp_server_webcrawl.html
│   ├── mcp_server_webcrawl.interactive.html
│   ├── mcp_server_webcrawl.models.html
│   ├── mcp_server_webcrawl.templates.html
│   ├── mcp_server_webcrawl.utils.html
│   ├── modules.html
│   ├── objects.inv
│   ├── prompts.html
│   ├── py-modindex.html
│   ├── search.html
│   ├── searchindex.js
│   └── usage.html
├── LICENSE
├── MANIFEST.in
├── prompts
│   ├── audit404.md
│   ├── auditfiles.md
│   ├── auditperf.md
│   ├── auditseo.md
│   ├── gopher.md
│   ├── README.md
│   └── testsearch.md
├── pyproject.toml
├── README.md
├── setup.py
├── sphinx
│   ├── _static
│   │   └── images
│   │       ├── interactive.document.png
│   │       ├── interactive.document.webp
│   │       ├── interactive.search.png
│   │       ├── interactive.search.webp
│   │       └── mcpswc.svg
│   ├── _templates
│   │   └── layout.html
│   ├── conf.py
│   ├── guides
│   │   ├── archivebox.rst
│   │   ├── httrack.rst
│   │   ├── interrobot.rst
│   │   ├── katana.rst
│   │   ├── siteone.rst
│   │   ├── warc.rst
│   │   └── wget.rst
│   ├── guides.rst
│   ├── index.rst
│   ├── installation.rst
│   ├── interactive.rst
│   ├── make.bat
│   ├── Makefile
│   ├── mcp_server_webcrawl.crawlers.archivebox.rst
│   ├── mcp_server_webcrawl.crawlers.base.rst
│   ├── mcp_server_webcrawl.crawlers.httrack.rst
│   ├── mcp_server_webcrawl.crawlers.interrobot.rst
│   ├── mcp_server_webcrawl.crawlers.katana.rst
│   ├── mcp_server_webcrawl.crawlers.rst
│   ├── mcp_server_webcrawl.crawlers.siteone.rst
│   ├── mcp_server_webcrawl.crawlers.warc.rst
│   ├── mcp_server_webcrawl.crawlers.wget.rst
│   ├── mcp_server_webcrawl.extras.rst
│   ├── mcp_server_webcrawl.interactive.rst
│   ├── mcp_server_webcrawl.models.rst
│   ├── mcp_server_webcrawl.rst
│   ├── mcp_server_webcrawl.templates.rst
│   ├── mcp_server_webcrawl.utils.rst
│   ├── modules.rst
│   ├── prompts.rst
│   ├── readme.txt
│   └── usage.rst
└── src
    └── mcp_server_webcrawl
        ├── __init__.py
        ├── crawlers
        │   ├── __init__.py
        │   ├── archivebox
        │   │   ├── __init__.py
        │   │   ├── adapter.py
        │   │   ├── crawler.py
        │   │   └── tests.py
        │   ├── base
        │   │   ├── __init__.py
        │   │   ├── adapter.py
        │   │   ├── api.py
        │   │   ├── crawler.py
        │   │   ├── indexed.py
        │   │   └── tests.py
        │   ├── httrack
        │   │   ├── __init__.py
        │   │   ├── adapter.py
        │   │   ├── crawler.py
        │   │   └── tests.py
        │   ├── interrobot
        │   │   ├── __init__.py
        │   │   ├── adapter.py
        │   │   ├── crawler.py
        │   │   └── tests.py
        │   ├── katana
        │   │   ├── __init__.py
        │   │   ├── adapter.py
        │   │   ├── crawler.py
        │   │   └── tests.py
        │   ├── siteone
        │   │   ├── __init__.py
        │   │   ├── adapter.py
        │   │   ├── crawler.py
        │   │   └── tests.py
        │   ├── warc
        │   │   ├── __init__.py
        │   │   ├── adapter.py
        │   │   ├── crawler.py
        │   │   └── tests.py
        │   └── wget
        │       ├── __init__.py
        │       ├── adapter.py
        │       ├── crawler.py
        │       └── tests.py
        ├── extras
        │   ├── __init__.py
        │   ├── markdown.py
        │   ├── regex.py
        │   ├── snippets.py
        │   ├── thumbnails.py
        │   └── xpath.py
        ├── interactive
        │   ├── __init__.py
        │   ├── highlights.py
        │   ├── search.py
        │   ├── session.py
        │   ├── ui.py
        │   └── views
        │       ├── base.py
        │       ├── document.py
        │       ├── help.py
        │       ├── requirements.py
        │       ├── results.py
        │       └── searchform.py
        ├── main.py
        ├── models
        │   ├── __init__.py
        │   ├── base.py
        │   ├── resources.py
        │   └── sites.py
        ├── settings.py
        ├── templates
        │   ├── __init__.py
        │   ├── markdown.xslt
        │   ├── tests_core.html
        │   └── tests.py
        └── utils
            ├── __init__.py
            ├── cli.py
            ├── logger.py
            ├── parser.py
            ├── parsetab.py
            ├── search.py
            ├── server.py
            ├── tests.py
            └── tools.py
```

# Files

--------------------------------------------------------------------------------
/docs/_static/language_data.js:
--------------------------------------------------------------------------------

```javascript
  1 | /*
  2 |  * language_data.js
  3 |  * ~~~~~~~~~~~~~~~~
  4 |  *
  5 |  * This script contains the language-specific data used by searchtools.js,
  6 |  * namely the list of stopwords, stemmer, scorer and splitter.
  7 |  *
  8 |  * :copyright: Copyright 2007-2023 by the Sphinx team, see AUTHORS.
  9 |  * :license: BSD, see LICENSE for details.
 10 |  *
 11 |  */
 12 | 
 13 | var stopwords = ["a", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "near", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"];
 14 | 
 15 | 
 16 | /* Non-minified version is copied as a separate JS file, is available */
 17 | 
 18 | /**
 19 |  * Porter Stemmer
 20 |  */
 21 | var Stemmer = function() {
 22 | 
 23 |   var step2list = {
 24 |     ational: 'ate',
 25 |     tional: 'tion',
 26 |     enci: 'ence',
 27 |     anci: 'ance',
 28 |     izer: 'ize',
 29 |     bli: 'ble',
 30 |     alli: 'al',
 31 |     entli: 'ent',
 32 |     eli: 'e',
 33 |     ousli: 'ous',
 34 |     ization: 'ize',
 35 |     ation: 'ate',
 36 |     ator: 'ate',
 37 |     alism: 'al',
 38 |     iveness: 'ive',
 39 |     fulness: 'ful',
 40 |     ousness: 'ous',
 41 |     aliti: 'al',
 42 |     iviti: 'ive',
 43 |     biliti: 'ble',
 44 |     logi: 'log'
 45 |   };
 46 | 
 47 |   var step3list = {
 48 |     icate: 'ic',
 49 |     ative: '',
 50 |     alize: 'al',
 51 |     iciti: 'ic',
 52 |     ical: 'ic',
 53 |     ful: '',
 54 |     ness: ''
 55 |   };
 56 | 
 57 |   var c = "[^aeiou]";          // consonant
 58 |   var v = "[aeiouy]";          // vowel
 59 |   var C = c + "[^aeiouy]*";    // consonant sequence
 60 |   var V = v + "[aeiou]*";      // vowel sequence
 61 | 
 62 |   var mgr0 = "^(" + C + ")?" + V + C;                      // [C]VC... is m>0
 63 |   var meq1 = "^(" + C + ")?" + V + C + "(" + V + ")?$";    // [C]VC[V] is m=1
 64 |   var mgr1 = "^(" + C + ")?" + V + C + V + C;              // [C]VCVC... is m>1
 65 |   var s_v   = "^(" + C + ")?" + v;                         // vowel in stem
 66 | 
 67 |   this.stemWord = function (w) {
 68 |     var stem;
 69 |     var suffix;
 70 |     var firstch;
 71 |     var origword = w;
 72 | 
 73 |     if (w.length < 3)
 74 |       return w;
 75 | 
 76 |     var re;
 77 |     var re2;
 78 |     var re3;
 79 |     var re4;
 80 | 
 81 |     firstch = w.substr(0,1);
 82 |     if (firstch == "y")
 83 |       w = firstch.toUpperCase() + w.substr(1);
 84 | 
 85 |     // Step 1a
 86 |     re = /^(.+?)(ss|i)es$/;
 87 |     re2 = /^(.+?)([^s])s$/;
 88 | 
 89 |     if (re.test(w))
 90 |       w = w.replace(re,"$1$2");
 91 |     else if (re2.test(w))
 92 |       w = w.replace(re2,"$1$2");
 93 | 
 94 |     // Step 1b
 95 |     re = /^(.+?)eed$/;
 96 |     re2 = /^(.+?)(ed|ing)$/;
 97 |     if (re.test(w)) {
 98 |       var fp = re.exec(w);
 99 |       re = new RegExp(mgr0);
100 |       if (re.test(fp[1])) {
101 |         re = /.$/;
102 |         w = w.replace(re,"");
103 |       }
104 |     }
105 |     else if (re2.test(w)) {
106 |       var fp = re2.exec(w);
107 |       stem = fp[1];
108 |       re2 = new RegExp(s_v);
109 |       if (re2.test(stem)) {
110 |         w = stem;
111 |         re2 = /(at|bl|iz)$/;
112 |         re3 = new RegExp("([^aeiouylsz])\\1$");
113 |         re4 = new RegExp("^" + C + v + "[^aeiouwxy]$");
114 |         if (re2.test(w))
115 |           w = w + "e";
116 |         else if (re3.test(w)) {
117 |           re = /.$/;
118 |           w = w.replace(re,"");
119 |         }
120 |         else if (re4.test(w))
121 |           w = w + "e";
122 |       }
123 |     }
124 | 
125 |     // Step 1c
126 |     re = /^(.+?)y$/;
127 |     if (re.test(w)) {
128 |       var fp = re.exec(w);
129 |       stem = fp[1];
130 |       re = new RegExp(s_v);
131 |       if (re.test(stem))
132 |         w = stem + "i";
133 |     }
134 | 
135 |     // Step 2
136 |     re = /^(.+?)(ational|tional|enci|anci|izer|bli|alli|entli|eli|ousli|ization|ation|ator|alism|iveness|fulness|ousness|aliti|iviti|biliti|logi)$/;
137 |     if (re.test(w)) {
138 |       var fp = re.exec(w);
139 |       stem = fp[1];
140 |       suffix = fp[2];
141 |       re = new RegExp(mgr0);
142 |       if (re.test(stem))
143 |         w = stem + step2list[suffix];
144 |     }
145 | 
146 |     // Step 3
147 |     re = /^(.+?)(icate|ative|alize|iciti|ical|ful|ness)$/;
148 |     if (re.test(w)) {
149 |       var fp = re.exec(w);
150 |       stem = fp[1];
151 |       suffix = fp[2];
152 |       re = new RegExp(mgr0);
153 |       if (re.test(stem))
154 |         w = stem + step3list[suffix];
155 |     }
156 | 
157 |     // Step 4
158 |     re = /^(.+?)(al|ance|ence|er|ic|able|ible|ant|ement|ment|ent|ou|ism|ate|iti|ous|ive|ize)$/;
159 |     re2 = /^(.+?)(s|t)(ion)$/;
160 |     if (re.test(w)) {
161 |       var fp = re.exec(w);
162 |       stem = fp[1];
163 |       re = new RegExp(mgr1);
164 |       if (re.test(stem))
165 |         w = stem;
166 |     }
167 |     else if (re2.test(w)) {
168 |       var fp = re2.exec(w);
169 |       stem = fp[1] + fp[2];
170 |       re2 = new RegExp(mgr1);
171 |       if (re2.test(stem))
172 |         w = stem;
173 |     }
174 | 
175 |     // Step 5
176 |     re = /^(.+?)e$/;
177 |     if (re.test(w)) {
178 |       var fp = re.exec(w);
179 |       stem = fp[1];
180 |       re = new RegExp(mgr1);
181 |       re2 = new RegExp(meq1);
182 |       re3 = new RegExp("^" + C + v + "[^aeiouwxy]$");
183 |       if (re.test(stem) || (re2.test(stem) && !(re3.test(stem))))
184 |         w = stem;
185 |     }
186 |     re = /ll$/;
187 |     re2 = new RegExp(mgr1);
188 |     if (re.test(w) && re2.test(w)) {
189 |       re = /.$/;
190 |       w = w.replace(re,"");
191 |     }
192 | 
193 |     // and turn initial Y back to y
194 |     if (firstch == "y")
195 |       w = firstch.toLowerCase() + w.substr(1);
196 |     return w;
197 |   }
198 | }
199 | 
200 | 
```

--------------------------------------------------------------------------------
/src/mcp_server_webcrawl/crawlers/siteone/tests.py:
--------------------------------------------------------------------------------

```python
  1 | from mcp_server_webcrawl.crawlers.siteone.crawler import SiteOneCrawler
  2 | from mcp_server_webcrawl.crawlers.base.tests import BaseCrawlerTests
  3 | from mcp_server_webcrawl.crawlers import get_fixture_directory
  4 | from mcp_server_webcrawl.crawlers.siteone.adapter import SiteOneManager
  5 | from mcp_server_webcrawl.utils.logger import get_logger
  6 | 
  7 | logger = get_logger()
  8 | 
  9 | # calculate using same hash function as adapter
 10 | EXAMPLE_SITE_ID = SiteOneManager.string_to_id("example.com")
 11 | PRAGMAR_SITE_ID = SiteOneManager.string_to_id("pragmar.com")
 12 | 
 13 | class SiteOneTests(BaseCrawlerTests):
 14 |     """
 15 |     Test suite for the SiteOne crawler implementation.
 16 |     Uses all wrapped test methods from BaseCrawlerTests plus SiteOne-specific features.
 17 |     """
 18 | 
 19 |     def setUp(self):
 20 |         """
 21 |         Set up the test environment with fixture data.
 22 |         """
 23 |         super().setUp()
 24 |         self._datasrc = get_fixture_directory() / "siteone"
 25 | 
 26 |     def test_siteone_pulse(self):
 27 |         """
 28 |         Test basic crawler initialization.
 29 |         """
 30 |         crawler = SiteOneCrawler(self._datasrc)
 31 |         self.assertIsNotNone(crawler)
 32 |         self.assertTrue(self._datasrc.is_dir())
 33 | 
 34 |     def test_siteone_sites(self):
 35 |         """
 36 |         Test site retrieval API functionality.
 37 |         """
 38 |         crawler = SiteOneCrawler(self._datasrc)
 39 |         self.run_pragmar_site_tests(crawler, PRAGMAR_SITE_ID)
 40 | 
 41 |     def test_siteone_search(self):
 42 |         """
 43 |         Test boolean search functionality
 44 |         """
 45 |         crawler = SiteOneCrawler(self._datasrc)
 46 |         self.run_pragmar_search_tests(crawler, PRAGMAR_SITE_ID)
 47 | 
 48 |     def test_siteone_resources(self):
 49 |         """
 50 |         Test resource retrieval API functionality with various parameters.
 51 |         """
 52 |         crawler = SiteOneCrawler(self._datasrc)
 53 |         self.run_sites_resources_tests(crawler, PRAGMAR_SITE_ID, EXAMPLE_SITE_ID)
 54 | 
 55 |     def test_interrobot_images(self):
 56 |         """
 57 |         Test InterroBot-specific image handling and thumbnails.
 58 |         """
 59 |         crawler = SiteOneCrawler(self._datasrc)
 60 |         self.run_pragmar_image_tests(crawler, PRAGMAR_SITE_ID)
 61 | 
 62 |     def test_siteone_sorts(self):
 63 |         """
 64 |         Test random sort functionality using the '?' sort parameter.
 65 |         """
 66 |         crawler = SiteOneCrawler(self._datasrc)
 67 |         self.run_pragmar_sort_tests(crawler, PRAGMAR_SITE_ID)
 68 | 
 69 |     def test_siteone_content_parsing(self):
 70 |         """
 71 |         Test content type detection and parsing.
 72 |         """
 73 |         crawler = SiteOneCrawler(self._datasrc)
 74 |         self.run_pragmar_content_tests(crawler, PRAGMAR_SITE_ID, False)
 75 | 
 76 |     def test_siteone_advanced_features(self):
 77 |         """
 78 |         Test SiteOne-specific advanced features not covered by base tests.
 79 |         """
 80 |         crawler = SiteOneCrawler(self._datasrc)
 81 | 
 82 |         # numeric status operators (SiteOne-specific feature)
 83 |         status_resources_gt = crawler.get_resources_api(
 84 |             sites=[PRAGMAR_SITE_ID],
 85 |             query="status: >400",
 86 |         )
 87 |         self.assertTrue(status_resources_gt.total > 0, "Numeric status operator should return results")
 88 |         for resource in status_resources_gt._results:
 89 |             self.assertGreater(resource.status, 400)
 90 | 
 91 |         # redirect status codes
 92 |         status_resources_redirect = crawler.get_resources_api(
 93 |             sites=[PRAGMAR_SITE_ID],
 94 |             query="status: 301"
 95 |         )
 96 |         self.assertTrue(status_resources_redirect.total > 0, "301 status filtering should return results")
 97 |         for resource in status_resources_redirect._results:
 98 |             self.assertEqual(resource.status, 301)
 99 | 
100 |         # 404 with size validation
101 |         status_resources_not_found = crawler.get_resources_api(
102 |             sites=[PRAGMAR_SITE_ID],
103 |             query="status: 404",
104 |             fields=["size"]
105 |         )
106 |         self.assertTrue(status_resources_not_found.total > 0, "404 status filtering should return results")
107 |         for resource in status_resources_not_found._results:
108 |             self.assertEqual(resource.status, 404)
109 | 
110 |         not_found_result = status_resources_not_found._results[0].to_dict()
111 |         self.assertIn("size", not_found_result)
112 |         self.assertGreater(not_found_result["size"], 0, "404 responses should still have size > 0")
113 | 
114 |         custom_fields = ["content", "headers", "time"]
115 |         field_resources = crawler.get_resources_api(
116 |             sites=[PRAGMAR_SITE_ID],
117 |             fields=custom_fields
118 |         )
119 |         self.assertTrue(field_resources.total > 0)
120 | 
121 |         # Test the SiteOne-specific forcefield dict method
122 |         resource_dict = field_resources._results[0].to_forcefield_dict(custom_fields)
123 |         for field in custom_fields:
124 |             self.assertIn(field, resource_dict, f"Field '{field}' should be in forcefield response")
125 | 
126 |     def test_report(self):
127 |         """
128 |         Run test report, save to data directory.
129 |         """
130 |         crawler = SiteOneCrawler(self._datasrc)
131 |         logger.info(self.run_pragmar_report(crawler, PRAGMAR_SITE_ID, "SiteOne"))
132 | 
```

--------------------------------------------------------------------------------
/docs/_static/pygments.css:
--------------------------------------------------------------------------------

```css
 1 | pre { line-height: 125%; }
 2 | td.linenos .normal { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; }
 3 | span.linenos { color: inherit; background-color: transparent; padding-left: 5px; padding-right: 5px; }
 4 | td.linenos .special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
 5 | span.linenos.special { color: #000000; background-color: #ffffc0; padding-left: 5px; padding-right: 5px; }
 6 | .highlight .hll { background-color: #ffffcc }
 7 | .highlight { background: #f8f8f8; }
 8 | .highlight .c { color: #3D7B7B; font-style: italic } /* Comment */
 9 | .highlight .err { border: 1px solid #FF0000 } /* Error */
10 | .highlight .k { color: #008000; font-weight: bold } /* Keyword */
11 | .highlight .o { color: #666666 } /* Operator */
12 | .highlight .ch { color: #3D7B7B; font-style: italic } /* Comment.Hashbang */
13 | .highlight .cm { color: #3D7B7B; font-style: italic } /* Comment.Multiline */
14 | .highlight .cp { color: #9C6500 } /* Comment.Preproc */
15 | .highlight .cpf { color: #3D7B7B; font-style: italic } /* Comment.PreprocFile */
16 | .highlight .c1 { color: #3D7B7B; font-style: italic } /* Comment.Single */
17 | .highlight .cs { color: #3D7B7B; font-style: italic } /* Comment.Special */
18 | .highlight .gd { color: #A00000 } /* Generic.Deleted */
19 | .highlight .ge { font-style: italic } /* Generic.Emph */
20 | .highlight .ges { font-weight: bold; font-style: italic } /* Generic.EmphStrong */
21 | .highlight .gr { color: #E40000 } /* Generic.Error */
22 | .highlight .gh { color: #000080; font-weight: bold } /* Generic.Heading */
23 | .highlight .gi { color: #008400 } /* Generic.Inserted */
24 | .highlight .go { color: #717171 } /* Generic.Output */
25 | .highlight .gp { color: #000080; font-weight: bold } /* Generic.Prompt */
26 | .highlight .gs { font-weight: bold } /* Generic.Strong */
27 | .highlight .gu { color: #800080; font-weight: bold } /* Generic.Subheading */
28 | .highlight .gt { color: #0044DD } /* Generic.Traceback */
29 | .highlight .kc { color: #008000; font-weight: bold } /* Keyword.Constant */
30 | .highlight .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */
31 | .highlight .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */
32 | .highlight .kp { color: #008000 } /* Keyword.Pseudo */
33 | .highlight .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */
34 | .highlight .kt { color: #B00040 } /* Keyword.Type */
35 | .highlight .m { color: #666666 } /* Literal.Number */
36 | .highlight .s { color: #BA2121 } /* Literal.String */
37 | .highlight .na { color: #687822 } /* Name.Attribute */
38 | .highlight .nb { color: #008000 } /* Name.Builtin */
39 | .highlight .nc { color: #0000FF; font-weight: bold } /* Name.Class */
40 | .highlight .no { color: #880000 } /* Name.Constant */
41 | .highlight .nd { color: #AA22FF } /* Name.Decorator */
42 | .highlight .ni { color: #717171; font-weight: bold } /* Name.Entity */
43 | .highlight .ne { color: #CB3F38; font-weight: bold } /* Name.Exception */
44 | .highlight .nf { color: #0000FF } /* Name.Function */
45 | .highlight .nl { color: #767600 } /* Name.Label */
46 | .highlight .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */
47 | .highlight .nt { color: #008000; font-weight: bold } /* Name.Tag */
48 | .highlight .nv { color: #19177C } /* Name.Variable */
49 | .highlight .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */
50 | .highlight .w { color: #bbbbbb } /* Text.Whitespace */
51 | .highlight .mb { color: #666666 } /* Literal.Number.Bin */
52 | .highlight .mf { color: #666666 } /* Literal.Number.Float */
53 | .highlight .mh { color: #666666 } /* Literal.Number.Hex */
54 | .highlight .mi { color: #666666 } /* Literal.Number.Integer */
55 | .highlight .mo { color: #666666 } /* Literal.Number.Oct */
56 | .highlight .sa { color: #BA2121 } /* Literal.String.Affix */
57 | .highlight .sb { color: #BA2121 } /* Literal.String.Backtick */
58 | .highlight .sc { color: #BA2121 } /* Literal.String.Char */
59 | .highlight .dl { color: #BA2121 } /* Literal.String.Delimiter */
60 | .highlight .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */
61 | .highlight .s2 { color: #BA2121 } /* Literal.String.Double */
62 | .highlight .se { color: #AA5D1F; font-weight: bold } /* Literal.String.Escape */
63 | .highlight .sh { color: #BA2121 } /* Literal.String.Heredoc */
64 | .highlight .si { color: #A45A77; font-weight: bold } /* Literal.String.Interpol */
65 | .highlight .sx { color: #008000 } /* Literal.String.Other */
66 | .highlight .sr { color: #A45A77 } /* Literal.String.Regex */
67 | .highlight .s1 { color: #BA2121 } /* Literal.String.Single */
68 | .highlight .ss { color: #19177C } /* Literal.String.Symbol */
69 | .highlight .bp { color: #008000 } /* Name.Builtin.Pseudo */
70 | .highlight .fm { color: #0000FF } /* Name.Function.Magic */
71 | .highlight .vc { color: #19177C } /* Name.Variable.Class */
72 | .highlight .vg { color: #19177C } /* Name.Variable.Global */
73 | .highlight .vi { color: #19177C } /* Name.Variable.Instance */
74 | .highlight .vm { color: #19177C } /* Name.Variable.Magic */
75 | .highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
```

--------------------------------------------------------------------------------
/docs/_static/js/theme.js:
--------------------------------------------------------------------------------

```javascript
1 | !function(n){var e={};function t(i){if(e[i])return e[i].exports;var o=e[i]={i:i,l:!1,exports:{}};return n[i].call(o.exports,o,o.exports,t),o.l=!0,o.exports}t.m=n,t.c=e,t.d=function(n,e,i){t.o(n,e)||Object.defineProperty(n,e,{enumerable:!0,get:i})},t.r=function(n){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(n,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(n,"__esModule",{value:!0})},t.t=function(n,e){if(1&e&&(n=t(n)),8&e)return n;if(4&e&&"object"==typeof n&&n&&n.__esModule)return n;var i=Object.create(null);if(t.r(i),Object.defineProperty(i,"default",{enumerable:!0,value:n}),2&e&&"string"!=typeof n)for(var o in n)t.d(i,o,function(e){return n[e]}.bind(null,o));return i},t.n=function(n){var e=n&&n.__esModule?function(){return n.default}:function(){return n};return t.d(e,"a",e),e},t.o=function(n,e){return Object.prototype.hasOwnProperty.call(n,e)},t.p="",t(t.s=0)}([function(n,e,t){t(1),n.exports=t(3)},function(n,e,t){(function(){var e="undefined"!=typeof window?window.jQuery:t(2);n.exports.ThemeNav={navBar:null,win:null,winScroll:!1,winResize:!1,linkScroll:!1,winPosition:0,winHeight:null,docHeight:null,isRunning:!1,enable:function(n){var t=this;void 0===n&&(n=!0),t.isRunning||(t.isRunning=!0,e((function(e){t.init(e),t.reset(),t.win.on("hashchange",t.reset),n&&t.win.on("scroll",(function(){t.linkScroll||t.winScroll||(t.winScroll=!0,requestAnimationFrame((function(){t.onScroll()})))})),t.win.on("resize",(function(){t.winResize||(t.winResize=!0,requestAnimationFrame((function(){t.onResize()})))})),t.onResize()})))},enableSticky:function(){this.enable(!0)},init:function(n){n(document);var e=this;this.navBar=n("div.wy-side-scroll:first"),this.win=n(window),n(document).on("click","[data-toggle='wy-nav-top']",(function(){n("[data-toggle='wy-nav-shift']").toggleClass("shift"),n("[data-toggle='rst-versions']").toggleClass("shift")})).on("click",".wy-menu-vertical .current ul li a",(function(){var t=n(this);n("[data-toggle='wy-nav-shift']").removeClass("shift"),n("[data-toggle='rst-versions']").toggleClass("shift"),e.toggleCurrent(t),e.hashChange()})).on("click","[data-toggle='rst-current-version']",(function(){n("[data-toggle='rst-versions']").toggleClass("shift-up")})),n("table.docutils:not(.field-list,.footnote,.citation)").wrap("<div class='wy-table-responsive'></div>"),n("table.docutils.footnote").wrap("<div class='wy-table-responsive footnote'></div>"),n("table.docutils.citation").wrap("<div class='wy-table-responsive citation'></div>"),n(".wy-menu-vertical ul").not(".simple").siblings("a").each((function(){var t=n(this);expand=n('<button class="toctree-expand" title="Open/close menu"></button>'),expand.on("click",(function(n){return e.toggleCurrent(t),n.stopPropagation(),!1})),t.prepend(expand)}))},reset:function(){var n=encodeURI(window.location.hash)||"#";try{var e=$(".wy-menu-vertical"),t=e.find('[href="'+n+'"]');if(0===t.length){var i=$('.document [id="'+n.substring(1)+'"]').closest("div.section");0===(t=e.find('[href="#'+i.attr("id")+'"]')).length&&(t=e.find('[href="#"]'))}if(t.length>0){$(".wy-menu-vertical .current").removeClass("current").attr("aria-expanded","false"),t.addClass("current").attr("aria-expanded","true"),t.closest("li.toctree-l1").parent().addClass("current").attr("aria-expanded","true");for(let n=1;n<=10;n++)t.closest("li.toctree-l"+n).addClass("current").attr("aria-expanded","true");t[0].scrollIntoView()}}catch(n){console.log("Error expanding nav for anchor",n)}},onScroll:function(){this.winScroll=!1;var n=this.win.scrollTop(),e=n+this.winHeight,t=this.navBar.scrollTop()+(n-this.winPosition);n<0||e>this.docHeight||(this.navBar.scrollTop(t),this.winPosition=n)},onResize:function(){this.winResize=!1,this.winHeight=this.win.height(),this.docHeight=$(document).height()},hashChange:function(){this.linkScroll=!0,this.win.one("hashchange",(function(){this.linkScroll=!1}))},toggleCurrent:function(n){var e=n.closest("li");e.siblings("li.current").removeClass("current").attr("aria-expanded","false"),e.siblings().find("li.current").removeClass("current").attr("aria-expanded","false");var t=e.find("> ul li");t.length&&(t.removeClass("current").attr("aria-expanded","false"),e.toggleClass("current").attr("aria-expanded",(function(n,e){return"true"==e?"false":"true"})))}},"undefined"!=typeof window&&(window.SphinxRtdTheme={Navigation:n.exports.ThemeNav,StickyNav:n.exports.ThemeNav}),function(){for(var n=0,e=["ms","moz","webkit","o"],t=0;t<e.length&&!window.requestAnimationFrame;++t)window.requestAnimationFrame=window[e[t]+"RequestAnimationFrame"],window.cancelAnimationFrame=window[e[t]+"CancelAnimationFrame"]||window[e[t]+"CancelRequestAnimationFrame"];window.requestAnimationFrame||(window.requestAnimationFrame=function(e,t){var i=(new Date).getTime(),o=Math.max(0,16-(i-n)),r=window.setTimeout((function(){e(i+o)}),o);return n=i+o,r}),window.cancelAnimationFrame||(window.cancelAnimationFrame=function(n){clearTimeout(n)})}()}).call(window)},function(n,e){n.exports=jQuery},function(n,e,t){}]);
```

--------------------------------------------------------------------------------
/docs/_static/sphinx_highlight.js:
--------------------------------------------------------------------------------

```javascript
  1 | /* Highlighting utilities for Sphinx HTML documentation. */
  2 | "use strict";
  3 | 
  4 | const SPHINX_HIGHLIGHT_ENABLED = true
  5 | 
  6 | /**
  7 |  * highlight a given string on a node by wrapping it in
  8 |  * span elements with the given class name.
  9 |  */
 10 | const _highlight = (node, addItems, text, className) => {
 11 |   if (node.nodeType === Node.TEXT_NODE) {
 12 |     const val = node.nodeValue;
 13 |     const parent = node.parentNode;
 14 |     const pos = val.toLowerCase().indexOf(text);
 15 |     if (
 16 |       pos >= 0 &&
 17 |       !parent.classList.contains(className) &&
 18 |       !parent.classList.contains("nohighlight")
 19 |     ) {
 20 |       let span;
 21 | 
 22 |       const closestNode = parent.closest("body, svg, foreignObject");
 23 |       const isInSVG = closestNode && closestNode.matches("svg");
 24 |       if (isInSVG) {
 25 |         span = document.createElementNS("http://www.w3.org/2000/svg", "tspan");
 26 |       } else {
 27 |         span = document.createElement("span");
 28 |         span.classList.add(className);
 29 |       }
 30 | 
 31 |       span.appendChild(document.createTextNode(val.substr(pos, text.length)));
 32 |       const rest = document.createTextNode(val.substr(pos + text.length));
 33 |       parent.insertBefore(
 34 |         span,
 35 |         parent.insertBefore(
 36 |           rest,
 37 |           node.nextSibling
 38 |         )
 39 |       );
 40 |       node.nodeValue = val.substr(0, pos);
 41 |       /* There may be more occurrences of search term in this node. So call this
 42 |        * function recursively on the remaining fragment.
 43 |        */
 44 |       _highlight(rest, addItems, text, className);
 45 | 
 46 |       if (isInSVG) {
 47 |         const rect = document.createElementNS(
 48 |           "http://www.w3.org/2000/svg",
 49 |           "rect"
 50 |         );
 51 |         const bbox = parent.getBBox();
 52 |         rect.x.baseVal.value = bbox.x;
 53 |         rect.y.baseVal.value = bbox.y;
 54 |         rect.width.baseVal.value = bbox.width;
 55 |         rect.height.baseVal.value = bbox.height;
 56 |         rect.setAttribute("class", className);
 57 |         addItems.push({ parent: parent, target: rect });
 58 |       }
 59 |     }
 60 |   } else if (node.matches && !node.matches("button, select, textarea")) {
 61 |     node.childNodes.forEach((el) => _highlight(el, addItems, text, className));
 62 |   }
 63 | };
 64 | const _highlightText = (thisNode, text, className) => {
 65 |   let addItems = [];
 66 |   _highlight(thisNode, addItems, text, className);
 67 |   addItems.forEach((obj) =>
 68 |     obj.parent.insertAdjacentElement("beforebegin", obj.target)
 69 |   );
 70 | };
 71 | 
 72 | /**
 73 |  * Small JavaScript module for the documentation.
 74 |  */
 75 | const SphinxHighlight = {
 76 | 
 77 |   /**
 78 |    * highlight the search words provided in localstorage in the text
 79 |    */
 80 |   highlightSearchWords: () => {
 81 |     if (!SPHINX_HIGHLIGHT_ENABLED) return;  // bail if no highlight
 82 | 
 83 |     // get and clear terms from localstorage
 84 |     const url = new URL(window.location);
 85 |     const highlight =
 86 |         localStorage.getItem("sphinx_highlight_terms")
 87 |         || url.searchParams.get("highlight")
 88 |         || "";
 89 |     localStorage.removeItem("sphinx_highlight_terms")
 90 |     url.searchParams.delete("highlight");
 91 |     window.history.replaceState({}, "", url);
 92 | 
 93 |     // get individual terms from highlight string
 94 |     const terms = highlight.toLowerCase().split(/\s+/).filter(x => x);
 95 |     if (terms.length === 0) return; // nothing to do
 96 | 
 97 |     // There should never be more than one element matching "div.body"
 98 |     const divBody = document.querySelectorAll("div.body");
 99 |     const body = divBody.length ? divBody[0] : document.querySelector("body");
100 |     window.setTimeout(() => {
101 |       terms.forEach((term) => _highlightText(body, term, "highlighted"));
102 |     }, 10);
103 | 
104 |     const searchBox = document.getElementById("searchbox");
105 |     if (searchBox === null) return;
106 |     searchBox.appendChild(
107 |       document
108 |         .createRange()
109 |         .createContextualFragment(
110 |           '<p class="highlight-link">' +
111 |             '<a href="javascript:SphinxHighlight.hideSearchWords()">' +
112 |             _("Hide Search Matches") +
113 |             "</a></p>"
114 |         )
115 |     );
116 |   },
117 | 
118 |   /**
119 |    * helper function to hide the search marks again
120 |    */
121 |   hideSearchWords: () => {
122 |     document
123 |       .querySelectorAll("#searchbox .highlight-link")
124 |       .forEach((el) => el.remove());
125 |     document
126 |       .querySelectorAll("span.highlighted")
127 |       .forEach((el) => el.classList.remove("highlighted"));
128 |     localStorage.removeItem("sphinx_highlight_terms")
129 |   },
130 | 
131 |   initEscapeListener: () => {
132 |     // only install a listener if it is really needed
133 |     if (!DOCUMENTATION_OPTIONS.ENABLE_SEARCH_SHORTCUTS) return;
134 | 
135 |     document.addEventListener("keydown", (event) => {
136 |       // bail for input elements
137 |       if (BLACKLISTED_KEY_CONTROL_ELEMENTS.has(document.activeElement.tagName)) return;
138 |       // bail with special keys
139 |       if (event.shiftKey || event.altKey || event.ctrlKey || event.metaKey) return;
140 |       if (DOCUMENTATION_OPTIONS.ENABLE_SEARCH_SHORTCUTS && (event.key === "Escape")) {
141 |         SphinxHighlight.hideSearchWords();
142 |         event.preventDefault();
143 |       }
144 |     });
145 |   },
146 | };
147 | 
148 | _ready(() => {
149 |   /* Do not call highlightSearchWords() when we are on the search page.
150 |    * It will highlight words from the *previous* search query.
151 |    */
152 |   if (typeof Search === "undefined") SphinxHighlight.highlightSearchWords();
153 |   SphinxHighlight.initEscapeListener();
154 | });
155 | 
```

--------------------------------------------------------------------------------
/src/mcp_server_webcrawl/crawlers/base/api.py:
--------------------------------------------------------------------------------

```python
  1 | import json
  2 | from datetime import datetime, timezone
  3 | from time import time
  4 | from typing import Any, Final
  5 | 
  6 | from mcp_server_webcrawl.crawlers.base.adapter import IndexState
  7 | from mcp_server_webcrawl.models.base import METADATA_VALUE_TYPE
  8 | from mcp_server_webcrawl.models.resources import ResourceResult, ResourceResultType
  9 | from mcp_server_webcrawl.models.sites import SiteResult
 10 | from mcp_server_webcrawl.utils import to_isoformat_zulu
 11 | from mcp_server_webcrawl.utils.logger import get_logger
 12 | 
 13 | logger = get_logger()
 14 | 
 15 | OVERRIDE_ERROR_MESSAGE: Final[str] = "BaseCrawler subclasses must implement \
 16 | the following methods: handle_list_tools, handle_call_tool"
 17 | 
 18 | class BaseJsonApiEncoder(json.JSONEncoder):
 19 |     """
 20 |     Custom JSON encoder for BaseJsonApi objects and ResourceResultType enums.
 21 |     """
 22 | 
 23 |     def default(self, obj) -> Any:
 24 |         """
 25 |         Override default encoder to handle custom types.
 26 | 
 27 |         Args:
 28 |             obj: Object to encode
 29 | 
 30 |         Returns:
 31 |             JSON serializable representation of the object
 32 |         """
 33 | 
 34 |         if isinstance(obj, BaseJsonApi):
 35 |             return obj.__dict__
 36 |         elif isinstance(obj, ResourceResultType):
 37 |             return obj.value
 38 |         elif isinstance(obj, datetime):
 39 |             return to_isoformat_zulu(obj)
 40 |         return super().default(obj)
 41 | 
 42 | class BaseJsonApi:
 43 |     """
 44 |     Base class for JSON API responses.
 45 | 
 46 |     Provides a standardized structure for API responses including metadata,
 47 |     results, and error handling.
 48 |     """
 49 | 
 50 |     def __init__(self, method: str, args: dict[str, Any], index_state: IndexState | None = None):
 51 |         """
 52 |         Construct with the arguments of creation (aoc), these will be echoed back in
 53 |         JSON response. This is an object that collapses into json on json dumps. This is
 54 |         done with everything within implementing to_dict.
 55 | 
 56 |         Args:
 57 |             method: API method name
 58 |             args: Dictionary of API arguments
 59 |             index_state: indexing, complete, remote, etc.
 60 |         """
 61 | 
 62 |         from mcp_server_webcrawl import __version__, __name__
 63 |         self._start_time = time()
 64 |         self.method = method
 65 |         self.args = args
 66 |         self.meta_generator = f"{__name__} ({__version__})"
 67 |         self.meta_generated = to_isoformat_zulu(datetime.now(timezone.utc))
 68 |         self.meta_index = index_state.to_dict() if index_state is not None else None
 69 |         self._results: list[SiteResult | ResourceResult] = []
 70 |         self._results_total: int = 0
 71 |         self._results_offset: int = 0
 72 |         self._results_limit: int = 0
 73 |         self._errors: list[str] = []
 74 | 
 75 |     @property
 76 |     def total(self) -> int:
 77 |         """
 78 |         Returns the total number of results.
 79 | 
 80 |         Returns:
 81 |             Integer count of total results
 82 |         """
 83 | 
 84 |         return self._results_total
 85 | 
 86 |     def get_results(self) -> list[SiteResult | ResourceResult]:
 87 |         """
 88 |         Returns list of results.
 89 | 
 90 |         Returns:
 91 |             Results of type SiteResult or ResourceResult
 92 |         """
 93 | 
 94 |         return self._results.copy()
 95 | 
 96 |     def set_results(self, results: list[SiteResult | ResourceResult], total: int, offset: int, limit: int) -> None:
 97 |         """
 98 |         Set the results of the API response.
 99 | 
100 |         Args:
101 |             results: List of result objects
102 |             total: Total number of results (including those beyond limit)
103 |             offset: Starting position in the full result set
104 |             limit: Maximum number of results to include
105 |         """
106 | 
107 |         self._results = results
108 |         self._results_total = total
109 |         self._results_offset = offset
110 |         self._results_limit = limit
111 | 
112 |     def append_error(self, message: str) -> None:
113 |         """
114 |         Add an error to the JSON response, visible to the endpoint LLM.
115 | 
116 |         Args:
117 |             message: Error message to add
118 |         """
119 | 
120 |         self._errors.append(message)
121 | 
122 |     def to_dict(self) -> dict[str, METADATA_VALUE_TYPE]:
123 |         """
124 |         Convert the object to a JSON-serializable dictionary.
125 | 
126 |         Returns:
127 |             Dictionary representation of the API response
128 |         """
129 | 
130 |         response: dict[str, Any] = {
131 |             "__meta__": {
132 |                 "generator": f"{self.meta_generator}",
133 |                 "generated": f"{self.meta_generated}",
134 |                 "request": {
135 |                     "method": f"{self.method}",
136 |                     "arguments": self.args,
137 |                     "time": time() - self._start_time,
138 |                 },
139 |                 "results": {
140 |                     "total": self._results_total,
141 |                     "offset": self._results_offset,
142 |                     "limit": self._results_limit,
143 |                 },
144 |             },
145 |             "results": [r.to_forcefield_dict(self.args["fields"]) if hasattr(r, "to_forcefield_dict") else r for r in self._results]
146 |         }
147 | 
148 |         if self.meta_index is not None:
149 |             response["__meta__"]["index"] = self.meta_index
150 | 
151 |         if self._errors:
152 |             response["__meta__"]["errors"] = self._errors
153 | 
154 |         return response
155 | 
156 |     def to_json(self) -> str:
157 |         """
158 |         Return a JSON serializable representation of this object.
159 | 
160 |         Returns:
161 |             JSON string representation of the API response
162 |         """
163 | 
164 |         return json.dumps(self.to_dict(), indent=1, cls=BaseJsonApiEncoder)
165 | 
```

--------------------------------------------------------------------------------
/src/mcp_server_webcrawl/crawlers/interrobot/tests.py:
--------------------------------------------------------------------------------

```python
  1 | import asyncio
  2 | from logging import Logger
  3 | 
  4 | from mcp.types import EmbeddedResource, ImageContent, TextContent
  5 | 
  6 | from mcp_server_webcrawl.crawlers.base.tests import BaseCrawlerTests
  7 | from mcp_server_webcrawl.crawlers.interrobot.crawler import InterroBotCrawler
  8 | from mcp_server_webcrawl.models.resources import RESOURCES_TOOL_NAME
  9 | from mcp_server_webcrawl.crawlers import get_fixture_directory
 10 | from mcp_server_webcrawl.utils.logger import get_logger
 11 | 
 12 | # these IDs belong to the db test fixture (interrobot.v2.db)
 13 | EXAMPLE_SITE_ID = 1
 14 | PRAGMAR_SITE_ID = 2
 15 | 
 16 | logger: Logger = get_logger()
 17 | 
 18 | class InterroBotTests(BaseCrawlerTests):
 19 |     """
 20 |     Test suite for the InterroBot crawler implementation.
 21 |     Uses all wrapped test methods from BaseCrawlerTests plus InterroBot-specific features.
 22 |     """
 23 | 
 24 |     def setUp(self):
 25 |         """
 26 |         Set up the test environment with fixture data.
 27 |         """
 28 |         super().setUp()
 29 |         self.fixture_path = get_fixture_directory() / "interrobot" / "interrobot.v2.db"
 30 | 
 31 |     def test_interrobot_pulse(self):
 32 |         """
 33 |         Test basic crawler initialization.
 34 |         """
 35 |         crawler = InterroBotCrawler(self.fixture_path)
 36 |         self.assertIsNotNone(crawler)
 37 | 
 38 |     def test_interrobot_sites(self):
 39 |         """
 40 |         Test site retrieval API functionality.
 41 |         """
 42 |         crawler = InterroBotCrawler(self.fixture_path)
 43 |         # Note: InterroBot uses site ID 2 for pragmar instead of calculating from string
 44 |         self.run_pragmar_site_tests(crawler, PRAGMAR_SITE_ID)
 45 | 
 46 |     def test_interrobot_search(self):
 47 |         """
 48 |         Test boolean search functionality
 49 |         """
 50 |         crawler = InterroBotCrawler(self.fixture_path)
 51 |         self.run_pragmar_search_tests(crawler, PRAGMAR_SITE_ID)
 52 | 
 53 |     def test_interrobot_resources(self):
 54 |         """
 55 |         Test resource retrieval API functionality with various parameters.
 56 |         """
 57 |         crawler = InterroBotCrawler(self.fixture_path)
 58 |         self.run_sites_resources_tests(crawler, PRAGMAR_SITE_ID, EXAMPLE_SITE_ID)
 59 | 
 60 |     def test_interrobot_images(self):
 61 |         """
 62 |         Test InterroBot-specific image handling and thumbnails.
 63 |         """
 64 |         crawler = InterroBotCrawler(self.fixture_path)
 65 |         self.run_pragmar_image_tests(crawler, PRAGMAR_SITE_ID)
 66 | 
 67 |     def test_interrobot_sorts(self):
 68 |         """
 69 |         Test random sort functionality using the '?' sort parameter.
 70 |         """
 71 |         crawler = InterroBotCrawler(self.fixture_path)
 72 |         self.run_pragmar_sort_tests(crawler, PRAGMAR_SITE_ID)
 73 | 
 74 |     def test_interrobot_content_parsing(self):
 75 |         """
 76 |         Test content type detection and parsing.
 77 |         """
 78 |         crawler = InterroBotCrawler(self.fixture_path)
 79 |         self.run_pragmar_content_tests(crawler, PRAGMAR_SITE_ID, False)
 80 | 
 81 |     def test_interrobot_mcp_features(self):
 82 |         """
 83 |         Test InterroBot-specific MCP tool functionality.
 84 |         """
 85 |         crawler = InterroBotCrawler(self.fixture_path)
 86 |         list_tools_result = asyncio.run(crawler.mcp_list_tools())
 87 |         self.assertIsNotNone(list_tools_result)
 88 | 
 89 |     def test_thumbnails_sync(self):
 90 |         """
 91 |         Test thumbnail generation functionality.
 92 |         """
 93 |         asyncio.run(self.__test_thumbnails())
 94 | 
 95 |     async def __test_thumbnails(self):
 96 |         """
 97 |         Test thumbnails are a special case for InterroBot. Other fixtures are
 98 |         not dependable, either images removed to slim archive, or not captured
 99 |         with defaults. Testing thumbnails here is enough.
100 |         """
101 |         crawler = InterroBotCrawler(self.fixture_path)
102 |         thumbnail_args = {
103 |             "datasrc": crawler.datasrc,
104 |             "sites": [PRAGMAR_SITE_ID],
105 |             "extras": ["thumbnails"],
106 |             "query": "type: img AND url: *.png",
107 |             "limit": 4,
108 |         }
109 |         thumbnail_result: list[TextContent | ImageContent | EmbeddedResource] = await crawler.mcp_call_tool(
110 |             RESOURCES_TOOL_NAME, thumbnail_args
111 |         )
112 |         if len(thumbnail_result) > 1:
113 |             self.assertTrue(
114 |                 thumbnail_result[1].type == "image",
115 |                 "ImageContent should be included in thumbnails response"
116 |             )
117 | 
118 |     def test_interrobot_advanced_site_features(self):
119 |         """
120 |         Test InterroBot-specific site features like robots field.
121 |         """
122 |         crawler = InterroBotCrawler(self.fixture_path)
123 | 
124 |         # robots field retrieval
125 |         site_one_field_json = crawler.get_sites_api(ids=[1], fields=["urls"])
126 |         if site_one_field_json.total > 0:
127 |             result_dict = site_one_field_json._results[0].to_dict()
128 |             self.assertIn("urls", result_dict, "robots field should be present in response")
129 | 
130 |         # multiple custom fields
131 |         site_multiple_fields_json = crawler.get_sites_api(ids=[1], fields=["urls", "created"])
132 |         if site_multiple_fields_json.total > 0:
133 |             result = site_multiple_fields_json._results[0].to_dict()
134 |             self.assertIn("urls", result, "robots field should be present in response")
135 |             self.assertIn("created", result, "created field should be present in response")
136 | 
137 |     def test_report(self):
138 |         """
139 |         Run test report, save to data directory.
140 |         """
141 |         crawler = InterroBotCrawler(self.fixture_path)
142 |         logger.info(self.run_pragmar_report(crawler, PRAGMAR_SITE_ID, "InterroBot"))
143 | 
```

--------------------------------------------------------------------------------
/docs/_sources/guides/archivebox.rst.txt:
--------------------------------------------------------------------------------

```
  1 | ArchiveBox MCP Setup Guide
  2 | ==========================
  3 | 
  4 | Instructions for setting up `mcp-server-webcrawl <https://pragmar.com/mcp-server-webcrawl/>`_ with `ArchiveBox <https://archivebox.io/>`_.
  5 | This allows your LLM (e.g. Claude Desktop) to search content and metadata from websites you've archived using ArchiveBox.
  6 | 
  7 | .. raw:: html
  8 | 
  9 |    <iframe width="560" height="315" src="https://www.youtube.com/embed/0KFqhSYf3f4" frameborder="0" allowfullscreen></iframe>
 10 | 
 11 | Follow along with the video, or the step-action guide below.
 12 | 
 13 | Requirements
 14 | ------------
 15 | 
 16 | Before you begin, ensure you have:
 17 | 
 18 | - `Claude Desktop <https://claude.ai/download>`_ installed
 19 | - `Python <https://python.org>`_ 3.10 or later installed
 20 | - `ArchiveBox <https://archivebox.io/>`_ installed
 21 | - Basic familiarity with command line interfaces
 22 | 
 23 | What is ArchiveBox?
 24 | -------------------
 25 | 
 26 | ArchiveBox is a powerful open-source web archiving solution that offers:
 27 | 
 28 | - Multiple output formats (HTML, PDF, screenshots, WARC, etc.)
 29 | - Comprehensive metadata
 30 | - CLI + webadmin for browsing and managing archives
 31 | - Support for various input sources (URLs, browser bookmarks, RSS feeds)
 32 | - Self-hosted solution for long-term web content preservation
 33 | 
 34 | Installation Steps
 35 | ------------------
 36 | 
 37 | 1. Install mcp-server-webcrawl
 38 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 39 | 
 40 | Open your terminal or command line and install the package::
 41 | 
 42 |     pip install mcp-server-webcrawl
 43 | 
 44 | Verify installation was successful::
 45 | 
 46 |     mcp-server-webcrawl --help
 47 | 
 48 | 2. Install and Set Up ArchiveBox
 49 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 50 | 
 51 | macOS/Linux only, Windows may work under Docker but is untested.
 52 | 
 53 | 1. Install ArchiveBox (macOS/Linux)::
 54 | 
 55 |     pip install archivebox
 56 | 
 57 | 2. macOS only, install brew and wget::
 58 | 
 59 |     brew install wget
 60 | 
 61 | 3. Create ArchiveBox collections. Unlike other crawlers that focus on single websites, ArchiveBox uses a collection-based approach where each collection can contain multiple URLs. You can create separate content for different projects or group related URLs together::
 62 | 
 63 |     # Create a directory structure for your collections
 64 |     mkdir ~/archivebox-data
 65 | 
 66 |     # Create an "example" collection
 67 |     mkdir ~/archivebox-data/example
 68 |     cd ~/archivebox-data/example
 69 |     archivebox init
 70 |     archivebox add https://example.com
 71 | 
 72 |     # Create a "pragmar" collection
 73 |     mkdir ~/archivebox-data/pragmar
 74 |     cd ~/archivebox-data/pragmar
 75 |     archivebox init
 76 |     archivebox add https://pragmar.com
 77 | 
 78 | 4. Each ``archivebox init`` creates a complete ArchiveBox instance with its own database and archive directory structure. The typical structure includes::
 79 | 
 80 |     collection-name/
 81 |     ├── archive/          # Archived content organized by timestamp
 82 |     ├── logs/            # ArchiveBox operation logs
 83 |     ├── sources/         # Source URL lists and metadata
 84 |     └── index.sqlite3    # Database containing all metadata
 85 | 
 86 | 3. Configure Claude Desktop
 87 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 88 | 
 89 | 1. Open Claude Desktop
 90 | 2. Go to **File → Settings → Developer → Edit Config**
 91 | 3. Add the following configuration (modify paths as needed):
 92 | 
 93 | .. code-block:: json
 94 | 
 95 |     {
 96 |       "mcpServers": {
 97 |         "webcrawl": {
 98 |           "command": "/path/to/mcp-server-webcrawl",
 99 |           "args": ["--crawler", "archivebox", "--datasrc",
100 |             "/path/to/archivebox-data/"]
101 |         }
102 |       }
103 |     }
104 | 
105 | .. note::
106 |    - On Windows, use ``"mcp-server-webcrawl"`` as the command
107 |    - On macOS/Linux, use the absolute path (output of ``which mcp-server-webcrawl``)
108 |    - The datasrc path should point to the parent directory containing your ArchiveBox collections (e.g., ``~/archivebox-data/``), not to individual collection directories
109 |    - Each collection directory (example, pragmar, etc.) will appear as a separate "site" in MCP
110 | 
111 | 4. Save the file and **completely exit** Claude Desktop (not just close the window)
112 | 5. Restart Claude Desktop
113 | 
114 | 4. Verify and Use
115 | ~~~~~~~~~~~~~~~~~
116 | 
117 | 1. In Claude Desktop, you should now see MCP tools available under Search and Tools
118 | 2. Ask Claude to list your archived sites::
119 | 
120 |     Can you list the crawled sites available?
121 | 
122 | 3. Try searching content from your archives::
123 | 
124 |     Can you find information about [topic] on [archived site]?
125 | 
126 | 4. Use the rich metadata for content discovery::
127 | 
128 |     Can you find all the archived pages related to [keyword] from [archive]?
129 | 
130 | Troubleshooting
131 | ---------------
132 | 
133 | - If Claude doesn't show MCP tools after restart, verify your configuration file is correctly formatted
134 | - Ensure Python and mcp-server-webcrawl are properly installed
135 | - Check that your ArchiveBox archive directory path in the configuration is correct
136 | - Make sure ArchiveBox has successfully archived the websites and created the database
137 | - Verify that files exist in your archive/[timestamp] directories
138 | - Remember that the first time you use a function, Claude will ask for permission
139 | - For large archives, initial indexing may take some time during the first search
140 | 
141 | ArchiveBox's comprehensive archiving capabilities combined with mcp-server-webcrawl provide powerful tools for content preservation, research, and analysis across your archived web content.
142 | 
143 | For more details, including API documentation and other crawler options, visit the `mcp-server-webcrawl documentation <https://github.com/pragmar/mcp-server-webcrawl>`_.
```

--------------------------------------------------------------------------------
/docs/_modules/mcp_server_webcrawl/utils/server.html:
--------------------------------------------------------------------------------

```html
  1 | 
  2 | 
  3 | <!DOCTYPE html>
  4 | <html class="writer-html5" lang="en" data-content_root="../../../">
  5 | <head>
  6 |   <meta charset="utf-8" />
  7 |   <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  8 |   <title>mcp_server_webcrawl.utils.server &mdash; mcp-server-webcrawl  documentation</title>
  9 |       <link rel="stylesheet" type="text/css" href="../../../_static/pygments.css?v=80d5e7a1" />
 10 |       <link rel="stylesheet" type="text/css" href="../../../_static/css/theme.css?v=e59714d7" />
 11 | 
 12 |   
 13 |       <script src="../../../_static/jquery.js?v=5d32c60e"></script>
 14 |       <script src="../../../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
 15 |       <script src="../../../_static/documentation_options.js?v=5929fcd5"></script>
 16 |       <script src="../../../_static/doctools.js?v=888ff710"></script>
 17 |       <script src="../../../_static/sphinx_highlight.js?v=dc90522c"></script>
 18 |     <script src="../../../_static/js/theme.js"></script>
 19 |     <link rel="index" title="Index" href="../../../genindex.html" />
 20 |     <link rel="search" title="Search" href="../../../search.html" /> 
 21 | </head>
 22 | 
 23 | <body class="wy-body-for-nav"> 
 24 |   <div class="wy-grid-for-nav">
 25 |     <nav data-toggle="wy-nav-shift" class="wy-nav-side">
 26 |       <div class="wy-side-scroll">
 27 |         <div class="wy-side-nav-search" >
 28 | 
 29 |           
 30 |           
 31 |           <a href="../../../index.html" class="icon icon-home">
 32 |             mcp-server-webcrawl
 33 |           </a>
 34 | <div role="search">
 35 |   <form id="rtd-search-form" class="wy-form" action="../../../search.html" method="get">
 36 |     <input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
 37 |     <input type="hidden" name="check_keywords" value="yes" />
 38 |     <input type="hidden" name="area" value="default" />
 39 |   </form>
 40 | </div>
 41 |         </div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
 42 |               <p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
 43 | <ul>
 44 | <li class="toctree-l1"><a class="reference internal" href="../../../installation.html">Installation</a></li>
 45 | <li class="toctree-l1"><a class="reference internal" href="../../../guides.html">Setup Guides</a></li>
 46 | <li class="toctree-l1"><a class="reference internal" href="../../../usage.html">Usage</a></li>
 47 | <li class="toctree-l1"><a class="reference internal" href="../../../modules.html">mcp_server_webcrawl</a></li>
 48 | </ul>
 49 | 
 50 |         </div>
 51 |       </div>
 52 |     </nav>
 53 | 
 54 |     <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
 55 |           <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
 56 |           <a href="../../../index.html">mcp-server-webcrawl</a>
 57 |       </nav>
 58 | 
 59 |       <div class="wy-nav-content">
 60 |         <div class="rst-content">
 61 |           <div role="navigation" aria-label="Page navigation">
 62 |   <ul class="wy-breadcrumbs">
 63 |       <li><a href="../../../index.html" class="icon icon-home" aria-label="Home"></a></li>
 64 |           <li class="breadcrumb-item"><a href="../../index.html">Module code</a></li>
 65 |           <li class="breadcrumb-item"><a href="../utils.html">mcp_server_webcrawl.utils</a></li>
 66 |       <li class="breadcrumb-item active">mcp_server_webcrawl.utils.server</li>
 67 |       <li class="wy-breadcrumbs-aside">
 68 |       </li>
 69 |   </ul>
 70 |   <hr/>
 71 | </div>
 72 |           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
 73 |            <div itemprop="articleBody">
 74 |              
 75 |   <h1>Source code for mcp_server_webcrawl.utils.server</h1><div class="highlight"><pre>
 76 | <span></span><span class="kn">import</span> <span class="nn">os</span>
 77 | <span class="kn">import</span> <span class="nn">sys</span>
 78 | 
 79 | <div class="viewcode-block" id="initialize_mcp_server">
 80 | <a class="viewcode-back" href="../../../mcp_server_webcrawl.utils.html#mcp_server_webcrawl.utils.server.initialize_mcp_server">[docs]</a>
 81 | <span class="k">def</span> <span class="nf">initialize_mcp_server</span><span class="p">()</span> <span class="o">-&gt;</span> <span class="kc">None</span><span class="p">:</span>
 82 | <span class="w">    </span><span class="sd">&quot;&quot;&quot;</span>
 83 | <span class="sd">    MCP stdio streams require utf-8 explicitly set for Windows (default cp1252)</span>
 84 | <span class="sd">    or internationalized content will fail.</span>
 85 | <span class="sd">    &quot;&quot;&quot;</span>
 86 |     <span class="k">if</span> <span class="n">sys</span><span class="o">.</span><span class="n">platform</span> <span class="o">==</span> <span class="s2">&quot;win32&quot;</span> <span class="ow">and</span> <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">&quot;PYTHONIOENCODING&quot;</span><span class="p">)</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
 87 |         <span class="n">sys</span><span class="o">.</span><span class="n">stdin</span><span class="o">.</span><span class="n">reconfigure</span><span class="p">(</span><span class="n">encoding</span><span class="o">=</span><span class="s2">&quot;utf-8&quot;</span><span class="p">)</span>
 88 |         <span class="n">sys</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">reconfigure</span><span class="p">(</span><span class="n">encoding</span><span class="o">=</span><span class="s2">&quot;utf-8&quot;</span><span class="p">)</span>
 89 |         <span class="n">sys</span><span class="o">.</span><span class="n">stderr</span><span class="o">.</span><span class="n">reconfigure</span><span class="p">(</span><span class="n">encoding</span><span class="o">=</span><span class="s2">&quot;utf-8&quot;</span><span class="p">)</span></div>
 90 | 
 91 | </pre></div>
 92 | 
 93 |            </div>
 94 |           </div>
 95 |           <footer>
 96 | 
 97 |   <hr/>
 98 | 
 99 |   <div role="contentinfo">
100 |     <p>&#169; Copyright 2025, pragmar.</p>
101 |   </div>
102 | 
103 |   Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
104 |     <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
105 |     provided by <a href="https://readthedocs.org">Read the Docs</a>.
106 |    
107 | 
108 | </footer>
109 |         </div>
110 |       </div>
111 |     </section>
112 |   </div>
113 |   <script>
114 |       jQuery(function () {
115 |           SphinxRtdTheme.Navigation.enable(true);
116 |       });
117 |   </script> 
118 | 
119 | </body>
120 | </html>
```

--------------------------------------------------------------------------------
/src/mcp_server_webcrawl/templates/tests_core.html:
--------------------------------------------------------------------------------

```html
  1 | <!DOCTYPE html>
  2 | <html lang="en">
  3 | <head>
  4 |     <meta charset="utf-8">
  5 |     <title>HTML Basic Page</title>
  6 |     <!--
  7 |     tests depend on this file being stable, don't update without planning of updating tests.py
  8 |     this file is transformed to markdown to test transform integrity
  9 |     -->
 10 |     <style>
 11 |         body { font-family: Georgia, serif; line-height: 1.6; margin: 2em; background: #fafafa; }
 12 |         h1 { color: #333; border-bottom: 2px solid #666; }
 13 |         h2 { color: #555; margin-top: 2em; }
 14 |         h3 { color: #777; }
 15 |         h4, h5, h6 { color: #888; }
 16 |         p { margin-bottom: 1em; }
 17 |         a { color: #0066cc; text-decoration: underline; }
 18 |         a:hover { color: #004499; }
 19 |         em { font-style: italic; color: #666; }
 20 |         strong, b { font-weight: bold; }
 21 |         i { font-style: italic; }
 22 |         ul, ol { margin: 1em 0; padding-left: 2em; }
 23 |         li { margin-bottom: 0.5em; }
 24 |         table { border-collapse: collapse; width: 100%; margin: 1em 0; }
 25 |         th, td { border: 1px solid #ccc; padding: 0.5em; text-align: left; }
 26 |         th { background: #f0f0f0; font-weight: bold; }
 27 |         blockquote { margin: 1em 2em; padding-left: 1em; border-left: 3px solid #ccc; font-style: italic; }
 28 |         code { background: #f5f5f5; padding: 0.2em 0.4em; font-family: monospace; }
 29 |         pre { background: #f5f5f5; padding: 1em; overflow-x: auto; }
 30 |         hr { border: none; border-top: 1px solid #ccc; margin: 2em 0; }
 31 |         dl { margin: 1em 0; }
 32 |         dt { font-weight: bold; margin-top: 0.5em; }
 33 |         dd { margin-left: 2em; margin-bottom: 0.5em; }
 34 |     </style>
 35 | </head>
 36 | <body>
 37 |     <h1>Lorem Ipsum Dolor Sit Amet</h1>
 38 |     <p>Lorem ipsum dolor sit amet, <strong>consectetur adipiscing elit</strong>. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, <a href="#nowhere">quis nostrud exercitation</a> ullamco laboris nisi ut aliquip ex ea commodo consequat. <em>Duis aute irure dolor</em> in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.</p>
 39 | 
 40 |     <h2>Consectetur Adipiscing Elit</h2>
 41 |     <p>Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. <b>Sed ut perspiciatis</b> unde omnis iste natus error sit voluptatem accusantium doloremque laudantium. <i>Totam rem aperiam</i>, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo.</p>
 42 | 
 43 |     <h3>Nemo Enim Ipsam Voluptatem</h3>
 44 |     <p>Quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.</p>
 45 | 
 46 |     <h4>Sed Quia Non Numquam</h4>
 47 |     <p>Eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam.</p>
 48 | 
 49 |     <h5>Nisi Ut Aliquid Ex Ea</h5>
 50 |     <p>Commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?</p>
 51 | 
 52 |     <h6>At Vero Eos Et Accusamus</h6>
 53 |     <p>Et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident.</p>
 54 | 
 55 |     <hr>
 56 | 
 57 |     <h2>Unordered List Example</h2>
 58 |     <ul>
 59 |         <li>Similique sunt in culpa qui officia deserunt</li>
 60 |         <li>Mollitia animi, id est laborum et dolorum fuga</li>
 61 |         <li>Et harum quidem rerum facilis est et expedita distinctio</li>
 62 |         <li>Nam libero tempore, cum soluta nobis est eligendi optio</li>
 63 |         <li>Cumque nihil impedit quo minus id quod maxime</li>
 64 |     </ul>
 65 | 
 66 |     <h2>Ordered List Example</h2>
 67 |     <ol>
 68 |         <li>Temporibus autem quibusdam et aut officiis debitis</li>
 69 |         <li>Aut reiciendis voluptatibus maiores alias consequatur</li>
 70 |         <li>Aut perferendis doloribus asperiores repellat</li>
 71 |         <li>Itaque earum rerum hic tenetur a sapiente delectus</li>
 72 |         <li>Ut aut reiciendis voluptatibus maiores alias</li>
 73 |     </ol>
 74 | 
 75 |     <h2>Definition List Example</h2>
 76 |     <dl>
 77 |         <dt>Lorem Ipsum</dt>
 78 |         <dd>Dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</dd>
 79 | 
 80 |         <dt>Ut Enim</dt>
 81 |         <dd>Ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</dd>
 82 | 
 83 |         <dt>Duis Aute</dt>
 84 |         <dd>Irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.</dd>
 85 |     </dl>
 86 | 
 87 |     <h2>Table Example</h2>
 88 |     <table>
 89 |         <thead>
 90 |             <tr>
 91 |                 <th>Lorem</th>
 92 |                 <th>Ipsum</th>
 93 |                 <th>Dolor</th>
 94 |                 <th>Sit</th>
 95 |             </tr>
 96 |         </thead>
 97 |         <tbody>
 98 |             <tr>
 99 |                 <td>Consectetur</td>
100 |                 <td>Adipiscing</td>
101 |                 <td>Elit</td>
102 |                 <td>Sed</td>
103 |             </tr>
104 |             <tr>
105 |                 <td>Eiusmod</td>
106 |                 <td>Tempor</td>
107 |                 <td>Incididunt</td>
108 |                 <td>Labore</td>
109 |             </tr>
110 |             <tr>
111 |                 <td>Dolore</td>
112 |                 <td>Magna</td>
113 |                 <td>Aliqua</td>
114 |                 <td>Enim</td>
115 |             </tr>
116 |             <tr>
117 |                 <td>Minim</td>
118 |                 <td>Veniam</td>
119 |                 <td>Quis</td>
120 |                 <td>Nostrud</td>
121 |             </tr>
122 |         </tbody>
123 |     </table>
124 | 
125 |     <h2>More Text Elements</h2>
126 |     <p>Here we have some <code>inline code</code> and a longer code block below:</p>
127 | 
128 |     <pre><code>function lorem() {
129 |     return "ipsum dolor sit amet";
130 | }</code></pre>
131 | 
132 |     <blockquote>
133 |         <p>"Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis."</p>
134 |     </blockquote>
135 | 
136 |     <p>Final paragraph with mixed formatting: <strong>bold text</strong>, <em>emphasized text</em>, <i>italic text</i>, <b>more bold</b>, and a <a href="#top">link back to top</a>. Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id quod maxime placeat facere possimus.</p>
137 | </body>
138 | </html>
```

--------------------------------------------------------------------------------
/src/mcp_server_webcrawl/interactive/highlights.py:
--------------------------------------------------------------------------------

```python
  1 | import re
  2 | import curses
  3 | 
  4 | from dataclasses import dataclass
  5 | from typing import List
  6 | 
  7 | from mcp_server_webcrawl.interactive.ui import safe_addstr
  8 | 
  9 | @dataclass
 10 | class HighlightSpan:
 11 |     """
 12 |     Represents a highlight span in text
 13 |     """
 14 |     start: int
 15 |     end: int
 16 |     text: str
 17 | 
 18 |     def __str__(self) -> str:
 19 |         return f"[{self.start}:{self.end} '{self.text}']"
 20 | 
 21 | 
 22 | class HighlightProcessor:
 23 |     """
 24 |     Shared highlight processing utilities
 25 |     """
 26 | 
 27 |     QUOTED_PHRASE_PATTERN = re.compile(r'"([^"]+)"')
 28 |     WORD_PATTERN = re.compile(r"\b\w+\b")
 29 |     SNIPPET_MARKER_PATTERN = re.compile(r"\*\*([a-zA-Z\-_' ]+)\*\*")
 30 |     IGNORE_WORDS = {"AND", "OR", "NOT", "and", "or", "not", "type", "status", "size", "url", "id"}
 31 | 
 32 |     @staticmethod
 33 |     def extract_search_terms(query: str) -> List[str]:
 34 |         """
 35 |         Extract search terms from query, handling quoted phrases and individual keywords.
 36 |         """
 37 |         if not query or not query.strip():
 38 |             return []
 39 | 
 40 |         search_terms = []
 41 |         for match in HighlightProcessor.QUOTED_PHRASE_PATTERN.finditer(query):
 42 |             phrase = match.group(1).strip()
 43 |             if phrase:
 44 |                 search_terms.append(phrase)
 45 | 
 46 |         remaining_query = HighlightProcessor.QUOTED_PHRASE_PATTERN.sub('', query)
 47 | 
 48 |         # extract individual words
 49 |         for match in HighlightProcessor.WORD_PATTERN.finditer(remaining_query):
 50 |             word = match.group().strip()
 51 |             if word and word not in HighlightProcessor.IGNORE_WORDS and len(word) > 2:
 52 |                 search_terms.append(word)
 53 | 
 54 |         return search_terms
 55 | 
 56 |     @staticmethod
 57 |     def find_highlights_in_text(text: str, search_terms: List[str]) -> List[HighlightSpan]:
 58 |         """
 59 |         Find all highlight spans in text for the given search terms.
 60 |         """
 61 |         if not text or not search_terms:
 62 |             return []
 63 | 
 64 |         highlights = []
 65 |         escaped_terms = [re.escape(term.strip("\"'")) for term in search_terms]
 66 |         pattern = re.compile(rf"\b({'|'.join(escaped_terms)})\b", re.IGNORECASE)
 67 | 
 68 |         for match in pattern.finditer(text):
 69 |             span = HighlightSpan(
 70 |                 start=match.start(),
 71 |                 end=match.end(),
 72 |                 text=match.group()
 73 |             )
 74 |             highlights.append(span)
 75 | 
 76 |         return HighlightProcessor.merge_overlapping_highlights(highlights, text)
 77 | 
 78 |     @staticmethod
 79 |     def extract_snippet_highlights(snippet_text: str) -> tuple[str, List[HighlightSpan]]:
 80 |         """
 81 |         Extract highlights from snippet text with **markers**, returning clean text and highlights.
 82 |         """
 83 |         if not snippet_text:
 84 |             return "", []
 85 | 
 86 |         normalized_text = re.sub(r"\s+", " ", snippet_text.strip())
 87 | 
 88 |         clean_text = ""
 89 |         highlights = []
 90 |         last_end = 0
 91 | 
 92 |         for match in HighlightProcessor.SNIPPET_MARKER_PATTERN.finditer(normalized_text):
 93 |             # text before this match
 94 |             clean_text += normalized_text[last_end:match.start()]
 95 | 
 96 |             # highlighted text (without markers)
 97 |             highlight_text = match.group(1)
 98 |             highlight_start = len(clean_text)
 99 |             clean_text += highlight_text
100 |             highlight_end = len(clean_text)
101 | 
102 |             span: HighlightSpan = HighlightSpan(
103 |                 start=highlight_start,
104 |                 end=highlight_end,
105 |                 text=highlight_text
106 |             )
107 |             highlights.append(span)
108 |             last_end = match.end()
109 | 
110 |         # remaining text
111 |         clean_text += normalized_text[last_end:]
112 | 
113 |         return clean_text.strip(), highlights
114 | 
115 |     @staticmethod
116 |     def merge_overlapping_highlights(highlights: List[HighlightSpan], text: str) -> List[HighlightSpan]:
117 |         """Merge overlapping or adjacent highlight spans."""
118 |         if not highlights:
119 |             return []
120 | 
121 |         # sort by start position
122 |         sorted_highlights = sorted(highlights, key=lambda h: h.start)
123 |         merged = []
124 | 
125 |         for highlight in sorted_highlights:
126 |             if not merged:
127 |                 merged.append(highlight)
128 |             else:
129 |                 last = merged[-1]
130 |                 if highlight.start <= last.end:
131 |                     # overlapping/adjacent - merge them
132 |                     end = max(last.end, highlight.end)
133 |                     merged_text = text[last.start:end]
134 |                     merged[-1] = HighlightSpan(
135 |                         start=last.start,
136 |                         end=end,
137 |                         text=merged_text
138 |                     )
139 |                 else:
140 |                     merged.append(highlight)
141 | 
142 |         return merged
143 | 
144 |     @staticmethod
145 |     def render_text_with_highlights(
146 |         stdscr: curses.window,
147 |         text: str,
148 |         highlights: List[HighlightSpan],
149 |         x: int,
150 |         y: int,
151 |         max_width: int,
152 |         normal_style: int,
153 |         hit_style: int
154 |     ) -> None:
155 |         """
156 |         Render text with highlights applied.
157 |         """
158 |         if not text.strip():
159 |             return
160 | 
161 |         display_text: str = text[:max_width] if len(text) > max_width else text
162 |         visible_highlights: list[str] = [h for h in highlights if h.start < len(display_text)]
163 |         current_x: int = x
164 |         pos: int = 0
165 | 
166 |         try:
167 |             for highlight in visible_highlights:
168 |                 # text before highlight
169 |                 if highlight.start > pos:
170 |                     text_before: str = display_text[pos:highlight.start]
171 |                     safe_addstr(stdscr, y, current_x, text_before, normal_style)
172 |                     current_x += len(text_before)
173 |                     pos = highlight.start
174 | 
175 |                 # highlighted text
176 |                 highlight_end: int = min(highlight.end, len(display_text))
177 |                 highlighted_text: str = display_text[highlight.start:highlight_end]
178 |                 if current_x + len(highlighted_text) <= x + max_width:
179 |                     safe_addstr(stdscr, y, current_x, highlighted_text, hit_style)
180 |                     current_x += len(highlighted_text)
181 |                 pos = highlight_end
182 | 
183 |             # remaining text
184 |             if pos < len(display_text):
185 |                 remaining_text: str = display_text[pos:]
186 |                 remaining_width: int = max_width - (current_x - x)
187 |                 if remaining_width > 0:
188 |                     safe_addstr(stdscr, y, current_x, remaining_text[:remaining_width], normal_style)
189 | 
190 |         except curses.error:
191 |             pass
192 | 
```

--------------------------------------------------------------------------------
/docs/guides.html:
--------------------------------------------------------------------------------

```html
  1 | 
  2 | 
  3 | <!DOCTYPE html>
  4 | <html class="writer-html5" lang="en" data-content_root="./">
  5 | <head>
  6 |   <meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />
  7 | 
  8 |   <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  9 |   <title>Setup Guides &mdash; mcp-server-webcrawl  documentation</title>
 10 |       <link rel="stylesheet" type="text/css" href="_static/pygments.css?v=80d5e7a1" />
 11 |       <link rel="stylesheet" type="text/css" href="_static/css/theme.css?v=e59714d7" />
 12 | 
 13 |   
 14 |       <script src="_static/jquery.js?v=5d32c60e"></script>
 15 |       <script src="_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
 16 |       <script src="_static/documentation_options.js?v=5929fcd5"></script>
 17 |       <script src="_static/doctools.js?v=888ff710"></script>
 18 |       <script src="_static/sphinx_highlight.js?v=dc90522c"></script>
 19 |     <script src="_static/js/theme.js"></script>
 20 |     <link rel="index" title="Index" href="genindex.html" />
 21 |     <link rel="search" title="Search" href="search.html" />
 22 |     <link rel="next" title="ArchiveBox MCP Setup Guide" href="guides/archivebox.html" />
 23 |     <link rel="prev" title="Installation" href="installation.html" /> 
 24 | </head>
 25 | 
 26 | <body class="wy-body-for-nav"> 
 27 |   <div class="wy-grid-for-nav">
 28 |     <nav data-toggle="wy-nav-shift" class="wy-nav-side">
 29 |       <div class="wy-side-scroll">
 30 |         <div class="wy-side-nav-search" >
 31 | 
 32 |           
 33 |           
 34 |           <a href="index.html" class="icon icon-home">
 35 |             mcp-server-webcrawl
 36 |           </a>
 37 | <div role="search">
 38 |   <form id="rtd-search-form" class="wy-form" action="search.html" method="get">
 39 |     <input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
 40 |     <input type="hidden" name="check_keywords" value="yes" />
 41 |     <input type="hidden" name="area" value="default" />
 42 |   </form>
 43 | </div>
 44 |         </div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
 45 |               <p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
 46 | <ul class="current">
 47 | <li class="toctree-l1"><a class="reference internal" href="installation.html">Installation</a></li>
 48 | <li class="toctree-l1 current"><a class="current reference internal" href="#">Setup Guides</a><ul>
 49 | <li class="toctree-l2"><a class="reference internal" href="guides/archivebox.html">ArchiveBox MCP Setup Guide</a></li>
 50 | <li class="toctree-l2"><a class="reference internal" href="guides/httrack.html">HTTrack MCP Setup Guide</a></li>
 51 | <li class="toctree-l2"><a class="reference internal" href="guides/interrobot.html">InterroBot MCP Setup Guide</a></li>
 52 | <li class="toctree-l2"><a class="reference internal" href="guides/katana.html">Katana MCP Setup Guide</a></li>
 53 | <li class="toctree-l2"><a class="reference internal" href="guides/siteone.html">SiteOne MCP Setup Guide</a></li>
 54 | <li class="toctree-l2"><a class="reference internal" href="guides/warc.html">WARC MCP Setup Guide</a></li>
 55 | <li class="toctree-l2"><a class="reference internal" href="guides/wget.html">wget MCP Setup Guide</a></li>
 56 | </ul>
 57 | </li>
 58 | <li class="toctree-l1"><a class="reference internal" href="usage.html">Usage</a></li>
 59 | <li class="toctree-l1"><a class="reference internal" href="prompts.html">Prompt Routines</a></li>
 60 | <li class="toctree-l1"><a class="reference internal" href="modules.html">mcp_server_webcrawl</a></li>
 61 | </ul>
 62 | 
 63 |         </div>
 64 |       </div>
 65 |     </nav>
 66 | 
 67 |     <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
 68 |           <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
 69 |           <a href="index.html">mcp-server-webcrawl</a>
 70 |       </nav>
 71 | 
 72 |       <div class="wy-nav-content">
 73 |         <div class="rst-content">
 74 |           <div role="navigation" aria-label="Page navigation">
 75 |   <ul class="wy-breadcrumbs">
 76 |       <li><a href="index.html" class="icon icon-home" aria-label="Home"></a></li>
 77 |       <li class="breadcrumb-item active">Setup Guides</li>
 78 |       <li class="wy-breadcrumbs-aside">
 79 |             <a href="_sources/guides.rst.txt" rel="nofollow"> View page source</a>
 80 |       </li>
 81 |   </ul>
 82 |   <hr/>
 83 | </div>
 84 |           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
 85 |            <div itemprop="articleBody">
 86 |              
 87 |   <section id="setup-guides">
 88 | <h1>Setup Guides<a class="headerlink" href="#setup-guides" title="Link to this heading"></a></h1>
 89 | <p>This section contains detailed setup guides for mcp-server-webcrawl in various environments and configurations.</p>
 90 | <div class="toctree-wrapper compound">
 91 | <p class="caption" role="heading"><span class="caption-text">Available Guides:</span></p>
 92 | <ul>
 93 | <li class="toctree-l1"><a class="reference internal" href="guides/archivebox.html">ArchiveBox MCP Setup Guide</a></li>
 94 | <li class="toctree-l1"><a class="reference internal" href="guides/httrack.html">HTTrack MCP Setup Guide</a></li>
 95 | <li class="toctree-l1"><a class="reference internal" href="guides/interrobot.html">InterroBot MCP Setup Guide</a></li>
 96 | <li class="toctree-l1"><a class="reference internal" href="guides/katana.html">Katana MCP Setup Guide</a></li>
 97 | <li class="toctree-l1"><a class="reference internal" href="guides/siteone.html">SiteOne MCP Setup Guide</a></li>
 98 | <li class="toctree-l1"><a class="reference internal" href="guides/warc.html">WARC MCP Setup Guide</a></li>
 99 | <li class="toctree-l1"><a class="reference internal" href="guides/wget.html">wget MCP Setup Guide</a></li>
100 | </ul>
101 | </div>
102 | </section>
103 | 
104 | 
105 |            </div>
106 |           </div>
107 |           <footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
108 |         <a href="installation.html" class="btn btn-neutral float-left" title="Installation" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
109 |         <a href="guides/archivebox.html" class="btn btn-neutral float-right" title="ArchiveBox MCP Setup Guide" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
110 |     </div>
111 | 
112 |   <hr/>
113 | 
114 |   <div role="contentinfo">
115 |     <p>&#169; Copyright 2025, pragmar.</p>
116 |   </div>
117 | 
118 |   Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
119 |     <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
120 |     provided by <a href="https://readthedocs.org">Read the Docs</a>.
121 |    
122 | 
123 | </footer>
124 |         </div>
125 |       </div>
126 |     </section>
127 |   </div>
128 |   <script>
129 |       jQuery(function () {
130 |           SphinxRtdTheme.Navigation.enable(true);
131 |       });
132 |   </script> 
133 | 
134 | </body>
135 | </html>
```

--------------------------------------------------------------------------------
/src/mcp_server_webcrawl/models/resources.py:
--------------------------------------------------------------------------------

```python
  1 | from enum import Enum
  2 | from typing import Final
  3 | from datetime import datetime
  4 | 
  5 | from mcp_server_webcrawl.models.base import BaseModel, METADATA_VALUE_TYPE
  6 | from mcp_server_webcrawl.utils import to_isoformat_zulu
  7 | 
  8 | RESOURCES_TOOL_NAME: Final[str] = "webcrawl_search"
  9 | RESOURCE_EXTRAS_ALLOWED: Final[set[str]] = {"markdown", "snippets", "regex", "thumbnails", "xpath"}
 10 | RESOURCES_LIMIT_DEFAULT: Final[int] = 20
 11 | RESOURCES_LIMIT_MAX: Final[int] = 100
 12 | 
 13 | RESOURCES_FIELDS_BASE: Final[list[str]] = ["id", "url", "site", "type", "status"]
 14 | RESOURCES_FIELDS_DEFAULT: Final[list[str]] = RESOURCES_FIELDS_BASE + ["created", "modified"]
 15 | RESOURCES_FIELDS_OPTIONS: Final[list[str]] = ["created", "modified", "size", "headers", "content"]
 16 | 
 17 | RESOURCES_DEFAULT_FIELD_MAPPING: Final[dict[str, str]] = {
 18 |     "id": "ResourcesFullText.Id",
 19 |     "site": "ResourcesFullText.Project",
 20 |     "created": "Resources.Created",
 21 |     "modified": "Resources.Modified",
 22 |     "url": "ResourcesFullText.Url",
 23 |     "status": "Resources.Status",
 24 |     "size": "Resources.Size",
 25 |     "type": "ResourcesFullText.Type",
 26 |     "headers": "ResourcesFullText.Headers",
 27 |     "content": "ResourcesFullText.Content",
 28 |     "time": "Resources.Time",
 29 |     "fulltext": "ResourcesFullText",
 30 | }
 31 | RESOURCES_DEFAULT_SORT_MAPPING: Final[dict[str, tuple[str, str]]] = {
 32 |     "+id": ("Resources.Id", "ASC"),
 33 |     "-id": ("Resources.Id", "DESC"),
 34 |     "+url": ("ResourcesFullText.Url", "ASC"),
 35 |     "-url": ("ResourcesFullText.Url", "DESC"),
 36 |     "+status": ("Resources.Status", "ASC"),
 37 |     "-status": ("Resources.Status", "DESC"),
 38 |     "+size": ("Resources.Size", "ASC"),
 39 |     "-size": ("Resources.Size", "DESC"),
 40 |     "?": ("Resources.Id", "RANDOM")
 41 | }
 42 | 
 43 | class ResourceResultType(Enum):
 44 |     """
 45 |     Enum representing different types of web resources.
 46 |     """
 47 |     UNDEFINED = ""
 48 |     PAGE = "html"
 49 |     FRAME = "iframe"
 50 |     IMAGE = "img"
 51 |     AUDIO = "audio"
 52 |     VIDEO = "video"
 53 |     FONT = "font"
 54 |     CSS = "style"
 55 |     SCRIPT = "script"
 56 |     FEED = "rss"
 57 |     TEXT = "text"
 58 |     PDF = "pdf"
 59 |     DOC = "doc"
 60 |     OTHER = "other"
 61 | 
 62 |     @classmethod
 63 |     def values(cls) -> list[str]:
 64 |         """
 65 |         Return all values of the enum as a list.
 66 |         """
 67 |         return [member.value for member in cls]
 68 | 
 69 |     @classmethod
 70 |     def to_int_map(cls):
 71 |         """
 72 |         Return a dictionary mapping each enum value to its integer position.
 73 | 
 74 |         Returns:
 75 |             dict: a dictionary with enum values as keys and their ordinal positions as values.
 76 |         """
 77 |         return {member.value: i for i, member in enumerate(cls)}
 78 | 
 79 | # if types stored as ints within db
 80 | RESOURCES_ENUMERATED_TYPE_MAPPING: Final[dict[int, ResourceResultType]] = {
 81 |     0: ResourceResultType.UNDEFINED,
 82 |     1: ResourceResultType.PAGE,
 83 |     2: ResourceResultType.OTHER,
 84 |     3: ResourceResultType.FEED,
 85 |     4: ResourceResultType.FRAME,
 86 |     5: ResourceResultType.OTHER,
 87 |     6: ResourceResultType.IMAGE,
 88 |     7: ResourceResultType.AUDIO,
 89 |     8: ResourceResultType.VIDEO,
 90 |     9: ResourceResultType.FONT,
 91 |     10: ResourceResultType.CSS,
 92 |     11: ResourceResultType.SCRIPT,
 93 |     12: ResourceResultType.OTHER,
 94 |     13: ResourceResultType.TEXT,
 95 |     14: ResourceResultType.PDF,
 96 |     15: ResourceResultType.DOC
 97 | }
 98 | 
 99 | class ResourceResult(BaseModel):
100 |     """
101 |     Represents a web resource result from a crawl operation.
102 |     """
103 |     def __init__(
104 |         self,
105 |         id: int,
106 |         url: str,
107 |         site: int | None = None,
108 |         crawl: int | None = None,
109 |         type: ResourceResultType = ResourceResultType.UNDEFINED,
110 |         name: str | None = None,
111 |         headers: str | None = None,
112 |         content: str | None = None,
113 |         created: datetime | None = None,
114 |         modified: datetime | None = None,
115 |         status: int | None = None,
116 |         size: int | None = None,
117 |         time: int | None = None,
118 |         metadata: dict[str, METADATA_VALUE_TYPE] | None = None,
119 |     ):
120 |         """
121 |         Initialize a ResourceResult instance.
122 | 
123 |         Args:
124 |             id: resource identifier
125 |             url: resource URL
126 |             site: site identifier the resource belongs to
127 |             crawl: crawl identifier the resource was found in
128 |             type: type of resource
129 |             name: resource name
130 |             headers: HTTP headers
131 |             content: resource content
132 |             created: creation timestamp
133 |             modified: last modification timestamp
134 |             status: HTTP status code
135 |             size: size in bytes
136 |             time: response time in milliseconds
137 |             thumbnail: base64 encoded thumbnail (experimental)
138 |             metadata: additional metadata for the resource
139 |         """
140 |         self.id = id
141 |         self.url = url
142 |         self.site = site
143 |         self.crawl = crawl
144 |         self.type = type
145 |         self.name = name
146 |         self.headers = headers
147 |         self.content = content
148 |         self.created = created
149 |         self.modified = modified
150 |         self.status = status
151 |         self.size = size  # in bytes
152 |         self.time = time  # in millis
153 |         self.metadata = metadata  # reserved
154 | 
155 |         # set externally
156 |         self.__extras: dict[str, str] = {}
157 | 
158 |     def to_dict(self) -> dict[str, METADATA_VALUE_TYPE]:
159 |         """
160 |         Convert the object to a dictionary suitable for JSON serialization.
161 |         """
162 |         result: dict[str, METADATA_VALUE_TYPE] = {
163 |             "id": self.id,
164 |             "url": self.url,
165 |             "site": self.site,
166 |             "crawl": self.crawl,
167 |             "type": self.type.value if self.type else None,
168 |             "name": self.name,
169 |             "headers": self.headers,
170 |             "content": self.content,
171 |             "created": to_isoformat_zulu(self.created) if self.created else None,
172 |             "modified": to_isoformat_zulu(self.modified) if self.modified else None,
173 |             "status": self.status,
174 |             "size": self.size,
175 |             "time": self.time,
176 |             "metadata": self.metadata  # reserved
177 |         }
178 |         if self.__extras:
179 |             result["extras"] = {k: v for k, v in self.__extras.items()}
180 | 
181 |         return {k: v for k, v in result.items() if v is not None and not (k == "metadata" and v == {})}
182 | 
183 |     def set_extra(self, extra_name: str, extra_value: str | None | list[str] | list[dict[str, str | int | float]]) -> None:
184 |         assert extra_name in RESOURCE_EXTRAS_ALLOWED, f"Unexpected extra requested. {extra_name}"
185 |         self.__extras[extra_name] = extra_value
186 | 
187 |     def get_extra(self, extra_name: str) -> str | None | list[str] | list[dict[str, str | int | float]]:
188 |         assert extra_name in RESOURCE_EXTRAS_ALLOWED, f"Unexpected extra requested. {extra_name}"
189 |         if extra_name in self.__extras:
190 |             return self.__extras[extra_name]
191 |         else:
192 |             return None
193 | 
```

--------------------------------------------------------------------------------
/docs/_modules/mcp_server_webcrawl/crawlers/katana/crawler.html:
--------------------------------------------------------------------------------

```html
  1 | 
  2 | 
  3 | <!DOCTYPE html>
  4 | <html class="writer-html5" lang="en" data-content_root="../../../../">
  5 | <head>
  6 |   <meta charset="utf-8" />
  7 |   <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  8 |   <title>mcp_server_webcrawl.crawlers.katana.crawler &mdash; mcp-server-webcrawl  documentation</title>
  9 |       <link rel="stylesheet" type="text/css" href="../../../../_static/pygments.css?v=80d5e7a1" />
 10 |       <link rel="stylesheet" type="text/css" href="../../../../_static/css/theme.css?v=e59714d7" />
 11 | 
 12 |   
 13 |       <script src="../../../../_static/jquery.js?v=5d32c60e"></script>
 14 |       <script src="../../../../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
 15 |       <script src="../../../../_static/documentation_options.js?v=5929fcd5"></script>
 16 |       <script src="../../../../_static/doctools.js?v=888ff710"></script>
 17 |       <script src="../../../../_static/sphinx_highlight.js?v=dc90522c"></script>
 18 |     <script src="../../../../_static/js/theme.js"></script>
 19 |     <link rel="index" title="Index" href="../../../../genindex.html" />
 20 |     <link rel="search" title="Search" href="../../../../search.html" /> 
 21 | </head>
 22 | 
 23 | <body class="wy-body-for-nav"> 
 24 |   <div class="wy-grid-for-nav">
 25 |     <nav data-toggle="wy-nav-shift" class="wy-nav-side">
 26 |       <div class="wy-side-scroll">
 27 |         <div class="wy-side-nav-search" >
 28 | 
 29 |           
 30 |           
 31 |           <a href="../../../../index.html" class="icon icon-home">
 32 |             mcp-server-webcrawl
 33 |           </a>
 34 | <div role="search">
 35 |   <form id="rtd-search-form" class="wy-form" action="../../../../search.html" method="get">
 36 |     <input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
 37 |     <input type="hidden" name="check_keywords" value="yes" />
 38 |     <input type="hidden" name="area" value="default" />
 39 |   </form>
 40 | </div>
 41 |         </div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
 42 |               <p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
 43 | <ul>
 44 | <li class="toctree-l1"><a class="reference internal" href="../../../../installation.html">Installation</a></li>
 45 | <li class="toctree-l1"><a class="reference internal" href="../../../../guides.html">Setup Guides</a></li>
 46 | <li class="toctree-l1"><a class="reference internal" href="../../../../usage.html">Usage</a></li>
 47 | <li class="toctree-l1"><a class="reference internal" href="../../../../modules.html">mcp_server_webcrawl</a></li>
 48 | </ul>
 49 | 
 50 |         </div>
 51 |       </div>
 52 |     </nav>
 53 | 
 54 |     <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
 55 |           <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
 56 |           <a href="../../../../index.html">mcp-server-webcrawl</a>
 57 |       </nav>
 58 | 
 59 |       <div class="wy-nav-content">
 60 |         <div class="rst-content">
 61 |           <div role="navigation" aria-label="Page navigation">
 62 |   <ul class="wy-breadcrumbs">
 63 |       <li><a href="../../../../index.html" class="icon icon-home" aria-label="Home"></a></li>
 64 |           <li class="breadcrumb-item"><a href="../../../index.html">Module code</a></li>
 65 |           <li class="breadcrumb-item"><a href="../../crawlers.html">mcp_server_webcrawl.crawlers</a></li>
 66 |       <li class="breadcrumb-item active">mcp_server_webcrawl.crawlers.katana.crawler</li>
 67 |       <li class="wy-breadcrumbs-aside">
 68 |       </li>
 69 |   </ul>
 70 |   <hr/>
 71 | </div>
 72 |           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
 73 |            <div itemprop="articleBody">
 74 |              
 75 |   <h1>Source code for mcp_server_webcrawl.crawlers.katana.crawler</h1><div class="highlight"><pre>
 76 | <span></span><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
 77 | 
 78 | <span class="kn">from</span> <span class="nn">mcp_server_webcrawl.crawlers.base.indexed</span> <span class="kn">import</span> <span class="n">IndexedCrawler</span>
 79 | <span class="kn">from</span> <span class="nn">mcp_server_webcrawl.crawlers.katana.adapter</span> <span class="kn">import</span> <span class="n">get_sites</span><span class="p">,</span> <span class="n">get_resources</span>
 80 | <span class="kn">from</span> <span class="nn">mcp_server_webcrawl.utils.logger</span> <span class="kn">import</span> <span class="n">get_logger</span>
 81 | 
 82 | <span class="n">logger</span> <span class="o">=</span> <span class="n">get_logger</span><span class="p">()</span>
 83 | 
 84 | <div class="viewcode-block" id="KatanaCrawler">
 85 | <a class="viewcode-back" href="../../../../mcp_server_webcrawl.crawlers.katana.html#mcp_server_webcrawl.crawlers.katana.crawler.KatanaCrawler">[docs]</a>
 86 | <span class="k">class</span> <span class="nc">KatanaCrawler</span><span class="p">(</span><span class="n">IndexedCrawler</span><span class="p">):</span>
 87 | <span class="w">    </span><span class="sd">&quot;&quot;&quot;</span>
 88 | <span class="sd">    A crawler implementation for HTTP text files.</span>
 89 | <span class="sd">    Provides functionality for accessing and searching web content from captured HTTP exchanges.</span>
 90 | <span class="sd">    &quot;&quot;&quot;</span>
 91 | 
 92 | <div class="viewcode-block" id="KatanaCrawler.__init__">
 93 | <a class="viewcode-back" href="../../../../mcp_server_webcrawl.crawlers.katana.html#mcp_server_webcrawl.crawlers.katana.crawler.KatanaCrawler.__init__">[docs]</a>
 94 |     <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">datasrc</span><span class="p">:</span> <span class="n">Path</span><span class="p">):</span>
 95 | <span class="w">        </span><span class="sd">&quot;&quot;&quot;</span>
 96 | <span class="sd">        Initialize the HTTP text crawler with a data source directory.</span>
 97 | 
 98 | <span class="sd">        Args:</span>
 99 | <span class="sd">            datasrc: The input argument as Path, it must be a directory containing</span>
100 | <span class="sd">                subdirectories with HTTP text files</span>
101 | <span class="sd">        &quot;&quot;&quot;</span>
102 |         <span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">(</span><span class="n">datasrc</span><span class="p">,</span> <span class="n">get_sites</span><span class="p">,</span> <span class="n">get_resources</span><span class="p">)</span></div>
103 | </div>
104 | 
105 | </pre></div>
106 | 
107 |            </div>
108 |           </div>
109 |           <footer>
110 | 
111 |   <hr/>
112 | 
113 |   <div role="contentinfo">
114 |     <p>&#169; Copyright 2025, pragmar.</p>
115 |   </div>
116 | 
117 |   Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
118 |     <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
119 |     provided by <a href="https://readthedocs.org">Read the Docs</a>.
120 |    
121 | 
122 | </footer>
123 |         </div>
124 |       </div>
125 |     </section>
126 |   </div>
127 |   <script>
128 |       jQuery(function () {
129 |           SphinxRtdTheme.Navigation.enable(true);
130 |       });
131 |   </script> 
132 | 
133 | </body>
134 | </html>
```

--------------------------------------------------------------------------------
/docs/_static/js/versions.js:
--------------------------------------------------------------------------------

```javascript
  1 | const themeFlyoutDisplay = "hidden";
  2 | const themeVersionSelector = true;
  3 | const themeLanguageSelector = true;
  4 | 
  5 | if (themeFlyoutDisplay === "attached") {
  6 |   function renderLanguages(config) {
  7 |     if (!config.projects.translations.length) {
  8 |       return "";
  9 |     }
 10 | 
 11 |     // Insert the current language to the options on the selector
 12 |     let languages = config.projects.translations.concat(config.projects.current);
 13 |     languages = languages.sort((a, b) => a.language.name.localeCompare(b.language.name));
 14 | 
 15 |     const languagesHTML = `
 16 |       <dl>
 17 |         <dt>Languages</dt>
 18 |         ${languages
 19 |           .map(
 20 |             (translation) => `
 21 |         <dd ${translation.slug == config.projects.current.slug ? 'class="rtd-current-item"' : ""}>
 22 |           <a href="${translation.urls.documentation}">${translation.language.code}</a>
 23 |         </dd>
 24 |         `,
 25 |           )
 26 |           .join("\n")}
 27 |       </dl>
 28 |     `;
 29 |     return languagesHTML;
 30 |   }
 31 | 
 32 |   function renderVersions(config) {
 33 |     if (!config.versions.active.length) {
 34 |       return "";
 35 |     }
 36 |     const versionsHTML = `
 37 |       <dl>
 38 |         <dt>Versions</dt>
 39 |         ${config.versions.active
 40 |           .map(
 41 |             (version) => `
 42 |         <dd ${version.slug === config.versions.current.slug ? 'class="rtd-current-item"' : ""}>
 43 |           <a href="${version.urls.documentation}">${version.slug}</a>
 44 |         </dd>
 45 |         `,
 46 |           )
 47 |           .join("\n")}
 48 |       </dl>
 49 |     `;
 50 |     return versionsHTML;
 51 |   }
 52 | 
 53 |   function renderDownloads(config) {
 54 |     if (!Object.keys(config.versions.current.downloads).length) {
 55 |       return "";
 56 |     }
 57 |     const downloadsNameDisplay = {
 58 |       pdf: "PDF",
 59 |       epub: "Epub",
 60 |       htmlzip: "HTML",
 61 |     };
 62 | 
 63 |     const downloadsHTML = `
 64 |       <dl>
 65 |         <dt>Downloads</dt>
 66 |         ${Object.entries(config.versions.current.downloads)
 67 |           .map(
 68 |             ([name, url]) => `
 69 |           <dd>
 70 |             <a href="${url}">${downloadsNameDisplay[name]}</a>
 71 |           </dd>
 72 |         `,
 73 |           )
 74 |           .join("\n")}
 75 |       </dl>
 76 |     `;
 77 |     return downloadsHTML;
 78 |   }
 79 | 
 80 |   document.addEventListener("readthedocs-addons-data-ready", function (event) {
 81 |     const config = event.detail.data();
 82 | 
 83 |     const flyout = `
 84 |       <div class="rst-versions" data-toggle="rst-versions" role="note">
 85 |         <span class="rst-current-version" data-toggle="rst-current-version">
 86 |           <span class="fa fa-book"> Read the Docs</span>
 87 |           v: ${config.versions.current.slug}
 88 |           <span class="fa fa-caret-down"></span>
 89 |         </span>
 90 |         <div class="rst-other-versions">
 91 |           <div class="injected">
 92 |             ${renderLanguages(config)}
 93 |             ${renderVersions(config)}
 94 |             ${renderDownloads(config)}
 95 |             <dl>
 96 |               <dt>On Read the Docs</dt>
 97 |               <dd>
 98 |                 <a href="${config.projects.current.urls.home}">Project Home</a>
 99 |               </dd>
100 |               <dd>
101 |                 <a href="${config.projects.current.urls.builds}">Builds</a>
102 |               </dd>
103 |               <dd>
104 |                 <a href="${config.projects.current.urls.downloads}">Downloads</a>
105 |               </dd>
106 |             </dl>
107 |             <dl>
108 |               <dt>Search</dt>
109 |               <dd>
110 |                 <form id="flyout-search-form">
111 |                   <input
112 |                     class="wy-form"
113 |                     type="text"
114 |                     name="q"
115 |                     aria-label="Search docs"
116 |                     placeholder="Search docs"
117 |                     />
118 |                 </form>
119 |               </dd>
120 |             </dl>
121 |             <hr />
122 |             <small>
123 |               <span>Hosted by <a href="https://about.readthedocs.org/?utm_source=&utm_content=flyout">Read the Docs</a></span>
124 |             </small>
125 |           </div>
126 |         </div>
127 |     `;
128 | 
129 |     // Inject the generated flyout into the body HTML element.
130 |     document.body.insertAdjacentHTML("beforeend", flyout);
131 | 
132 |     // Trigger the Read the Docs Addons Search modal when clicking on the "Search docs" input from inside the flyout.
133 |     document
134 |       .querySelector("#flyout-search-form")
135 |       .addEventListener("focusin", () => {
136 |         const event = new CustomEvent("readthedocs-search-show");
137 |         document.dispatchEvent(event);
138 |       });
139 |   })
140 | }
141 | 
142 | if (themeLanguageSelector || themeVersionSelector) {
143 |   function onSelectorSwitch(event) {
144 |     const option = event.target.selectedIndex;
145 |     const item = event.target.options[option];
146 |     window.location.href = item.dataset.url;
147 |   }
148 | 
149 |   document.addEventListener("readthedocs-addons-data-ready", function (event) {
150 |     const config = event.detail.data();
151 | 
152 |     const versionSwitch = document.querySelector(
153 |       "div.switch-menus > div.version-switch",
154 |     );
155 |     if (themeVersionSelector) {
156 |       let versions = config.versions.active;
157 |       if (config.versions.current.hidden || config.versions.current.type === "external") {
158 |         versions.unshift(config.versions.current);
159 |       }
160 |       const versionSelect = `
161 |     <select>
162 |       ${versions
163 |         .map(
164 |           (version) => `
165 |         <option
166 |   value="${version.slug}"
167 |   ${config.versions.current.slug === version.slug ? 'selected="selected"' : ""}
168 |               data-url="${version.urls.documentation}">
169 |               ${version.slug}
170 |           </option>`,
171 |         )
172 |         .join("\n")}
173 |     </select>
174 |   `;
175 | 
176 |       versionSwitch.innerHTML = versionSelect;
177 |       versionSwitch.firstElementChild.addEventListener("change", onSelectorSwitch);
178 |     }
179 | 
180 |     const languageSwitch = document.querySelector(
181 |       "div.switch-menus > div.language-switch",
182 |     );
183 | 
184 |     if (themeLanguageSelector) {
185 |       if (config.projects.translations.length) {
186 |         // Add the current language to the options on the selector
187 |         let languages = config.projects.translations.concat(
188 |           config.projects.current,
189 |         );
190 |         languages = languages.sort((a, b) =>
191 |           a.language.name.localeCompare(b.language.name),
192 |         );
193 | 
194 |         const languageSelect = `
195 |       <select>
196 |         ${languages
197 |           .map(
198 |             (language) => `
199 |               <option
200 |                   value="${language.language.code}"
201 |                   ${config.projects.current.slug === language.slug ? 'selected="selected"' : ""}
202 |                   data-url="${language.urls.documentation}">
203 |                   ${language.language.name}
204 |               </option>`,
205 |           )
206 |           .join("\n")}
207 |        </select>
208 |     `;
209 | 
210 |         languageSwitch.innerHTML = languageSelect;
211 |         languageSwitch.firstElementChild.addEventListener("change", onSelectorSwitch);
212 |       }
213 |       else {
214 |         languageSwitch.remove();
215 |       }
216 |     }
217 |   });
218 | }
219 | 
220 | document.addEventListener("readthedocs-addons-data-ready", function (event) {
221 |   // Trigger the Read the Docs Addons Search modal when clicking on "Search docs" input from the topnav.
222 |   document
223 |     .querySelector("[role='search'] input")
224 |     .addEventListener("focusin", () => {
225 |       const event = new CustomEvent("readthedocs-search-show");
226 |       document.dispatchEvent(event);
227 |     });
228 | });
```

--------------------------------------------------------------------------------
/docs/_modules/mcp_server_webcrawl/main.html:
--------------------------------------------------------------------------------

```html
  1 | 
  2 | 
  3 | <!DOCTYPE html>
  4 | <html class="writer-html5" lang="en" data-content_root="../../">
  5 | <head>
  6 |   <meta charset="utf-8" />
  7 |   <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  8 |   <title>mcp_server_webcrawl.main &mdash; mcp-server-webcrawl  documentation</title>
  9 |       <link rel="stylesheet" type="text/css" href="../../_static/pygments.css?v=80d5e7a1" />
 10 |       <link rel="stylesheet" type="text/css" href="../../_static/css/theme.css?v=e59714d7" />
 11 | 
 12 |   
 13 |       <script src="../../_static/jquery.js?v=5d32c60e"></script>
 14 |       <script src="../../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
 15 |       <script src="../../_static/documentation_options.js?v=5929fcd5"></script>
 16 |       <script src="../../_static/doctools.js?v=888ff710"></script>
 17 |       <script src="../../_static/sphinx_highlight.js?v=dc90522c"></script>
 18 |     <script src="../../_static/js/theme.js"></script>
 19 |     <link rel="index" title="Index" href="../../genindex.html" />
 20 |     <link rel="search" title="Search" href="../../search.html" /> 
 21 | </head>
 22 | 
 23 | <body class="wy-body-for-nav"> 
 24 |   <div class="wy-grid-for-nav">
 25 |     <nav data-toggle="wy-nav-shift" class="wy-nav-side">
 26 |       <div class="wy-side-scroll">
 27 |         <div class="wy-side-nav-search" >
 28 | 
 29 |           
 30 |           
 31 |           <a href="../../index.html" class="icon icon-home">
 32 |             mcp-server-webcrawl
 33 |           </a>
 34 | <div role="search">
 35 |   <form id="rtd-search-form" class="wy-form" action="../../search.html" method="get">
 36 |     <input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
 37 |     <input type="hidden" name="check_keywords" value="yes" />
 38 |     <input type="hidden" name="area" value="default" />
 39 |   </form>
 40 | </div>
 41 |         </div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
 42 |               <p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
 43 | <ul>
 44 | <li class="toctree-l1"><a class="reference internal" href="../../installation.html">Installation</a></li>
 45 | <li class="toctree-l1"><a class="reference internal" href="../../guides.html">Setup Guides</a></li>
 46 | <li class="toctree-l1"><a class="reference internal" href="../../usage.html">Usage</a></li>
 47 | <li class="toctree-l1"><a class="reference internal" href="../../modules.html">mcp_server_webcrawl</a></li>
 48 | </ul>
 49 | 
 50 |         </div>
 51 |       </div>
 52 |     </nav>
 53 | 
 54 |     <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
 55 |           <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
 56 |           <a href="../../index.html">mcp-server-webcrawl</a>
 57 |       </nav>
 58 | 
 59 |       <div class="wy-nav-content">
 60 |         <div class="rst-content">
 61 |           <div role="navigation" aria-label="Page navigation">
 62 |   <ul class="wy-breadcrumbs">
 63 |       <li><a href="../../index.html" class="icon icon-home" aria-label="Home"></a></li>
 64 |           <li class="breadcrumb-item"><a href="../index.html">Module code</a></li>
 65 |       <li class="breadcrumb-item active">mcp_server_webcrawl.main</li>
 66 |       <li class="wy-breadcrumbs-aside">
 67 |       </li>
 68 |   </ul>
 69 |   <hr/>
 70 | </div>
 71 |           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
 72 |            <div itemprop="articleBody">
 73 |              
 74 |   <h1>Source code for mcp_server_webcrawl.main</h1><div class="highlight"><pre>
 75 | <span></span><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
 76 | 
 77 | <span class="kn">from</span> <span class="nn">mcp.server.stdio</span> <span class="kn">import</span> <span class="n">stdio_server</span>
 78 | 
 79 | <span class="kn">from</span> <span class="nn">mcp_server_webcrawl.crawlers.base.crawler</span> <span class="kn">import</span> <span class="n">BaseCrawler</span>
 80 | <span class="kn">from</span> <span class="nn">mcp_server_webcrawl.utils.logger</span> <span class="kn">import</span> <span class="n">get_logger</span><span class="p">,</span> <span class="n">initialize_logger</span>
 81 | <span class="kn">from</span> <span class="nn">mcp_server_webcrawl.utils.server</span> <span class="kn">import</span> <span class="n">initialize_mcp_server</span>
 82 | 
 83 | <span class="n">logger</span> <span class="o">=</span> <span class="n">get_logger</span><span class="p">()</span>
 84 | 
 85 | <div class="viewcode-block" id="main">
 86 | <a class="viewcode-back" href="../../mcp_server_webcrawl.html#mcp_server_webcrawl.main.main">[docs]</a>
 87 | <span class="k">async</span> <span class="k">def</span> <span class="nf">main</span><span class="p">(</span><span class="n">crawler</span><span class="p">:</span> <span class="n">BaseCrawler</span><span class="p">,</span> <span class="n">datasrc</span><span class="p">:</span> <span class="n">Path</span><span class="p">):</span>
 88 |     <span class="n">initialize_logger</span><span class="p">()</span>
 89 |     <span class="n">initialize_mcp_server</span><span class="p">()</span>
 90 |     <span class="k">async</span> <span class="k">with</span> <span class="n">stdio_server</span><span class="p">()</span> <span class="k">as</span> <span class="p">(</span><span class="n">read_stream</span><span class="p">,</span> <span class="n">write_stream</span><span class="p">):</span>
 91 |         <span class="n">crawler</span> <span class="o">=</span> <span class="n">crawler</span><span class="p">(</span><span class="n">datasrc</span><span class="p">)</span>
 92 |         <span class="n">logger</span><span class="o">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;MCP webcrawl server initialized with adapter </span><span class="si">{</span><span class="n">crawler</span><span class="o">.</span><span class="vm">__class__</span><span class="o">.</span><span class="vm">__name__</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span>
 93 |         <span class="n">logger</span><span class="o">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;datasrc: </span><span class="si">{</span><span class="n">datasrc</span><span class="o">.</span><span class="n">absolute</span><span class="p">()</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span>
 94 |         <span class="k">await</span> <span class="n">crawler</span><span class="o">.</span><span class="n">serve</span><span class="p">(</span><span class="n">read_stream</span><span class="p">,</span> <span class="n">write_stream</span><span class="p">)</span>
 95 |         <span class="n">logger</span><span class="o">.</span><span class="n">info</span><span class="p">(</span><span class="s2">&quot;MCP webcrawl server exited&quot;</span><span class="p">)</span></div>
 96 | 
 97 | </pre></div>
 98 | 
 99 |            </div>
100 |           </div>
101 |           <footer>
102 | 
103 |   <hr/>
104 | 
105 |   <div role="contentinfo">
106 |     <p>&#169; Copyright 2025, pragmar.</p>
107 |   </div>
108 | 
109 |   Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
110 |     <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
111 |     provided by <a href="https://readthedocs.org">Read the Docs</a>.
112 |    
113 | 
114 | </footer>
115 |         </div>
116 |       </div>
117 |     </section>
118 |   </div>
119 |   <script>
120 |       jQuery(function () {
121 |           SphinxRtdTheme.Navigation.enable(true);
122 |       });
123 |   </script> 
124 | 
125 | </body>
126 | </html>
```

--------------------------------------------------------------------------------
/docs/installation.html:
--------------------------------------------------------------------------------

```html
  1 | 
  2 | 
  3 | <!DOCTYPE html>
  4 | <html class="writer-html5" lang="en" data-content_root="./">
  5 | <head>
  6 |   <meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />
  7 | 
  8 |   <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  9 |   <title>Installation &mdash; mcp-server-webcrawl  documentation</title>
 10 |       <link rel="stylesheet" type="text/css" href="_static/pygments.css?v=80d5e7a1" />
 11 |       <link rel="stylesheet" type="text/css" href="_static/css/theme.css?v=e59714d7" />
 12 | 
 13 | 
 14 |       <script src="_static/jquery.js?v=5d32c60e"></script>
 15 |       <script src="_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
 16 |       <script src="_static/documentation_options.js?v=5929fcd5"></script>
 17 |       <script src="_static/doctools.js?v=888ff710"></script>
 18 |       <script src="_static/sphinx_highlight.js?v=dc90522c"></script>
 19 |     <script src="_static/js/theme.js"></script>
 20 |     <link rel="index" title="Index" href="genindex.html" />
 21 |     <link rel="search" title="Search" href="search.html" />
 22 |     <link rel="next" title="Setup Guides" href="guides.html" />
 23 |     <link rel="prev" title="mcp-server-webcrawl" href="index.html" />
 24 | </head>
 25 | 
 26 | <body class="wy-body-for-nav">
 27 |   <div class="wy-grid-for-nav">
 28 |     <nav data-toggle="wy-nav-shift" class="wy-nav-side">
 29 |       <div class="wy-side-scroll">
 30 |         <div class="wy-side-nav-search" >
 31 | 
 32 | 
 33 | 
 34 |           <a href="index.html" class="icon icon-home">
 35 |             mcp-server-webcrawl
 36 |           </a>
 37 | <div role="search">
 38 |   <form id="rtd-search-form" class="wy-form" action="search.html" method="get">
 39 |     <input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
 40 |     <input type="hidden" name="check_keywords" value="yes" />
 41 |     <input type="hidden" name="area" value="default" />
 42 |   </form>
 43 | </div>
 44 |         </div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
 45 |               <p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
 46 | <ul class="current">
 47 | <li class="toctree-l1 current"><a class="current reference internal" href="#">Installation</a><ul>
 48 | <li class="toctree-l2"><a class="reference internal" href="#requirements">Requirements</a></li>
 49 | <li class="toctree-l2"><a class="reference internal" href="#mcp-configuration">MCP Configuration</a></li>
 50 | <li class="toctree-l2"><a class="reference internal" href="#multiple-configurations">Multiple Configurations</a></li>
 51 | <li class="toctree-l2"><a class="reference internal" href="#references">References</a></li>
 52 | </ul>
 53 | </li>
 54 | <li class="toctree-l1"><a class="reference internal" href="guides.html">Setup Guides</a></li>
 55 | <li class="toctree-l1"><a class="reference internal" href="usage.html">Usage</a></li>
 56 | <li class="toctree-l1"><a class="reference internal" href="prompts.html">Prompt Routines</a></li>
 57 | <li class="toctree-l1"><a class="reference internal" href="interactive.html">Interactive Mode</a></li>
 58 | <li class="toctree-l1"><a class="reference internal" href="modules.html">mcp_server_webcrawl</a></li>
 59 | </ul>
 60 | 
 61 |         </div>
 62 |       </div>
 63 |     </nav>
 64 | 
 65 |     <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
 66 |           <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
 67 |           <a href="index.html">mcp-server-webcrawl</a>
 68 |       </nav>
 69 | 
 70 |       <div class="wy-nav-content">
 71 |         <div class="rst-content">
 72 |           <div role="navigation" aria-label="Page navigation">
 73 |   <ul class="wy-breadcrumbs">
 74 |       <li><a href="index.html" class="icon icon-home" aria-label="Home"></a></li>
 75 |       <li class="breadcrumb-item active">Installation</li>
 76 |       <li class="wy-breadcrumbs-aside">
 77 |             <a href="_sources/installation.rst.txt" rel="nofollow"> View page source</a>
 78 |       </li>
 79 |   </ul>
 80 |   <hr/>
 81 | </div>
 82 |           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
 83 |            <div itemprop="articleBody">
 84 | 
 85 |   <section id="installation">
 86 | <h1>Installation<a class="headerlink" href="#installation" title="Link to this heading"></a></h1>
 87 | <p>Install the package via pip:</p>
 88 | <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>pip<span class="w"> </span>install<span class="w"> </span>mcp-server-webcrawl
 89 | </pre></div>
 90 | </div>
 91 | <section id="requirements">
 92 | <h2>Requirements<a class="headerlink" href="#requirements" title="Link to this heading"></a></h2>
 93 | <p>To use mcp-server-webcrawl effectively, you need:</p>
 94 | <ul class="simple">
 95 | <li><p>An MCP-capable LLM host such as Claude Desktop [1]</p></li>
 96 | <li><p>Python [2] installed on your command line interface</p></li>
 97 | <li><p>Basic familiarity with running Python packages</p></li>
 98 | </ul>
 99 | <p>After ensuring these prerequisites are met, run the pip install command above to add the package to your environment.</p>
100 | </section>
101 | <section id="mcp-configuration">
102 | <h2>MCP Configuration<a class="headerlink" href="#mcp-configuration" title="Link to this heading"></a></h2>
103 | <p>To enable your LLM host to access your web crawl data, you’ll need to add an MCP server configuration. From Claude’s developer settings, locate the MCP configuration section and add the appropriate configuration for your crawler type.</p>
104 | <p>Setup guides and videos are available for each supported crawler:</p>
105 | <ul class="simple">
106 | <li><p><a class="reference internal" href="guides/archivebox.html"><span class="doc">ArchiveBox</span></a></p></li>
107 | <li><p><a class="reference internal" href="guides/httrack.html"><span class="doc">HTTrack</span></a></p></li>
108 | <li><p><a class="reference internal" href="guides/interrobot.html"><span class="doc">InterroBot</span></a></p></li>
109 | <li><p><a class="reference internal" href="guides/katana.html"><span class="doc">Katana</span></a></p></li>
110 | <li><p><a class="reference internal" href="guides/siteone.html"><span class="doc">SiteOne</span></a></p></li>
111 | <li><p><a class="reference internal" href="guides/warc.html"><span class="doc">WARC</span></a></p></li>
112 | <li><p><a class="reference internal" href="guides/wget.html"><span class="doc">Wget</span></a></p></li>
113 | </ul>
114 | </section>
115 | 
116 | <section id="references">
117 | <h2>References<a class="headerlink" href="#references" title="Link to this heading"></a></h2>
118 | <p>[1] Claude Desktop: <a class="reference external" href="https://claude.ai">https://claude.ai</a>
119 | [2] Python: <a class="reference external" href="https://python.org">https://python.org</a></p>
120 | </section>
121 | </section>
122 | 
123 | 
124 |            </div>
125 |           </div>
126 |           <footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
127 |         <a href="index.html" class="btn btn-neutral float-left" title="mcp-server-webcrawl" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
128 |         <a href="guides.html" class="btn btn-neutral float-right" title="Setup Guides" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
129 |     </div>
130 | 
131 |   <hr/>
132 | 
133 |   <div role="contentinfo">
134 |     <p>&#169; Copyright 2025, pragmar.</p>
135 |   </div>
136 | 
137 |   Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
138 |     <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
139 |     provided by <a href="https://readthedocs.org">Read the Docs</a>.
140 | 
141 | 
142 | </footer>
143 |         </div>
144 |       </div>
145 |     </section>
146 |   </div>
147 |   <script>
148 |       jQuery(function () {
149 |           SphinxRtdTheme.Navigation.enable(true);
150 |       });
151 |   </script>
152 | 
153 | </body>
154 | </html>
```

--------------------------------------------------------------------------------
/docs/_modules/mcp_server_webcrawl/crawlers/warc/crawler.html:
--------------------------------------------------------------------------------

```html
  1 | 
  2 | 
  3 | <!DOCTYPE html>
  4 | <html class="writer-html5" lang="en" data-content_root="../../../../">
  5 | <head>
  6 |   <meta charset="utf-8" />
  7 |   <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  8 |   <title>mcp_server_webcrawl.crawlers.warc.crawler &mdash; mcp-server-webcrawl  documentation</title>
  9 |       <link rel="stylesheet" type="text/css" href="../../../../_static/pygments.css?v=80d5e7a1" />
 10 |       <link rel="stylesheet" type="text/css" href="../../../../_static/css/theme.css?v=e59714d7" />
 11 | 
 12 |   
 13 |       <script src="../../../../_static/jquery.js?v=5d32c60e"></script>
 14 |       <script src="../../../../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
 15 |       <script src="../../../../_static/documentation_options.js?v=5929fcd5"></script>
 16 |       <script src="../../../../_static/doctools.js?v=888ff710"></script>
 17 |       <script src="../../../../_static/sphinx_highlight.js?v=dc90522c"></script>
 18 |     <script src="../../../../_static/js/theme.js"></script>
 19 |     <link rel="index" title="Index" href="../../../../genindex.html" />
 20 |     <link rel="search" title="Search" href="../../../../search.html" /> 
 21 | </head>
 22 | 
 23 | <body class="wy-body-for-nav"> 
 24 |   <div class="wy-grid-for-nav">
 25 |     <nav data-toggle="wy-nav-shift" class="wy-nav-side">
 26 |       <div class="wy-side-scroll">
 27 |         <div class="wy-side-nav-search" >
 28 | 
 29 |           
 30 |           
 31 |           <a href="../../../../index.html" class="icon icon-home">
 32 |             mcp-server-webcrawl
 33 |           </a>
 34 | <div role="search">
 35 |   <form id="rtd-search-form" class="wy-form" action="../../../../search.html" method="get">
 36 |     <input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
 37 |     <input type="hidden" name="check_keywords" value="yes" />
 38 |     <input type="hidden" name="area" value="default" />
 39 |   </form>
 40 | </div>
 41 |         </div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
 42 |               <p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
 43 | <ul>
 44 | <li class="toctree-l1"><a class="reference internal" href="../../../../installation.html">Installation</a></li>
 45 | <li class="toctree-l1"><a class="reference internal" href="../../../../guides.html">Setup Guides</a></li>
 46 | <li class="toctree-l1"><a class="reference internal" href="../../../../usage.html">Usage</a></li>
 47 | <li class="toctree-l1"><a class="reference internal" href="../../../../modules.html">mcp_server_webcrawl</a></li>
 48 | </ul>
 49 | 
 50 |         </div>
 51 |       </div>
 52 |     </nav>
 53 | 
 54 |     <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
 55 |           <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
 56 |           <a href="../../../../index.html">mcp-server-webcrawl</a>
 57 |       </nav>
 58 | 
 59 |       <div class="wy-nav-content">
 60 |         <div class="rst-content">
 61 |           <div role="navigation" aria-label="Page navigation">
 62 |   <ul class="wy-breadcrumbs">
 63 |       <li><a href="../../../../index.html" class="icon icon-home" aria-label="Home"></a></li>
 64 |           <li class="breadcrumb-item"><a href="../../../index.html">Module code</a></li>
 65 |           <li class="breadcrumb-item"><a href="../../crawlers.html">mcp_server_webcrawl.crawlers</a></li>
 66 |       <li class="breadcrumb-item active">mcp_server_webcrawl.crawlers.warc.crawler</li>
 67 |       <li class="wy-breadcrumbs-aside">
 68 |       </li>
 69 |   </ul>
 70 |   <hr/>
 71 | </div>
 72 |           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
 73 |            <div itemprop="articleBody">
 74 |              
 75 |   <h1>Source code for mcp_server_webcrawl.crawlers.warc.crawler</h1><div class="highlight"><pre>
 76 | <span></span><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
 77 | 
 78 | <span class="kn">from</span> <span class="nn">mcp_server_webcrawl.crawlers.base.indexed</span> <span class="kn">import</span> <span class="n">IndexedCrawler</span>
 79 | <span class="kn">from</span> <span class="nn">mcp_server_webcrawl.crawlers.warc.adapter</span> <span class="kn">import</span> <span class="n">get_sites</span><span class="p">,</span> <span class="n">get_resources</span>
 80 | <span class="kn">from</span> <span class="nn">mcp_server_webcrawl.utils.logger</span> <span class="kn">import</span> <span class="n">get_logger</span>
 81 | 
 82 | <span class="n">logger</span> <span class="o">=</span> <span class="n">get_logger</span><span class="p">()</span>
 83 | 
 84 | <div class="viewcode-block" id="WarcCrawler">
 85 | <a class="viewcode-back" href="../../../../mcp_server_webcrawl.crawlers.warc.html#mcp_server_webcrawl.crawlers.warc.crawler.WarcCrawler">[docs]</a>
 86 | <span class="k">class</span> <span class="nc">WarcCrawler</span><span class="p">(</span><span class="n">IndexedCrawler</span><span class="p">):</span>
 87 | <span class="w">    </span><span class="sd">&quot;&quot;&quot;</span>
 88 | <span class="sd">    A crawler implementation for WARC (Web ARChive) files.</span>
 89 | <span class="sd">    Provides functionality for accessing and searching web archive content.</span>
 90 | <span class="sd">    &quot;&quot;&quot;</span>
 91 | 
 92 | <div class="viewcode-block" id="WarcCrawler.__init__">
 93 | <a class="viewcode-back" href="../../../../mcp_server_webcrawl.crawlers.warc.html#mcp_server_webcrawl.crawlers.warc.crawler.WarcCrawler.__init__">[docs]</a>
 94 |     <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">datasrc</span><span class="p">:</span> <span class="n">Path</span><span class="p">):</span>
 95 | <span class="w">        </span><span class="sd">&quot;&quot;&quot;</span>
 96 | <span class="sd">        Initialize the WARC crawler with a data source directory.</span>
 97 | <span class="sd">        Supported file types: .txt, .warc, and .warc.gz</span>
 98 | 
 99 | <span class="sd">        Args:</span>
100 | <span class="sd">            datasrc: the input argument as Path, must be a directory containing WARC files</span>
101 | 
102 | 
103 | <span class="sd">        Raises:</span>
104 | <span class="sd">            AssertionError: If datasrc is None or not a directory</span>
105 | <span class="sd">        &quot;&quot;&quot;</span>
106 |         <span class="k">assert</span> <span class="n">datasrc</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">,</span> <span class="sa">f</span><span class="s2">&quot;WarcCrawler needs a datasrc, regardless of action&quot;</span>
107 |         <span class="k">assert</span> <span class="n">datasrc</span><span class="o">.</span><span class="n">is_dir</span><span class="p">(),</span> <span class="s2">&quot;WarcCrawler datasrc must be a directory&quot;</span>
108 |         <span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">(</span><span class="n">datasrc</span><span class="p">,</span> <span class="n">get_sites</span><span class="p">,</span> <span class="n">get_resources</span><span class="p">)</span></div>
109 | </div>
110 | 
111 | </pre></div>
112 | 
113 |            </div>
114 |           </div>
115 |           <footer>
116 | 
117 |   <hr/>
118 | 
119 |   <div role="contentinfo">
120 |     <p>&#169; Copyright 2025, pragmar.</p>
121 |   </div>
122 | 
123 |   Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
124 |     <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
125 |     provided by <a href="https://readthedocs.org">Read the Docs</a>.
126 |    
127 | 
128 | </footer>
129 |         </div>
130 |       </div>
131 |     </section>
132 |   </div>
133 |   <script>
134 |       jQuery(function () {
135 |           SphinxRtdTheme.Navigation.enable(true);
136 |       });
137 |   </script> 
138 | 
139 | </body>
140 | </html>
```

--------------------------------------------------------------------------------
/docs/_modules/mcp_server_webcrawl/crawlers/wget/crawler.html:
--------------------------------------------------------------------------------

```html
  1 | 
  2 | 
  3 | <!DOCTYPE html>
  4 | <html class="writer-html5" lang="en" data-content_root="../../../../">
  5 | <head>
  6 |   <meta charset="utf-8" />
  7 |   <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  8 |   <title>mcp_server_webcrawl.crawlers.wget.crawler &mdash; mcp-server-webcrawl  documentation</title>
  9 |       <link rel="stylesheet" type="text/css" href="../../../../_static/pygments.css?v=80d5e7a1" />
 10 |       <link rel="stylesheet" type="text/css" href="../../../../_static/css/theme.css?v=e59714d7" />
 11 | 
 12 |   
 13 |       <script src="../../../../_static/jquery.js?v=5d32c60e"></script>
 14 |       <script src="../../../../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
 15 |       <script src="../../../../_static/documentation_options.js?v=5929fcd5"></script>
 16 |       <script src="../../../../_static/doctools.js?v=888ff710"></script>
 17 |       <script src="../../../../_static/sphinx_highlight.js?v=dc90522c"></script>
 18 |     <script src="../../../../_static/js/theme.js"></script>
 19 |     <link rel="index" title="Index" href="../../../../genindex.html" />
 20 |     <link rel="search" title="Search" href="../../../../search.html" /> 
 21 | </head>
 22 | 
 23 | <body class="wy-body-for-nav"> 
 24 |   <div class="wy-grid-for-nav">
 25 |     <nav data-toggle="wy-nav-shift" class="wy-nav-side">
 26 |       <div class="wy-side-scroll">
 27 |         <div class="wy-side-nav-search" >
 28 | 
 29 |           
 30 |           
 31 |           <a href="../../../../index.html" class="icon icon-home">
 32 |             mcp-server-webcrawl
 33 |           </a>
 34 | <div role="search">
 35 |   <form id="rtd-search-form" class="wy-form" action="../../../../search.html" method="get">
 36 |     <input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
 37 |     <input type="hidden" name="check_keywords" value="yes" />
 38 |     <input type="hidden" name="area" value="default" />
 39 |   </form>
 40 | </div>
 41 |         </div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
 42 |               <p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
 43 | <ul>
 44 | <li class="toctree-l1"><a class="reference internal" href="../../../../installation.html">Installation</a></li>
 45 | <li class="toctree-l1"><a class="reference internal" href="../../../../guides.html">Setup Guides</a></li>
 46 | <li class="toctree-l1"><a class="reference internal" href="../../../../usage.html">Usage</a></li>
 47 | <li class="toctree-l1"><a class="reference internal" href="../../../../modules.html">mcp_server_webcrawl</a></li>
 48 | </ul>
 49 | 
 50 |         </div>
 51 |       </div>
 52 |     </nav>
 53 | 
 54 |     <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
 55 |           <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
 56 |           <a href="../../../../index.html">mcp-server-webcrawl</a>
 57 |       </nav>
 58 | 
 59 |       <div class="wy-nav-content">
 60 |         <div class="rst-content">
 61 |           <div role="navigation" aria-label="Page navigation">
 62 |   <ul class="wy-breadcrumbs">
 63 |       <li><a href="../../../../index.html" class="icon icon-home" aria-label="Home"></a></li>
 64 |           <li class="breadcrumb-item"><a href="../../../index.html">Module code</a></li>
 65 |           <li class="breadcrumb-item"><a href="../../crawlers.html">mcp_server_webcrawl.crawlers</a></li>
 66 |       <li class="breadcrumb-item active">mcp_server_webcrawl.crawlers.wget.crawler</li>
 67 |       <li class="wy-breadcrumbs-aside">
 68 |       </li>
 69 |   </ul>
 70 |   <hr/>
 71 | </div>
 72 |           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
 73 |            <div itemprop="articleBody">
 74 |              
 75 |   <h1>Source code for mcp_server_webcrawl.crawlers.wget.crawler</h1><div class="highlight"><pre>
 76 | <span></span><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
 77 | 
 78 | <span class="kn">from</span> <span class="nn">mcp_server_webcrawl.crawlers.base.indexed</span> <span class="kn">import</span> <span class="n">IndexedCrawler</span>
 79 | <span class="kn">from</span> <span class="nn">mcp_server_webcrawl.crawlers.wget.adapter</span> <span class="kn">import</span> <span class="n">get_sites</span><span class="p">,</span> <span class="n">get_resources</span>
 80 | <span class="kn">from</span> <span class="nn">mcp_server_webcrawl.utils.logger</span> <span class="kn">import</span> <span class="n">get_logger</span>
 81 | 
 82 | <span class="n">logger</span> <span class="o">=</span> <span class="n">get_logger</span><span class="p">()</span>
 83 | 
 84 | <div class="viewcode-block" id="WgetCrawler">
 85 | <a class="viewcode-back" href="../../../../mcp_server_webcrawl.crawlers.wget.html#mcp_server_webcrawl.crawlers.wget.crawler.WgetCrawler">[docs]</a>
 86 | <span class="k">class</span> <span class="nc">WgetCrawler</span><span class="p">(</span><span class="n">IndexedCrawler</span><span class="p">):</span>
 87 | <span class="w">    </span><span class="sd">&quot;&quot;&quot;</span>
 88 | <span class="sd">    A crawler implementation for wget captured sites.</span>
 89 | <span class="sd">    Provides functionality for accessing and searching web content from wget captures.</span>
 90 | <span class="sd">    &quot;&quot;&quot;</span>
 91 | 
 92 | <div class="viewcode-block" id="WgetCrawler.__init__">
 93 | <a class="viewcode-back" href="../../../../mcp_server_webcrawl.crawlers.wget.html#mcp_server_webcrawl.crawlers.wget.crawler.WgetCrawler.__init__">[docs]</a>
 94 |     <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">datasrc</span><span class="p">:</span> <span class="n">Path</span><span class="p">):</span>
 95 | <span class="w">        </span><span class="sd">&quot;&quot;&quot;</span>
 96 | <span class="sd">        Initialize the wget crawler with a data source directory.</span>
 97 | 
 98 | <span class="sd">        Args:</span>
 99 | <span class="sd">            datasrc: the input argument as Path, it must be a directory containing</span>
100 | <span class="sd">                wget captures organized as subdirectories</span>
101 | 
102 | <span class="sd">        Raises:</span>
103 | <span class="sd">            AssertionError: If datasrc is None or not a directory</span>
104 | <span class="sd">        &quot;&quot;&quot;</span>
105 |         <span class="k">assert</span> <span class="n">datasrc</span> <span class="ow">is</span> <span class="ow">not</span> <span class="kc">None</span><span class="p">,</span> <span class="sa">f</span><span class="s2">&quot;WgetCrawler needs a datasrc, regardless of action&quot;</span>
106 |         <span class="k">assert</span> <span class="n">datasrc</span><span class="o">.</span><span class="n">is_dir</span><span class="p">(),</span> <span class="s2">&quot;WgetCrawler datasrc must be a directory&quot;</span>
107 | 
108 |         <span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">(</span><span class="n">datasrc</span><span class="p">,</span> <span class="n">get_sites</span><span class="p">,</span> <span class="n">get_resources</span><span class="p">)</span></div>
109 | </div>
110 | 
111 | </pre></div>
112 | 
113 |            </div>
114 |           </div>
115 |           <footer>
116 | 
117 |   <hr/>
118 | 
119 |   <div role="contentinfo">
120 |     <p>&#169; Copyright 2025, pragmar.</p>
121 |   </div>
122 | 
123 |   Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
124 |     <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
125 |     provided by <a href="https://readthedocs.org">Read the Docs</a>.
126 |    
127 | 
128 | </footer>
129 |         </div>
130 |       </div>
131 |     </section>
132 |   </div>
133 |   <script>
134 |       jQuery(function () {
135 |           SphinxRtdTheme.Navigation.enable(true);
136 |       });
137 |   </script> 
138 | 
139 | </body>
140 | </html>
```

--------------------------------------------------------------------------------
/docs/interactive.html:
--------------------------------------------------------------------------------

```html
  1 | 
  2 | 
  3 | <!DOCTYPE html>
  4 | <html class="writer-html5" lang="en" data-content_root="./">
  5 | <head>
  6 |   <meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />
  7 | 
  8 |   <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  9 |   <title>Interactive Mode &mdash; mcp-server-webcrawl  documentation</title>
 10 |       <link rel="stylesheet" type="text/css" href="_static/pygments.css?v=80d5e7a1" />
 11 |       <link rel="stylesheet" type="text/css" href="_static/css/theme.css?v=e59714d7" />
 12 | 
 13 |   
 14 |       <script src="_static/jquery.js?v=5d32c60e"></script>
 15 |       <script src="_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
 16 |       <script src="_static/documentation_options.js?v=5929fcd5"></script>
 17 |       <script src="_static/doctools.js?v=888ff710"></script>
 18 |       <script src="_static/sphinx_highlight.js?v=dc90522c"></script>
 19 |     <script src="_static/js/theme.js"></script>
 20 |     <link rel="index" title="Index" href="genindex.html" />
 21 |     <link rel="search" title="Search" href="search.html" />
 22 |     <link rel="next" title="mcp_server_webcrawl" href="modules.html" />
 23 |     <link rel="prev" title="Prompt Routines" href="prompts.html" /> 
 24 | </head>
 25 | 
 26 | <body class="wy-body-for-nav"> 
 27 |   <div class="wy-grid-for-nav">
 28 |     <nav data-toggle="wy-nav-shift" class="wy-nav-side">
 29 |       <div class="wy-side-scroll">
 30 |         <div class="wy-side-nav-search" >
 31 | 
 32 |           
 33 |           
 34 |           <a href="index.html" class="icon icon-home">
 35 |             mcp-server-webcrawl
 36 |           </a>
 37 | <div role="search">
 38 |   <form id="rtd-search-form" class="wy-form" action="search.html" method="get">
 39 |     <input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
 40 |     <input type="hidden" name="check_keywords" value="yes" />
 41 |     <input type="hidden" name="area" value="default" />
 42 |   </form>
 43 | </div>
 44 |         </div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
 45 |               <p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
 46 | <ul class="current">
 47 | <li class="toctree-l1"><a class="reference internal" href="installation.html">Installation</a></li>
 48 | <li class="toctree-l1"><a class="reference internal" href="guides.html">Setup Guides</a></li>
 49 | <li class="toctree-l1"><a class="reference internal" href="usage.html">Usage</a></li>
 50 | <li class="toctree-l1"><a class="reference internal" href="prompts.html">Prompt Routines</a></li>
 51 | <li class="toctree-l1 current"><a class="current reference internal" href="#">Interactive Mode</a><ul>
 52 | <li class="toctree-l2"><a class="reference internal" href="#usage">Usage</a></li>
 53 | <li class="toctree-l2"><a class="reference internal" href="#screencaps">Screencaps</a></li>
 54 | </ul>
 55 | </li>
 56 | <li class="toctree-l1"><a class="reference internal" href="modules.html">mcp_server_webcrawl</a></li>
 57 | </ul>
 58 | 
 59 |         </div>
 60 |       </div>
 61 |     </nav>
 62 | 
 63 |     <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
 64 |           <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
 65 |           <a href="index.html">mcp-server-webcrawl</a>
 66 |       </nav>
 67 | 
 68 |       <div class="wy-nav-content">
 69 |         <div class="rst-content">
 70 |           <div role="navigation" aria-label="Page navigation">
 71 |   <ul class="wy-breadcrumbs">
 72 |       <li><a href="index.html" class="icon icon-home" aria-label="Home"></a></li>
 73 |       <li class="breadcrumb-item active">Interactive Mode</li>
 74 |       <li class="wy-breadcrumbs-aside">
 75 |             <a href="_sources/interactive.rst.txt" rel="nofollow"> View page source</a>
 76 |       </li>
 77 |   </ul>
 78 |   <hr/>
 79 | </div>
 80 |           <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
 81 |            <div itemprop="articleBody">
 82 |              
 83 |   <section id="interactive-mode">
 84 | <h1>Interactive Mode<a class="headerlink" href="#interactive-mode" title="Link to this heading"></a></h1>
 85 | <p><strong>No AI, just classic Boolean search of your web-archives in a terminal.</strong></p>
 86 | <p>mcp-server-webcrawl can double as a terminal search for your web archives. You can run it against your local archives, but it gets more interesting when you realize you can ssh into any remote host and view archives sitting on that host. No downloads, syncs, multifactor logins, or other common drudgery required. With interactive mode, you can be in and searching a crawl sitting on a remote server in no time at all.</p>
 87 | <iframe width="560" height="315" style="display: block;margin-bottom:1rem;" src="https://www.youtube.com/embed/8kNkP-zNzs4" frameborder="0" allowfullscreen></iframe><section id="usage">
 88 | <h2>Usage<a class="headerlink" href="#usage" title="Link to this heading"></a></h2>
 89 | <p>Interactive mode exposes the mcp-server-webcrawl search layer as a terminal UI (TUI), bypassing MCP/AI altogether. Core field and Boolean search are supported, along with the human-friendly aspects of the search interface, such as result snippets.</p>
 90 | <p>You launch interactive mode from the termial, using the –interactive command line argument.</p>
 91 | <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>mcp-server-webcrawl<span class="w"> </span>--crawler<span class="w"> </span>wget<span class="w"> </span>--datasrc<span class="w"> </span>/path/to/datasrc<span class="w"> </span>--interactive
 92 | <span class="c1"># or manually enter crawler and datasrc in the UI</span>
 93 | mcp-server-webcrawl<span class="w"> </span>--interactive
 94 | </pre></div>
 95 | </div>
 96 | </section>
 97 | <section id="screencaps">
 98 | <h2>Screencaps<a class="headerlink" href="#screencaps" title="Link to this heading"></a></h2>
 99 | <figure class="align-center" id="id1">
100 | <a class="reference internal image-reference" href="_images/interactive.search.webp"><img alt="mcp-server-webcrawl in --interactive mode heading" src="_images/interactive.search.webp" style="width: 100%;" /></a>
101 | <figcaption>
102 | <p><span class="caption-text">Search view, showing snippets with “Solar Eclipse” highlights</span><a class="headerlink" href="#id1" title="Link to this image"></a></p>
103 | </figcaption>
104 | </figure>
105 | <figure class="align-center" id="id2">
106 | <a class="reference internal image-reference" href="_images/interactive.document.webp"><img alt="mcp-server-webcrawl in --interactive mode heading" src="_images/interactive.document.webp" style="width: 100%;" /></a>
107 | <figcaption>
108 | <p><span class="caption-text">Document presentated in in Markdown, with raw and HTTP headers views available.</span><a class="headerlink" href="#id2" title="Link to this image"></a></p>
109 | </figcaption>
110 | </figure>
111 | </section>
112 | </section>
113 | 
114 | 
115 |            </div>
116 |           </div>
117 |           <footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
118 |         <a href="prompts.html" class="btn btn-neutral float-left" title="Prompt Routines" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
119 |         <a href="modules.html" class="btn btn-neutral float-right" title="mcp_server_webcrawl" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
120 |     </div>
121 | 
122 |   <hr/>
123 | 
124 |   <div role="contentinfo">
125 |     <p>&#169; Copyright 2025, pragmar.</p>
126 |   </div>
127 | 
128 |   Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
129 |     <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
130 |     provided by <a href="https://readthedocs.org">Read the Docs</a>.
131 |    
132 | 
133 | </footer>
134 |         </div>
135 |       </div>
136 |     </section>
137 |   </div>
138 |   <script>
139 |       jQuery(function () {
140 |           SphinxRtdTheme.Navigation.enable(true);
141 |       });
142 |   </script> 
143 | 
144 | </body>
145 | </html>
```

--------------------------------------------------------------------------------
/prompts/audit404.md:
--------------------------------------------------------------------------------

```markdown
  1 | # Webcrawl 404 Audit Instructions
  2 | 
  3 | ## Query Sequence
  4 | 
  5 | ### 1. Identify Target Domain & Homepage
  6 | 
  7 | **FIRST:** Get available sites and let user choose:
  8 | ```
  9 | webcrawl_sites() - get all available domains
 10 | ```
 11 | 
 12 | **THEN:** Find homepage with sorted URL approach:
 13 | ```
 14 | query: type: html AND url: [target_site_domain]
 15 | limit: 1
 16 | sites: [target_site_id]
 17 | sort: +url
 18 | ```
 19 | 
 20 | **NEXT:** Extract the exact domain (e.g. `example.com`) from the homepage URL. You will use this domain string in all subsequent queries to filter results to on-site pages, and using Boolean logic (NOT), to extract "all other 404s" separately.
 21 | 
 22 | ### 2. Get Segmented 404s
 23 | 
 24 | All on-site 404s:
 25 | ```
 26 | query: status:404 AND url: example.com
 27 | limit: 100
 28 | sites: [target_site_id]
 29 | ```
 30 | 
 31 | All off-site 404s (outlinks, generally):
 32 | ```
 33 | query: status:404 AND NOT url: example.com
 34 | limit: 100
 35 | sites: [target_site_id]
 36 | ```
 37 | 
 38 | Note the total count from results metadata to understand scale. **If 100+ errors**, run additional queries prioritizing onsite 404s with offset: 0, 100, 200, 300... until all are captured or you gather 400 total results. Ask the user for permission for more if you think it'd be helpful and there is an end in sight.
 39 | 
 40 | ### 3. Group URLs by Domain/Subdomain Patterns
 41 | - Identify main domain vs subdomains (e.g., `example.com` vs `corp.example.com`)
 42 | - Check for legacy HTTP domains vs HTTPS
 43 | - Count occurrences of each domain type
 44 | 
 45 | ### 4. Identify Structural Patterns
 46 | Look for these common failure types:
 47 | 
 48 | **Pagination Issues:**
 49 | - URLs containing `page=`, `search_page=`, `/p/`, `offset=`
 50 | - Others, you will know when you see them
 51 | - Usually indicates pagination system generating invalid page numbers
 52 | 
 53 | **API Endpoint Failures:**
 54 | - URLs with `/api/`, `/wp-json/`, `/rest/`, `/oembed/`
 55 | - Others, you will know when you see them
 56 | - Often configuration or authentication issues
 57 | 
 58 | **Legacy Infrastructure:**
 59 | - HTTP vs HTTPS mismatches
 60 | - Old directory structures no longer supported
 61 | - Retired subdomains or CDN endpoints
 62 | 
 63 | **Media/Asset Problems:**
 64 | - File extensions (.m4r, .pdf, .jpg, .mp4)
 65 | - `/multimedia/`, `/images/`, `/downloads/` paths
 66 | - Missing files from content migrations
 67 | 
 68 | **Content Management Issues:**
 69 | - Similar path structures suggesting bulk content moves
 70 | - Deleted pages without proper redirects
 71 | - URL structure changes without migration planning
 72 | 
 73 | ### 5. Calculate Pattern Distribution
 74 | - Count URLs in each pattern category
 75 | - Calculate percentage of total 404s for each theme
 76 | - Identify the dominant failure mode (usually 50%+ of errors)
 77 | 
 78 | ### 6. Offer Advanced Analysis or Tool Research
 79 | 
 80 | After completing the main audit report, offer the user two additional options:
 81 | - **Detailed Analysis:** More comprehensive investigation of specific 404 patterns or high-impact broken pages
 82 | - **Tool Research:** Research and recommend specific tools to address identified 404 problems and implement monitoring
 83 | 
 84 | ## Pattern Analysis Method
 85 | 
 86 | ## Reporting Template
 87 | 
 88 | ### 📊 Summary Metrics
 89 | 
 90 | | Metric | Value | Grade Threshold |
 91 | |--------|-------|----------------|
 92 | | **Total 404s** | X out of Y pages | A: <0.5% \| B: 0.5-1% \| C: 1-2% \| D: 2-3% \| F: >3% |
 93 | | **Error Rate** | Z% | [Calculated Grade] |
 94 | | **Site Health** | [Assessment] | Based on error distribution |
 95 | 
 96 | ### 🔍 Pattern Distribution Analysis
 97 | 
 98 | | Pattern Type | Count | % of Total | Priority | Root Cause | Recommended Fix |
 99 | |--------------|-------|------------|----------|------------|-----------------|
100 | | [Pattern Name] | X | Y% | Critical/High/Medium/Low | [Technical explanation] | [Specific action] |
101 | | [Pattern Name] | X | Y% | Critical/High/Medium/Low | [Technical explanation] | [Specific action] |
102 | | [Pattern Name] | X | Y% | Critical/High/Medium/Low | [Technical explanation] | [Specific action] |
103 | 
104 | ### 🔧 Technical Impact Assessment
105 | 
106 | | Domain/Subdomain | 404 Count | Error Type | Business Impact | Fix Complexity |
107 | |------------------|-----------|------------|-----------------|----------------|
108 | | [main_domain] | X | [Pattern] | [SEO/UX/Revenue] | [Simple/Complex] |
109 | | [subdomain] | X | [Pattern] | [SEO/UX/Revenue] | [Simple/Complex] |
110 | | [external] | X | [Pattern] | [SEO/UX/Revenue] | [Simple/Complex] |
111 | 
112 | ### ⚡ Impact Priority Assessment
113 | 
114 | | Priority Level | Criteria | Example Issues |
115 | |----------------|----------|----------------|
116 | | **🚨 Critical** | Core functionality, revenue impact | Payment pages, login systems |
117 | | **🔴 High** | Major SEO/UX degradation | Product pages, main navigation |
118 | | **🟡 Medium** | Internal links, historical content | Blog archives, old campaigns |
119 | | **🟢 Low** | Edge cases, rarely accessed | Test pages, admin tools |
120 | 
121 | ### 🎯 Quick Win Opportunities
122 | 
123 | | Fix Type | Effort Level | Impact | Implementation Method |
124 | |----------|--------------|--------|----------------------|
125 | | **Simple redirects** | Low | High | 301 redirects for obvious replacements |
126 | | **HTTPS upgrades** | Low | Medium | Automatic HTTP→HTTPS redirect rules |
127 | | **Config fixes** | Medium | High | Server/CDN configuration updates |
128 | | **Asset cleanup** | Medium | Medium | Remove/replace broken media references |
129 | 
130 | ### 🛠️ Solution Stack Reference
131 | 
132 | #### Monitoring & Detection Tools
133 | 
134 | | Tool Category | Recommended Solution | Use Case | Integration Complexity |
135 | |---------------|---------------------|----------|----------------------|
136 | | **Search Monitoring** | Google Search Console | Track SERP 404s, set alerts | Simple |
137 | | **Site Crawling** | Screaming Frog SEO Spider | Comprehensive link analysis | Medium |
138 | | **Automated Monitoring** | Dead Link Checker, Pingdom | Ongoing 404 detection | Medium |
139 | | **Log Analysis** | GoAccess, AWStats | Server-level 404 pattern analysis | Complex |
140 | 
141 | #### Redirect Management Options
142 | 
143 | | Platform | Tool | Strengths | Best For |
144 | |----------|------|-----------|----------|
145 | | **WordPress** | Redirection Plugin | User-friendly interface | Content sites |
146 | | **CDN Level** | Cloudflare Page Rules | Global, cached redirects | High-traffic sites |
147 | | **Server Level** | Nginx/Apache rewrites | Maximum performance | Technical teams |
148 | | **Bulk Operations** | CSV redirect generators | Mass URL migrations | Large site moves |
149 | 
150 | ## What's Next?
151 | 
152 | The audit results give you a clear picture of what you're dealing with - whether it's a few simple redirects, a pattern of broken external links, or something more complex like a pagination system gone wrong. Most 404 issues fall into predictable patterns that have standard solutions.
153 | 
154 | **Ready to dive deeper?** I can help you:
155 | - **Create detailed fix strategies** - Let's prioritize your specific 404 patterns and map out exactly how to address them, including timeline recommendations and implementation approaches
156 | - **Expand the analysis** - Examine more URLs, analyze referrer patterns to see how users find these broken links, or investigate when the breaks started happening
157 | - **Research implementation tools** - Find the right redirect management, monitoring, or automated testing solutions that fit your technical stack and team workflow
158 | 
159 | **What would be most helpful for your next steps?**
160 | 
161 | ## Methodology
162 | 
163 | You will review this web project from the perspective of an accomplished but patient web developer. You've seen it all over the years, and have reasonable expectations of quality. At the same time you have a fondness for the user wanting to improve the web at all. It's a noble pursuit that you can encourage without being overbearing. Nobody wants a scolding or patronizing AI. It's a fine line to walk, but you somehow manage it well. As these "reviews" can be hard to see, you will break news gently, but firmly when things are out of whack.
164 | 
165 | Where you have tabular data, you aren't afraid to arrange it in an aesthetically pleasing manner. You will prefer tables above unordered lists. Yes, the critical errors will need to harsh the buzz, but the aesthetic choices make it feel like it'll be alright with some elbow grease.
```
Page 2/35FirstPrevNextLast