#
tokens: 18633/50000 11/11 files
lines: on (toggle) GitHub
raw markdown copy reset
# Directory Structure

```
├── .gitignore
├── docs
│   ├── fastmcp.md
│   └── langextract.md
├── LICENSE
├── pyproject.toml
├── README.md
├── SETUP.md
├── src
│   └── langextract_mcp
│       ├── __init__.py
│       ├── resources
│       │   ├── __init__.py
│       │   ├── README.md
│       │   └── supported-models.md
│       └── server.py
└── uv.lock
```

# Files

--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------

```
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | pip-wheel-metadata/
 24 | share/python-wheels/
 25 | *.egg-info/
 26 | .installed.cfg
 27 | *.egg
 28 | MANIFEST
 29 | 
 30 | # PyInstaller
 31 | *.manifest
 32 | *.spec
 33 | 
 34 | # Installer logs
 35 | pip-log.txt
 36 | pip-delete-this-directory.txt
 37 | 
 38 | # Unit test / coverage reports
 39 | htmlcov/
 40 | .tox/
 41 | .nox/
 42 | .coverage
 43 | .coverage.*
 44 | .cache
 45 | nosetests.xml
 46 | coverage.xml
 47 | *.cover
 48 | *.py,cover
 49 | .hypothesis/
 50 | .pytest_cache/
 51 | 
 52 | # Translations
 53 | *.mo
 54 | *.pot
 55 | 
 56 | # Django stuff:
 57 | *.log
 58 | local_settings.py
 59 | db.sqlite3
 60 | db.sqlite3-journal
 61 | 
 62 | # Flask stuff:
 63 | instance/
 64 | .webassets-cache
 65 | 
 66 | # Scrapy stuff:
 67 | .scrapy
 68 | 
 69 | # Sphinx documentation
 70 | docs/_build/
 71 | 
 72 | # PyBuilder
 73 | target/
 74 | 
 75 | # Jupyter Notebook
 76 | .ipynb_checkpoints
 77 | 
 78 | # IPython
 79 | profile_default/
 80 | ipython_config.py
 81 | 
 82 | # pyenv
 83 | .python-version
 84 | 
 85 | # pipenv
 86 | #Pipfile.lock
 87 | 
 88 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 89 | __pypackages__/
 90 | 
 91 | # Celery stuff
 92 | celerybeat-schedule
 93 | celerybeat.pid
 94 | 
 95 | # SageMath parsed files
 96 | *.sage.py
 97 | 
 98 | # Environments
 99 | .env
100 | .venv
101 | env/
102 | venv/
103 | ENV/
104 | env.bak/
105 | venv.bak/
106 | 
107 | # Spyder project settings
108 | .spyderproject
109 | .spyproject
110 | 
111 | # Rope project settings
112 | .ropeproject
113 | 
114 | # mkdocs documentation
115 | /site
116 | 
117 | # mypy
118 | .mypy_cache/
119 | .dmypy.json
120 | dmypy.json
121 | 
122 | # Pyre type checker
123 | .pyre/
124 | 
125 | # IDE
126 | .vscode/
127 | .idea/
128 | *.swp
129 | *.swo
130 | *~
131 | 
132 | # macOS
133 | .DS_Store
134 | 
135 | # Windows
136 | Thumbs.db
137 | ehthumbs.db
138 | Desktop.ini
139 | 
140 | # Project specific
141 | results/
142 | *.jsonl
143 | *.html
144 | temp/
145 | test_output/
146 | 
147 | # API keys and sensitive info
148 | .env.local
149 | .env.production
150 | api_keys.txt
151 | 
152 | # Logs
153 | *.log
154 | logs/
155 | EOF < /dev/null
156 | 
157 | # Claude
158 | .mcp.json
159 | .CLAUDE.md
160 | .claude/
```

--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------

```markdown
  1 | # LangExtract MCP Server
  2 | 
  3 | A FastMCP server for Google's [langextract](https://github.com/google/langextract) library. This server enables AI assistants like Claude Code to extract structured information from unstructured text using Large Language Models through a MCP interface.
  4 | 
  5 | <a href="https://glama.ai/mcp/servers/@larsenweigle/langextract-mcp">
  6 |   <img width="380" height="200" src="https://glama.ai/mcp/servers/@larsenweigle/langextract-mcp/badge" alt="LangExtract Server MCP server" />
  7 | </a>
  8 | 
  9 | ## Overview
 10 | 
 11 | LangExtract is a Python library that uses LLMs to extract structured information from text documents while maintaining precise source grounding. This MCP server exposes langextract's capabilities through the Model Context Protocol. The server includes intelligent caching, persistent connections, and server-side credential management to provide optimal performance in long-running environments like Claude Code.
 12 | 
 13 | ## Quick Setup for Claude Code
 14 | 
 15 | ### Prerequisites
 16 | 
 17 | - Claude Code installed and configured
 18 | - Google Gemini API key ([Get one here](https://aistudio.google.com/app/apikey))
 19 | - Python 3.10 or higher
 20 | 
 21 | ### Installation
 22 | 
 23 | Install directly into Claude Code using the built-in MCP management:
 24 | 
 25 | ```bash
 26 | claude mcp add langextract-mcp -e LANGEXTRACT_API_KEY=your-gemini-api-key -- uv run --with fastmcp fastmcp run src/langextract_mcp/server.py
 27 | ```
 28 | 
 29 | The server will automatically start and integrate with Claude Code. No additional configuration is required.
 30 | 
 31 | ### Verification
 32 | 
 33 | After installation, verify the integration entering in Claude Code:
 34 | 
 35 | ```
 36 | /mcp
 37 | ```
 38 | 
 39 | You should see output indicating the server is running and can enter the server to see its tool contents.
 40 | 
 41 | ## Available Tools
 42 | 
 43 | The server provides the following tools for text extraction workflows:
 44 | 
 45 | **Core Extraction**
 46 | - `extract_from_text` - Extract structured information from provided text
 47 | - `extract_from_url` - Extract information from web content
 48 | - `save_extraction_results` - Save results to JSONL format
 49 | - `generate_visualization` - Create interactive HTML visualizations
 50 | 
 51 | For more information, you can checkout out the resources available to the client under `src/langextract_mcp/resources`
 52 | 
 53 | ## Usage Examples
 54 | 
 55 | I am currently adding the abilty for MCP clients to pass file paths to unstructured text.
 56 | 
 57 | ### Basic Text Extraction
 58 | 
 59 | Ask Claude Code to extract information using natural language:
 60 | 
 61 | ```
 62 | Extract medication information from this text: "Patient prescribed 500mg amoxicillin twice daily for infection"
 63 | 
 64 | Use these examples to guide the extraction:
 65 | - Text: "Take 250mg ibuprofen every 4 hours"
 66 | - Expected: medication=ibuprofen, dosage=250mg, frequency=every 4 hours
 67 | ```
 68 | 
 69 | ### Advanced Configuration
 70 | 
 71 | For complex extractions, specify configuration parameters:
 72 | 
 73 | ```
 74 | Extract character emotions from Shakespeare using:
 75 | - Model: gemini-2.5-pro for better literary analysis
 76 | - Multiple passes: 3 for comprehensive extraction
 77 | - Temperature: 0.2 for consistent results
 78 | ```
 79 | 
 80 | ### URL Processing
 81 | 
 82 | Extract information directly from web content:
 83 | 
 84 | ```
 85 | Extract key findings from this research paper: https://arxiv.org/abs/example
 86 | Focus on methodology, results, and conclusions
 87 | ```
 88 | 
 89 | ## Supported Models
 90 | 
 91 | This server currently supports **Google Gemini models only**, optimized for reliable structured extraction with advanced schema constraints:
 92 | 
 93 | - `gemini-2.5-flash` - **Recommended default** - Optimal balance of speed, cost, and quality
 94 | - `gemini-2.5-pro` - Best for complex reasoning and analysis tasks requiring highest accuracy
 95 | 
 96 | The server uses persistent connections, schema caching, and connection pooling for optimal performance with Gemini models. Support for additional providers may be added in future versions.
 97 | 
 98 | ## Configuration Reference
 99 | 
100 | ### Environment Variables
101 | 
102 | Set during installation or in server environment:
103 | 
104 | ```bash
105 | LANGEXTRACT_API_KEY=your-gemini-api-key  # Required
106 | ```
107 | 
108 | ### Tool Parameters
109 | 
110 | Configure extraction behavior through tool parameters:
111 | 
112 | ```python
113 | {
114 |     "model_id": "gemini-2.5-flash",     # Language model selection
115 |     "max_char_buffer": 1000,            # Text chunk size
116 |     "temperature": 0.5,                 # Sampling temperature (0.0-1.0)  
117 |     "extraction_passes": 1,             # Number of extraction attempts
118 |     "max_workers": 10                   # Parallel processing threads
119 | }
120 | ```
121 | 
122 | ### Output Format
123 | 
124 | All extractions return consistent structured data:
125 | 
126 | ```python
127 | {
128 |     "document_id": "doc_123",
129 |     "total_extractions": 5,
130 |     "extractions": [
131 |         {
132 |             "extraction_class": "medication", 
133 |             "extraction_text": "amoxicillin",
134 |             "attributes": {"type": "antibiotic"},
135 |             "start_char": 25,
136 |             "end_char": 35
137 |         }
138 |     ],
139 |     "metadata": {
140 |         "model_id": "gemini-2.5-flash",
141 |         "extraction_passes": 1,
142 |         "temperature": 0.5
143 |     }
144 | }
145 | ```
146 | 
147 | ## Use Cases
148 | 
149 | LangExtract MCP Server supports a wide range of use cases across multiple domains. In healthcare and life sciences, it can extract medications, dosages, and treatment protocols from clinical notes, structure radiology and pathology reports, and process research papers or clinical trial data. For legal and compliance applications, it enables extraction of contract terms, parties, and obligations, as well as analysis of regulatory documents, compliance reports, and case law. In research and academia, the server is useful for extracting methodologies, findings, and citations from papers, analyzing survey responses and interview transcripts, and processing historical or archival materials. For business intelligence, it helps extract insights from customer feedback and reviews, analyze news articles and market reports, and process financial documents and earnings reports.
150 | 
151 | ## Support and Documentation
152 | 
153 | **Primary Resources:**
154 | - [LangExtract Documentation](https://github.com/google/langextract) - Core library reference
155 | - [FastMCP Documentation](https://gofastmcp.com/) - MCP server framework
156 | - [Model Context Protocol](https://modelcontextprotocol.io/) - Protocol specification
157 | 
```

--------------------------------------------------------------------------------
/src/langextract_mcp/resources/README.md:
--------------------------------------------------------------------------------

```markdown
  1 | # LangExtract MCP Server - Client Guide
  2 | 
  3 | A Model Context Protocol (MCP) server that provides structured information extraction from unstructured text using Google's LangExtract library and Gemini models.
  4 | 
  5 | ## Overview
  6 | 
  7 | This MCP server enables AI assistants to extract structured information from text documents while maintaining precise source grounding. Each extraction is mapped to its exact location in the source text, enabling visual highlighting and verification.
  8 | 
  9 | ## Available Tools
 10 | 
 11 | ### Core Extraction Tools
 12 | 
 13 | #### `extract_from_text`
 14 | Extract structured information from provided text using Large Language Models.
 15 | 
 16 | **Parameters:**
 17 | - `text` (string): The text to extract information from
 18 | - `prompt_description` (string): Clear instructions for what to extract
 19 | - `examples` (array): List of example extractions to guide the model
 20 | - `config` (object, optional): Configuration parameters
 21 | 
 22 | #### `extract_from_url`
 23 | Extract structured information from web content by downloading and processing the text.
 24 | 
 25 | **Parameters:**
 26 | - `url` (string): URL to download text from (must start with http:// or https://)
 27 | - `prompt_description` (string): Clear instructions for what to extract
 28 | - `examples` (array): List of example extractions to guide the model
 29 | - `config` (object, optional): Configuration parameters
 30 | 
 31 | #### `save_extraction_results`
 32 | Save extraction results to a JSONL file for later use or visualization.
 33 | 
 34 | **Parameters:**
 35 | - `extraction_results` (object): Results from extract_from_text or extract_from_url
 36 | - `output_name` (string): Name for the output file (without .jsonl extension)
 37 | - `output_dir` (string, optional): Directory to save the file (default: current directory)
 38 | 
 39 | #### `generate_visualization`
 40 | Generate interactive HTML visualization from extraction results.
 41 | 
 42 | **Parameters:**
 43 | - `jsonl_file_path` (string): Path to the JSONL file containing extraction results
 44 | - `output_html_path` (string, optional): Optional path for the HTML output
 45 | 
 46 | ## How to Structure Examples
 47 | 
 48 | Examples are critical for guiding the extraction model. Each example should follow this structure:
 49 | 
 50 | ```json
 51 | {
 52 |   "text": "Example input text",
 53 |   "extractions": [
 54 |     {
 55 |       "extraction_class": "category_name",
 56 |       "extraction_text": "exact text from input",
 57 |       "attributes": {
 58 |         "key1": "value1",
 59 |         "key2": "value2"
 60 |       }
 61 |     }
 62 |   ]
 63 | }
 64 | ```
 65 | 
 66 | ### Key Principles for Examples:
 67 | 
 68 | 1. **Use exact text**: `extraction_text` should be verbatim from the input text
 69 | 2. **Don't paraphrase**: Extract the actual words, not interpretations
 70 | 3. **Provide meaningful attributes**: Add context through the attributes dictionary
 71 | 4. **Cover all extraction classes**: Include examples for each type you want to extract
 72 | 5. **Show variety**: Demonstrate different patterns and edge cases
 73 | 
 74 | ## Configuration Options
 75 | 
 76 | The `config` parameter accepts these options:
 77 | 
 78 | - `model_id` (string): Gemini model to use (default: "gemini-2.5-flash")
 79 | - `max_char_buffer` (integer): Text chunk size (default: 1000)
 80 | - `temperature` (float): Sampling temperature 0.0-1.0 (default: 0.5)
 81 | - `extraction_passes` (integer): Number of extraction attempts for better recall (default: 1)
 82 | - `max_workers` (integer): Parallel processing threads (default: 10)
 83 | 
 84 | ## Supported Models
 85 | 
 86 | This server only supports Google Gemini models:
 87 | - `gemini-2.5-flash` - **Recommended default** - Optimal balance of speed, cost, and quality
 88 | - `gemini-2.5-pro` - Best for complex reasoning and analysis tasks
 89 | 
 90 | ## Complete Usage Examples
 91 | 
 92 | ### Example 1: Medical Information Extraction
 93 | 
 94 | ```json
 95 | {
 96 |   "tool": "extract_from_text",
 97 |   "parameters": {
 98 |     "text": "Patient prescribed 500mg amoxicillin twice daily for bacterial infection. Take with food to reduce stomach upset.",
 99 |     "prompt_description": "Extract medication information including drug names, dosages, frequencies, and administration instructions. Use exact text for extractions.",
100 |     "examples": [
101 |       {
102 |         "text": "Take 250mg ibuprofen every 4 hours as needed for pain",
103 |         "extractions": [
104 |           {
105 |             "extraction_class": "medication",
106 |             "extraction_text": "ibuprofen",
107 |             "attributes": {
108 |               "type": "pain_reliever",
109 |               "category": "NSAID"
110 |             }
111 |           },
112 |           {
113 |             "extraction_class": "dosage",
114 |             "extraction_text": "250mg",
115 |             "attributes": {
116 |               "amount": "250",
117 |               "unit": "mg"
118 |             }
119 |           },
120 |           {
121 |             "extraction_class": "frequency",
122 |             "extraction_text": "every 4 hours",
123 |             "attributes": {
124 |               "interval": "4 hours",
125 |               "schedule_type": "as_needed"
126 |             }
127 |           }
128 |         ]
129 |       }
130 |     ],
131 |     "config": {
132 |       "model_id": "gemini-2.5-flash",
133 |       "temperature": 0.2
134 |     }
135 |   }
136 | }
137 | ```
138 | 
139 | ### Example 2: Document Analysis from URL
140 | 
141 | ```json
142 | {
143 |   "tool": "extract_from_url",
144 |   "parameters": {
145 |     "url": "https://example.com/research-paper.html",
146 |     "prompt_description": "Extract research findings, methodologies, and key statistics from academic papers. Focus on quantitative results and experimental methods.",
147 |     "examples": [
148 |       {
149 |         "text": "Our study of 500 participants showed a 23% improvement in accuracy using the new method compared to baseline.",
150 |         "extractions": [
151 |           {
152 |             "extraction_class": "finding",
153 |             "extraction_text": "23% improvement in accuracy",
154 |             "attributes": {
155 |               "metric": "accuracy",
156 |               "change": "improvement",
157 |               "magnitude": "23%"
158 |             }
159 |           },
160 |           {
161 |             "extraction_class": "methodology",
162 |             "extraction_text": "study of 500 participants",
163 |             "attributes": {
164 |               "sample_size": "500",
165 |               "study_type": "comparative"
166 |             }
167 |           }
168 |         ]
169 |       }
170 |     ],
171 |     "config": {
172 |       "model_id": "gemini-2.5-pro",
173 |       "extraction_passes": 2,
174 |       "max_char_buffer": 1500
175 |     }
176 |   }
177 | }
178 | ```
179 | 
180 | ### Example 3: Literary Character Analysis
181 | 
182 | ```json
183 | {
184 |   "tool": "extract_from_text",
185 |   "parameters": {
186 |     "text": "ROMEO: But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
187 |     "prompt_description": "Extract characters, emotions, and literary devices from Shakespeare. Capture the emotional context and relationships between characters.",
188 |     "examples": [
189 |       {
190 |         "text": "HAMLET: To be or not to be, that is the question.",
191 |         "extractions": [
192 |           {
193 |             "extraction_class": "character",
194 |             "extraction_text": "HAMLET",
195 |             "attributes": {
196 |               "play": "Hamlet",
197 |               "emotional_state": "contemplative"
198 |             }
199 |           },
200 |           {
201 |             "extraction_class": "philosophical_statement",
202 |             "extraction_text": "To be or not to be, that is the question",
203 |             "attributes": {
204 |               "theme": "existential",
205 |               "type": "soliloquy"
206 |             }
207 |           }
208 |         ]
209 |       }
210 |     ]
211 |   }
212 | }
213 | ```
214 | 
215 | ### Example 4: Business Intelligence from Customer Feedback
216 | 
217 | ```json
218 | {
219 |   "tool": "extract_from_text",
220 |   "parameters": {
221 |     "text": "The new software update is fantastic! Loading times are 50% faster and the interface is much more intuitive. However, the mobile app still crashes occasionally.",
222 |     "prompt_description": "Extract customer sentiments, specific feedback points, and performance metrics from reviews. Identify both positive and negative aspects.",
223 |     "examples": [
224 |       {
225 |         "text": "Love the new design but the checkout process takes too long - about 3 minutes.",
226 |         "extractions": [
227 |           {
228 |             "extraction_class": "positive_feedback",
229 |             "extraction_text": "Love the new design",
230 |             "attributes": {
231 |               "aspect": "design",
232 |               "sentiment": "positive"
233 |             }
234 |           },
235 |           {
236 |             "extraction_class": "negative_feedback",
237 |             "extraction_text": "checkout process takes too long",
238 |             "attributes": {
239 |               "aspect": "checkout",
240 |               "sentiment": "negative"
241 |             }
242 |           },
243 |           {
244 |             "extraction_class": "metric",
245 |             "extraction_text": "about 3 minutes",
246 |             "attributes": {
247 |               "measurement": "time",
248 |               "value": "3",
249 |               "unit": "minutes"
250 |             }
251 |           }
252 |         ]
253 |       }
254 |     ]
255 |   }
256 | }
257 | ```
258 | 
259 | ## Working with Results
260 | 
261 | ### Saving and Visualizing Extractions
262 | 
263 | After running an extraction, you can save the results and create an interactive visualization:
264 | 
265 | ```json
266 | {
267 |   "tool": "save_extraction_results",
268 |   "parameters": {
269 |     "extraction_results": {...}, // Results from previous extraction
270 |     "output_name": "medical_extractions",
271 |     "output_dir": "./results"
272 |   }
273 | }
274 | ```
275 | 
276 | ```json
277 | {
278 |   "tool": "generate_visualization",
279 |   "parameters": {
280 |     "jsonl_file_path": "./results/medical_extractions.jsonl",
281 |     "output_html_path": "./results/medical_visualization.html"
282 |   }
283 | }
284 | ```
285 | 
286 | ### Expected Output Format
287 | 
288 | All extractions return this structured format:
289 | 
290 | ```json
291 | {
292 |   "document_id": "doc_123",
293 |   "total_extractions": 5,
294 |   "extractions": [
295 |     {
296 |       "extraction_class": "medication",
297 |       "extraction_text": "amoxicillin",
298 |       "attributes": {
299 |         "type": "antibiotic"
300 |       },
301 |       "start_char": 25,
302 |       "end_char": 35
303 |     }
304 |   ],
305 |   "metadata": {
306 |     "model_id": "gemini-2.5-flash",
307 |     "extraction_passes": 1,
308 |     "temperature": 0.5
309 |   }
310 | }
311 | ```
312 | 
313 | ## Best Practices
314 | 
315 | ### Creating Effective Examples
316 | 
317 | 1. **Quality over quantity**: 1-3 high-quality examples are better than many poor ones
318 | 2. **Representative patterns**: Cover the main patterns you expect to see
319 | 3. **Exact text matching**: Always use verbatim text from the input
320 | 4. **Rich attributes**: Use attributes to provide context and categorization
321 | 5. **Edge cases**: Include examples of challenging or ambiguous cases
322 | 
323 | ### Optimizing Performance
324 | 
325 | - Use `gemini-2.5-flash` for most tasks (faster, cost-effective)
326 | - Use `gemini-2.5-pro` for complex reasoning or analysis
327 | - Increase `extraction_passes` for higher recall on long documents
328 | - Decrease `max_char_buffer` for better accuracy on dense text
329 | - Lower `temperature` (0.1-0.3) for consistent, factual extractions
330 | - Higher `temperature` (0.7-0.9) for creative or interpretive tasks
331 | 
332 | ### Error Handling
333 | 
334 | Common issues and solutions:
335 | 
336 | - **"At least one example is required"**: Always provide examples array
337 | - **"Only Gemini models are supported"**: Use `gemini-2.5-flash` or `gemini-2.5-pro`
338 | - **"API key required"**: Server administrator must set LANGEXTRACT_API_KEY
339 | - **"Input text cannot be empty"**: Ensure text parameter has content
340 | - **"URL must start with http://"**: Use full URLs for extract_from_url
341 | 
342 | ## Advanced Features
343 | 
344 | ### Multi-pass Extraction
345 | For comprehensive extraction from long documents:
346 | 
347 | ```json
348 | {
349 |   "config": {
350 |     "extraction_passes": 3,
351 |     "max_workers": 20,
352 |     "max_char_buffer": 800
353 |   }
354 | }
355 | ```
356 | 
357 | ### Precision vs. Recall Tuning
358 | - **High precision**: Lower temperature (0.1-0.3), single pass
359 | - **High recall**: Multiple passes (2-3), higher temperature (0.5-0.7)
360 | 
361 | ### Domain-Specific Configurations
362 | - **Medical texts**: Use `gemini-2.5-pro`, low temperature, multiple passes
363 | - **Legal documents**: Smaller chunks (500-800 chars), precise examples
364 | - **Literary analysis**: Higher temperature, rich attribute examples
365 | - **Technical documentation**: Structured examples, consistent terminology
366 | 
367 | This MCP server provides a powerful interface to Google's LangExtract library, enabling precise structured information extraction with source grounding and interactive visualization capabilities.
```

--------------------------------------------------------------------------------
/src/langextract_mcp/resources/__init__.py:
--------------------------------------------------------------------------------

```python
1 | 
```

--------------------------------------------------------------------------------
/src/langextract_mcp/__init__.py:
--------------------------------------------------------------------------------

```python
1 | """LangExtract MCP Server - FastMCP server for Google's langextract library."""
2 | 
3 | from .server import mcp, main
4 | 
5 | __version__ = "0.1.0"
6 | __all__ = ["mcp", "main"]
```

--------------------------------------------------------------------------------
/SETUP.md:
--------------------------------------------------------------------------------

```markdown
 1 | # LangExtract MCP Server Setup Guide
 2 | 
 3 | ## Quick Setup (No Config Files Needed!)
 4 | 
 5 | This MCP server doesn't use separate configuration files. Everything is handled through environment variables and tool parameters.
 6 | 
 7 | ### Step 1: Get Your API Key
 8 | 1. Go to [Google AI Studio](https://aistudio.google.com/app/apikey)
 9 | 2. Create a new API key
10 | 3. Copy the key (keep it secure!)
11 | 
12 | ### Step 2: Install with Claude Code
13 | ```bash
14 | # Single command installation - no config files needed!
15 | claude mcp add langextract-mcp -e LANGEXTRACT_API_KEY=your-gemini-api-key -- uv run --with fastmcp fastmcp run src/langextract_mcp/server.py
16 | ```
17 | 
18 | That's it! The server will start automatically when Claude Code needs it.
19 | 
20 | ## Configuration Details
21 | 
22 | ### Environment Variables (Set Once)
23 | ```bash
24 | # Required
25 | LANGEXTRACT_API_KEY=your-gemini-api-key
26 | 
27 | # Optional
28 | LANGEXTRACT_DEFAULT_MODEL=gemini-2.5-flash
29 | LANGEXTRACT_MAX_WORKERS=10
30 | ```
31 | 
32 | ### Per-Request Configuration (In Tool Calls)
33 | When using tools, you can configure behavior per request:
34 | 
35 | ```python
36 | {
37 |     "text": "Your text to extract from",
38 |     "config": {
39 |         "model_id": "gemini-2.5-flash",     # Which model to use
40 |         "temperature": 0.5,                 # Randomness (0.0-1.0)
41 |         "extraction_passes": 1,             # How many extraction attempts
42 |         "max_workers": 10                   # Parallel processing
43 |     }
44 | }
45 | ```
46 | 
47 | ## Verification
48 | After installation, ask Claude Code:
49 | ```
50 | Use the get_server_info tool to show the LangExtract server status
51 | ```
52 | 
53 | You should see:
54 | - Server running: ✅
55 | - API key configured: ✅
56 | - Optimization features enabled: ✅
57 | 
58 | ## Troubleshooting
59 | 
60 | **"Server not found"**
61 | ```bash
62 | # Check if registered
63 | claude mcp list
64 | 
65 | # Re-add if missing
66 | claude mcp add langextract-mcp -e LANGEXTRACT_API_KEY=your-key -- uv run --with fastmcp fastmcp run src/langextract_mcp/server.py
67 | ```
68 | 
69 | **"API key not set"**
70 | ```bash
71 | # Check environment
72 | echo $LANGEXTRACT_API_KEY
73 | 
74 | # Set if missing (permanent)
75 | echo 'export LANGEXTRACT_API_KEY=your-key' >> ~/.bashrc
76 | source ~/.bashrc
77 | ```
78 | 
79 | **"Tools not working"**
80 | - Verify API key is valid at [Google AI Studio](https://aistudio.google.com/app/apikey)
81 | - Check network connectivity
82 | - Try with different model (e.g., "gemini-2.5-pro")
83 | 
```

--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------

```toml
 1 | [build-system]
 2 | requires = ["hatchling"]
 3 | build-backend = "hatchling.build"
 4 | 
 5 | [project]
 6 | name = "langextract-mcp"
 7 | version = "0.1.0"
 8 | description = "FastMCP server for Google's langextract library - extract structured information from unstructured text using LLMs"
 9 | readme = "README.md"
10 | license = { text = "Apache-2.0" }
11 | authors = [
12 |     { name = "Larsen Weigle" }
13 | ]
14 | classifiers = [
15 |     "Development Status :: 3 - Alpha",
16 |     "Intended Audience :: Developers",
17 |     "License :: OSI Approved :: Apache Software License",
18 |     "Programming Language :: Python :: 3",
19 |     "Programming Language :: Python :: 3.10",
20 |     "Programming Language :: Python :: 3.11",
21 |     "Programming Language :: Python :: 3.12",
22 |     "Topic :: Software Development :: Libraries :: Python Modules",
23 |     "Topic :: Scientific/Engineering :: Artificial Intelligence",
24 |     "Topic :: Text Processing :: Linguistic",
25 | ]
26 | keywords = ["mcp", "fastmcp", "langextract", "llm", "text-extraction", "nlp", "ai"]
27 | 
28 | requires-python = ">=3.10"
29 | dependencies = [
30 |     "fastmcp>=0.1.0",
31 |     "langextract>=0.1.0",
32 |     "pydantic>=2.0.0",
33 |     "python-dotenv>=1.0.0",
34 |     "httpx>=0.25.0",
35 | ]
36 | 
37 | [project.optional-dependencies]
38 | dev = [
39 |     "pytest>=7.0.0",
40 |     "pytest-asyncio>=0.21.0",
41 |     "black>=23.0.0",
42 |     "isort>=5.12.0",
43 |     "mypy>=1.5.0",
44 |     "pre-commit>=3.0.0",
45 | ]
46 | 
47 | [project.urls]
48 | Homepage = "https://github.com/your-org/langextract-mcp"
49 | Repository = "https://github.com/your-org/langextract-mcp"
50 | Documentation = "https://github.com/your-org/langextract-mcp/blob/main/README.md"
51 | Issues = "https://github.com/your-org/langextract-mcp/issues"
52 | 
53 | [project.scripts]
54 | langextract-mcp = "langextract_mcp.server:main"
55 | 
56 | [tool.hatch.build.targets.wheel]
57 | packages = ["src/langextract_mcp"]
58 | 
59 | [tool.hatch.build.targets.sdist]
60 | include = [
61 |     "/src",
62 |     "/docs",
63 |     "/examples",
64 |     "/README.md",
65 |     "/LICENSE",
66 | ]
67 | 
68 | [tool.black]
69 | line-length = 88
70 | target-version = ['py310']
71 | include = '\.pyi?$'
72 | 
73 | [tool.isort]
74 | profile = "black"
75 | line_length = 88
76 | 
77 | [tool.mypy]
78 | python_version = "3.10"
79 | warn_return_any = true
80 | warn_unused_configs = true
81 | disallow_untyped_defs = true
82 | disallow_incomplete_defs = true
83 | 
84 | [tool.pytest.ini_options]
85 | testpaths = ["tests"]
86 | python_files = ["test_*.py"]
87 | python_classes = ["Test*"]
88 | python_functions = ["test_*"]
89 | addopts = "-v --tb=short"
90 | asyncio_mode = "auto"
```

--------------------------------------------------------------------------------
/src/langextract_mcp/resources/supported-models.md:
--------------------------------------------------------------------------------

```markdown
 1 | # Supported Language Models
 2 | 
 3 | This document provides comprehensive information about the language models supported by the langextract-mcp server.
 4 | 
 5 | ## Currently Supported Models
 6 | 
 7 | The langextract-mcp server currently supports **Google Gemini models only**, which are optimized for reliable structured extraction with schema constraints.
 8 | 
 9 | ### Gemini 2.5 Flash
10 | - **Provider**: Google
11 | - **Model ID**: `gemini-2.5-flash`
12 | - **Description**: Fast, cost-effective model with excellent quality
13 | - **Schema Constraints**: ✅ Supported
14 | - **Recommended For**:
15 |   - General extraction tasks
16 |   - Fast processing requirements
17 |   - Cost-sensitive applications
18 | - **Notes**: Recommended default choice - optimal balance of speed, cost, and quality
19 | 
20 | ### Gemini 2.5 Pro
21 | - **Provider**: Google
22 | - **Model ID**: `gemini-2.5-pro`
23 | - **Description**: Advanced model for complex reasoning tasks
24 | - **Schema Constraints**: ✅ Supported
25 | - **Recommended For**:
26 |   - Complex extractions
27 |   - High accuracy requirements
28 |   - Sophisticated reasoning tasks
29 | - **Notes**: Best quality for complex tasks but higher cost
30 | 
31 | ## Model Recommendations
32 | 
33 | | Use Case | Recommended Model | Reason |
34 | |----------|------------------|---------|
35 | | **Default/General** | `gemini-2.5-flash` | Best balance of speed, cost, and quality |
36 | | **High Quality** | `gemini-2.5-pro` | Superior accuracy and reasoning capabilities |
37 | | **Cost Optimized** | `gemini-2.5-flash` | Most cost-effective option |
38 | | **Complex Reasoning** | `gemini-2.5-pro` | Advanced reasoning for complex extraction tasks |
39 | 
40 | ## Configuration Parameters
41 | 
42 | When using any supported model, you can configure the following parameters:
43 | 
44 | - **`model_id`**: The model identifier (e.g., "gemini-2.5-flash")
45 | - **`max_char_buffer`**: Maximum characters per chunk (default: 1000)
46 | - **`temperature`**: Sampling temperature 0.0-1.0 (default: 0.5)
47 | - **`extraction_passes`**: Number of extraction passes for better recall (default: 1)
48 | - **`max_workers`**: Maximum parallel workers (default: 10)
49 | 
50 | ## Limitations
51 | 
52 | - **Provider Support**: Currently supports Google Gemini models only
53 | - **Future Support**: OpenAI and local model support may be added in future versions
54 | - **API Dependencies**: Requires active internet connection and valid API keys
55 | 
56 | ## Schema Constraints
57 | 
58 | All supported Gemini models include schema constraint capabilities, which means:
59 | 
60 | - **Structured Output**: Guaranteed JSON structure based on your examples
61 | - **Type Safety**: Consistent field types across extractions
62 | - **Validation**: Automatic validation of extracted data against schema
63 | - **Reliability**: Reduced hallucination and improved consistency
64 | 
65 | This makes the langextract-mcp server particularly reliable for production applications requiring consistent structured data extraction.
66 | 
```

--------------------------------------------------------------------------------
/docs/fastmcp.md:
--------------------------------------------------------------------------------

```markdown
  1 | # FastMCP Framework Study Notes - Deep Analysis
  2 | 
  3 | ## Overview
  4 | FastMCP is a Python framework for building Model Context Protocol (MCP) servers and clients, designed to enable sophisticated interactions between AI systems and various services. It provides a "fast, Pythonic way" to build MCP servers with comprehensive functionality and enterprise-grade features.
  5 | 
  6 | ## Core Architecture
  7 | 
  8 | ### 1. Servers
  9 | - **Primary Function**: Expose tools as executable capabilities
 10 | - **Authentication**: Support multiple authentication mechanisms
 11 | - **Middleware**: Enable cross-cutting functionality for request/response processing
 12 | - **Resource Management**: Allow resource and prompt management
 13 | - **Monitoring**: Support progress reporting and logging
 14 | 
 15 | ### 2. Clients
 16 | - **Purpose**: Provide programmatic interaction with MCP servers
 17 | - **Authentication**: Support multiple methods (Bearer Token, OAuth)
 18 | - **Processing**: Handle message processing, logging, and progress monitoring
 19 | 
 20 | ## Key Features
 21 | 
 22 | ### Tool Operations
 23 | - Define tools as executable functions
 24 | - Structured user input handling
 25 | - Comprehensive tool management
 26 | 
 27 | ### Resource Management
 28 | - Create and manage resources
 29 | - Prompt templating capabilities
 30 | - Resource organization and access
 31 | 
 32 | ### Authentication & Security
 33 | - Flexible authentication strategies
 34 | - Bearer Token support
 35 | - OAuth integration
 36 | - Authorization provider compatibility
 37 | 
 38 | ### Middleware System
 39 | - Request/response processing
 40 | - Cross-cutting concerns handling
 41 | - Extensible middleware chain
 42 | 
 43 | ### Monitoring & Logging
 44 | - Progress tracking
 45 | - Comprehensive logging
 46 | - User interaction context
 47 | 
 48 | ## Integration Capabilities
 49 | 
 50 | ### Supported Platforms
 51 | - OpenAI API
 52 | - Anthropic
 53 | - Google Gemini
 54 | - FastAPI
 55 | - Starlette/ASGI
 56 | 
 57 | ### Authorization Providers
 58 | - Various authorization providers supported
 59 | - Flexible configuration options
 60 | 
 61 | ## Server Development Guidelines
 62 | 
 63 | ### 1. Tool Definition
 64 | - Define tools as executable functions
 65 | - Implement clear input/output schemas
 66 | - Handle errors gracefully
 67 | 
 68 | ### 2. Authentication Setup
 69 | - Choose appropriate authentication strategy
 70 | - Configure security mechanisms
 71 | - Implement user context handling
 72 | 
 73 | ### 3. Context Configuration
 74 | - Set up logging context
 75 | - Configure user interactions
 76 | - Implement progress tracking
 77 | 
 78 | ### 4. Middleware Implementation
 79 | - Use middleware for common functionality
 80 | - Process requests and responses
 81 | - Handle cross-cutting concerns
 82 | 
 83 | ### 5. Resource Creation
 84 | - Define resources and prompt templates
 85 | - Organize resource access patterns
 86 | - Implement resource management
 87 | 
 88 | ## Unique Selling Points
 89 | 
 90 | 1. **Pythonic Interface**: Natural Python API design
 91 | 2. **Flexible Composition**: Modular server composition
 92 | 3. **Structured Input**: Sophisticated user input handling
 93 | 4. **Comprehensive SDK**: Extensive documentation and tooling
 94 | 5. **Standardized Protocol**: Uses MCP for consistent interactions
 95 | 
 96 | ## FastMCP Implementation Patterns
 97 | 
 98 | ### 1. Server Instantiation
 99 | ```python
100 | from fastmcp import FastMCP
101 | 
102 | # Basic server
103 | mcp = FastMCP("Demo 🚀")
104 | 
105 | # Server with configuration
106 | mcp = FastMCP(
107 |     name="LangExtractServer",
108 |     instructions="Extract structured information from text using LLMs",
109 |     include_tags={"public"},
110 |     exclude_tags={"internal"}
111 | )
112 | ```
113 | 
114 | ### 2. Tool Definition Patterns
115 | ```python
116 | @mcp.tool
117 | def add(a: int, b: int) -> int:
118 |     """Add two numbers"""
119 |     return a + b
120 | 
121 | # Complex parameters with validation
122 | @mcp.tool
123 | def process_data(
124 |     query: str,
125 |     max_results: int = 10,
126 |     sort_by: str = "relevance",
127 |     category: str | None = None
128 | ) -> dict:
129 |     """Process data with parameters"""
130 |     return {"results": []}
131 | ```
132 | 
133 | ### 3. Error Handling
134 | ```python
135 | from fastmcp.exceptions import ToolError
136 | 
137 | @mcp.tool
138 | def divide(a: float, b: float) -> float:
139 |     if b == 0:
140 |         raise ToolError("Cannot divide by zero")
141 |     return a / b
142 | ```
143 | 
144 | ### 4. Authentication Patterns
145 | ```python
146 | from fastmcp.server.auth.providers.jwt import JWTVerifier
147 | 
148 | auth = JWTVerifier(
149 |     jwks_uri="https://your-auth-system.com/.well-known/jwks.json",
150 |     issuer="https://your-auth-system.com",
151 |     audience="your-mcp-server"
152 | )
153 | 
154 | mcp = FastMCP(name="Protected Server", auth=auth)
155 | ```
156 | 
157 | ### 5. Server Execution
158 | ```python
159 | # STDIO transport (default for MCP clients)
160 | mcp.run()
161 | 
162 | # HTTP transport
163 | mcp.run(transport="http", host="0.0.0.0", port=9000)
164 | ```
165 | 
166 | ### 6. Server Composition
167 | ```python
168 | main = FastMCP(name="MainServer")
169 | sub = FastMCP(name="SubServer")
170 | main.mount(sub, prefix="sub")
171 | ```
172 | 
173 | ## Key Insights for LangExtract MCP Server
174 | 
175 | 1. **Simple Decorator Pattern**: Use `@mcp.tool` for all langextract functions
176 | 2. **Type Safety**: Leverage Python type hints for automatic validation
177 | 3. **Proper Error Handling**: Use `ToolError` for controlled error messaging
178 | 4. **Clean Architecture**: Keep tools simple and focused
179 | 5. **Context Management**: Use FastMCP's built-in context for logging/progress
180 | 6. **Transport Flexibility**: Support both STDIO and HTTP transports
181 | 7. **Authentication Ready**: Design with auth in mind for production use
182 | 
183 | ## Implementation Strategy for langextract
184 | 
185 | Based on deeper FastMCP understanding:
186 | 
187 | 1. **Clean Tool Interface**: Each langextract function as a simple `@mcp.tool`
188 | 2. **Type-Safe Parameters**: Use Pydantic models for complex inputs
189 | 3. **Structured Outputs**: Return proper dictionaries/models
190 | 4. **Error Management**: Comprehensive error handling with `ToolError`
191 | 5. **Context Integration**: Use FastMCP context for progress/logging
192 | 6. **Resource Management**: Expose example templates as MCP resources
193 | 7. **Production Ready**: Authentication and deployment configuration
```

--------------------------------------------------------------------------------
/docs/langextract.md:
--------------------------------------------------------------------------------

```markdown
  1 | # LangExtract Library Study Notes
  2 | 
  3 | ## Overview
  4 | LangExtract is a Python library developed by Google that uses Large Language Models (LLMs) to extract structured information from unstructured text documents based on user-defined instructions. It's designed to process materials like clinical notes, reports, and other documents while maintaining precise source grounding.
  5 | 
  6 | ## Key Features & Differentiators
  7 | 
  8 | ### 1. Precise Source Grounding
  9 | - **Capability**: Maps every extraction to its exact location in the source text
 10 | - **Benefit**: Enables visual highlighting for easy traceability and verification
 11 | - **Implementation**: Through annotation system that tracks character positions
 12 | 
 13 | ### 2. Reliable Structured Outputs
 14 | - **Schema Enforcement**: Consistent output schema based on few-shot examples
 15 | - **Controlled Generation**: Leverages structured output capabilities in supported models (Gemini)
 16 | - **Format Support**: JSON and YAML output formats
 17 | 
 18 | ### 3. Long Document Optimization
 19 | - **Challenge Addressed**: "Needle-in-a-haystack" problem in large documents
 20 | - **Strategy**: Text chunking + parallel processing + multiple extraction passes
 21 | - **Benefit**: Higher recall on complex documents
 22 | 
 23 | ### 4. Interactive Visualization
 24 | - **Output**: Self-contained HTML files for reviewing extractions
 25 | - **Scalability**: Handles thousands of extracted entities
 26 | - **Context**: Shows entities in their original document context
 27 | 
 28 | ### 5. Flexible LLM Support
 29 | - **Cloud Models**: Google Gemini family, OpenAI models
 30 | - **Local Models**: Built-in Ollama interface
 31 | - **Extensibility**: Can be extended to other APIs
 32 | 
 33 | ### 6. Domain Adaptability
 34 | - **No Fine-tuning**: Uses few-shot examples instead of model training
 35 | - **Flexibility**: Works across any domain with proper examples
 36 | - **Customization**: Leverages LLM world knowledge through prompt engineering
 37 | 
 38 | ## Core Architecture
 39 | 
 40 | ### Main Components
 41 | 
 42 | #### 1. Data Models (`data.py`)
 43 | - **ExampleData**: Defines extraction examples with text and expected extractions
 44 | - **Extraction**: Individual extracted entity with class, text, and attributes
 45 | - **Document**: Input document container
 46 | - **AnnotatedDocument**: Result container with extractions and metadata
 47 | 
 48 | #### 2. Inference Engine (`inference.py`)
 49 | - **GeminiLanguageModel**: Google Gemini API integration
 50 | - **OpenAILanguageModel**: OpenAI API integration
 51 | - **BaseLanguageModel**: Abstract base for language model implementations
 52 | - **Schema Support**: Structured output generation for supported models
 53 | 
 54 | #### 3. Annotation System (`annotation.py`)
 55 | - **Annotator**: Core extraction orchestrator
 56 | - **Text Processing**: Handles chunking and parallel processing
 57 | - **Progress Tracking**: Monitors extraction progress
 58 | 
 59 | #### 4. Resolver System (`resolver.py`)
 60 | - **Purpose**: Parses raw LLM output into structured Extraction objects
 61 | - **Fence Handling**: Extracts content from markdown code blocks
 62 | - **Format Parsing**: Handles JSON/YAML parsing and validation
 63 | 
 64 | #### 5. Chunking Engine (`chunking.py`)
 65 | - **Text Segmentation**: Breaks long documents into processable chunks
 66 | - **Buffer Management**: Handles max_char_buffer limits
 67 | - **Overlap Strategy**: Maintains context across chunk boundaries
 68 | 
 69 | #### 6. Visualization (`visualization.py`)
 70 | - **HTML Generation**: Creates interactive visualization files
 71 | - **Entity Highlighting**: Shows extractions in original context
 72 | - **Scalable Interface**: Handles large result sets efficiently
 73 | 
 74 | #### 7. I/O Operations (`io.py`)
 75 | - **URL Download**: Fetches text from web URLs
 76 | - **File Operations**: Saves results to JSONL format
 77 | - **Document Loading**: Handles various input formats
 78 | 
 79 | ### Key API Functions
 80 | 
 81 | #### Primary Interface
 82 | ```python
 83 | lx.extract(
 84 |     text_or_documents,      # Input text, URL, or Document objects
 85 |     prompt_description,     # Extraction instructions
 86 |     examples,              # Few-shot examples
 87 |     model_id="gemini-2.5-flash",
 88 |     # Configuration options...
 89 | )
 90 | ```
 91 | 
 92 | #### Visualization
 93 | ```python
 94 | lx.visualize(jsonl_file_path)  # Generate HTML visualization
 95 | ```
 96 | 
 97 | #### I/O Operations
 98 | ```python
 99 | lx.io.save_annotated_documents(results, output_name, output_dir)
100 | ```
101 | 
102 | ## Configuration Parameters
103 | 
104 | ### Core Parameters
105 | - **model_id**: LLM model selection
106 | - **api_key**: Authentication for cloud models
107 | - **temperature**: Sampling temperature (0.5 recommended)
108 | - **max_char_buffer**: Chunk size limit (1000 default)
109 | 
110 | ### Performance Parameters
111 | - **max_workers**: Parallel processing workers (10 default)
112 | - **batch_length**: Chunks per batch (10 default)
113 | - **extraction_passes**: Multiple extraction attempts (1 default)
114 | 
115 | ### Output Control
116 | - **format_type**: JSON or YAML output
117 | - **fence_output**: Code fence expectations
118 | - **use_schema_constraints**: Structured output enforcement
119 | 
120 | ## Supported Models
121 | 
122 | ### Google Gemini
123 | - **gemini-2.5-flash**: Recommended default (speed/cost/quality balance)
124 | - **gemini-2.5-pro**: For complex reasoning tasks
125 | - **Schema Support**: Full structured output support
126 | - **Rate Limits**: Tier 2 quota recommended for production
127 | 
128 | ### OpenAI
129 | - **gpt-4o**: Supported with limitations
130 | - **Requirements**: `fence_output=True`, `use_schema_constraints=False`
131 | - **Note**: Schema constraints not yet implemented for OpenAI
132 | 
133 | ### Local Models
134 | - **Ollama**: Built-in support
135 | - **Extension**: Can be extended to other local APIs
136 | 
137 | ## Use Cases & Examples
138 | 
139 | ### 1. Literary Analysis
140 | - **Characters**: Extract character names and emotional states
141 | - **Relationships**: Identify character interactions and metaphors
142 | - **Context**: Track narrative elements across long texts
143 | 
144 | ### 2. Medical Document Processing
145 | - **Medications**: Extract drug names, dosages, routes, frequencies
146 | - **Clinical Notes**: Structure unstructured medical reports
147 | - **Compliance**: Maintain source grounding for medical accuracy
148 | 
149 | ### 3. Radiology Reports
150 | - **Structured Data**: Convert free-text reports to structured findings
151 | - **Demo Available**: RadExtract on HuggingFace Spaces
152 | 
153 | ### 4. Long Document Processing
154 | - **Full Novels**: Process complete books (e.g., Romeo & Juliet - 147k chars)
155 | - **Performance**: Parallel processing with multiple passes
156 | - **Visualization**: Handle hundreds of entities in context
157 | 
158 | ## Technical Implementation Details
159 | 
160 | ### Text Processing Pipeline
161 | 1. **Input Validation**: Validate text/documents and examples
162 | 2. **URL Handling**: Download content if URL provided
163 | 3. **Chunking**: Break long texts into manageable pieces
164 | 4. **Parallel Processing**: Distribute chunks across workers
165 | 5. **Multiple Passes**: Optional additional extraction rounds
166 | 6. **Resolution**: Parse LLM outputs into structured data
167 | 7. **Annotation**: Create AnnotatedDocument with source grounding
168 | 8. **Visualization**: Generate interactive HTML output
169 | 
170 | ### Error Handling
171 | - **API Failures**: Graceful handling of LLM API errors
172 | - **Parsing Errors**: Robust JSON/YAML parsing with fallbacks
173 | - **Validation**: Schema validation for structured outputs
174 | 
175 | ### Performance Optimization
176 | - **Concurrent Processing**: Parallel chunk processing
177 | - **Efficient Chunking**: Smart text segmentation
178 | - **Progressive Enhancement**: Multiple passes for better recall
179 | - **Memory Management**: Efficient handling of large documents
180 | 
181 | ## MCP Server Design Implications
182 | 
183 | Based on langextract's architecture, a FastMCP server should expose:
184 | 
185 | ### Core Tools
186 | 1. **extract_text**: Main extraction function
187 | 2. **extract_from_url**: URL-based extraction
188 | 3. **visualize_results**: Generate HTML visualization
189 | 4. **validate_examples**: Validate extraction examples
190 | 
191 | ### Configuration Management
192 | 1. **set_model**: Configure LLM model
193 | 2. **set_api_key**: Set authentication
194 | 3. **configure_extraction**: Set extraction parameters
195 | 
196 | ### File Operations
197 | 1. **save_results**: Save to JSONL format
198 | 2. **load_results**: Load previous results
199 | 3. **export_visualization**: Generate and save HTML
200 | 
201 | ### Advanced Features
202 | 1. **batch_extract**: Process multiple documents
203 | 2. **progressive_extract**: Multi-pass extraction
204 | 3. **compare_results**: Compare extraction results
205 | 
206 | ### Resource Management
207 | - **Model Configurations**: Manage different model setups
208 | - **Example Templates**: Store reusable extraction examples
209 | - **Result Archives**: Access previous extraction results
210 | 
211 | ## Dependencies & Installation
212 | - **Core**: Python 3.10+, requests, dotenv
213 | - **LLM APIs**: google-generativeai, openai
214 | - **Processing**: concurrent.futures for parallelization
215 | - **Visualization**: HTML/CSS/JS generation
216 | - **Format Support**: JSON, YAML parsing
217 | 
218 | ## Licensing & Usage
219 | - **License**: Apache 2.0
220 | - **Disclaimer**: Not officially supported Google product
221 | - **Health Applications**: Subject to Health AI Developer Foundations Terms
222 | - **Citation**: Recommended for production/publication use
```

--------------------------------------------------------------------------------
/src/langextract_mcp/server.py:
--------------------------------------------------------------------------------

```python
  1 | """FastMCP server for langextract - optimized for Claude Code integration."""
  2 | 
  3 | import os
  4 | from typing import Any
  5 | from pathlib import Path
  6 | import hashlib
  7 | import json
  8 | 
  9 | import langextract as lx
 10 | from fastmcp import FastMCP
 11 | from fastmcp.resources import FileResource
 12 | from fastmcp.exceptions import ToolError
 13 | from pydantic import BaseModel, Field
 14 | 
 15 | 
 16 | # Simple dictionary types for easier LLM usage
 17 | # ExtractionItem: {"extraction_class": str, "extraction_text": str, "attributes": dict}
 18 | # ExtractionExample: {"text": str, "extractions": list[ExtractionItem]}
 19 | 
 20 | 
 21 | class ExtractionConfig(BaseModel):
 22 |     """Configuration for extraction parameters."""
 23 |     model_id: str = Field(default="gemini-2.5-flash", description="LLM model to use")
 24 |     max_char_buffer: int = Field(default=1000, description="Max characters per chunk")
 25 |     temperature: float = Field(default=0.5, description="Sampling temperature (0.0-1.0)")
 26 |     extraction_passes: int = Field(default=1, description="Number of extraction passes for better recall")
 27 |     max_workers: int = Field(default=10, description="Max parallel workers")
 28 | 
 29 | 
 30 | # Initialize FastMCP server with Claude Code compatibility
 31 | mcp = FastMCP(
 32 |     name="langextract-mcp",
 33 |     instructions="Extract structured information from unstructured text using Google Gemini models. "
 34 |                 "Provides precise source grounding, interactive visualizations, and optimized caching for performance."
 35 | )
 36 | 
 37 | 
 38 | class LangExtractClient:
 39 |     """Optimized langextract client for MCP server usage.
 40 |     
 41 |     This client maintains persistent connections and caches expensive operations
 42 |     like schema generation and prompt templates for better performance in a
 43 |     long-running MCP server context.
 44 |     """
 45 |     
 46 |     def __init__(self):
 47 |         self._language_models: dict[str, Any] = {}
 48 |         self._schema_cache: dict[str, Any] = {}
 49 |         self._prompt_template_cache: dict[str, Any] = {}
 50 |         self._resolver_cache: dict[str, Any] = {}
 51 |         
 52 |     def _get_examples_hash(self, examples: list[dict[str, Any]]) -> str:
 53 |         """Generate a hash for caching based on examples."""
 54 |         examples_str = json.dumps(examples, sort_keys=True)
 55 |         return hashlib.md5(examples_str.encode()).hexdigest()
 56 |     
 57 |     def _get_language_model(self, config: ExtractionConfig, api_key: str, schema: Any | None = None, schema_hash: str | None = None) -> Any:
 58 |         """Get or create a cached language model instance."""
 59 |         # Include schema hash in cache key to prevent schema mutation conflicts
 60 |         model_key = f"{config.model_id}_{config.temperature}_{config.max_workers}_{schema_hash or 'no_schema'}"
 61 |         
 62 |         if model_key not in self._language_models:
 63 |             # Validate that only Gemini models are supported
 64 |             if not config.model_id.startswith('gemini'):
 65 |                 raise ValueError(f"Only Gemini models are supported. Got: {config.model_id}")
 66 |                 
 67 |             language_model = lx.inference.GeminiLanguageModel(
 68 |                 model_id=config.model_id,
 69 |                 api_key=api_key,
 70 |                 temperature=config.temperature,
 71 |                 max_workers=config.max_workers,
 72 |                 gemini_schema=schema
 73 |             )
 74 |             self._language_models[model_key] = language_model
 75 |             
 76 |         return self._language_models[model_key]
 77 |     
 78 |     def _get_schema(self, examples: list[dict[str, Any]], model_id: str) -> tuple[Any, str]:
 79 |         """Get or create a cached schema for the examples.
 80 |         
 81 |         Returns:
 82 |             Tuple of (schema, examples_hash) for use in caching language models
 83 |         """
 84 |         if not model_id.startswith('gemini'):
 85 |             return None, ""
 86 |             
 87 |         examples_hash = self._get_examples_hash(examples)
 88 |         schema_key = f"{model_id}_{examples_hash}"
 89 |         
 90 |         if schema_key not in self._schema_cache:
 91 |             # Convert examples to langextract format
 92 |             langextract_examples = self._create_langextract_examples(examples)
 93 |             
 94 |             # Create prompt template to generate schema
 95 |             prompt_template = lx.prompting.PromptTemplateStructured(description="Schema generation")
 96 |             prompt_template.examples.extend(langextract_examples)
 97 |             
 98 |             # Generate schema
 99 |             schema = lx.schema.GeminiSchema.from_examples(prompt_template.examples)
100 |             self._schema_cache[schema_key] = schema
101 |             
102 |         return self._schema_cache[schema_key], examples_hash
103 |     
104 |     def _get_resolver(self, format_type: str = "JSON") -> Any:
105 |         """Get or create a cached resolver."""
106 |         if format_type not in self._resolver_cache:
107 |             resolver = lx.resolver.Resolver(
108 |                 fence_output=False,
109 |                 format_type=lx.data.FormatType.JSON if format_type == "JSON" else lx.data.FormatType.YAML,
110 |                 extraction_attributes_suffix="_attributes",
111 |                 extraction_index_suffix=None,
112 |             )
113 |             self._resolver_cache[format_type] = resolver
114 |             
115 |         return self._resolver_cache[format_type]
116 |     
117 |     def _create_langextract_examples(self, examples: list[dict[str, Any]]) -> list[lx.data.ExampleData]:
118 |         """Convert dictionary examples to langextract ExampleData objects."""
119 |         langextract_examples = []
120 |         
121 |         for example in examples:
122 |             extractions = []
123 |             for extraction_data in example["extractions"]:
124 |                 extractions.append(
125 |                     lx.data.Extraction(
126 |                         extraction_class=extraction_data["extraction_class"],
127 |                         extraction_text=extraction_data["extraction_text"],
128 |                         attributes=extraction_data.get("attributes", {})
129 |                     )
130 |                 )
131 |             
132 |             langextract_examples.append(
133 |                 lx.data.ExampleData(
134 |                     text=example["text"],
135 |                     extractions=extractions
136 |                 )
137 |             )
138 |         
139 |         return langextract_examples
140 |     
141 |     def extract(
142 |         self, 
143 |         text_or_url: str,
144 |         prompt_description: str,
145 |         examples: list[dict[str, Any]],
146 |         config: ExtractionConfig,
147 |         api_key: str
148 |     ) -> lx.data.AnnotatedDocument:
149 |         """Optimized extraction using cached components."""
150 |         # Get or generate schema first
151 |         schema, examples_hash = self._get_schema(examples, config.model_id)
152 |         
153 |         # Get cached components with schema-aware caching
154 |         language_model = self._get_language_model(config, api_key, schema, examples_hash)
155 |         resolver = self._get_resolver("JSON")
156 |         
157 |         # Convert examples
158 |         langextract_examples = self._create_langextract_examples(examples)
159 |         
160 |         # Create prompt template
161 |         prompt_template = lx.prompting.PromptTemplateStructured(
162 |             description=prompt_description
163 |         )
164 |         prompt_template.examples.extend(langextract_examples)
165 |         
166 |         # Create annotator
167 |         annotator = lx.annotation.Annotator(
168 |             language_model=language_model,
169 |             prompt_template=prompt_template,
170 |             format_type=lx.data.FormatType.JSON,
171 |             fence_output=False,
172 |         )
173 |         
174 |         # Perform extraction
175 |         if text_or_url.startswith(('http://', 'https://')):
176 |             # Download text first
177 |             text = lx.io.download_text_from_url(text_or_url)
178 |         else:
179 |             text = text_or_url
180 |             
181 |         return annotator.annotate_text(
182 |             text=text,
183 |             resolver=resolver,
184 |             max_char_buffer=config.max_char_buffer,
185 |             batch_length=10,
186 |             additional_context=None,
187 |             debug=False,  # Disable debug for cleaner MCP output
188 |             extraction_passes=config.extraction_passes,
189 |         )
190 | 
191 | 
192 | # Global client instance for the server lifecycle
193 | _langextract_client = LangExtractClient()
194 | 
195 | 
196 | def _get_api_key() -> str | None:
197 |     """Get API key from environment (server-side only for security)."""
198 |     return os.environ.get("LANGEXTRACT_API_KEY")
199 | 
200 | 
201 | def _format_extraction_result(result: lx.data.AnnotatedDocument, config: ExtractionConfig, source_url: str | None = None) -> dict[str, Any]:
202 |     """Format langextract result for MCP response."""
203 |     extractions = []
204 |     
205 |     for extraction in result.extractions or []:
206 |         extractions.append({
207 |             "extraction_class": extraction.extraction_class,
208 |             "extraction_text": extraction.extraction_text,
209 |             "attributes": extraction.attributes,
210 |             "start_char": getattr(extraction, 'start_char', None),
211 |             "end_char": getattr(extraction, 'end_char', None),
212 |         })
213 |     
214 |     response = {
215 |         "document_id": result.document_id if result.document_id else "anonymous",
216 |         "total_extractions": len(extractions),
217 |         "extractions": extractions,
218 |         "metadata": {
219 |             "model_id": config.model_id,
220 |             "extraction_passes": config.extraction_passes,
221 |             "max_char_buffer": config.max_char_buffer,
222 |             "temperature": config.temperature,
223 |         }
224 |     }
225 |     
226 |     if source_url:
227 |         response["source_url"] = source_url
228 |         
229 |     return response
230 | 
231 | # ============================================================================
232 | # Tools
233 | # ============================================================================
234 | 
235 | @mcp.tool
236 | def extract_from_text(
237 |     text: str,
238 |     prompt_description: str,
239 |     examples: list[dict[str, Any]],
240 |     model_id: str = "gemini-2.5-flash",
241 |     max_char_buffer: int = 1000,
242 |     temperature: float = 0.5,
243 |     extraction_passes: int = 1,
244 |     max_workers: int = 10
245 | ) -> dict[str, Any]:
246 |     """
247 |     Extract structured information from text using langextract.
248 |     
249 |     Uses Large Language Models to extract structured information from unstructured text
250 |     based on user-defined instructions and examples. Each extraction is mapped to its
251 |     exact location in the source text for precise source grounding.
252 |     
253 |     Args:
254 |         text: The text to extract information from
255 |         prompt_description: Clear instructions for what to extract
256 |         examples: List of example extractions to guide the model
257 |         model_id: LLM model to use (default: "gemini-2.5-flash")
258 |         max_char_buffer: Max characters per chunk (default: 1000)
259 |         temperature: Sampling temperature 0.0-1.0 (default: 0.5)
260 |         extraction_passes: Number of extraction passes for better recall (default: 1)
261 |         max_workers: Max parallel workers (default: 10)
262 |         
263 |     Returns:
264 |         Dictionary containing extracted entities with source locations and metadata
265 |         
266 |     Raises:
267 |         ToolError: If extraction fails due to invalid parameters or API issues
268 |     """
269 |     try:
270 |         if not examples:
271 |             raise ToolError("At least one example is required for reliable extraction")
272 |         
273 |         if not prompt_description.strip():
274 |             raise ToolError("Prompt description cannot be empty")
275 |             
276 |         if not text.strip():
277 |             raise ToolError("Input text cannot be empty")
278 |         
279 |         # Validate that only Gemini models are supported
280 |         if not model_id.startswith('gemini'):
281 |             raise ToolError(
282 |                 f"Only Google Gemini models are supported. Got: {model_id}. "
283 |                 f"Use 'list_supported_models' tool to see available options."
284 |             )
285 |         
286 |         # Create config object from individual parameters
287 |         config = ExtractionConfig(
288 |             model_id=model_id,
289 |             max_char_buffer=max_char_buffer,
290 |             temperature=temperature,
291 |             extraction_passes=extraction_passes,
292 |             max_workers=max_workers
293 |         )
294 |         
295 |         # Get API key (server-side only for security)
296 |         api_key = _get_api_key()
297 |         if not api_key:
298 |             raise ToolError(
299 |                 "API key required. Server administrator must set LANGEXTRACT_API_KEY environment variable."
300 |             )
301 |         
302 |         # Perform optimized extraction using cached client
303 |         result = _langextract_client.extract(
304 |             text_or_url=text,
305 |             prompt_description=prompt_description,
306 |             examples=examples,
307 |             config=config,
308 |             api_key=api_key
309 |         )
310 |         
311 |         return _format_extraction_result(result, config)
312 |         
313 |     except ValueError as e:
314 |         raise ToolError(f"Invalid parameters: {str(e)}")
315 |     except Exception as e:
316 |         raise ToolError(f"Extraction failed: {str(e)}")
317 | 
318 | 
319 | @mcp.tool
320 | def extract_from_url(
321 |     url: str,
322 |     prompt_description: str,
323 |     examples: list[dict[str, Any]],
324 |     model_id: str = "gemini-2.5-flash",
325 |     max_char_buffer: int = 1000,
326 |     temperature: float = 0.5,
327 |     extraction_passes: int = 1,
328 |     max_workers: int = 10
329 | ) -> dict[str, Any]:
330 |     """
331 |     Extract structured information from text content at a URL.
332 |     
333 |     Downloads text from the specified URL and extracts structured information
334 |     using Large Language Models. Ideal for processing web articles, documents,
335 |     or any text content accessible via HTTP/HTTPS.
336 |     
337 |     Args:
338 |         url: URL to download text from (must start with http:// or https://)
339 |         prompt_description: Clear instructions for what to extract
340 |         examples: List of example extractions to guide the model
341 |         model_id: LLM model to use (default: "gemini-2.5-flash")
342 |         max_char_buffer: Max characters per chunk (default: 1000)
343 |         temperature: Sampling temperature 0.0-1.0 (default: 0.5)
344 |         extraction_passes: Number of extraction passes for better recall (default: 1)
345 |         max_workers: Max parallel workers (default: 10)
346 |         
347 |     Returns:
348 |         Dictionary containing extracted entities with source locations and metadata
349 |         
350 |     Raises:
351 |         ToolError: If URL is invalid, download fails, or extraction fails
352 |     """
353 |     try:
354 |         if not url.startswith(('http://', 'https://')):
355 |             raise ToolError("URL must start with http:// or https://")
356 |             
357 |         if not examples:
358 |             raise ToolError("At least one example is required for reliable extraction")
359 |         
360 |         if not prompt_description.strip():
361 |             raise ToolError("Prompt description cannot be empty")
362 |         
363 |         # Validate that only Gemini models are supported
364 |         if not model_id.startswith('gemini'):
365 |             raise ToolError(
366 |                 f"Only Google Gemini models are supported. Got: {model_id}. "
367 |                 f"Use 'list_supported_models' tool to see available options."
368 |             )
369 |         
370 |         # Create config object from individual parameters
371 |         config = ExtractionConfig(
372 |             model_id=model_id,
373 |             max_char_buffer=max_char_buffer,
374 |             temperature=temperature,
375 |             extraction_passes=extraction_passes,
376 |             max_workers=max_workers
377 |         )
378 |         
379 |         # Get API key (server-side only for security)
380 |         api_key = _get_api_key()
381 |         if not api_key:
382 |             raise ToolError(
383 |                 "API key required. Server administrator must set LANGEXTRACT_API_KEY environment variable."
384 |             )
385 |         
386 |         # Perform optimized extraction using cached client
387 |         result = _langextract_client.extract(
388 |             text_or_url=url,
389 |             prompt_description=prompt_description,
390 |             examples=examples,
391 |             config=config,
392 |             api_key=api_key
393 |         )
394 |         
395 |         return _format_extraction_result(result, config, source_url=url)
396 |         
397 |     except ValueError as e:
398 |         raise ToolError(f"Invalid parameters: {str(e)}")
399 |     except Exception as e:
400 |         raise ToolError(f"URL extraction failed: {str(e)}")
401 | 
402 | 
403 | @mcp.tool  
404 | def save_extraction_results(
405 |     extraction_results: dict[str, Any],
406 |     output_name: str,
407 |     output_dir: str = "."
408 | ) -> dict[str, str]:
409 |     """
410 |     Save extraction results to a JSONL file for later use or visualization.
411 |     
412 |     Saves the extraction results in JSONL (JSON Lines) format, which is commonly
413 |     used for structured data and can be loaded for visualization or further processing.
414 |     
415 |     Args:
416 |         extraction_results: Results from extract_from_text or extract_from_url
417 |         output_name: Name for the output file (without .jsonl extension)
418 |         output_dir: Directory to save the file (default: current directory)
419 |         
420 |     Returns:
421 |         Dictionary with file path and save confirmation
422 |         
423 |     Raises:
424 |         ToolError: If save operation fails
425 |     """
426 |     try:
427 |         # Create output directory if it doesn't exist
428 |         output_path = Path(output_dir)
429 |         output_path.mkdir(parents=True, exist_ok=True)
430 |         
431 |         # Create full file path
432 |         file_path = output_path / f"{output_name}.jsonl"
433 |         
434 |         # Save results to JSONL format
435 |         import json
436 |         with open(file_path, 'w', encoding='utf-8') as f:
437 |             json.dump(extraction_results, f, ensure_ascii=False)
438 |             f.write('\n')
439 |         
440 |         return {
441 |             "message": "Results saved successfully",
442 |             "file_path": str(file_path.absolute()),
443 |             "total_extractions": extraction_results.get("total_extractions", 0)
444 |         }
445 |         
446 |     except Exception as e:
447 |         raise ToolError(f"Failed to save results: {str(e)}")
448 | 
449 | 
450 | @mcp.tool
451 | def generate_visualization(
452 |     jsonl_file_path: str,
453 |     output_html_path: str | None = None
454 | ) -> dict[str, str]:
455 |     """
456 |     Generate interactive HTML visualization from extraction results.
457 |     
458 |     Creates an interactive HTML file that shows extracted entities highlighted
459 |     in their original text context. The visualization is self-contained and
460 |     can handle thousands of entities with color coding and hover details.
461 |     
462 |     Args:
463 |         jsonl_file_path: Path to the JSONL file containing extraction results
464 |         output_html_path: Optional path for the HTML output (default: auto-generated)
465 |         
466 |     Returns:
467 |         Dictionary with HTML file path and generation details
468 |         
469 |     Raises:
470 |         ToolError: If visualization generation fails
471 |     """
472 |     try:
473 |         # Validate input file exists
474 |         input_path = Path(jsonl_file_path)
475 |         if not input_path.exists():
476 |             raise ToolError(f"Input file not found: {jsonl_file_path}")
477 |         
478 |         # Generate visualization using langextract
479 |         html_content = lx.visualize(str(input_path))
480 |         
481 |         # Determine output path
482 |         if output_html_path:
483 |             output_path = Path(output_html_path)
484 |         else:
485 |             output_path = input_path.parent / f"{input_path.stem}_visualization.html"
486 |         
487 |         # Ensure output directory exists
488 |         output_path.parent.mkdir(parents=True, exist_ok=True)
489 |         
490 |         # Write HTML file
491 |         with open(output_path, 'w', encoding='utf-8') as f:
492 |             f.write(html_content)
493 |         
494 |         return {
495 |             "message": "Visualization generated successfully",
496 |             "html_file_path": str(output_path.absolute()),
497 |             "file_size_bytes": len(html_content.encode('utf-8'))
498 |         }
499 |         
500 |     except Exception as e:
501 |         raise ToolError(f"Failed to generate visualization: {str(e)}")
502 | 
503 | # ============================================================================
504 | # Resources
505 | # ============================================================================
506 | 
507 | # Get the directory containing this server.py file
508 | server_dir = Path(__file__).parent
509 | 
510 | readme_path = (server_dir / "resources" / "README.md").resolve()
511 | if readme_path.exists():
512 |     print(f"Adding README resource: {readme_path}")
513 |     # Use a file:// URI scheme
514 |     readme_resource = FileResource(
515 |         uri=f"file://{readme_path.as_posix()}",
516 |         path=readme_path, # Path to the actual file
517 |         name="README File",
518 |         description="The README for the langextract-mcp server.",
519 |         mime_type="text/markdown",
520 |         tags={"documentation"}
521 |     )
522 |     mcp.add_resource(readme_resource)
523 | 
524 | 
525 | supported_models_path = (server_dir / "resources" / "supported-models.md").resolve()
526 | if supported_models_path.exists():
527 |     print(f"Adding Supported Models resource: {supported_models_path}")
528 |     supported_models_resource = FileResource(
529 |         uri=f"file://{supported_models_path.as_posix()}",
530 |         path=supported_models_path,
531 |         name="Supported Models",
532 |         description="The supported models for the langextract-mcp server.",
533 |         mime_type="text/markdown",
534 |         tags={"documentation"}
535 |     )
536 |     mcp.add_resource(supported_models_resource)
537 | 
538 | 
539 | def main():
540 |     """Main function to run the FastMCP server."""
541 |     mcp.run()
542 | 
543 | 
544 | if __name__ == "__main__":
545 |     main()
546 | 
```