# Directory Structure
```
├── .gitignore
├── docs
│ ├── fastmcp.md
│ └── langextract.md
├── LICENSE
├── pyproject.toml
├── README.md
├── SETUP.md
├── src
│ └── langextract_mcp
│ ├── __init__.py
│ ├── resources
│ │ ├── __init__.py
│ │ ├── README.md
│ │ └── supported-models.md
│ └── server.py
└── uv.lock
```
# Files
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
```
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | pip-wheel-metadata/
24 | share/python-wheels/
25 | *.egg-info/
26 | .installed.cfg
27 | *.egg
28 | MANIFEST
29 |
30 | # PyInstaller
31 | *.manifest
32 | *.spec
33 |
34 | # Installer logs
35 | pip-log.txt
36 | pip-delete-this-directory.txt
37 |
38 | # Unit test / coverage reports
39 | htmlcov/
40 | .tox/
41 | .nox/
42 | .coverage
43 | .coverage.*
44 | .cache
45 | nosetests.xml
46 | coverage.xml
47 | *.cover
48 | *.py,cover
49 | .hypothesis/
50 | .pytest_cache/
51 |
52 | # Translations
53 | *.mo
54 | *.pot
55 |
56 | # Django stuff:
57 | *.log
58 | local_settings.py
59 | db.sqlite3
60 | db.sqlite3-journal
61 |
62 | # Flask stuff:
63 | instance/
64 | .webassets-cache
65 |
66 | # Scrapy stuff:
67 | .scrapy
68 |
69 | # Sphinx documentation
70 | docs/_build/
71 |
72 | # PyBuilder
73 | target/
74 |
75 | # Jupyter Notebook
76 | .ipynb_checkpoints
77 |
78 | # IPython
79 | profile_default/
80 | ipython_config.py
81 |
82 | # pyenv
83 | .python-version
84 |
85 | # pipenv
86 | #Pipfile.lock
87 |
88 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
89 | __pypackages__/
90 |
91 | # Celery stuff
92 | celerybeat-schedule
93 | celerybeat.pid
94 |
95 | # SageMath parsed files
96 | *.sage.py
97 |
98 | # Environments
99 | .env
100 | .venv
101 | env/
102 | venv/
103 | ENV/
104 | env.bak/
105 | venv.bak/
106 |
107 | # Spyder project settings
108 | .spyderproject
109 | .spyproject
110 |
111 | # Rope project settings
112 | .ropeproject
113 |
114 | # mkdocs documentation
115 | /site
116 |
117 | # mypy
118 | .mypy_cache/
119 | .dmypy.json
120 | dmypy.json
121 |
122 | # Pyre type checker
123 | .pyre/
124 |
125 | # IDE
126 | .vscode/
127 | .idea/
128 | *.swp
129 | *.swo
130 | *~
131 |
132 | # macOS
133 | .DS_Store
134 |
135 | # Windows
136 | Thumbs.db
137 | ehthumbs.db
138 | Desktop.ini
139 |
140 | # Project specific
141 | results/
142 | *.jsonl
143 | *.html
144 | temp/
145 | test_output/
146 |
147 | # API keys and sensitive info
148 | .env.local
149 | .env.production
150 | api_keys.txt
151 |
152 | # Logs
153 | *.log
154 | logs/
155 | EOF < /dev/null
156 |
157 | # Claude
158 | .mcp.json
159 | .CLAUDE.md
160 | .claude/
```
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
```markdown
1 | # LangExtract MCP Server
2 |
3 | A FastMCP server for Google's [langextract](https://github.com/google/langextract) library. This server enables AI assistants like Claude Code to extract structured information from unstructured text using Large Language Models through a MCP interface.
4 |
5 | <a href="https://glama.ai/mcp/servers/@larsenweigle/langextract-mcp">
6 | <img width="380" height="200" src="https://glama.ai/mcp/servers/@larsenweigle/langextract-mcp/badge" alt="LangExtract Server MCP server" />
7 | </a>
8 |
9 | ## Overview
10 |
11 | LangExtract is a Python library that uses LLMs to extract structured information from text documents while maintaining precise source grounding. This MCP server exposes langextract's capabilities through the Model Context Protocol. The server includes intelligent caching, persistent connections, and server-side credential management to provide optimal performance in long-running environments like Claude Code.
12 |
13 | ## Quick Setup for Claude Code
14 |
15 | ### Prerequisites
16 |
17 | - Claude Code installed and configured
18 | - Google Gemini API key ([Get one here](https://aistudio.google.com/app/apikey))
19 | - Python 3.10 or higher
20 |
21 | ### Installation
22 |
23 | Install directly into Claude Code using the built-in MCP management:
24 |
25 | ```bash
26 | claude mcp add langextract-mcp -e LANGEXTRACT_API_KEY=your-gemini-api-key -- uv run --with fastmcp fastmcp run src/langextract_mcp/server.py
27 | ```
28 |
29 | The server will automatically start and integrate with Claude Code. No additional configuration is required.
30 |
31 | ### Verification
32 |
33 | After installation, verify the integration entering in Claude Code:
34 |
35 | ```
36 | /mcp
37 | ```
38 |
39 | You should see output indicating the server is running and can enter the server to see its tool contents.
40 |
41 | ## Available Tools
42 |
43 | The server provides the following tools for text extraction workflows:
44 |
45 | **Core Extraction**
46 | - `extract_from_text` - Extract structured information from provided text
47 | - `extract_from_url` - Extract information from web content
48 | - `save_extraction_results` - Save results to JSONL format
49 | - `generate_visualization` - Create interactive HTML visualizations
50 |
51 | For more information, you can checkout out the resources available to the client under `src/langextract_mcp/resources`
52 |
53 | ## Usage Examples
54 |
55 | I am currently adding the abilty for MCP clients to pass file paths to unstructured text.
56 |
57 | ### Basic Text Extraction
58 |
59 | Ask Claude Code to extract information using natural language:
60 |
61 | ```
62 | Extract medication information from this text: "Patient prescribed 500mg amoxicillin twice daily for infection"
63 |
64 | Use these examples to guide the extraction:
65 | - Text: "Take 250mg ibuprofen every 4 hours"
66 | - Expected: medication=ibuprofen, dosage=250mg, frequency=every 4 hours
67 | ```
68 |
69 | ### Advanced Configuration
70 |
71 | For complex extractions, specify configuration parameters:
72 |
73 | ```
74 | Extract character emotions from Shakespeare using:
75 | - Model: gemini-2.5-pro for better literary analysis
76 | - Multiple passes: 3 for comprehensive extraction
77 | - Temperature: 0.2 for consistent results
78 | ```
79 |
80 | ### URL Processing
81 |
82 | Extract information directly from web content:
83 |
84 | ```
85 | Extract key findings from this research paper: https://arxiv.org/abs/example
86 | Focus on methodology, results, and conclusions
87 | ```
88 |
89 | ## Supported Models
90 |
91 | This server currently supports **Google Gemini models only**, optimized for reliable structured extraction with advanced schema constraints:
92 |
93 | - `gemini-2.5-flash` - **Recommended default** - Optimal balance of speed, cost, and quality
94 | - `gemini-2.5-pro` - Best for complex reasoning and analysis tasks requiring highest accuracy
95 |
96 | The server uses persistent connections, schema caching, and connection pooling for optimal performance with Gemini models. Support for additional providers may be added in future versions.
97 |
98 | ## Configuration Reference
99 |
100 | ### Environment Variables
101 |
102 | Set during installation or in server environment:
103 |
104 | ```bash
105 | LANGEXTRACT_API_KEY=your-gemini-api-key # Required
106 | ```
107 |
108 | ### Tool Parameters
109 |
110 | Configure extraction behavior through tool parameters:
111 |
112 | ```python
113 | {
114 | "model_id": "gemini-2.5-flash", # Language model selection
115 | "max_char_buffer": 1000, # Text chunk size
116 | "temperature": 0.5, # Sampling temperature (0.0-1.0)
117 | "extraction_passes": 1, # Number of extraction attempts
118 | "max_workers": 10 # Parallel processing threads
119 | }
120 | ```
121 |
122 | ### Output Format
123 |
124 | All extractions return consistent structured data:
125 |
126 | ```python
127 | {
128 | "document_id": "doc_123",
129 | "total_extractions": 5,
130 | "extractions": [
131 | {
132 | "extraction_class": "medication",
133 | "extraction_text": "amoxicillin",
134 | "attributes": {"type": "antibiotic"},
135 | "start_char": 25,
136 | "end_char": 35
137 | }
138 | ],
139 | "metadata": {
140 | "model_id": "gemini-2.5-flash",
141 | "extraction_passes": 1,
142 | "temperature": 0.5
143 | }
144 | }
145 | ```
146 |
147 | ## Use Cases
148 |
149 | LangExtract MCP Server supports a wide range of use cases across multiple domains. In healthcare and life sciences, it can extract medications, dosages, and treatment protocols from clinical notes, structure radiology and pathology reports, and process research papers or clinical trial data. For legal and compliance applications, it enables extraction of contract terms, parties, and obligations, as well as analysis of regulatory documents, compliance reports, and case law. In research and academia, the server is useful for extracting methodologies, findings, and citations from papers, analyzing survey responses and interview transcripts, and processing historical or archival materials. For business intelligence, it helps extract insights from customer feedback and reviews, analyze news articles and market reports, and process financial documents and earnings reports.
150 |
151 | ## Support and Documentation
152 |
153 | **Primary Resources:**
154 | - [LangExtract Documentation](https://github.com/google/langextract) - Core library reference
155 | - [FastMCP Documentation](https://gofastmcp.com/) - MCP server framework
156 | - [Model Context Protocol](https://modelcontextprotocol.io/) - Protocol specification
157 |
```
--------------------------------------------------------------------------------
/src/langextract_mcp/resources/README.md:
--------------------------------------------------------------------------------
```markdown
1 | # LangExtract MCP Server - Client Guide
2 |
3 | A Model Context Protocol (MCP) server that provides structured information extraction from unstructured text using Google's LangExtract library and Gemini models.
4 |
5 | ## Overview
6 |
7 | This MCP server enables AI assistants to extract structured information from text documents while maintaining precise source grounding. Each extraction is mapped to its exact location in the source text, enabling visual highlighting and verification.
8 |
9 | ## Available Tools
10 |
11 | ### Core Extraction Tools
12 |
13 | #### `extract_from_text`
14 | Extract structured information from provided text using Large Language Models.
15 |
16 | **Parameters:**
17 | - `text` (string): The text to extract information from
18 | - `prompt_description` (string): Clear instructions for what to extract
19 | - `examples` (array): List of example extractions to guide the model
20 | - `config` (object, optional): Configuration parameters
21 |
22 | #### `extract_from_url`
23 | Extract structured information from web content by downloading and processing the text.
24 |
25 | **Parameters:**
26 | - `url` (string): URL to download text from (must start with http:// or https://)
27 | - `prompt_description` (string): Clear instructions for what to extract
28 | - `examples` (array): List of example extractions to guide the model
29 | - `config` (object, optional): Configuration parameters
30 |
31 | #### `save_extraction_results`
32 | Save extraction results to a JSONL file for later use or visualization.
33 |
34 | **Parameters:**
35 | - `extraction_results` (object): Results from extract_from_text or extract_from_url
36 | - `output_name` (string): Name for the output file (without .jsonl extension)
37 | - `output_dir` (string, optional): Directory to save the file (default: current directory)
38 |
39 | #### `generate_visualization`
40 | Generate interactive HTML visualization from extraction results.
41 |
42 | **Parameters:**
43 | - `jsonl_file_path` (string): Path to the JSONL file containing extraction results
44 | - `output_html_path` (string, optional): Optional path for the HTML output
45 |
46 | ## How to Structure Examples
47 |
48 | Examples are critical for guiding the extraction model. Each example should follow this structure:
49 |
50 | ```json
51 | {
52 | "text": "Example input text",
53 | "extractions": [
54 | {
55 | "extraction_class": "category_name",
56 | "extraction_text": "exact text from input",
57 | "attributes": {
58 | "key1": "value1",
59 | "key2": "value2"
60 | }
61 | }
62 | ]
63 | }
64 | ```
65 |
66 | ### Key Principles for Examples:
67 |
68 | 1. **Use exact text**: `extraction_text` should be verbatim from the input text
69 | 2. **Don't paraphrase**: Extract the actual words, not interpretations
70 | 3. **Provide meaningful attributes**: Add context through the attributes dictionary
71 | 4. **Cover all extraction classes**: Include examples for each type you want to extract
72 | 5. **Show variety**: Demonstrate different patterns and edge cases
73 |
74 | ## Configuration Options
75 |
76 | The `config` parameter accepts these options:
77 |
78 | - `model_id` (string): Gemini model to use (default: "gemini-2.5-flash")
79 | - `max_char_buffer` (integer): Text chunk size (default: 1000)
80 | - `temperature` (float): Sampling temperature 0.0-1.0 (default: 0.5)
81 | - `extraction_passes` (integer): Number of extraction attempts for better recall (default: 1)
82 | - `max_workers` (integer): Parallel processing threads (default: 10)
83 |
84 | ## Supported Models
85 |
86 | This server only supports Google Gemini models:
87 | - `gemini-2.5-flash` - **Recommended default** - Optimal balance of speed, cost, and quality
88 | - `gemini-2.5-pro` - Best for complex reasoning and analysis tasks
89 |
90 | ## Complete Usage Examples
91 |
92 | ### Example 1: Medical Information Extraction
93 |
94 | ```json
95 | {
96 | "tool": "extract_from_text",
97 | "parameters": {
98 | "text": "Patient prescribed 500mg amoxicillin twice daily for bacterial infection. Take with food to reduce stomach upset.",
99 | "prompt_description": "Extract medication information including drug names, dosages, frequencies, and administration instructions. Use exact text for extractions.",
100 | "examples": [
101 | {
102 | "text": "Take 250mg ibuprofen every 4 hours as needed for pain",
103 | "extractions": [
104 | {
105 | "extraction_class": "medication",
106 | "extraction_text": "ibuprofen",
107 | "attributes": {
108 | "type": "pain_reliever",
109 | "category": "NSAID"
110 | }
111 | },
112 | {
113 | "extraction_class": "dosage",
114 | "extraction_text": "250mg",
115 | "attributes": {
116 | "amount": "250",
117 | "unit": "mg"
118 | }
119 | },
120 | {
121 | "extraction_class": "frequency",
122 | "extraction_text": "every 4 hours",
123 | "attributes": {
124 | "interval": "4 hours",
125 | "schedule_type": "as_needed"
126 | }
127 | }
128 | ]
129 | }
130 | ],
131 | "config": {
132 | "model_id": "gemini-2.5-flash",
133 | "temperature": 0.2
134 | }
135 | }
136 | }
137 | ```
138 |
139 | ### Example 2: Document Analysis from URL
140 |
141 | ```json
142 | {
143 | "tool": "extract_from_url",
144 | "parameters": {
145 | "url": "https://example.com/research-paper.html",
146 | "prompt_description": "Extract research findings, methodologies, and key statistics from academic papers. Focus on quantitative results and experimental methods.",
147 | "examples": [
148 | {
149 | "text": "Our study of 500 participants showed a 23% improvement in accuracy using the new method compared to baseline.",
150 | "extractions": [
151 | {
152 | "extraction_class": "finding",
153 | "extraction_text": "23% improvement in accuracy",
154 | "attributes": {
155 | "metric": "accuracy",
156 | "change": "improvement",
157 | "magnitude": "23%"
158 | }
159 | },
160 | {
161 | "extraction_class": "methodology",
162 | "extraction_text": "study of 500 participants",
163 | "attributes": {
164 | "sample_size": "500",
165 | "study_type": "comparative"
166 | }
167 | }
168 | ]
169 | }
170 | ],
171 | "config": {
172 | "model_id": "gemini-2.5-pro",
173 | "extraction_passes": 2,
174 | "max_char_buffer": 1500
175 | }
176 | }
177 | }
178 | ```
179 |
180 | ### Example 3: Literary Character Analysis
181 |
182 | ```json
183 | {
184 | "tool": "extract_from_text",
185 | "parameters": {
186 | "text": "ROMEO: But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
187 | "prompt_description": "Extract characters, emotions, and literary devices from Shakespeare. Capture the emotional context and relationships between characters.",
188 | "examples": [
189 | {
190 | "text": "HAMLET: To be or not to be, that is the question.",
191 | "extractions": [
192 | {
193 | "extraction_class": "character",
194 | "extraction_text": "HAMLET",
195 | "attributes": {
196 | "play": "Hamlet",
197 | "emotional_state": "contemplative"
198 | }
199 | },
200 | {
201 | "extraction_class": "philosophical_statement",
202 | "extraction_text": "To be or not to be, that is the question",
203 | "attributes": {
204 | "theme": "existential",
205 | "type": "soliloquy"
206 | }
207 | }
208 | ]
209 | }
210 | ]
211 | }
212 | }
213 | ```
214 |
215 | ### Example 4: Business Intelligence from Customer Feedback
216 |
217 | ```json
218 | {
219 | "tool": "extract_from_text",
220 | "parameters": {
221 | "text": "The new software update is fantastic! Loading times are 50% faster and the interface is much more intuitive. However, the mobile app still crashes occasionally.",
222 | "prompt_description": "Extract customer sentiments, specific feedback points, and performance metrics from reviews. Identify both positive and negative aspects.",
223 | "examples": [
224 | {
225 | "text": "Love the new design but the checkout process takes too long - about 3 minutes.",
226 | "extractions": [
227 | {
228 | "extraction_class": "positive_feedback",
229 | "extraction_text": "Love the new design",
230 | "attributes": {
231 | "aspect": "design",
232 | "sentiment": "positive"
233 | }
234 | },
235 | {
236 | "extraction_class": "negative_feedback",
237 | "extraction_text": "checkout process takes too long",
238 | "attributes": {
239 | "aspect": "checkout",
240 | "sentiment": "negative"
241 | }
242 | },
243 | {
244 | "extraction_class": "metric",
245 | "extraction_text": "about 3 minutes",
246 | "attributes": {
247 | "measurement": "time",
248 | "value": "3",
249 | "unit": "minutes"
250 | }
251 | }
252 | ]
253 | }
254 | ]
255 | }
256 | }
257 | ```
258 |
259 | ## Working with Results
260 |
261 | ### Saving and Visualizing Extractions
262 |
263 | After running an extraction, you can save the results and create an interactive visualization:
264 |
265 | ```json
266 | {
267 | "tool": "save_extraction_results",
268 | "parameters": {
269 | "extraction_results": {...}, // Results from previous extraction
270 | "output_name": "medical_extractions",
271 | "output_dir": "./results"
272 | }
273 | }
274 | ```
275 |
276 | ```json
277 | {
278 | "tool": "generate_visualization",
279 | "parameters": {
280 | "jsonl_file_path": "./results/medical_extractions.jsonl",
281 | "output_html_path": "./results/medical_visualization.html"
282 | }
283 | }
284 | ```
285 |
286 | ### Expected Output Format
287 |
288 | All extractions return this structured format:
289 |
290 | ```json
291 | {
292 | "document_id": "doc_123",
293 | "total_extractions": 5,
294 | "extractions": [
295 | {
296 | "extraction_class": "medication",
297 | "extraction_text": "amoxicillin",
298 | "attributes": {
299 | "type": "antibiotic"
300 | },
301 | "start_char": 25,
302 | "end_char": 35
303 | }
304 | ],
305 | "metadata": {
306 | "model_id": "gemini-2.5-flash",
307 | "extraction_passes": 1,
308 | "temperature": 0.5
309 | }
310 | }
311 | ```
312 |
313 | ## Best Practices
314 |
315 | ### Creating Effective Examples
316 |
317 | 1. **Quality over quantity**: 1-3 high-quality examples are better than many poor ones
318 | 2. **Representative patterns**: Cover the main patterns you expect to see
319 | 3. **Exact text matching**: Always use verbatim text from the input
320 | 4. **Rich attributes**: Use attributes to provide context and categorization
321 | 5. **Edge cases**: Include examples of challenging or ambiguous cases
322 |
323 | ### Optimizing Performance
324 |
325 | - Use `gemini-2.5-flash` for most tasks (faster, cost-effective)
326 | - Use `gemini-2.5-pro` for complex reasoning or analysis
327 | - Increase `extraction_passes` for higher recall on long documents
328 | - Decrease `max_char_buffer` for better accuracy on dense text
329 | - Lower `temperature` (0.1-0.3) for consistent, factual extractions
330 | - Higher `temperature` (0.7-0.9) for creative or interpretive tasks
331 |
332 | ### Error Handling
333 |
334 | Common issues and solutions:
335 |
336 | - **"At least one example is required"**: Always provide examples array
337 | - **"Only Gemini models are supported"**: Use `gemini-2.5-flash` or `gemini-2.5-pro`
338 | - **"API key required"**: Server administrator must set LANGEXTRACT_API_KEY
339 | - **"Input text cannot be empty"**: Ensure text parameter has content
340 | - **"URL must start with http://"**: Use full URLs for extract_from_url
341 |
342 | ## Advanced Features
343 |
344 | ### Multi-pass Extraction
345 | For comprehensive extraction from long documents:
346 |
347 | ```json
348 | {
349 | "config": {
350 | "extraction_passes": 3,
351 | "max_workers": 20,
352 | "max_char_buffer": 800
353 | }
354 | }
355 | ```
356 |
357 | ### Precision vs. Recall Tuning
358 | - **High precision**: Lower temperature (0.1-0.3), single pass
359 | - **High recall**: Multiple passes (2-3), higher temperature (0.5-0.7)
360 |
361 | ### Domain-Specific Configurations
362 | - **Medical texts**: Use `gemini-2.5-pro`, low temperature, multiple passes
363 | - **Legal documents**: Smaller chunks (500-800 chars), precise examples
364 | - **Literary analysis**: Higher temperature, rich attribute examples
365 | - **Technical documentation**: Structured examples, consistent terminology
366 |
367 | This MCP server provides a powerful interface to Google's LangExtract library, enabling precise structured information extraction with source grounding and interactive visualization capabilities.
```
--------------------------------------------------------------------------------
/src/langextract_mcp/resources/__init__.py:
--------------------------------------------------------------------------------
```python
1 |
```
--------------------------------------------------------------------------------
/src/langextract_mcp/__init__.py:
--------------------------------------------------------------------------------
```python
1 | """LangExtract MCP Server - FastMCP server for Google's langextract library."""
2 |
3 | from .server import mcp, main
4 |
5 | __version__ = "0.1.0"
6 | __all__ = ["mcp", "main"]
```
--------------------------------------------------------------------------------
/SETUP.md:
--------------------------------------------------------------------------------
```markdown
1 | # LangExtract MCP Server Setup Guide
2 |
3 | ## Quick Setup (No Config Files Needed!)
4 |
5 | This MCP server doesn't use separate configuration files. Everything is handled through environment variables and tool parameters.
6 |
7 | ### Step 1: Get Your API Key
8 | 1. Go to [Google AI Studio](https://aistudio.google.com/app/apikey)
9 | 2. Create a new API key
10 | 3. Copy the key (keep it secure!)
11 |
12 | ### Step 2: Install with Claude Code
13 | ```bash
14 | # Single command installation - no config files needed!
15 | claude mcp add langextract-mcp -e LANGEXTRACT_API_KEY=your-gemini-api-key -- uv run --with fastmcp fastmcp run src/langextract_mcp/server.py
16 | ```
17 |
18 | That's it! The server will start automatically when Claude Code needs it.
19 |
20 | ## Configuration Details
21 |
22 | ### Environment Variables (Set Once)
23 | ```bash
24 | # Required
25 | LANGEXTRACT_API_KEY=your-gemini-api-key
26 |
27 | # Optional
28 | LANGEXTRACT_DEFAULT_MODEL=gemini-2.5-flash
29 | LANGEXTRACT_MAX_WORKERS=10
30 | ```
31 |
32 | ### Per-Request Configuration (In Tool Calls)
33 | When using tools, you can configure behavior per request:
34 |
35 | ```python
36 | {
37 | "text": "Your text to extract from",
38 | "config": {
39 | "model_id": "gemini-2.5-flash", # Which model to use
40 | "temperature": 0.5, # Randomness (0.0-1.0)
41 | "extraction_passes": 1, # How many extraction attempts
42 | "max_workers": 10 # Parallel processing
43 | }
44 | }
45 | ```
46 |
47 | ## Verification
48 | After installation, ask Claude Code:
49 | ```
50 | Use the get_server_info tool to show the LangExtract server status
51 | ```
52 |
53 | You should see:
54 | - Server running: ✅
55 | - API key configured: ✅
56 | - Optimization features enabled: ✅
57 |
58 | ## Troubleshooting
59 |
60 | **"Server not found"**
61 | ```bash
62 | # Check if registered
63 | claude mcp list
64 |
65 | # Re-add if missing
66 | claude mcp add langextract-mcp -e LANGEXTRACT_API_KEY=your-key -- uv run --with fastmcp fastmcp run src/langextract_mcp/server.py
67 | ```
68 |
69 | **"API key not set"**
70 | ```bash
71 | # Check environment
72 | echo $LANGEXTRACT_API_KEY
73 |
74 | # Set if missing (permanent)
75 | echo 'export LANGEXTRACT_API_KEY=your-key' >> ~/.bashrc
76 | source ~/.bashrc
77 | ```
78 |
79 | **"Tools not working"**
80 | - Verify API key is valid at [Google AI Studio](https://aistudio.google.com/app/apikey)
81 | - Check network connectivity
82 | - Try with different model (e.g., "gemini-2.5-pro")
83 |
```
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
```toml
1 | [build-system]
2 | requires = ["hatchling"]
3 | build-backend = "hatchling.build"
4 |
5 | [project]
6 | name = "langextract-mcp"
7 | version = "0.1.0"
8 | description = "FastMCP server for Google's langextract library - extract structured information from unstructured text using LLMs"
9 | readme = "README.md"
10 | license = { text = "Apache-2.0" }
11 | authors = [
12 | { name = "Larsen Weigle" }
13 | ]
14 | classifiers = [
15 | "Development Status :: 3 - Alpha",
16 | "Intended Audience :: Developers",
17 | "License :: OSI Approved :: Apache Software License",
18 | "Programming Language :: Python :: 3",
19 | "Programming Language :: Python :: 3.10",
20 | "Programming Language :: Python :: 3.11",
21 | "Programming Language :: Python :: 3.12",
22 | "Topic :: Software Development :: Libraries :: Python Modules",
23 | "Topic :: Scientific/Engineering :: Artificial Intelligence",
24 | "Topic :: Text Processing :: Linguistic",
25 | ]
26 | keywords = ["mcp", "fastmcp", "langextract", "llm", "text-extraction", "nlp", "ai"]
27 |
28 | requires-python = ">=3.10"
29 | dependencies = [
30 | "fastmcp>=0.1.0",
31 | "langextract>=0.1.0",
32 | "pydantic>=2.0.0",
33 | "python-dotenv>=1.0.0",
34 | "httpx>=0.25.0",
35 | ]
36 |
37 | [project.optional-dependencies]
38 | dev = [
39 | "pytest>=7.0.0",
40 | "pytest-asyncio>=0.21.0",
41 | "black>=23.0.0",
42 | "isort>=5.12.0",
43 | "mypy>=1.5.0",
44 | "pre-commit>=3.0.0",
45 | ]
46 |
47 | [project.urls]
48 | Homepage = "https://github.com/your-org/langextract-mcp"
49 | Repository = "https://github.com/your-org/langextract-mcp"
50 | Documentation = "https://github.com/your-org/langextract-mcp/blob/main/README.md"
51 | Issues = "https://github.com/your-org/langextract-mcp/issues"
52 |
53 | [project.scripts]
54 | langextract-mcp = "langextract_mcp.server:main"
55 |
56 | [tool.hatch.build.targets.wheel]
57 | packages = ["src/langextract_mcp"]
58 |
59 | [tool.hatch.build.targets.sdist]
60 | include = [
61 | "/src",
62 | "/docs",
63 | "/examples",
64 | "/README.md",
65 | "/LICENSE",
66 | ]
67 |
68 | [tool.black]
69 | line-length = 88
70 | target-version = ['py310']
71 | include = '\.pyi?$'
72 |
73 | [tool.isort]
74 | profile = "black"
75 | line_length = 88
76 |
77 | [tool.mypy]
78 | python_version = "3.10"
79 | warn_return_any = true
80 | warn_unused_configs = true
81 | disallow_untyped_defs = true
82 | disallow_incomplete_defs = true
83 |
84 | [tool.pytest.ini_options]
85 | testpaths = ["tests"]
86 | python_files = ["test_*.py"]
87 | python_classes = ["Test*"]
88 | python_functions = ["test_*"]
89 | addopts = "-v --tb=short"
90 | asyncio_mode = "auto"
```
--------------------------------------------------------------------------------
/src/langextract_mcp/resources/supported-models.md:
--------------------------------------------------------------------------------
```markdown
1 | # Supported Language Models
2 |
3 | This document provides comprehensive information about the language models supported by the langextract-mcp server.
4 |
5 | ## Currently Supported Models
6 |
7 | The langextract-mcp server currently supports **Google Gemini models only**, which are optimized for reliable structured extraction with schema constraints.
8 |
9 | ### Gemini 2.5 Flash
10 | - **Provider**: Google
11 | - **Model ID**: `gemini-2.5-flash`
12 | - **Description**: Fast, cost-effective model with excellent quality
13 | - **Schema Constraints**: ✅ Supported
14 | - **Recommended For**:
15 | - General extraction tasks
16 | - Fast processing requirements
17 | - Cost-sensitive applications
18 | - **Notes**: Recommended default choice - optimal balance of speed, cost, and quality
19 |
20 | ### Gemini 2.5 Pro
21 | - **Provider**: Google
22 | - **Model ID**: `gemini-2.5-pro`
23 | - **Description**: Advanced model for complex reasoning tasks
24 | - **Schema Constraints**: ✅ Supported
25 | - **Recommended For**:
26 | - Complex extractions
27 | - High accuracy requirements
28 | - Sophisticated reasoning tasks
29 | - **Notes**: Best quality for complex tasks but higher cost
30 |
31 | ## Model Recommendations
32 |
33 | | Use Case | Recommended Model | Reason |
34 | |----------|------------------|---------|
35 | | **Default/General** | `gemini-2.5-flash` | Best balance of speed, cost, and quality |
36 | | **High Quality** | `gemini-2.5-pro` | Superior accuracy and reasoning capabilities |
37 | | **Cost Optimized** | `gemini-2.5-flash` | Most cost-effective option |
38 | | **Complex Reasoning** | `gemini-2.5-pro` | Advanced reasoning for complex extraction tasks |
39 |
40 | ## Configuration Parameters
41 |
42 | When using any supported model, you can configure the following parameters:
43 |
44 | - **`model_id`**: The model identifier (e.g., "gemini-2.5-flash")
45 | - **`max_char_buffer`**: Maximum characters per chunk (default: 1000)
46 | - **`temperature`**: Sampling temperature 0.0-1.0 (default: 0.5)
47 | - **`extraction_passes`**: Number of extraction passes for better recall (default: 1)
48 | - **`max_workers`**: Maximum parallel workers (default: 10)
49 |
50 | ## Limitations
51 |
52 | - **Provider Support**: Currently supports Google Gemini models only
53 | - **Future Support**: OpenAI and local model support may be added in future versions
54 | - **API Dependencies**: Requires active internet connection and valid API keys
55 |
56 | ## Schema Constraints
57 |
58 | All supported Gemini models include schema constraint capabilities, which means:
59 |
60 | - **Structured Output**: Guaranteed JSON structure based on your examples
61 | - **Type Safety**: Consistent field types across extractions
62 | - **Validation**: Automatic validation of extracted data against schema
63 | - **Reliability**: Reduced hallucination and improved consistency
64 |
65 | This makes the langextract-mcp server particularly reliable for production applications requiring consistent structured data extraction.
66 |
```
--------------------------------------------------------------------------------
/docs/fastmcp.md:
--------------------------------------------------------------------------------
```markdown
1 | # FastMCP Framework Study Notes - Deep Analysis
2 |
3 | ## Overview
4 | FastMCP is a Python framework for building Model Context Protocol (MCP) servers and clients, designed to enable sophisticated interactions between AI systems and various services. It provides a "fast, Pythonic way" to build MCP servers with comprehensive functionality and enterprise-grade features.
5 |
6 | ## Core Architecture
7 |
8 | ### 1. Servers
9 | - **Primary Function**: Expose tools as executable capabilities
10 | - **Authentication**: Support multiple authentication mechanisms
11 | - **Middleware**: Enable cross-cutting functionality for request/response processing
12 | - **Resource Management**: Allow resource and prompt management
13 | - **Monitoring**: Support progress reporting and logging
14 |
15 | ### 2. Clients
16 | - **Purpose**: Provide programmatic interaction with MCP servers
17 | - **Authentication**: Support multiple methods (Bearer Token, OAuth)
18 | - **Processing**: Handle message processing, logging, and progress monitoring
19 |
20 | ## Key Features
21 |
22 | ### Tool Operations
23 | - Define tools as executable functions
24 | - Structured user input handling
25 | - Comprehensive tool management
26 |
27 | ### Resource Management
28 | - Create and manage resources
29 | - Prompt templating capabilities
30 | - Resource organization and access
31 |
32 | ### Authentication & Security
33 | - Flexible authentication strategies
34 | - Bearer Token support
35 | - OAuth integration
36 | - Authorization provider compatibility
37 |
38 | ### Middleware System
39 | - Request/response processing
40 | - Cross-cutting concerns handling
41 | - Extensible middleware chain
42 |
43 | ### Monitoring & Logging
44 | - Progress tracking
45 | - Comprehensive logging
46 | - User interaction context
47 |
48 | ## Integration Capabilities
49 |
50 | ### Supported Platforms
51 | - OpenAI API
52 | - Anthropic
53 | - Google Gemini
54 | - FastAPI
55 | - Starlette/ASGI
56 |
57 | ### Authorization Providers
58 | - Various authorization providers supported
59 | - Flexible configuration options
60 |
61 | ## Server Development Guidelines
62 |
63 | ### 1. Tool Definition
64 | - Define tools as executable functions
65 | - Implement clear input/output schemas
66 | - Handle errors gracefully
67 |
68 | ### 2. Authentication Setup
69 | - Choose appropriate authentication strategy
70 | - Configure security mechanisms
71 | - Implement user context handling
72 |
73 | ### 3. Context Configuration
74 | - Set up logging context
75 | - Configure user interactions
76 | - Implement progress tracking
77 |
78 | ### 4. Middleware Implementation
79 | - Use middleware for common functionality
80 | - Process requests and responses
81 | - Handle cross-cutting concerns
82 |
83 | ### 5. Resource Creation
84 | - Define resources and prompt templates
85 | - Organize resource access patterns
86 | - Implement resource management
87 |
88 | ## Unique Selling Points
89 |
90 | 1. **Pythonic Interface**: Natural Python API design
91 | 2. **Flexible Composition**: Modular server composition
92 | 3. **Structured Input**: Sophisticated user input handling
93 | 4. **Comprehensive SDK**: Extensive documentation and tooling
94 | 5. **Standardized Protocol**: Uses MCP for consistent interactions
95 |
96 | ## FastMCP Implementation Patterns
97 |
98 | ### 1. Server Instantiation
99 | ```python
100 | from fastmcp import FastMCP
101 |
102 | # Basic server
103 | mcp = FastMCP("Demo 🚀")
104 |
105 | # Server with configuration
106 | mcp = FastMCP(
107 | name="LangExtractServer",
108 | instructions="Extract structured information from text using LLMs",
109 | include_tags={"public"},
110 | exclude_tags={"internal"}
111 | )
112 | ```
113 |
114 | ### 2. Tool Definition Patterns
115 | ```python
116 | @mcp.tool
117 | def add(a: int, b: int) -> int:
118 | """Add two numbers"""
119 | return a + b
120 |
121 | # Complex parameters with validation
122 | @mcp.tool
123 | def process_data(
124 | query: str,
125 | max_results: int = 10,
126 | sort_by: str = "relevance",
127 | category: str | None = None
128 | ) -> dict:
129 | """Process data with parameters"""
130 | return {"results": []}
131 | ```
132 |
133 | ### 3. Error Handling
134 | ```python
135 | from fastmcp.exceptions import ToolError
136 |
137 | @mcp.tool
138 | def divide(a: float, b: float) -> float:
139 | if b == 0:
140 | raise ToolError("Cannot divide by zero")
141 | return a / b
142 | ```
143 |
144 | ### 4. Authentication Patterns
145 | ```python
146 | from fastmcp.server.auth.providers.jwt import JWTVerifier
147 |
148 | auth = JWTVerifier(
149 | jwks_uri="https://your-auth-system.com/.well-known/jwks.json",
150 | issuer="https://your-auth-system.com",
151 | audience="your-mcp-server"
152 | )
153 |
154 | mcp = FastMCP(name="Protected Server", auth=auth)
155 | ```
156 |
157 | ### 5. Server Execution
158 | ```python
159 | # STDIO transport (default for MCP clients)
160 | mcp.run()
161 |
162 | # HTTP transport
163 | mcp.run(transport="http", host="0.0.0.0", port=9000)
164 | ```
165 |
166 | ### 6. Server Composition
167 | ```python
168 | main = FastMCP(name="MainServer")
169 | sub = FastMCP(name="SubServer")
170 | main.mount(sub, prefix="sub")
171 | ```
172 |
173 | ## Key Insights for LangExtract MCP Server
174 |
175 | 1. **Simple Decorator Pattern**: Use `@mcp.tool` for all langextract functions
176 | 2. **Type Safety**: Leverage Python type hints for automatic validation
177 | 3. **Proper Error Handling**: Use `ToolError` for controlled error messaging
178 | 4. **Clean Architecture**: Keep tools simple and focused
179 | 5. **Context Management**: Use FastMCP's built-in context for logging/progress
180 | 6. **Transport Flexibility**: Support both STDIO and HTTP transports
181 | 7. **Authentication Ready**: Design with auth in mind for production use
182 |
183 | ## Implementation Strategy for langextract
184 |
185 | Based on deeper FastMCP understanding:
186 |
187 | 1. **Clean Tool Interface**: Each langextract function as a simple `@mcp.tool`
188 | 2. **Type-Safe Parameters**: Use Pydantic models for complex inputs
189 | 3. **Structured Outputs**: Return proper dictionaries/models
190 | 4. **Error Management**: Comprehensive error handling with `ToolError`
191 | 5. **Context Integration**: Use FastMCP context for progress/logging
192 | 6. **Resource Management**: Expose example templates as MCP resources
193 | 7. **Production Ready**: Authentication and deployment configuration
```
--------------------------------------------------------------------------------
/docs/langextract.md:
--------------------------------------------------------------------------------
```markdown
1 | # LangExtract Library Study Notes
2 |
3 | ## Overview
4 | LangExtract is a Python library developed by Google that uses Large Language Models (LLMs) to extract structured information from unstructured text documents based on user-defined instructions. It's designed to process materials like clinical notes, reports, and other documents while maintaining precise source grounding.
5 |
6 | ## Key Features & Differentiators
7 |
8 | ### 1. Precise Source Grounding
9 | - **Capability**: Maps every extraction to its exact location in the source text
10 | - **Benefit**: Enables visual highlighting for easy traceability and verification
11 | - **Implementation**: Through annotation system that tracks character positions
12 |
13 | ### 2. Reliable Structured Outputs
14 | - **Schema Enforcement**: Consistent output schema based on few-shot examples
15 | - **Controlled Generation**: Leverages structured output capabilities in supported models (Gemini)
16 | - **Format Support**: JSON and YAML output formats
17 |
18 | ### 3. Long Document Optimization
19 | - **Challenge Addressed**: "Needle-in-a-haystack" problem in large documents
20 | - **Strategy**: Text chunking + parallel processing + multiple extraction passes
21 | - **Benefit**: Higher recall on complex documents
22 |
23 | ### 4. Interactive Visualization
24 | - **Output**: Self-contained HTML files for reviewing extractions
25 | - **Scalability**: Handles thousands of extracted entities
26 | - **Context**: Shows entities in their original document context
27 |
28 | ### 5. Flexible LLM Support
29 | - **Cloud Models**: Google Gemini family, OpenAI models
30 | - **Local Models**: Built-in Ollama interface
31 | - **Extensibility**: Can be extended to other APIs
32 |
33 | ### 6. Domain Adaptability
34 | - **No Fine-tuning**: Uses few-shot examples instead of model training
35 | - **Flexibility**: Works across any domain with proper examples
36 | - **Customization**: Leverages LLM world knowledge through prompt engineering
37 |
38 | ## Core Architecture
39 |
40 | ### Main Components
41 |
42 | #### 1. Data Models (`data.py`)
43 | - **ExampleData**: Defines extraction examples with text and expected extractions
44 | - **Extraction**: Individual extracted entity with class, text, and attributes
45 | - **Document**: Input document container
46 | - **AnnotatedDocument**: Result container with extractions and metadata
47 |
48 | #### 2. Inference Engine (`inference.py`)
49 | - **GeminiLanguageModel**: Google Gemini API integration
50 | - **OpenAILanguageModel**: OpenAI API integration
51 | - **BaseLanguageModel**: Abstract base for language model implementations
52 | - **Schema Support**: Structured output generation for supported models
53 |
54 | #### 3. Annotation System (`annotation.py`)
55 | - **Annotator**: Core extraction orchestrator
56 | - **Text Processing**: Handles chunking and parallel processing
57 | - **Progress Tracking**: Monitors extraction progress
58 |
59 | #### 4. Resolver System (`resolver.py`)
60 | - **Purpose**: Parses raw LLM output into structured Extraction objects
61 | - **Fence Handling**: Extracts content from markdown code blocks
62 | - **Format Parsing**: Handles JSON/YAML parsing and validation
63 |
64 | #### 5. Chunking Engine (`chunking.py`)
65 | - **Text Segmentation**: Breaks long documents into processable chunks
66 | - **Buffer Management**: Handles max_char_buffer limits
67 | - **Overlap Strategy**: Maintains context across chunk boundaries
68 |
69 | #### 6. Visualization (`visualization.py`)
70 | - **HTML Generation**: Creates interactive visualization files
71 | - **Entity Highlighting**: Shows extractions in original context
72 | - **Scalable Interface**: Handles large result sets efficiently
73 |
74 | #### 7. I/O Operations (`io.py`)
75 | - **URL Download**: Fetches text from web URLs
76 | - **File Operations**: Saves results to JSONL format
77 | - **Document Loading**: Handles various input formats
78 |
79 | ### Key API Functions
80 |
81 | #### Primary Interface
82 | ```python
83 | lx.extract(
84 | text_or_documents, # Input text, URL, or Document objects
85 | prompt_description, # Extraction instructions
86 | examples, # Few-shot examples
87 | model_id="gemini-2.5-flash",
88 | # Configuration options...
89 | )
90 | ```
91 |
92 | #### Visualization
93 | ```python
94 | lx.visualize(jsonl_file_path) # Generate HTML visualization
95 | ```
96 |
97 | #### I/O Operations
98 | ```python
99 | lx.io.save_annotated_documents(results, output_name, output_dir)
100 | ```
101 |
102 | ## Configuration Parameters
103 |
104 | ### Core Parameters
105 | - **model_id**: LLM model selection
106 | - **api_key**: Authentication for cloud models
107 | - **temperature**: Sampling temperature (0.5 recommended)
108 | - **max_char_buffer**: Chunk size limit (1000 default)
109 |
110 | ### Performance Parameters
111 | - **max_workers**: Parallel processing workers (10 default)
112 | - **batch_length**: Chunks per batch (10 default)
113 | - **extraction_passes**: Multiple extraction attempts (1 default)
114 |
115 | ### Output Control
116 | - **format_type**: JSON or YAML output
117 | - **fence_output**: Code fence expectations
118 | - **use_schema_constraints**: Structured output enforcement
119 |
120 | ## Supported Models
121 |
122 | ### Google Gemini
123 | - **gemini-2.5-flash**: Recommended default (speed/cost/quality balance)
124 | - **gemini-2.5-pro**: For complex reasoning tasks
125 | - **Schema Support**: Full structured output support
126 | - **Rate Limits**: Tier 2 quota recommended for production
127 |
128 | ### OpenAI
129 | - **gpt-4o**: Supported with limitations
130 | - **Requirements**: `fence_output=True`, `use_schema_constraints=False`
131 | - **Note**: Schema constraints not yet implemented for OpenAI
132 |
133 | ### Local Models
134 | - **Ollama**: Built-in support
135 | - **Extension**: Can be extended to other local APIs
136 |
137 | ## Use Cases & Examples
138 |
139 | ### 1. Literary Analysis
140 | - **Characters**: Extract character names and emotional states
141 | - **Relationships**: Identify character interactions and metaphors
142 | - **Context**: Track narrative elements across long texts
143 |
144 | ### 2. Medical Document Processing
145 | - **Medications**: Extract drug names, dosages, routes, frequencies
146 | - **Clinical Notes**: Structure unstructured medical reports
147 | - **Compliance**: Maintain source grounding for medical accuracy
148 |
149 | ### 3. Radiology Reports
150 | - **Structured Data**: Convert free-text reports to structured findings
151 | - **Demo Available**: RadExtract on HuggingFace Spaces
152 |
153 | ### 4. Long Document Processing
154 | - **Full Novels**: Process complete books (e.g., Romeo & Juliet - 147k chars)
155 | - **Performance**: Parallel processing with multiple passes
156 | - **Visualization**: Handle hundreds of entities in context
157 |
158 | ## Technical Implementation Details
159 |
160 | ### Text Processing Pipeline
161 | 1. **Input Validation**: Validate text/documents and examples
162 | 2. **URL Handling**: Download content if URL provided
163 | 3. **Chunking**: Break long texts into manageable pieces
164 | 4. **Parallel Processing**: Distribute chunks across workers
165 | 5. **Multiple Passes**: Optional additional extraction rounds
166 | 6. **Resolution**: Parse LLM outputs into structured data
167 | 7. **Annotation**: Create AnnotatedDocument with source grounding
168 | 8. **Visualization**: Generate interactive HTML output
169 |
170 | ### Error Handling
171 | - **API Failures**: Graceful handling of LLM API errors
172 | - **Parsing Errors**: Robust JSON/YAML parsing with fallbacks
173 | - **Validation**: Schema validation for structured outputs
174 |
175 | ### Performance Optimization
176 | - **Concurrent Processing**: Parallel chunk processing
177 | - **Efficient Chunking**: Smart text segmentation
178 | - **Progressive Enhancement**: Multiple passes for better recall
179 | - **Memory Management**: Efficient handling of large documents
180 |
181 | ## MCP Server Design Implications
182 |
183 | Based on langextract's architecture, a FastMCP server should expose:
184 |
185 | ### Core Tools
186 | 1. **extract_text**: Main extraction function
187 | 2. **extract_from_url**: URL-based extraction
188 | 3. **visualize_results**: Generate HTML visualization
189 | 4. **validate_examples**: Validate extraction examples
190 |
191 | ### Configuration Management
192 | 1. **set_model**: Configure LLM model
193 | 2. **set_api_key**: Set authentication
194 | 3. **configure_extraction**: Set extraction parameters
195 |
196 | ### File Operations
197 | 1. **save_results**: Save to JSONL format
198 | 2. **load_results**: Load previous results
199 | 3. **export_visualization**: Generate and save HTML
200 |
201 | ### Advanced Features
202 | 1. **batch_extract**: Process multiple documents
203 | 2. **progressive_extract**: Multi-pass extraction
204 | 3. **compare_results**: Compare extraction results
205 |
206 | ### Resource Management
207 | - **Model Configurations**: Manage different model setups
208 | - **Example Templates**: Store reusable extraction examples
209 | - **Result Archives**: Access previous extraction results
210 |
211 | ## Dependencies & Installation
212 | - **Core**: Python 3.10+, requests, dotenv
213 | - **LLM APIs**: google-generativeai, openai
214 | - **Processing**: concurrent.futures for parallelization
215 | - **Visualization**: HTML/CSS/JS generation
216 | - **Format Support**: JSON, YAML parsing
217 |
218 | ## Licensing & Usage
219 | - **License**: Apache 2.0
220 | - **Disclaimer**: Not officially supported Google product
221 | - **Health Applications**: Subject to Health AI Developer Foundations Terms
222 | - **Citation**: Recommended for production/publication use
```
--------------------------------------------------------------------------------
/src/langextract_mcp/server.py:
--------------------------------------------------------------------------------
```python
1 | """FastMCP server for langextract - optimized for Claude Code integration."""
2 |
3 | import os
4 | from typing import Any
5 | from pathlib import Path
6 | import hashlib
7 | import json
8 |
9 | import langextract as lx
10 | from fastmcp import FastMCP
11 | from fastmcp.resources import FileResource
12 | from fastmcp.exceptions import ToolError
13 | from pydantic import BaseModel, Field
14 |
15 |
16 | # Simple dictionary types for easier LLM usage
17 | # ExtractionItem: {"extraction_class": str, "extraction_text": str, "attributes": dict}
18 | # ExtractionExample: {"text": str, "extractions": list[ExtractionItem]}
19 |
20 |
21 | class ExtractionConfig(BaseModel):
22 | """Configuration for extraction parameters."""
23 | model_id: str = Field(default="gemini-2.5-flash", description="LLM model to use")
24 | max_char_buffer: int = Field(default=1000, description="Max characters per chunk")
25 | temperature: float = Field(default=0.5, description="Sampling temperature (0.0-1.0)")
26 | extraction_passes: int = Field(default=1, description="Number of extraction passes for better recall")
27 | max_workers: int = Field(default=10, description="Max parallel workers")
28 |
29 |
30 | # Initialize FastMCP server with Claude Code compatibility
31 | mcp = FastMCP(
32 | name="langextract-mcp",
33 | instructions="Extract structured information from unstructured text using Google Gemini models. "
34 | "Provides precise source grounding, interactive visualizations, and optimized caching for performance."
35 | )
36 |
37 |
38 | class LangExtractClient:
39 | """Optimized langextract client for MCP server usage.
40 |
41 | This client maintains persistent connections and caches expensive operations
42 | like schema generation and prompt templates for better performance in a
43 | long-running MCP server context.
44 | """
45 |
46 | def __init__(self):
47 | self._language_models: dict[str, Any] = {}
48 | self._schema_cache: dict[str, Any] = {}
49 | self._prompt_template_cache: dict[str, Any] = {}
50 | self._resolver_cache: dict[str, Any] = {}
51 |
52 | def _get_examples_hash(self, examples: list[dict[str, Any]]) -> str:
53 | """Generate a hash for caching based on examples."""
54 | examples_str = json.dumps(examples, sort_keys=True)
55 | return hashlib.md5(examples_str.encode()).hexdigest()
56 |
57 | def _get_language_model(self, config: ExtractionConfig, api_key: str, schema: Any | None = None, schema_hash: str | None = None) -> Any:
58 | """Get or create a cached language model instance."""
59 | # Include schema hash in cache key to prevent schema mutation conflicts
60 | model_key = f"{config.model_id}_{config.temperature}_{config.max_workers}_{schema_hash or 'no_schema'}"
61 |
62 | if model_key not in self._language_models:
63 | # Validate that only Gemini models are supported
64 | if not config.model_id.startswith('gemini'):
65 | raise ValueError(f"Only Gemini models are supported. Got: {config.model_id}")
66 |
67 | language_model = lx.inference.GeminiLanguageModel(
68 | model_id=config.model_id,
69 | api_key=api_key,
70 | temperature=config.temperature,
71 | max_workers=config.max_workers,
72 | gemini_schema=schema
73 | )
74 | self._language_models[model_key] = language_model
75 |
76 | return self._language_models[model_key]
77 |
78 | def _get_schema(self, examples: list[dict[str, Any]], model_id: str) -> tuple[Any, str]:
79 | """Get or create a cached schema for the examples.
80 |
81 | Returns:
82 | Tuple of (schema, examples_hash) for use in caching language models
83 | """
84 | if not model_id.startswith('gemini'):
85 | return None, ""
86 |
87 | examples_hash = self._get_examples_hash(examples)
88 | schema_key = f"{model_id}_{examples_hash}"
89 |
90 | if schema_key not in self._schema_cache:
91 | # Convert examples to langextract format
92 | langextract_examples = self._create_langextract_examples(examples)
93 |
94 | # Create prompt template to generate schema
95 | prompt_template = lx.prompting.PromptTemplateStructured(description="Schema generation")
96 | prompt_template.examples.extend(langextract_examples)
97 |
98 | # Generate schema
99 | schema = lx.schema.GeminiSchema.from_examples(prompt_template.examples)
100 | self._schema_cache[schema_key] = schema
101 |
102 | return self._schema_cache[schema_key], examples_hash
103 |
104 | def _get_resolver(self, format_type: str = "JSON") -> Any:
105 | """Get or create a cached resolver."""
106 | if format_type not in self._resolver_cache:
107 | resolver = lx.resolver.Resolver(
108 | fence_output=False,
109 | format_type=lx.data.FormatType.JSON if format_type == "JSON" else lx.data.FormatType.YAML,
110 | extraction_attributes_suffix="_attributes",
111 | extraction_index_suffix=None,
112 | )
113 | self._resolver_cache[format_type] = resolver
114 |
115 | return self._resolver_cache[format_type]
116 |
117 | def _create_langextract_examples(self, examples: list[dict[str, Any]]) -> list[lx.data.ExampleData]:
118 | """Convert dictionary examples to langextract ExampleData objects."""
119 | langextract_examples = []
120 |
121 | for example in examples:
122 | extractions = []
123 | for extraction_data in example["extractions"]:
124 | extractions.append(
125 | lx.data.Extraction(
126 | extraction_class=extraction_data["extraction_class"],
127 | extraction_text=extraction_data["extraction_text"],
128 | attributes=extraction_data.get("attributes", {})
129 | )
130 | )
131 |
132 | langextract_examples.append(
133 | lx.data.ExampleData(
134 | text=example["text"],
135 | extractions=extractions
136 | )
137 | )
138 |
139 | return langextract_examples
140 |
141 | def extract(
142 | self,
143 | text_or_url: str,
144 | prompt_description: str,
145 | examples: list[dict[str, Any]],
146 | config: ExtractionConfig,
147 | api_key: str
148 | ) -> lx.data.AnnotatedDocument:
149 | """Optimized extraction using cached components."""
150 | # Get or generate schema first
151 | schema, examples_hash = self._get_schema(examples, config.model_id)
152 |
153 | # Get cached components with schema-aware caching
154 | language_model = self._get_language_model(config, api_key, schema, examples_hash)
155 | resolver = self._get_resolver("JSON")
156 |
157 | # Convert examples
158 | langextract_examples = self._create_langextract_examples(examples)
159 |
160 | # Create prompt template
161 | prompt_template = lx.prompting.PromptTemplateStructured(
162 | description=prompt_description
163 | )
164 | prompt_template.examples.extend(langextract_examples)
165 |
166 | # Create annotator
167 | annotator = lx.annotation.Annotator(
168 | language_model=language_model,
169 | prompt_template=prompt_template,
170 | format_type=lx.data.FormatType.JSON,
171 | fence_output=False,
172 | )
173 |
174 | # Perform extraction
175 | if text_or_url.startswith(('http://', 'https://')):
176 | # Download text first
177 | text = lx.io.download_text_from_url(text_or_url)
178 | else:
179 | text = text_or_url
180 |
181 | return annotator.annotate_text(
182 | text=text,
183 | resolver=resolver,
184 | max_char_buffer=config.max_char_buffer,
185 | batch_length=10,
186 | additional_context=None,
187 | debug=False, # Disable debug for cleaner MCP output
188 | extraction_passes=config.extraction_passes,
189 | )
190 |
191 |
192 | # Global client instance for the server lifecycle
193 | _langextract_client = LangExtractClient()
194 |
195 |
196 | def _get_api_key() -> str | None:
197 | """Get API key from environment (server-side only for security)."""
198 | return os.environ.get("LANGEXTRACT_API_KEY")
199 |
200 |
201 | def _format_extraction_result(result: lx.data.AnnotatedDocument, config: ExtractionConfig, source_url: str | None = None) -> dict[str, Any]:
202 | """Format langextract result for MCP response."""
203 | extractions = []
204 |
205 | for extraction in result.extractions or []:
206 | extractions.append({
207 | "extraction_class": extraction.extraction_class,
208 | "extraction_text": extraction.extraction_text,
209 | "attributes": extraction.attributes,
210 | "start_char": getattr(extraction, 'start_char', None),
211 | "end_char": getattr(extraction, 'end_char', None),
212 | })
213 |
214 | response = {
215 | "document_id": result.document_id if result.document_id else "anonymous",
216 | "total_extractions": len(extractions),
217 | "extractions": extractions,
218 | "metadata": {
219 | "model_id": config.model_id,
220 | "extraction_passes": config.extraction_passes,
221 | "max_char_buffer": config.max_char_buffer,
222 | "temperature": config.temperature,
223 | }
224 | }
225 |
226 | if source_url:
227 | response["source_url"] = source_url
228 |
229 | return response
230 |
231 | # ============================================================================
232 | # Tools
233 | # ============================================================================
234 |
235 | @mcp.tool
236 | def extract_from_text(
237 | text: str,
238 | prompt_description: str,
239 | examples: list[dict[str, Any]],
240 | model_id: str = "gemini-2.5-flash",
241 | max_char_buffer: int = 1000,
242 | temperature: float = 0.5,
243 | extraction_passes: int = 1,
244 | max_workers: int = 10
245 | ) -> dict[str, Any]:
246 | """
247 | Extract structured information from text using langextract.
248 |
249 | Uses Large Language Models to extract structured information from unstructured text
250 | based on user-defined instructions and examples. Each extraction is mapped to its
251 | exact location in the source text for precise source grounding.
252 |
253 | Args:
254 | text: The text to extract information from
255 | prompt_description: Clear instructions for what to extract
256 | examples: List of example extractions to guide the model
257 | model_id: LLM model to use (default: "gemini-2.5-flash")
258 | max_char_buffer: Max characters per chunk (default: 1000)
259 | temperature: Sampling temperature 0.0-1.0 (default: 0.5)
260 | extraction_passes: Number of extraction passes for better recall (default: 1)
261 | max_workers: Max parallel workers (default: 10)
262 |
263 | Returns:
264 | Dictionary containing extracted entities with source locations and metadata
265 |
266 | Raises:
267 | ToolError: If extraction fails due to invalid parameters or API issues
268 | """
269 | try:
270 | if not examples:
271 | raise ToolError("At least one example is required for reliable extraction")
272 |
273 | if not prompt_description.strip():
274 | raise ToolError("Prompt description cannot be empty")
275 |
276 | if not text.strip():
277 | raise ToolError("Input text cannot be empty")
278 |
279 | # Validate that only Gemini models are supported
280 | if not model_id.startswith('gemini'):
281 | raise ToolError(
282 | f"Only Google Gemini models are supported. Got: {model_id}. "
283 | f"Use 'list_supported_models' tool to see available options."
284 | )
285 |
286 | # Create config object from individual parameters
287 | config = ExtractionConfig(
288 | model_id=model_id,
289 | max_char_buffer=max_char_buffer,
290 | temperature=temperature,
291 | extraction_passes=extraction_passes,
292 | max_workers=max_workers
293 | )
294 |
295 | # Get API key (server-side only for security)
296 | api_key = _get_api_key()
297 | if not api_key:
298 | raise ToolError(
299 | "API key required. Server administrator must set LANGEXTRACT_API_KEY environment variable."
300 | )
301 |
302 | # Perform optimized extraction using cached client
303 | result = _langextract_client.extract(
304 | text_or_url=text,
305 | prompt_description=prompt_description,
306 | examples=examples,
307 | config=config,
308 | api_key=api_key
309 | )
310 |
311 | return _format_extraction_result(result, config)
312 |
313 | except ValueError as e:
314 | raise ToolError(f"Invalid parameters: {str(e)}")
315 | except Exception as e:
316 | raise ToolError(f"Extraction failed: {str(e)}")
317 |
318 |
319 | @mcp.tool
320 | def extract_from_url(
321 | url: str,
322 | prompt_description: str,
323 | examples: list[dict[str, Any]],
324 | model_id: str = "gemini-2.5-flash",
325 | max_char_buffer: int = 1000,
326 | temperature: float = 0.5,
327 | extraction_passes: int = 1,
328 | max_workers: int = 10
329 | ) -> dict[str, Any]:
330 | """
331 | Extract structured information from text content at a URL.
332 |
333 | Downloads text from the specified URL and extracts structured information
334 | using Large Language Models. Ideal for processing web articles, documents,
335 | or any text content accessible via HTTP/HTTPS.
336 |
337 | Args:
338 | url: URL to download text from (must start with http:// or https://)
339 | prompt_description: Clear instructions for what to extract
340 | examples: List of example extractions to guide the model
341 | model_id: LLM model to use (default: "gemini-2.5-flash")
342 | max_char_buffer: Max characters per chunk (default: 1000)
343 | temperature: Sampling temperature 0.0-1.0 (default: 0.5)
344 | extraction_passes: Number of extraction passes for better recall (default: 1)
345 | max_workers: Max parallel workers (default: 10)
346 |
347 | Returns:
348 | Dictionary containing extracted entities with source locations and metadata
349 |
350 | Raises:
351 | ToolError: If URL is invalid, download fails, or extraction fails
352 | """
353 | try:
354 | if not url.startswith(('http://', 'https://')):
355 | raise ToolError("URL must start with http:// or https://")
356 |
357 | if not examples:
358 | raise ToolError("At least one example is required for reliable extraction")
359 |
360 | if not prompt_description.strip():
361 | raise ToolError("Prompt description cannot be empty")
362 |
363 | # Validate that only Gemini models are supported
364 | if not model_id.startswith('gemini'):
365 | raise ToolError(
366 | f"Only Google Gemini models are supported. Got: {model_id}. "
367 | f"Use 'list_supported_models' tool to see available options."
368 | )
369 |
370 | # Create config object from individual parameters
371 | config = ExtractionConfig(
372 | model_id=model_id,
373 | max_char_buffer=max_char_buffer,
374 | temperature=temperature,
375 | extraction_passes=extraction_passes,
376 | max_workers=max_workers
377 | )
378 |
379 | # Get API key (server-side only for security)
380 | api_key = _get_api_key()
381 | if not api_key:
382 | raise ToolError(
383 | "API key required. Server administrator must set LANGEXTRACT_API_KEY environment variable."
384 | )
385 |
386 | # Perform optimized extraction using cached client
387 | result = _langextract_client.extract(
388 | text_or_url=url,
389 | prompt_description=prompt_description,
390 | examples=examples,
391 | config=config,
392 | api_key=api_key
393 | )
394 |
395 | return _format_extraction_result(result, config, source_url=url)
396 |
397 | except ValueError as e:
398 | raise ToolError(f"Invalid parameters: {str(e)}")
399 | except Exception as e:
400 | raise ToolError(f"URL extraction failed: {str(e)}")
401 |
402 |
403 | @mcp.tool
404 | def save_extraction_results(
405 | extraction_results: dict[str, Any],
406 | output_name: str,
407 | output_dir: str = "."
408 | ) -> dict[str, str]:
409 | """
410 | Save extraction results to a JSONL file for later use or visualization.
411 |
412 | Saves the extraction results in JSONL (JSON Lines) format, which is commonly
413 | used for structured data and can be loaded for visualization or further processing.
414 |
415 | Args:
416 | extraction_results: Results from extract_from_text or extract_from_url
417 | output_name: Name for the output file (without .jsonl extension)
418 | output_dir: Directory to save the file (default: current directory)
419 |
420 | Returns:
421 | Dictionary with file path and save confirmation
422 |
423 | Raises:
424 | ToolError: If save operation fails
425 | """
426 | try:
427 | # Create output directory if it doesn't exist
428 | output_path = Path(output_dir)
429 | output_path.mkdir(parents=True, exist_ok=True)
430 |
431 | # Create full file path
432 | file_path = output_path / f"{output_name}.jsonl"
433 |
434 | # Save results to JSONL format
435 | import json
436 | with open(file_path, 'w', encoding='utf-8') as f:
437 | json.dump(extraction_results, f, ensure_ascii=False)
438 | f.write('\n')
439 |
440 | return {
441 | "message": "Results saved successfully",
442 | "file_path": str(file_path.absolute()),
443 | "total_extractions": extraction_results.get("total_extractions", 0)
444 | }
445 |
446 | except Exception as e:
447 | raise ToolError(f"Failed to save results: {str(e)}")
448 |
449 |
450 | @mcp.tool
451 | def generate_visualization(
452 | jsonl_file_path: str,
453 | output_html_path: str | None = None
454 | ) -> dict[str, str]:
455 | """
456 | Generate interactive HTML visualization from extraction results.
457 |
458 | Creates an interactive HTML file that shows extracted entities highlighted
459 | in their original text context. The visualization is self-contained and
460 | can handle thousands of entities with color coding and hover details.
461 |
462 | Args:
463 | jsonl_file_path: Path to the JSONL file containing extraction results
464 | output_html_path: Optional path for the HTML output (default: auto-generated)
465 |
466 | Returns:
467 | Dictionary with HTML file path and generation details
468 |
469 | Raises:
470 | ToolError: If visualization generation fails
471 | """
472 | try:
473 | # Validate input file exists
474 | input_path = Path(jsonl_file_path)
475 | if not input_path.exists():
476 | raise ToolError(f"Input file not found: {jsonl_file_path}")
477 |
478 | # Generate visualization using langextract
479 | html_content = lx.visualize(str(input_path))
480 |
481 | # Determine output path
482 | if output_html_path:
483 | output_path = Path(output_html_path)
484 | else:
485 | output_path = input_path.parent / f"{input_path.stem}_visualization.html"
486 |
487 | # Ensure output directory exists
488 | output_path.parent.mkdir(parents=True, exist_ok=True)
489 |
490 | # Write HTML file
491 | with open(output_path, 'w', encoding='utf-8') as f:
492 | f.write(html_content)
493 |
494 | return {
495 | "message": "Visualization generated successfully",
496 | "html_file_path": str(output_path.absolute()),
497 | "file_size_bytes": len(html_content.encode('utf-8'))
498 | }
499 |
500 | except Exception as e:
501 | raise ToolError(f"Failed to generate visualization: {str(e)}")
502 |
503 | # ============================================================================
504 | # Resources
505 | # ============================================================================
506 |
507 | # Get the directory containing this server.py file
508 | server_dir = Path(__file__).parent
509 |
510 | readme_path = (server_dir / "resources" / "README.md").resolve()
511 | if readme_path.exists():
512 | print(f"Adding README resource: {readme_path}")
513 | # Use a file:// URI scheme
514 | readme_resource = FileResource(
515 | uri=f"file://{readme_path.as_posix()}",
516 | path=readme_path, # Path to the actual file
517 | name="README File",
518 | description="The README for the langextract-mcp server.",
519 | mime_type="text/markdown",
520 | tags={"documentation"}
521 | )
522 | mcp.add_resource(readme_resource)
523 |
524 |
525 | supported_models_path = (server_dir / "resources" / "supported-models.md").resolve()
526 | if supported_models_path.exists():
527 | print(f"Adding Supported Models resource: {supported_models_path}")
528 | supported_models_resource = FileResource(
529 | uri=f"file://{supported_models_path.as_posix()}",
530 | path=supported_models_path,
531 | name="Supported Models",
532 | description="The supported models for the langextract-mcp server.",
533 | mime_type="text/markdown",
534 | tags={"documentation"}
535 | )
536 | mcp.add_resource(supported_models_resource)
537 |
538 |
539 | def main():
540 | """Main function to run the FastMCP server."""
541 | mcp.run()
542 |
543 |
544 | if __name__ == "__main__":
545 | main()
546 |
```