mlexpertio/mcp-pdf-reader # codebase.md

# Directory Structure

```
├── .gitignore
├── .python-version
├── .vscode
│   └── settings.json
├── cursor-rule-example.mdc
├── data
│   └── .gitignore
├── LICENSE
├── pdf_converter.py
├── pyproject.toml
├── README.md
├── server.py
└── uv.lock
```

# Files

--------------------------------------------------------------------------------
/.python-version:
--------------------------------------------------------------------------------

```
1 | 3.12.10
2 | 
```

--------------------------------------------------------------------------------
/data/.gitignore:
--------------------------------------------------------------------------------

```
1 | *
2 | !.gitignore
```

--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------

```
 1 | # Python-generated files
 2 | __pycache__/
 3 | *.py[oc]
 4 | build/
 5 | dist/
 6 | wheels/
 7 | *.egg-info
 8 | 
 9 | # Virtual environments
10 | .venv
11 | .embeddings/
12 | .DS_Store
13 | .env
14 | mlruns/
15 | mlartifacts/
16 | .ruff_cache/
```

--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------

```markdown
 1 | # MCP PDF Reader
 2 | 
 3 | A Model Context Protocol (MCP) server that provides tools for reading and processing PDF documents. Built with Docling for document conversion and text extraction.
 4 | 
 5 | ## Features
 6 | 
 7 | - **MCP Server** with tools for PDF document processing
 8 | - **Document Text Extraction**: Convert PDF content to clean Markdown format
 9 | - **Document Discovery**: List and access available PDF files
10 | 
11 | ## Tools
12 | 
13 | The server provides two main tools:
14 | 
15 | - **`get_document_list`**: Returns a list of all available PDF files in the data directory
16 | - **`get_document_text`**: Extracts and returns the full text content of a specified PDF file in Markdown format
17 | 
18 | ## Install
19 | 
20 | Make sure you have [`uv` installed](https://docs.astral.sh/uv/getting-started/installation/).
21 | 
22 | Clone the repository:
23 | 
24 | ```bash
25 | git clone [email protected]:mlexpertio/mcp-pdf-reader.git
26 | cd mcp-pdf-reader
27 | ```
28 | 
29 | Install Python:
30 | 
31 | ```bash
32 | uv python install 3.12.10
33 | ```
34 | 
35 | Create and activate a virtual environment:
36 | 
37 | ```bash
38 | uv venv
39 | source .venv/bin/activate
40 | ```
41 | 
42 | Install dependencies:
43 | 
44 | ```bash
45 | uv sync
46 | ```
47 | 
48 | ## Usage
49 | 
50 | ### Add PDF Documents
51 | 
52 | Place your PDF files in the `data/` directory. The server will automatically detect and make them available through the tools.
53 | 
54 | ### Run MCP Server
55 | 
56 | Start the MCP server:
57 | 
58 | ```bash
59 | python server.py
60 | ```
61 | 
62 | The server runs using stdio transport and can be integrated with any MCP-compatible client.
63 | 
64 | ### Development and Testing
65 | 
66 | Use the MCP inspector to test the server:
67 | 
68 | ```bash
69 | mcp dev server.py
70 | ```
71 | 
72 | This will open a web interface where you can test the available tools and inspect their responses.
73 | 
74 | ## Use in VSCode/Cursor
75 | 
76 | You can use the MCP integration in your editor. `Tools & Integrations` -> `New MCP Server` and edit the `mcp.json` file to include the following:
77 | 
78 | ```json
79 | {
80 |   "mcpServers": {
81 |     "pdf-reader": {
82 |       "command": "/opt/homebrew/bin/uv", // path to your uv binary
83 |       "args": ["run", "--directory", "PATH_TO_YOUR_PROJECT", "server.py"]
84 |     }
85 |   }
86 | }
87 | ```
88 | 
89 | ## License
90 | 
91 | See LICENSE file for details.
92 | 
```

--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------

```toml
 1 | [project]
 2 | name = "mcp-pdf-reader"
 3 | version = "0.1.0"
 4 | description = "Local MCP server that extract text from PDF files"
 5 | readme = "README.md"
 6 | requires-python = ">=3.12.10"
 7 | dependencies = [
 8 |     "docling>=2.43.0",
 9 |     "mcp[cli]>=1.12.3",
10 |     "pypdfium2>=4.30.0",
11 | ]
12 | 
13 | [dependency-groups]
14 | dev = [
15 |     "ruff>=0.12.7",
16 | ]
17 | 
```

--------------------------------------------------------------------------------
/.vscode/settings.json:
--------------------------------------------------------------------------------

```json
 1 | {
 2 |     "[python]": {
 3 |       "editor.formatOnSave": true,
 4 |       "editor.defaultFormatter": "charliermarsh.ruff",
 5 |       "editor.codeActionsOnSave": {
 6 |         "source.fixAll": "explicit",
 7 |         "source.organizeImports": "explicit"
 8 |       }
 9 |     },
10 |     "notebook.formatOnSave.enabled": true,
11 |     "notebook.codeActionsOnSave": {
12 |       "notebook.source.fixAll": "explicit",
13 |       "notebook.source.organizeImports": "explicit",
14 |   },
15 |   "terminal.integrated.fontFamily": "MesloLGS NF",
16 |   "terminal.integrated.fontLigatures.enabled": true,
17 |   "jupyter.interactiveWindow.textEditor.executeSelection": true,
18 |   "python.analysis.autoImportCompletions": true,
19 |   "python.analysis.typeCheckingMode": "basic"
20 | }
```

--------------------------------------------------------------------------------
/server.py:
--------------------------------------------------------------------------------

```python
 1 | import os
 2 | from pathlib import Path
 3 | 
 4 | from mcp.server.fastmcp import FastMCP
 5 | 
 6 | from pdf_converter import convert_to_markdown, create_pdf_converter
 7 | 
 8 | mcp = FastMCP()
 9 | 
10 | 
11 | APP_HOME = Path(os.getenv("APP_HOME", Path(__file__).parent))
12 | DATA_DIR = APP_HOME / "data"
13 | 
14 | 
15 | doc_converter = create_pdf_converter()
16 | 
17 | 
18 | @mcp.tool()
19 | def get_document_text(filename: str) -> str:
20 |     """
21 |     Use this tool to get the content of a document given its filename.
22 | 
23 |     Args:
24 |         filename (str): The filename of the document to get.
25 | 
26 |     Returns:
27 |         str: The content of the document in Markdown format.
28 |     """
29 |     doc_path = DATA_DIR / filename
30 |     return convert_to_markdown(doc_path, doc_converter)
31 | 
32 | 
33 | @mcp.tool()
34 | def get_document_list() -> list[str]:
35 |     """
36 |     Use this tool to get the list of documents.
37 | 
38 |     Returns:
39 |         list[str]: The list of document filenames.
40 |     """
41 |     return sorted([str(path.name) for path in DATA_DIR.glob("*.pdf")])
42 | 
43 | 
44 | if __name__ == "__main__":
45 |     mcp.run(transport="stdio")
46 | 
```

--------------------------------------------------------------------------------
/pdf_converter.py:
--------------------------------------------------------------------------------

```python
 1 | from pathlib import Path
 2 | 
 3 | from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
 4 | from docling.datamodel.base_models import InputFormat
 5 | from docling.datamodel.pipeline_options import PdfPipelineOptions, TableStructureOptions
 6 | from docling.document_converter import DocumentConverter, PdfFormatOption
 7 | 
 8 | 
 9 | def create_pdf_converter() -> DocumentConverter:
10 |     return DocumentConverter(
11 |         allowed_formats=[InputFormat.PDF],
12 |         format_options={
13 |             InputFormat.PDF: PdfFormatOption(
14 |                 pipeline_options=PdfPipelineOptions(
15 |                     do_ocr=False,
16 |                     do_table_structure=False,
17 |                     table_structure_options=TableStructureOptions(
18 |                         do_cell_matching=False
19 |                     ),
20 |                 ),
21 |                 backend=PyPdfiumDocumentBackend,
22 |             )
23 |         },
24 |     )
25 | 
26 | 
27 | def convert_to_markdown(pdf_path: Path, converter: DocumentConverter) -> str:
28 |     document = converter.convert(pdf_path).document
29 |     return document.export_to_markdown()
30 | 
```