# Directory Structure ``` ├── .gitignore ├── .python-version ├── .vscode │ └── settings.json ├── cursor-rule-example.mdc ├── data │ └── .gitignore ├── LICENSE ├── pdf_converter.py ├── pyproject.toml ├── README.md ├── server.py └── uv.lock ``` # Files -------------------------------------------------------------------------------- /.python-version: -------------------------------------------------------------------------------- ``` 1 | 3.12.10 2 | ``` -------------------------------------------------------------------------------- /data/.gitignore: -------------------------------------------------------------------------------- ``` 1 | * 2 | !.gitignore ``` -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- ``` 1 | # Python-generated files 2 | __pycache__/ 3 | *.py[oc] 4 | build/ 5 | dist/ 6 | wheels/ 7 | *.egg-info 8 | 9 | # Virtual environments 10 | .venv 11 | .embeddings/ 12 | .DS_Store 13 | .env 14 | mlruns/ 15 | mlartifacts/ 16 | .ruff_cache/ ``` -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- ```markdown 1 | # MCP PDF Reader 2 | 3 | A Model Context Protocol (MCP) server that provides tools for reading and processing PDF documents. Built with Docling for document conversion and text extraction. 4 | 5 | ## Features 6 | 7 | - **MCP Server** with tools for PDF document processing 8 | - **Document Text Extraction**: Convert PDF content to clean Markdown format 9 | - **Document Discovery**: List and access available PDF files 10 | 11 | ## Tools 12 | 13 | The server provides two main tools: 14 | 15 | - **`get_document_list`**: Returns a list of all available PDF files in the data directory 16 | - **`get_document_text`**: Extracts and returns the full text content of a specified PDF file in Markdown format 17 | 18 | ## Install 19 | 20 | Make sure you have [`uv` installed](https://docs.astral.sh/uv/getting-started/installation/). 21 | 22 | Clone the repository: 23 | 24 | ```bash 25 | git clone [email protected]:mlexpertio/mcp-pdf-reader.git 26 | cd mcp-pdf-reader 27 | ``` 28 | 29 | Install Python: 30 | 31 | ```bash 32 | uv python install 3.12.10 33 | ``` 34 | 35 | Create and activate a virtual environment: 36 | 37 | ```bash 38 | uv venv 39 | source .venv/bin/activate 40 | ``` 41 | 42 | Install dependencies: 43 | 44 | ```bash 45 | uv sync 46 | ``` 47 | 48 | ## Usage 49 | 50 | ### Add PDF Documents 51 | 52 | Place your PDF files in the `data/` directory. The server will automatically detect and make them available through the tools. 53 | 54 | ### Run MCP Server 55 | 56 | Start the MCP server: 57 | 58 | ```bash 59 | python server.py 60 | ``` 61 | 62 | The server runs using stdio transport and can be integrated with any MCP-compatible client. 63 | 64 | ### Development and Testing 65 | 66 | Use the MCP inspector to test the server: 67 | 68 | ```bash 69 | mcp dev server.py 70 | ``` 71 | 72 | This will open a web interface where you can test the available tools and inspect their responses. 73 | 74 | ## Use in VSCode/Cursor 75 | 76 | You can use the MCP integration in your editor. `Tools & Integrations` -> `New MCP Server` and edit the `mcp.json` file to include the following: 77 | 78 | ```json 79 | { 80 | "mcpServers": { 81 | "pdf-reader": { 82 | "command": "/opt/homebrew/bin/uv", // path to your uv binary 83 | "args": ["run", "--directory", "PATH_TO_YOUR_PROJECT", "server.py"] 84 | } 85 | } 86 | } 87 | ``` 88 | 89 | ## License 90 | 91 | See LICENSE file for details. 92 | ``` -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- ```toml 1 | [project] 2 | name = "mcp-pdf-reader" 3 | version = "0.1.0" 4 | description = "Local MCP server that extract text from PDF files" 5 | readme = "README.md" 6 | requires-python = ">=3.12.10" 7 | dependencies = [ 8 | "docling>=2.43.0", 9 | "mcp[cli]>=1.12.3", 10 | "pypdfium2>=4.30.0", 11 | ] 12 | 13 | [dependency-groups] 14 | dev = [ 15 | "ruff>=0.12.7", 16 | ] 17 | ``` -------------------------------------------------------------------------------- /.vscode/settings.json: -------------------------------------------------------------------------------- ```json 1 | { 2 | "[python]": { 3 | "editor.formatOnSave": true, 4 | "editor.defaultFormatter": "charliermarsh.ruff", 5 | "editor.codeActionsOnSave": { 6 | "source.fixAll": "explicit", 7 | "source.organizeImports": "explicit" 8 | } 9 | }, 10 | "notebook.formatOnSave.enabled": true, 11 | "notebook.codeActionsOnSave": { 12 | "notebook.source.fixAll": "explicit", 13 | "notebook.source.organizeImports": "explicit", 14 | }, 15 | "terminal.integrated.fontFamily": "MesloLGS NF", 16 | "terminal.integrated.fontLigatures.enabled": true, 17 | "jupyter.interactiveWindow.textEditor.executeSelection": true, 18 | "python.analysis.autoImportCompletions": true, 19 | "python.analysis.typeCheckingMode": "basic" 20 | } ``` -------------------------------------------------------------------------------- /server.py: -------------------------------------------------------------------------------- ```python 1 | import os 2 | from pathlib import Path 3 | 4 | from mcp.server.fastmcp import FastMCP 5 | 6 | from pdf_converter import convert_to_markdown, create_pdf_converter 7 | 8 | mcp = FastMCP() 9 | 10 | 11 | APP_HOME = Path(os.getenv("APP_HOME", Path(__file__).parent)) 12 | DATA_DIR = APP_HOME / "data" 13 | 14 | 15 | doc_converter = create_pdf_converter() 16 | 17 | 18 | @mcp.tool() 19 | def get_document_text(filename: str) -> str: 20 | """ 21 | Use this tool to get the content of a document given its filename. 22 | 23 | Args: 24 | filename (str): The filename of the document to get. 25 | 26 | Returns: 27 | str: The content of the document in Markdown format. 28 | """ 29 | doc_path = DATA_DIR / filename 30 | return convert_to_markdown(doc_path, doc_converter) 31 | 32 | 33 | @mcp.tool() 34 | def get_document_list() -> list[str]: 35 | """ 36 | Use this tool to get the list of documents. 37 | 38 | Returns: 39 | list[str]: The list of document filenames. 40 | """ 41 | return sorted([str(path.name) for path in DATA_DIR.glob("*.pdf")]) 42 | 43 | 44 | if __name__ == "__main__": 45 | mcp.run(transport="stdio") 46 | ``` -------------------------------------------------------------------------------- /pdf_converter.py: -------------------------------------------------------------------------------- ```python 1 | from pathlib import Path 2 | 3 | from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend 4 | from docling.datamodel.base_models import InputFormat 5 | from docling.datamodel.pipeline_options import PdfPipelineOptions, TableStructureOptions 6 | from docling.document_converter import DocumentConverter, PdfFormatOption 7 | 8 | 9 | def create_pdf_converter() -> DocumentConverter: 10 | return DocumentConverter( 11 | allowed_formats=[InputFormat.PDF], 12 | format_options={ 13 | InputFormat.PDF: PdfFormatOption( 14 | pipeline_options=PdfPipelineOptions( 15 | do_ocr=False, 16 | do_table_structure=False, 17 | table_structure_options=TableStructureOptions( 18 | do_cell_matching=False 19 | ), 20 | ), 21 | backend=PyPdfiumDocumentBackend, 22 | ) 23 | }, 24 | ) 25 | 26 | 27 | def convert_to_markdown(pdf_path: Path, converter: DocumentConverter) -> str: 28 | document = converter.convert(pdf_path).document 29 | return document.export_to_markdown() 30 | ```