trafflux/pdf-reader-mcp # codebase.md

# Directory Structure

```
├── Dockerfile
├── README.md
├── requirements.txt
└── src
    └── server.py
```

# Files

--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------

```markdown
  1 | # PDF Reader MCP Server
  2 | 
  3 | A Model Context Protocol (MCP) server that provides tools for reading and extracting text from PDF files, supporting both local files and URLs.
  4 | 
  5 | ## Author
  6 | 
  7 | Philip Van de Walker  
  8 | Email: [email protected]  
  9 | GitHub: https://github.com/trafflux
 10 | 
 11 | ## Features
 12 | 
 13 | - Read text content from local PDF files
 14 | - Read text content from PDF URLs
 15 | - Error handling for corrupt or invalid PDFs
 16 | - Volume mounting for accessing local PDFs
 17 | - Auto-detection of PDF encoding
 18 | - Standardized JSON output format
 19 | 
 20 | ## Installation
 21 | 
 22 | 1. Clone the repository:
 23 | 
 24 | ```bash
 25 | git clone https://github.com/trafflux/pdf-reader-mcp.git
 26 | cd pdf-reader-mcp
 27 | ```
 28 | 
 29 | 2. Build the Docker image:
 30 | 
 31 | ```bash
 32 | docker build -t mcp/pdf-reader .
 33 | ```
 34 | 
 35 | ## Usage
 36 | 
 37 | ### Running the Server
 38 | 
 39 | To run the server with access to local PDF files:
 40 | 
 41 | ```bash
 42 | docker run -i --rm -v /path/to/pdfs:/pdfs mcp/pdf-reader
 43 | ```
 44 | 
 45 | Replace `/path/to/pdfs` with the actual path to your PDF files directory.
 46 | 
 47 | If not using local PDF files:
 48 | 
 49 | ```bash
 50 | docker run -i --rm mcp/pdf-reader
 51 | ```
 52 | 
 53 | ### MCP Configuration
 54 | 
 55 | Add to your MCP settings configuration:
 56 | 
 57 | ```json
 58 | {
 59 |   "mcpServers": {
 60 |     "pdf-reader": {
 61 |       "command": "docker",
 62 |       "args": [
 63 |         "run",
 64 |         "-i",
 65 |         "--rm",
 66 |         "-v",
 67 |         "/path/to/pdfs:/pdfs",
 68 |         "mcp/pdf-reader"
 69 |       ],
 70 |       "disabled": false,
 71 |       "autoApprove": []
 72 |     }
 73 |   }
 74 | }
 75 | ```
 76 | 
 77 | Without local file PDF files:
 78 | 
 79 | ```json
 80 | {
 81 |   "mcpServers": {
 82 |     "pdf-reader": {
 83 |       "command": "docker",
 84 |       "args": ["run", "-i", "--rm", "mcp/pdf-reader"],
 85 |       "disabled": false,
 86 |       "autoApprove": []
 87 |     }
 88 |   }
 89 | }
 90 | ```
 91 | 
 92 | ### Available Tools
 93 | 
 94 | 1. `read_local_pdf`
 95 | 
 96 |    - Purpose: Read text content from a local PDF file
 97 |    - Input:
 98 |      ```json
 99 |      {
100 |        "path": "/pdfs/document.pdf"
101 |      }
102 |      ```
103 |    - Output:
104 |      ```json
105 |      {
106 |        "success": true,
107 |        "data": {
108 |          "text": "Extracted content..."
109 |        }
110 |      }
111 |      ```
112 | 
113 | 2. `read_pdf_url`
114 |    - Purpose: Read text content from a PDF URL
115 |    - Input:
116 |      ```json
117 |      {
118 |        "url": "https://example.com/document.pdf"
119 |      }
120 |      ```
121 |    - Output:
122 |      ```json
123 |      {
124 |        "success": true,
125 |        "data": {
126 |          "text": "Extracted content..."
127 |        }
128 |      }
129 |      ```
130 | 
131 | ## Error Handling
132 | 
133 | The server handles various error cases with clear error messages:
134 | 
135 | - Invalid or corrupt PDF files
136 | - Missing files
137 | - Failed URL requests
138 | - Permission issues
139 | - Network connectivity problems
140 | 
141 | Error responses follow the format:
142 | 
143 | ```json
144 | {
145 |   "success": false,
146 |   "error": "Detailed error message"
147 | }
148 | ```
149 | 
150 | ## Dependencies
151 | 
152 | - Python 3.11+
153 | - PyPDF2: PDF parsing and text extraction
154 | - requests: HTTP client for fetching PDFs from URLs
155 | - MCP SDK: Model Context Protocol implementation
156 | 
157 | ## Project Structure
158 | 
159 | ```
160 | .
161 | ├── Dockerfile          # Container configuration
162 | ├── README.md          # This documentation
163 | ├── requirements.txt   # Python dependencies
164 | └── src/
165 |     ├── __init__.py    # Package initialization
166 |     └── server.py      # Main server implementation
167 | ```
168 | 
169 | ## License
170 | 
171 | Copyright 2025 Philip Van de Walker
172 | 
173 | Licensed under the Apache License, Version 2.0 (the "License");
174 | you may not use this file except in compliance with the License.
175 | You may obtain a copy of the License at
176 | 
177 |     http://www.apache.org/licenses/LICENSE-2.0
178 | 
179 | Unless required by applicable law or agreed to in writing, software
180 | distributed under the License is distributed on an "AS IS" BASIS,
181 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
182 | See the License for the specific language governing permissions and
183 | limitations under the License.
184 | 
185 | ## Contributing
186 | 
187 | Contributions are welcome! Please feel free to submit a Pull Request.
188 | 
189 | ## Contact
190 | 
191 | For questions, issues, or contributions, please contact Philip Van de Walker:
192 | 
193 | - Email: [email protected]
194 | - GitHub: https://github.com/trafflux
195 | 
```

--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------

```
1 | PyPDF2>=3.0.0
2 | requests>=2.31.0
3 | 
```

--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------

```dockerfile
 1 | # PDF Reader MCP Server Dockerfile
 2 | # Author: Philip Van de Walker
 3 | # Email: [email protected]
 4 | # Repo: https://github.com/trafflux/pdf-reader-mcp
 5 | # Licensed under the Apache License, Version 2.0 (the "License");
 6 | # you may not use this file except in compliance with the License.
 7 | # You may obtain a copy of the License at
 8 | #
 9 | #     http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | 
17 | # Use Python 3.11 slim image as base
18 | FROM python:3.11-slim
19 | 
20 | # Set working directory
21 | WORKDIR /app
22 | 
23 | # Install git for MCP SDK installation
24 | RUN apt-get update && \
25 |     apt-get install -y git && \
26 |     apt-get clean && \
27 |     rm -rf /var/lib/apt/lists/*
28 | 
29 | # Install MCP SDK directly from GitHub repository
30 | RUN pip install git+https://github.com/modelcontextprotocol/python-sdk.git
31 | 
32 | # Install project Python dependencies
33 | COPY requirements.txt .
34 | RUN pip install -r requirements.txt
35 | 
36 | # Copy source code into container
37 | COPY src/ .
38 | 
39 | # Command to run the server
40 | # The container expects a volume mount at /pdfs for accessing local PDF files
41 | ENTRYPOINT ["python", "server.py"]
42 | 
```

--------------------------------------------------------------------------------
/src/server.py:
--------------------------------------------------------------------------------

```python
  1 | """
  2 | PDF Reader MCP Server
  3 | --------------------
  4 | 
  5 | A Model Context Protocol (MCP) server that provides tools for reading and extracting text from PDF files.
  6 | Supports both local files and URLs, with comprehensive error handling and standardized output format.
  7 | 
  8 | Author: Philip Van de Walker
  9 | Email: [email protected]
 10 | Repo: https://github.com/trafflux/pdf-reader-mcp
 11 | 
 12 | Licensed under the Apache License, Version 2.0 (the "License");
 13 | you may not use this file except in compliance with the License.
 14 | You may obtain a copy of the License at
 15 | 
 16 |     http://www.apache.org/licenses/LICENSE-2.0
 17 | 
 18 | Unless required by applicable law or agreed to in writing, software
 19 | distributed under the License is distributed on an "AS IS" BASIS,
 20 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 21 | See the License for the specific language governing permissions and
 22 | limitations under the License.
 23 | 
 24 | This module implements an MCP server with two main tools:
 25 | - read_local_pdf: Extracts text from local PDF files
 26 | - read_pdf_url: Extracts text from PDFs accessed via URLs
 27 | 
 28 | The server uses FastMCP for simplified tool registration and standardized error handling.
 29 | All text extraction is done using PyPDF2 with proper error handling for various edge cases.
 30 | """
 31 | 
 32 | import os
 33 | import io
 34 | import logging
 35 | from typing import Dict, Any
 36 | 
 37 | import PyPDF2
 38 | import requests
 39 | from mcp.server.fastmcp import FastMCP
 40 | 
 41 | def get_logger(name: str):
 42 |     logger = logging.getLogger(name)
 43 |     return logger
 44 | 
 45 | logger = get_logger(__name__)
 46 | 
 47 | # Create server instance using FastMCP
 48 | mcp = FastMCP("pdf-reader")
 49 | 
 50 | def extract_text_from_pdf(pdf_file) -> str:
 51 |     """Extract text content from a PDF file."""
 52 |     try:
 53 |         reader = PyPDF2.PdfReader(pdf_file)
 54 |         text = ""
 55 |         for page in reader.pages:
 56 |             text += page.extract_text() + "\n"
 57 |         return text.strip()
 58 |     except Exception as e:
 59 |         logger.error(f"Failed to extract text from PDF: {str(e)}")
 60 |         raise ValueError(f"Failed to extract text from PDF: {str(e)}")
 61 | 
 62 | @mcp.tool()
 63 | async def read_local_pdf(path: str) -> Dict[str, Any]:
 64 |     """Read text content from a local PDF file."""
 65 |     try:
 66 |         with open(path, 'rb') as file:
 67 |             text = extract_text_from_pdf(file)
 68 |             return {
 69 |                 "success": True,
 70 |                 "data": {
 71 |                     "text": text
 72 |                 }
 73 |             }
 74 |     except FileNotFoundError:
 75 |         logger.error(f"PDF file not found: {path}")
 76 |         return {
 77 |             "success": False,
 78 |             "error": f"PDF file not found: {path}"
 79 |         }
 80 |     except Exception as e:
 81 |         logger.error(str(e))
 82 |         return {
 83 |             "success": False,
 84 |             "error": str(e)
 85 |         }
 86 | 
 87 | @mcp.tool()
 88 | async def read_pdf_url(url: str) -> Dict[str, Any]:
 89 |     """Read text content from a PDF URL."""
 90 |     try:
 91 |         response = requests.get(url)
 92 |         response.raise_for_status()
 93 |         pdf_file = io.BytesIO(response.content)
 94 |         text = extract_text_from_pdf(pdf_file)
 95 |         return {
 96 |             "success": True,
 97 |             "data": {
 98 |                 "text": text
 99 |             }
100 |         }
101 |     except requests.RequestException as e:
102 |         logger.error(f"Failed to fetch PDF from URL: {str(e)}")
103 |         return {
104 |             "success": False,
105 |             "error": f"Failed to fetch PDF from URL: {str(e)}"
106 |         }
107 |     except Exception as e:
108 |         logger.error(str(e))
109 |         return {
110 |             "success": False,
111 |             "error": str(e)
112 |         }
113 | 
114 | def main() -> None:
115 |     """Run the MCP server."""
116 |     try:
117 |         mcp.run()
118 |     except Exception as e:
119 |         logger.error(f"Error starting server: {str(e)}")
120 |         raise
121 | 
122 | if __name__ == "__main__":
123 |     main()
124 | 
```