deepseekmine/mcp-pdf-reader # codebase.md

This is page 1 of 2. Use http://codebase.md/deepseekmine/mcp-pdf-reader?lines=false&page={x} to view the full context.

# Directory Structure

```
├── .idea
│   ├── vcs.xml
│   └── workspace.xml
├── pdf_server.py
├── README.md
└── requirements.txt
```

# Files

--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------

```markdown
# 📄 MCP PDF Server

A PDF file reading server based on [FastMCP](https://github.com/minimaxir/fastmcp).

Supports PDF text extraction, OCR recognition, and image extraction via the MCP protocol, with a built-in web debugger for easy testing.

---

## 🚀 Features

- **read_pdf_text**  
  Extracts normal text from a PDF (page by page).

- **read_by_ocr**  
  Uses OCR to recognize text from scanned or image-based PDFs.

- **read_pdf_images**  
  Extracts all images from a specified PDF page (Base64 encoded output).

---

## 📂 Project Structure

```
mcp-pdf-server/
├── pdf_resources/        # Directory for uploaded and processed PDF files
├── txt_server.py         # Main server entry point
└── README.md             # Project documentation
```

---

## ⚙️ Installation

Recommended Python version: 3.9+

```bash
pip install pymupdf mcp
```

> Note: To use OCR features, you may need a MuPDF build with OCR support or external OCR libraries.

---

## 🔦 Start the Server

Run the following command:

```bash
python txt_server.py
```

You should see logs like:

```
Serving on http://127.0.0.1:6231
```

---

## 🌐 Web Debugging Interface

Open your browser and visit:

```
http://127.0.0.1:6231
```

- Select a tool from the left panel
- Fill in parameters on the right panel
- Click "Run" to test the tool

No coding required — easily debug and test via the web UI.

---

## 🛠️ API Tool List

| Tool | Description | Input Parameters | Returns |
|:-----|:------------|:-----------------|:--------|
| `read_pdf_text` | Extracts normal text from PDF pages | `file_path`, `start_page`, `end_page` | List of page texts |
| `read_by_ocr` | Recognizes text via OCR | `file_path`, `start_page`, `end_page`, `language`, `dpi` | OCR extracted text |
| `read_pdf_images` | Extracts images from a PDF page | `file_path`, `page_number` | List of images (Base64 encoded) |

---

## 📝 Example Usage

Extract text from pages 1 to 5:

```bash
mcp run read_pdf_text --args '{"file_path": "pdf_resources/example.pdf", "start_page": 1, "end_page": 5}'
```

Perform OCR recognition on page 1:

```bash
mcp run read_by_ocr --args '{"file_path": "pdf_resources/example.pdf", "start_page": 1, "end_page": 1, "language": "eng"}'
```

Extract all images from page 3:

```bash
mcp run read_pdf_images --args '{"file_path": "pdf_resources/example.pdf", "page_number": 3}'
```

---

## 📢 Notes

- Files must be placed inside the `pdf_resources/` directory, or an absolute path must be provided.
- OCR functionality requires appropriate OCR support in the environment.
- When processing large files, adjust memory and timeout settings as needed.

---

## 📜 License

This project is licensed under the MIT License.  
For commercial use, please credit the original source.

---
```

--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------

```
fitz
mcp

```

--------------------------------------------------------------------------------
/.idea/vcs.xml:
--------------------------------------------------------------------------------

```
<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
  <component name="VcsDirectoryMappings">
    <mapping directory="$PROJECT_DIR$/.." vcs="Git" />
    <mapping directory="$PROJECT_DIR$" vcs="Git" />
  </component>
</project>
```

--------------------------------------------------------------------------------
/pdf_server.py:
--------------------------------------------------------------------------------

```python
from typing import Optional, List, Dict, Any
import os, json
import base64
import fitz
from mcp.server.fastmcp import FastMCP
import uuid
import logging
from datetime import datetime

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger('mcp-pdf-server')

PDF_DIR = os.path.join(os.getcwd(), "pdf_resources")
os.makedirs(PDF_DIR, exist_ok=True)

mcp = FastMCP("PDF Reader", version="1.0.0", description="MCP server for PDF reading")

pdf_resources = {}
pdf_cache = {}


@mcp.tool()
def read_pdf_text(file_path: str, start_page: int = 1, end_page: Optional[int] = None) -> Dict[
    str, Any]:
    """
    Read normal text from a PDF file, one text per page.

    Args:
        file_path: Path to the PDF file
        start_page: Start page (1-based)
        end_page: End page (inclusive)

    Returns:
        Dict containing:
            - page_count: total number of pages
            - pages: list of {page_number, text}
    """
    if not os.path.exists(file_path):
        raise ValueError(f"File not found: {file_path}")

    doc = fitz.open(file_path)
    total_pages = len(doc)

    if start_page < 1:
        start_page = 1
    if end_page is None or end_page > total_pages:
        end_page = total_pages
    if start_page > end_page:
        start_page, end_page = end_page, start_page

    pages = []

    for page_num in range(start_page - 1, end_page):
        page = doc[page_num]
        page_text = page.get_text()

        pages.append({"page_number": page_num + 1, "text": page_text.strip()})

    doc.close()

    return json.loads(json.dumps({"page_count": total_pages, "pages": pages}, ensure_ascii=False))


@mcp.tool()
def read_by_ocr(file_path: str, start_page: int = 1, end_page: Optional[int] = None,
        language: str = "eng", dpi: int = 300) -> Dict[str, Any]:
    """
    Read text from PDF file using OCR.
    Args:
        file_path: Path to the PDF file
        start_page: Start page (1-based)
        end_page: End page (inclusive)
        language: OCR language code
        dpi: OCR DPI
    Returns:
        Dict with extracted text, page_count, extracted_pages
    """
    if not os.path.exists(file_path):
        raise ValueError(f"File not found: {file_path}")

    doc = fitz.open(file_path)
    total_pages = len(doc)

    if start_page < 1:
        start_page = 1
    if end_page is None or end_page > total_pages:
        end_page = total_pages
    if start_page > end_page:
        start_page, end_page = end_page, start_page

    text_content = ""
    for page_num in range(start_page - 1, end_page):
        page = doc[page_num]
        try:
            textpage = page.get_textpage_ocr(flags=3, language=language, dpi=dpi, full=True)
            page_text = page.get_text(textpage=textpage)
        except Exception as e:
            logger.warning(f"OCR failed on page {page_num + 1}, fallback to normal text: {e}")
            page_text = page.get_text()

        text_content += page_text + "\n\n"

    doc.close()

    return {"text": text_content, "page_count": total_pages,
        "extracted_pages": list(range(start_page, end_page + 1))}


@mcp.tool()
def read_pdf_images(file_path: str, page_number: int=1) -> Dict[str, List[Dict[str, Any]]]:
    """
    Extract images from a specific page in PDF.
    Args:
        file_path: Path to the PDF file
        page_number: Page number (1-based)
    Returns:
        Dict with list of images (base64 format)
    """
    if not os.path.exists(file_path):
        raise ValueError(f"File not found: {file_path}")

    doc = fitz.open(file_path)
    total_pages = len(doc)

    if page_number < 1 or page_number > total_pages:
        raise ValueError(f"Page number {page_number} out of range (1-{total_pages})")

    page = doc[page_number - 1]
    image_list = page.get_images(full=True)

    images = []
    for idx, img in enumerate(image_list):
        xref = img[0]
        base_image = doc.extract_image(xref)
        image_data = base_image["image"]
        image_ext = base_image["ext"]
        image_b64 = base64.b64encode(image_data).decode('utf-8')

        images.append({"image_id": f"p{page_number}_img{idx}", "width": base_image["width"],
            "height": base_image["height"], "format": image_ext, "image_b64": image_b64})

    doc.close()

    return {"images": images}


if __name__ == "__main__":
    logger.info("Starting MCP PDF Server...")
    logger.info(f"PDF resources will be stored in: {PDF_DIR}")
    mcp.run()

```