xraywu/mcp-pdf-extraction-server # codebase.md

# Directory Structure

```
├── .gitignore
├── .python-version
├── pyproject.toml
├── README.md
├── src
│   └── pdf_extraction
│       ├── __init__.py
│       ├── __main__.py
│       ├── pdf_extractor.py
│       └── server.py
└── uv.lock
```

# Files

--------------------------------------------------------------------------------
/.python-version:
--------------------------------------------------------------------------------

```
3.11

```

--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------

```
# Python-generated files
__pycache__/
*.py[oc]
build/
dist/
wheels/
*.egg-info

# Virtual environments
.venv

```

--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------

```markdown
# PDF Extraction MCP Server (Claude Code Fork)

MCP server to extract contents from PDF files, with fixes for Claude Code CLI installation.

This fork includes critical fixes for installing and running the server with Claude Code (the CLI version).

## What's Different in This Fork

1. **Added `__main__.py`** - Enables the package to be run as a module with `python -m pdf_extraction`
2. **Claude Code specific instructions** - Clear installation steps that work with Claude Code CLI
3. **Tested installation process** - Verified working with `claude mcp add` command

## Components

### Tools

The server implements one tool:
- **extract-pdf-contents**: Extract contents from a local PDF file
  - Takes `pdf_path` as a required string argument (local file path)
  - Takes `pages` as an optional string argument (comma-separated page numbers, supports negative indexing like `-1` for last page)
  - Supports both PDF text extraction and OCR for scanned documents

## Installation for Claude Code CLI

### Prerequisites

- Python 3.11 or higher
- pip or conda
- Claude Code CLI installed (`claude` command)

### Step 1: Clone and Install

```bash
# Clone this fork
git clone https://github.com/lh/mcp-pdf-extraction-server.git
cd mcp-pdf-extraction-server

# Install in development mode
pip install -e .
```

### Step 2: Find the Installed Command

```bash
# Check where pdf-extraction was installed
which pdf-extraction
# Example output: /opt/homebrew/Caskroom/miniconda/base/bin/pdf-extraction
```

### Step 3: Add to Claude Code

```bash
# Add the server using the full path from above
claude mcp add pdf-extraction /opt/homebrew/Caskroom/miniconda/base/bin/pdf-extraction

# Verify it was added
claude mcp list
```

### Step 4: Use in Claude

```bash
# Start a new Claude session
claude

# In Claude, type:
/mcp

# You should see:
# MCP Server Status
# • pdf-extraction: connected
```

## Usage Example

Once connected, you can ask Claude to extract PDF contents:

```
"Can you extract the content from the PDF at /path/to/document.pdf?"

"Extract pages 1-3 and the last page from /path/to/document.pdf"
```

## Troubleshooting

### Server Not Connecting

1. Make sure you started a NEW Claude session after adding the server
2. Verify the command path is correct: `ls -la $(which pdf-extraction)`
3. Test the command directly (it should hang waiting for input): `pdf-extraction`

### Module Not Found Errors

If you get Python import errors:
1. Make sure you're using the same Python environment where you installed the package
2. Try using the full Python path: `claude mcp add pdf-extraction /path/to/python -m pdf_extraction`

### Installation Issues

If `pip install -e .` fails:
1. Make sure you have Python 3.11+: `python --version`
2. Try creating a fresh virtual environment:
   ```bash
   python -m venv venv
   source venv/bin/activate  # On Windows: venv\Scripts\activate
   pip install -e .
   ```

## For Claude Desktop Users

This fork is specifically for Claude Code CLI. If you're using Claude Desktop (the GUI app), please refer to the [original repository](https://github.com/xraywu/mcp-pdf-extraction-server) for installation instructions.

## Dependencies

- mcp>=1.2.0
- pypdf2>=3.0.1
- pytesseract>=0.3.10 (for OCR support)
- Pillow>=10.0.0
- pydantic>=2.10.1,<3.0.0
- pymupdf>=1.24.0

## Contributing

Contributions are welcome! The main change in this fork is the addition of `__main__.py` to make the package runnable as a module.

## License

Same as the original repository.

## Credits

Original server by [@xraywu](https://github.com/xraywu)
Claude Code fixes by [@lh](https://github.com/lh)
```

--------------------------------------------------------------------------------
/src/pdf_extraction/__main__.py:
--------------------------------------------------------------------------------

```python
from . import main

if __name__ == "__main__":
    main()
```

--------------------------------------------------------------------------------
/src/pdf_extraction/__init__.py:
--------------------------------------------------------------------------------

```python
from . import server
import asyncio

def main():
    """Main entry point for the package."""
    asyncio.run(server.main())

# Optionally expose other important items at package level
__all__ = ['main', 'server']
```

--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------

```toml
[project]
name = "pdf-extraction"
version = "0.1.0"
description = "MCP server to extract contents from PDF files"
readme = "README.md"
requires-python = ">=3.11"
dependencies = [
 "mcp>=1.2.0",
 "pypdf2>=3.0.1",
 "pytesseract>=0.3.10",
 "Pillow>=10.0.0",
 "pydantic>=2.10.1,<3.0.0",
 "pymupdf>=1.24.0"
]

[build-system]
requires = [ "hatchling",]
build-backend = "hatchling.build"

[project.scripts]
pdf-extraction = "pdf_extraction:main"

```

--------------------------------------------------------------------------------
/src/pdf_extraction/server.py:
--------------------------------------------------------------------------------

```python
from mcp.server.models import InitializationOptions
import mcp.types as types
from mcp.server import NotificationOptions, Server
import mcp.server.stdio
from .pdf_extractor import PDFExtractor


# MCP 服务器配置
server = Server("pdf_extraction")

# MCP 工具配置
@server.list_tools()
async def handle_list_tools() -> list[types.Tool]:
    """
    Tools for PDF contents extraction
    """
    return [
        types.Tool(
            name="extract-pdf-contents",
            description="Extract contents from a local PDF file, given page numbers separated in comma. Negative page index number supported.",
            inputSchema={
                "type": "object",
                "properties": {
                    "pdf_path": {"type": "string"},
                    "pages": {"type": "string"},
                },
                "required": ["pdf_path"],
            },
        )
    ]

@server.call_tool()
async def handle_call_tool(
    name: str, arguments: dict | None
) -> list[types.TextContent | types.ImageContent | types.EmbeddedResource]:
    """
    Tools for PDF content extraction
    """
    if name == "extract-pdf-contents":
        if not arguments:
            raise ValueError("Missing arguments")

        pdf_path = arguments.get("pdf_path")
        pages = arguments.get("pages")

        if not pdf_path:
            raise ValueError("Missing file path")


        extractor = PDFExtractor()
        extracted_text = extractor.extract_content(pdf_path, pages)
        return [
            types.TextContent(
                type="text",
                text=extracted_text,
            )
        ]
    else:
        raise ValueError(f"Unknown tool: {name}")


# 启动主函数
async def main():
    # Run the server using stdin/stdout streams
    async with mcp.server.stdio.stdio_server() as (read_stream, write_stream):
        await server.run(
            read_stream,
            write_stream,
            InitializationOptions(
                server_name="pdf_extraction",
                server_version="0.1.0",
                capabilities=server.get_capabilities(
                    notification_options=NotificationOptions(),
                    experimental_capabilities={},
                ),
            ),
        )
```

--------------------------------------------------------------------------------
/src/pdf_extraction/pdf_extractor.py:
--------------------------------------------------------------------------------

```python
from PyPDF2 import PdfReader
from pytesseract import image_to_string
from PIL import Image
import fitz  # PyMuPDF
import io
from typing import List, Optional


class PDFExtractor:
    """PDF内容提取器，支持普通PDF和扫描件"""
    
    def __init__(self):
        pass

    def is_scanned_pdf(self, pdf_path: str) -> bool:
        """检查PDF是否为扫描件（图片格式）"""
        reader = PdfReader(pdf_path)
        for page in reader.pages:
            if page.extract_text().strip():
                return False
        return True

    def extract_text_from_scanned(self, pdf_path: str, pages: List[int]) -> str:
        """使用OCR从扫描件PDF中提取文本"""
        doc = fitz.open(pdf_path)
        extracted_text = []
        
        for page_num in pages:
            page = doc.load_page(page_num)
            pix = page.get_pixmap()
            img = Image.open(io.BytesIO(pix.tobytes()))
            
            # OCR支持中文和英文
            text = image_to_string(img, lang='chi_sim+eng')
            extracted_text.append(f"Page {page_num + 1}:\n{text}")
        
        return "\n\n".join(extracted_text)

    def extract_text_from_normal(self, pdf_path: str, pages: List[int]) -> str:
        """从普通PDF中提取文本"""
        reader = PdfReader(pdf_path)
        extracted_text = []
        
        for page_num in pages:
            page = reader.pages[page_num]
            extracted_text.append(f"Page {page_num + 1}:\n{page.extract_text()}")
        
        return "\n\n".join(extracted_text)

    def parse_pages(self, pages_str: Optional[str], total_pages: int) -> List[int]:
        """解析页码字符串"""
        if not pages_str:
            return list(range(total_pages))
        
        pages = []
        for part in pages_str.split(','):
            if not part.strip():
                continue
            try:
                page_num = int(part.strip())
                if page_num < 0:
                    page_num = total_pages + page_num
                elif page_num > 0:
                    page_num = page_num - 1
                else:
                    raise ValueError("PDF页码不能为0")
                if 0 <= page_num < total_pages:
                    pages.append(page_num)
            except ValueError:
                continue
        return sorted(set(pages))

    def extract_content(self, pdf_path: str, pages: Optional[str]) -> List[str]:
        """提取PDF内容的主方法"""
        if not pdf_path:
            raise ValueError("PDF路径不能为空")

        try:
            # 检查是否为扫描件
            is_scanned = self.is_scanned_pdf(pdf_path)
            
            # 解析页码
            reader = PdfReader(pdf_path)
            total_pages = len(reader.pages)
            selected_pages = self.parse_pages(pages, total_pages)
            
            # 根据PDF类型选择提取方式
            if is_scanned:
                text = self.extract_text_from_scanned(pdf_path, selected_pages)
            else:
                text = self.extract_text_from_normal(pdf_path, selected_pages)
                
            return text
        except Exception as e:
            raise ValueError(f"提取PDF内容失败: {str(e)}")
```