# Directory Structure
```
├── .dockerignore
├── .github
│ ├── ISSUE_TEMPLATE
│ │ ├── bug_report.yml
│ │ ├── config.yml
│ │ ├── feature_request.yml
│ │ └── question.yml
│ ├── pull_request_template.md
│ └── workflows
│ ├── deploy.yml
│ ├── ping-server.yml
│ └── tests.yml
├── .gitignore
├── .python-version
├── assets
│ ├── cursor-usage.png
│ ├── langdock-usage.png
│ └── paperclip.svg
├── CONTRIBUTING.md
├── docker-compose.prod.yml
├── docker-compose.yml
├── Dockerfile
├── LICENSE.md
├── README.md
├── requirements.txt
├── src
│ ├── core
│ │ ├── __init__.py
│ │ ├── arxiv.py
│ │ ├── openalex.py
│ │ ├── osf.py
│ │ └── providers.py
│ ├── prompts.py
│ ├── server.py
│ ├── tools.py
│ └── utils
│ ├── __init__.py
│ ├── pdf2md.py
│ └── sanitize_api_queries.py
└── tests
├── __init__.py
├── test_metadata_retrieval.py
└── test_pdf_retrieval.py
```
# Files
--------------------------------------------------------------------------------
/.python-version:
--------------------------------------------------------------------------------
```
1 | 3.12.7
2 |
```
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
```
1 | venv
2 | env
3 | .venv
4 | .env
5 | __pycache__/
6 | *.xlsx
7 | *.csv
8 | .DS_Store
9 | .luarc.json
10 |
11 | *node_modules/
12 |
```
--------------------------------------------------------------------------------
/.dockerignore:
--------------------------------------------------------------------------------
```
1 | # Git
2 | .git
3 | .gitignore
4 | .gitattributes
5 |
6 | # Docker
7 | Dockerfile*
8 | docker-compose*
9 | .dockerignore
10 |
11 | # Documentation
12 | *.md
13 | docs/
14 |
15 | # Python
16 | __pycache__/
17 | *.py[cod]
18 | *$py.class
19 | *.so
20 | .Python
21 | build/
22 | develop-eggs/
23 | dist/
24 | downloads/
25 | eggs/
26 | .eggs/
27 | lib/
28 | lib64/
29 | parts/
30 | sdist/
31 | var/
32 | wheels/
33 | *.egg-info/
34 | .installed.cfg
35 | *.egg
36 | MANIFEST
37 |
38 | # Virtual environments
39 | .env
40 | .venv
41 | env/
42 | venv/
43 | ENV/
44 | env.bak/
45 | venv.bak/
46 | .conda/
47 |
48 | # IDEs
49 | .cursor/
50 | .vscode/
51 | .idea/
52 | *.swp
53 | *.swo
54 | *~
55 |
56 | # OS
57 | .DS_Store
58 | .DS_Store?
59 | ._*
60 | .Spotlight-V100
61 | .Trashes
62 | ehthumbs.db
63 | Thumbs.db
64 |
65 | # Testing
66 | .tox/
67 | .nox/
68 | .coverage
69 | .pytest_cache/
70 | htmlcov/
71 | .cache
72 | tests/
73 |
74 | # Jupyter Notebook
75 | .ipynb_checkpoints
76 |
77 | # Environment variables
78 | .env*
79 | env.prod.template
80 |
81 | # Logs
82 | *.log
83 | logs/
84 |
85 | # Temporary files
86 | *.tmp
87 | *.temp
88 | .tmp/
89 | .temp/
```
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
```markdown
1 | <div align="center">
2 | <img src="assets/paperclip.svg" alt="Paperclip Logo" width="48" height="48">
3 |
4 | # Paperclip MCP Server
5 | </div>
6 |
7 | > 📎 Paperclip is a Model Context Protocol (MCP) server that enables searching and retrieving research papers from Arxiv, the Open Science Framework (OSF) API, and OpenAlex.
8 |
9 | [](https://github.com/matsjfunke/paperclip/actions/workflows/tests.yml)
10 | [](https://github.com/matsjfunke/paperclip/actions/workflows/ping-server.yml)
11 | [](LICENSE.md)
12 |
13 | ## Quick Start
14 |
15 | Setup the paperclip MCP server in your host via the server url `https://paperclip.matsjfunke.com/mcp` no authentication is needed.
16 |
17 | Example JSON for cursor:
18 |
19 | ```json
20 | {
21 | "mcpServers": {
22 | "paperclip": {
23 | "url": "https://paperclip.matsjfunke.com/mcp"
24 | }
25 | }
26 | }
27 | ```
28 |
29 | ## Table of Contents
30 |
31 | - [Quick Start](#quick-start)
32 | - [Usage Examples](#usage-examples)
33 | - [Supported Paper providers](#supported-paper-providers)
34 | - [Preprint Providers to be added](#preprint-providers-to-be-added)
35 | - [Contributing](#contributing)
36 |
37 | ## Usage Examples
38 |
39 | Here are examples of Paperclip integrated with popular MCP clients:
40 |
41 | **Cursor IDE:**
42 |
43 | 
44 |
45 | **Langdock:**
46 |
47 | 
48 |
49 | ## Supported Paper providers
50 |
51 | - [AfricArXiv](https://africarxiv.org)
52 | - [AgriXiv](https://agrirxiv.org)
53 | - [ArabXiv](https://arabixiv.org)
54 | - [arXiv](https://arxiv.org)
55 | - [BioHackrXiv](http://guide.biohackrxiv.org/about.html)
56 | - [BodoArXiv](https://bodoarxiv.wordpress.com)
57 | - [COP Preprints](https://www.collegeofphlebology.com)
58 | - [EarthArXiv](https://eartharxiv.org)
59 | - [EcoEvoRxiv](https://www.ecoevorxiv.com)
60 | - [ECSarxiv](https://ecsarxiv.org)
61 | - [EdArXiv](https://edarxiv.org)
62 | - [EngrXiv](https://engrxiv.org)
63 | - [FocusArchive](https://osf.io/preprints/focusarchive)
64 | - [Frenxiv](https://frenxiv.org)
65 | - [INArxiv](https://rinarxiv.lipi.go.id)
66 | - [IndiaRxiv](https://osf.io/preprints/indiarxiv)
67 | - [Law Archive](https://library.law.yale.edu/research/law-archive)
68 | - [LawArXiv](https://osf.io/preprints/lawarxiv)
69 | - [LISSA](https://osf.io/preprints/lissa)
70 | - [LiveData](https://osf.io/preprints/livedata)
71 | - [MarXiv](https://osf.io/preprints/marxiv)
72 | - [MediArXiv](https://mediarxiv.com)
73 | - [MetaArXiv](https://osf.io/preprints/metaarxiv)
74 | - [MindRxiv](https://osf.io/preprints/mindrxiv)
75 | - [NewAddictionSx](https://osf.io/preprints/newaddictionsx)
76 | - [NutriXiv](https://niblunc.org)
77 | - [OpenAlex](https://openalex.org)
78 | - [OSF Preprints](https://osf.io/preprints/osf)
79 | - [PaleoRxiv](https://osf.io/preprints/paleorxiv)
80 | - [PsyArXiv](https://psyarxiv.com)
81 | - [SocArXiv](https://socopen.org/welcome)
82 | - [SportRxiv](http://sportrxiv.org)
83 | - [Thesis Commons](https://osf.io/preprints/thesiscommons)
84 |
85 | ## Preprint Providers to be added
86 |
87 | [List of preprint repositorys](https://en.wikipedia.org/wiki/List_of_preprint_repositories)
88 |
89 | - bioRxiv & medRxiv both share the underlying api structure (https://api.biorxiv.org/pubs/[server]/[interval]/[cursor] where [server] can be "biorxiv" or "medrxiv")
90 | - ChemRxiv
91 | - [hal open science](https://hal.science/?lang=en)
92 | - [research square](https://www.researchsquare.com/)
93 | - [osf preprints](https://osf.io/preprints)
94 | - [preprints.org](https://preprints.org)
95 | - [science open](https://www.scienceopen.com/)
96 | - [SSRN](https://www.ssrn.com/index.cfm/en/the-lancet/)
97 | - [synthical](https://synthical.com/feed/new)
98 |
99 | ## Contributing
100 |
101 | Interested in contributing to Paperclip? Check out our [Contributing Guide](CONTRIBUTING.md) for development setup instructions, testing procedures, and more!
102 |
```
--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
```markdown
1 | MIT License
2 |
3 | Copyright (c) 2025 Mats Julius Funke
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
```
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
```markdown
1 | # Contributing to Paperclip
2 |
3 | Thank you for your interest in contributing to Paperclip! This guide will help you get started with development.
4 |
5 | ## Development Setup
6 |
7 | ### Prerequisites
8 |
9 | - Python 3.12+
10 | - pip
11 |
12 | ### Installation
13 |
14 | 1. **Fork and clone the repository**
15 |
16 | - Fork this repository on GitHub
17 | - Clone your fork:
18 |
19 | ```bash
20 | git clone https://github.com/YOUR_USERNAME/paperclip.git
21 | cd paperclip
22 | ```
23 |
24 | 2. **Create and activate virtual environment**
25 |
26 | ```bash
27 | python -m venv .venv
28 | source .venv/bin/activate # On Windows: .venv\Scripts\activate
29 | ```
30 |
31 | 3. **Install dependencies**
32 |
33 | ```bash
34 | pip install -r requirements.txt
35 | ```
36 |
37 | 4. **Add dependencies**
38 | ```bash
39 | pip install <new-lib>
40 | pip freeze > requirements.txt
41 | ```
42 |
43 | ### Running the Server with Hot Reload
44 |
45 | ```bash
46 | # Run with hot reload
47 | watchmedo auto-restart --patterns="*.py" --recursive -- python src/server.py
48 | # Run Server using fastmcp
49 | fastmcp run src/server.py --transport http --host 0.0.0.0 --port 8000
50 | # use docker compose
51 | docker-compose up --build
52 | ```
53 |
54 | The server will automatically restart when you make changes to any `.py` files.
55 |
56 | ## Testing
57 |
58 | Use the [MCP Inspector](https://inspector.modelcontextprotocol.io/) to interact with the server.
59 |
60 | ```bash
61 | pnpx @modelcontextprotocol/inspector
62 | ```
63 |
64 | ### Unit Tests
65 |
66 | Run the unit tests to verify the functionality of individual components:
67 |
68 | ```bash
69 | # Run all tests
70 | python -m unittest discover tests -v
71 | ```
72 |
73 | ## Contributing Changes
74 |
75 | ### Creating a Pull Request
76 |
77 | 1. **Create a feature branch**
78 |
79 | ```bash
80 | git checkout -b feat/your-feature-name
81 | # or for bug fixes:
82 | git checkout -b fix/issue-description
83 | ```
84 |
85 | 2. **Make your changes**
86 |
87 | - Write your code following the existing style
88 | - Add tests for new functionality
89 | - Update documentation as needed
90 |
91 | 3. **Commit your changes and push to your fork**
92 |
93 | ```bash
94 | git push origin feat/your-feature-name
95 | ```
96 |
97 | 4. **Open a Pull Request**
98 |
99 | - Go to the original repository on GitHub
100 | - Click "New Pull Request"
101 | - Select your branch from your fork
102 | - Fill out the PR template with:
103 | - Clear description of changes
104 | - Link to related issues (if applicable)
105 | - Testing steps you've performed
106 |
107 | ### Pull Request Guidelines
108 |
109 | - **Keep PRs focused**: One feature or fix per PR
110 | - **Write clear descriptions**: Explain what changes you made and why
111 | - **Test your changes**: Ensure all tests pass before submitting
112 | - **Update documentation**: Add or update docs for new features
113 | - **Be responsive**: Address feedback and questions promptly
114 |
```
--------------------------------------------------------------------------------
/tests/__init__.py:
--------------------------------------------------------------------------------
```python
1 | # Tests package for Paperclip MCP Server
```
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/config.yml:
--------------------------------------------------------------------------------
```yaml
1 | blank_issues_enabled: true
2 | contact_links:
3 | - name: 📖 Documentation
4 | url: https://github.com/matsjfunke/paperclip/blob/main/README.md
5 | about: Check the README for documentation
6 |
```
--------------------------------------------------------------------------------
/src/utils/__init__.py:
--------------------------------------------------------------------------------
```python
1 | """
2 | Utility functions for the paperclip MCP server.
3 | """
4 |
5 | from .pdf2md import extract_pdf_to_markdown
6 | from .sanitize_api_queries import sanitize_api_queries
7 |
8 | __all__ = ["sanitize_api_queries", "extract_pdf_to_markdown"]
9 |
```
--------------------------------------------------------------------------------
/docker-compose.yml:
--------------------------------------------------------------------------------
```yaml
1 | version: "3.8"
2 |
3 | services:
4 | paperclip:
5 | build:
6 | context: .
7 | image: paperclip-image
8 | container_name: paperclip
9 | ports:
10 | - 8000:8000
11 | volumes:
12 | - ./:/app # mount local backend dir to /app in container to enable live reloading of code changes
13 | command: watchmedo auto-restart --patterns="*.py" --recursive -- python src/server.py --transport http --host 0.0.0.0 --port 8000
14 |
```
--------------------------------------------------------------------------------
/assets/paperclip.svg:
--------------------------------------------------------------------------------
```
1 | <svg xmlns="http://www.w3.org/2000/svg" width="48" height="48" viewBox="0 0 24 24" fill="none" stroke="white" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="icon icon-tabler icons-tabler-outline icon-tabler-paperclip"><path stroke="none" d="M0 0h24v24H0z" fill="none"/><path d="M15 7l-6.5 6.5a1.5 1.5 0 0 0 3 3l6.5 -6.5a3 3 0 0 0 -6 -6l-6.5 6.5a4.5 4.5 0 0 0 9 9l6.5 -6.5" /></svg>
```
--------------------------------------------------------------------------------
/.github/pull_request_template.md:
--------------------------------------------------------------------------------
```markdown
1 | ## Description
2 |
3 | Brief description of what this PR does.
4 |
5 | ## Changes Made
6 |
7 | - [ ] List specific changes
8 | - [ ] Include any new features
9 | - [ ] Mention any bug fixes
10 |
11 | ## Testing
12 |
13 | - [ ] All existing tests pass
14 | - [ ] Added tests for new functionality (if applicable)
15 | - [ ] Tested base functonality manually with MCP Inspector
16 |
17 | ## Related Issues
18 |
19 | Closes #[issue-number] (if applicable)
20 |
21 | ## Additional Notes
22 |
23 | Any additional context or considerations for reviewers.
24 |
```
--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
```dockerfile
1 | FROM python:3.12-slim-bullseye
2 |
3 | WORKDIR /app
4 |
5 | COPY requirements.txt .
6 | RUN pip install --no-cache-dir -r requirements.txt
7 |
8 | # Update sources list and install packages (assuming these are needed for your app)
9 | RUN apt-get update && \
10 | apt-get upgrade -y && \
11 | apt-get install -y --no-install-recommends \
12 | libgl1-mesa-glx \
13 | libglib2.0-0 \
14 | && apt-get clean && \
15 | rm -rf /var/lib/apt/lists/*
16 |
17 | RUN pip install --upgrade pip
18 | COPY ./src .
19 |
20 | EXPOSE 8000
```
--------------------------------------------------------------------------------
/.github/workflows/tests.yml:
--------------------------------------------------------------------------------
```yaml
1 | name: Tests
2 |
3 | on:
4 | push:
5 | branches: [main]
6 | workflow_dispatch:
7 |
8 | jobs:
9 | test:
10 | runs-on: ubuntu-latest
11 |
12 | steps:
13 | - name: Checkout code
14 | uses: actions/checkout@v4
15 |
16 | - name: Set up Python 3.12
17 | uses: actions/setup-python@v4
18 | with:
19 | python-version: 3.12
20 |
21 | - name: Install dependencies
22 | run: |
23 | python -m pip install --upgrade pip
24 | pip install -r requirements.txt
25 |
26 | - name: Run all tests
27 | run: |
28 | python -m unittest discover tests -v
29 |
```
--------------------------------------------------------------------------------
/src/core/__init__.py:
--------------------------------------------------------------------------------
```python
1 | """
2 | Core package for paperclip MCP server.
3 | """
4 |
5 | from .arxiv import (
6 | fetch_arxiv_papers,
7 | fetch_single_arxiv_paper_metadata,
8 | )
9 | from .osf import (
10 | fetch_osf_preprints,
11 | fetch_osf_providers,
12 | fetch_single_osf_preprint_metadata,
13 | )
14 | from .openalex import (
15 | fetch_openalex_papers,
16 | fetch_single_openalex_paper_metadata,
17 | )
18 |
19 |
20 | from .providers import get_all_providers, validate_provider, fetch_osf_providers
21 |
22 | __all__ = [
23 | "fetch_arxiv_papers",
24 | "fetch_osf_preprints",
25 | "fetch_osf_providers",
26 | "fetch_single_arxiv_paper_metadata",
27 | "fetch_single_osf_preprint_metadata",
28 | "fetch_openalex_papers",
29 | "fetch_single_openalex_paper_metadata",
30 | "get_all_providers",
31 | "validate_provider",
32 | "fetch_osf_providers",
33 | ]
34 |
```
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
```
1 | annotated-types==0.7.0
2 | anyio==4.9.0
3 | attrs==25.3.0
4 | Authlib==1.6.1
5 | certifi==2025.8.3
6 | cffi==1.17.1
7 | charset-normalizer==3.4.2
8 | click==8.2.1
9 | cryptography==45.0.5
10 | cyclopts==3.22.5
11 | dnspython==2.7.0
12 | docstring_parser==0.17.0
13 | docutils==0.22
14 | email_validator==2.2.0
15 | exceptiongroup==1.3.0
16 | fastmcp==2.11.0
17 | h11==0.16.0
18 | httpcore==1.0.9
19 | httpx==0.28.1
20 | httpx-sse==0.4.1
21 | idna==3.10
22 | isodate==0.7.2
23 | jsonschema==4.25.0
24 | jsonschema-path==0.3.4
25 | jsonschema-specifications==2025.4.1
26 | lazy-object-proxy==1.11.0
27 | markdown-it-py==3.0.0
28 | MarkupSafe==3.0.2
29 | mcp==1.12.3
30 | mdurl==0.1.2
31 | more-itertools==10.7.0
32 | openapi-core==0.19.5
33 | openapi-pydantic==0.5.1
34 | openapi-schema-validator==0.6.3
35 | openapi-spec-validator==0.7.2
36 | parse==1.20.2
37 | pathable==0.4.4
38 | pycparser==2.22
39 | pydantic==2.11.7
40 | pydantic-settings==2.10.1
41 | pydantic_core==2.33.2
42 | Pygments==2.19.2
43 | PyMuPDF==1.26.3
44 | pymupdf4llm==0.0.27
45 | pyperclip==1.9.0
46 | python-dotenv==1.1.1
47 | python-multipart==0.0.20
48 | PyYAML==6.0.2
49 | referencing==0.36.2
50 | requests==2.32.4
51 | rfc3339-validator==0.1.4
52 | rich==14.1.0
53 | rich-rst==1.3.1
54 | rpds-py==0.26.0
55 | six==1.17.0
56 | sniffio==1.3.1
57 | sse-starlette==3.0.2
58 | starlette==0.47.2
59 | typing-inspection==0.4.1
60 | typing_extensions==4.14.1
61 | urllib3==2.5.0
62 | uvicorn==0.35.0
63 | watchdog==6.0.0
64 | Werkzeug==3.1.1
65 |
```
--------------------------------------------------------------------------------
/src/prompts.py:
--------------------------------------------------------------------------------
```python
1 | from fastmcp import FastMCP
2 |
3 |
4 | prompt_mcp = FastMCP()
5 |
6 | @prompt_mcp.prompt
7 | def list_paper_providers() -> str:
8 | """List all available paper providers."""
9 | return "List all available paper providers."
10 |
11 | @prompt_mcp.prompt
12 | def find_attention_is_all_you_need() -> str:
13 | """Finds the Attention is all you need paper in arxiv."""
14 | return "Search for Attention is all you need in arxiv"
15 |
16 | @prompt_mcp.prompt
17 | def get_paper_by_id() -> str:
18 | """Prompt to use the get_paper_by_id tool."""
19 | return "Retrieve the full content (including abstract, sections, and references) of the paper with ID: 1706.03762"
20 |
21 | @prompt_mcp.prompt
22 | def get_paper_metadata_by_id() -> str:
23 | """Prompt to use the get_paper_metadata_by_id tool."""
24 | return "Retrieve the metadata of the paper with ID: 1706.03762"
25 |
26 | @prompt_mcp.prompt
27 | def get_paper_by_url() -> str:
28 | """Prompt to use the get_paper_by_url tool."""
29 | return "Retrieve the full content (including abstract, sections, and references) of the paper with URL: https://arxiv.org/pdf/1706.03762"
30 |
31 | @prompt_mcp.prompt
32 | def search_across_providers() -> str:
33 | """Prompt for searching across all providers (not specifying a provider)."""
34 | return "Search for papers across all providers with the query: MCP"
```
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/bug_report.yml:
--------------------------------------------------------------------------------
```yaml
1 | name: Bug Report
2 | description: Report a bug or unexpected behavior
3 | title: "[Bug]: "
4 | labels: ["bug"]
5 | body:
6 | - type: markdown
7 | attributes:
8 | value: |
9 | Thanks for taking the time to report a bug 🫶 ! Please fill out the information below to help us investigate.
10 |
11 | - type: textarea
12 | id: description
13 | attributes:
14 | label: Bug Description
15 | description: A clear and concise description of what the bug is.
16 | placeholder: Describe what happened and what you expected to happen instead.
17 | validations:
18 | required: true
19 |
20 | - type: textarea
21 | id: reproduction
22 | attributes:
23 | label: Steps to Reproduce
24 | description: Steps to reproduce the behavior
25 | validations:
26 | required: true
27 |
28 | - type: textarea
29 | id: expected
30 | attributes:
31 | label: Expected Behavior
32 | description: What you expected to happen
33 | validations:
34 | required: true
35 |
36 | - type: textarea
37 | id: actual
38 | attributes:
39 | label: Actual Behavior
40 | description: What actually happened (include full error message if applicable)
41 | validations:
42 | required: true
43 |
44 | - type: textarea
45 | id: additional-context
46 | attributes:
47 | label: Additional Context
48 | description: Any other context about the problem here.
49 |
```
--------------------------------------------------------------------------------
/src/server.py:
--------------------------------------------------------------------------------
```python
1 | from typing import Annotated
2 | import asyncio
3 |
4 | from fastmcp import FastMCP
5 |
6 | from core import (
7 | fetch_arxiv_papers,
8 | fetch_openalex_papers,
9 | fetch_osf_preprints,
10 | fetch_single_arxiv_paper_metadata,
11 | fetch_single_openalex_paper_metadata,
12 | fetch_single_osf_preprint_metadata,
13 | fetch_osf_providers,
14 | get_all_providers,
15 | )
16 | from utils.pdf2md import download_pdf_and_parse_to_markdown, download_paper_and_parse_to_markdown, extract_pdf_to_markdown
17 | from prompts import prompt_mcp
18 | from tools import tools_mcp
19 |
20 | mcp = FastMCP(
21 | name="Paperclip MCP Server",
22 | instructions="""
23 | This server provides tools to search, retrieve, and read academic papers from multiple sources.
24 | - Search papers across providers with filters for query text, subjects, and publication date
25 | - Read full paper content in markdown format
26 | - Retrieve paper metadata without downloading content (e.g. title, authors, abstract, publication date, journal info, and download URLs)
27 | """,
28 | )
29 |
30 |
31 | # Import subservers
32 | async def setup():
33 | await mcp.import_server(prompt_mcp, prefix="prompt")
34 | await mcp.import_server(tools_mcp, prefix="tools")
35 |
36 | if __name__ == "__main__":
37 | asyncio.run(setup())
38 | mcp.run(transport="http", host="0.0.0.0", port=8000)
39 |
```
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/question.yml:
--------------------------------------------------------------------------------
```yaml
1 | name: Question
2 | description: Ask a question about paperclip
3 | title: "[Question]: "
4 | labels: ["question"]
5 | body:
6 | - type: markdown
7 | attributes:
8 | value: |
9 | Have a question about paperclip? We're here to help! Please provide as much detail as possible.
10 |
11 | - type: textarea
12 | id: question
13 | attributes:
14 | label: Your Question
15 | description: What would you like to know about paperclip?
16 | placeholder: Ask your question here...
17 | validations:
18 | required: true
19 |
20 | - type: textarea
21 | id: context
22 | attributes:
23 | label: Context
24 | description: |
25 | What are you trying to achieve? Providing context helps us give better answers.
26 | placeholder: |
27 | e.g., "I'm trying to understand how paperclip handles..."
28 | or "I want to use paperclip to..."
29 |
30 | - type: checkboxes
31 | id: checklist
32 | attributes:
33 | label: Checklist
34 | description: Please confirm you've done the following
35 | options:
36 | - label: I've checked the README and documentation
37 | required: true
38 | - label: I've searched existing issues for similar questions
39 | required: true
40 |
41 | - type: textarea
42 | id: additional-info
43 | attributes:
44 | label: Additional Information
45 | description: Any other details that might be helpful
46 |
```
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature_request.yml:
--------------------------------------------------------------------------------
```yaml
1 | name: Feature Request
2 | description: Suggest a new feature or improvement
3 | title: "[Feature]: "
4 | labels: ["enhancement"]
5 | body:
6 | - type: markdown
7 | attributes:
8 | value: |
9 | Thanks for suggesting a new feature! Please fill out the information below to help us understand your request.
10 |
11 | - type: textarea
12 | id: summary
13 | attributes:
14 | label: Feature Summary
15 | description: A clear and concise description of the feature you'd like to see added.
16 | placeholder: Briefly describe the feature you're requesting.
17 | validations:
18 | required: true
19 |
20 | - type: textarea
21 | id: problem
22 | attributes:
23 | label: Problem or Use Case
24 | description: What problem does this feature solve? What use case does it address?
25 | placeholder: |
26 | e.g., "I often need to... but currently paperclip doesn't support..."
27 | or "It would be helpful if paperclip could..."
28 | validations:
29 | required: true
30 |
31 | - type: textarea
32 | id: solution
33 | attributes:
34 | label: Proposed Solution
35 | description: How would you like this feature to work?
36 | placeholder: |
37 | Describe your ideal solution. Consider:
38 | - What command/option would trigger this feature?
39 | - What would the output look like?
40 | - How should it interact with existing features?
41 | validations:
42 | required: true
43 |
44 | - type: textarea
45 | id: alternatives
46 | attributes:
47 | label: Alternatives Considered
48 | description: Have you considered any alternative solutions or workarounds?
49 | placeholder: |
50 | e.g., "I currently work around this by..."
51 | or "Other tools like X handle this by..."
52 |
53 | - type: textarea
54 | id: additional-context
55 | attributes:
56 | label: Additional Context
57 | description: |
58 | Any other context, screenshots, or examples that would help us understand your request.
59 |
```
--------------------------------------------------------------------------------
/src/utils/sanitize_api_queries.py:
--------------------------------------------------------------------------------
```python
1 | """
2 | Text processing utilities for API interactions.
3 | """
4 |
5 | import re
6 |
7 |
8 | def sanitize_api_queries(text: str, max_length: int = 200) -> str:
9 | """
10 | Clean text for API queries by removing problematic characters and formatting.
11 |
12 | Args:
13 | text: The text to clean
14 | max_length: Maximum allowed length (default 200)
15 |
16 | Returns:
17 | Cleaned text suitable for API queries
18 | """
19 | if not text:
20 | return text
21 |
22 | # Remove or replace problematic characters
23 | cleaned = text
24 |
25 | # Replace various quote types with simple quotes or remove them
26 | cleaned = cleaned.replace('"', '"').replace('"', '"').replace(""", "'").replace(""", "'")
27 |
28 | # Remove or replace other problematic special characters
29 | cleaned = cleaned.replace(" ", " ") # Non-breaking space
30 | cleaned = cleaned.replace("\u00a0", " ") # Unicode non-breaking space
31 | cleaned = cleaned.replace("\n", " ").replace("\r", " ").replace("\t", " ") # Line breaks and tabs
32 |
33 | # Replace multiple spaces with single space
34 | cleaned = re.sub(r"\s+", " ", cleaned)
35 |
36 | # Remove leading/trailing whitespace
37 | cleaned = cleaned.strip()
38 |
39 | # Handle length limits
40 | if len(cleaned) > max_length:
41 | cleaned = cleaned[: max_length - 3] + "..."
42 |
43 | # Remove or replace characters that commonly cause URL encoding issues
44 | problematic_chars = ["<", ">", "{", "}", "|", "\\", "^", "`", "[", "]"]
45 | for char in problematic_chars:
46 | cleaned = cleaned.replace(char, "")
47 |
48 | # Replace colons which seem to cause OSF API issues
49 | cleaned = cleaned.replace(":", " -")
50 |
51 | # Replace other potentially problematic punctuation
52 | cleaned = cleaned.replace(";", ",") # Semicolons to commas
53 | cleaned = cleaned.replace("?", "") # Remove question marks
54 | cleaned = cleaned.replace("!", "") # Remove exclamation marks
55 | cleaned = cleaned.replace("#", "") # Remove hashtags
56 | cleaned = cleaned.replace("%", "") # Remove percent signs
57 |
58 | # Clean up any double spaces created by replacements
59 | cleaned = re.sub(r"\s+", " ", cleaned).strip()
60 |
61 | return cleaned
62 |
```
--------------------------------------------------------------------------------
/src/core/providers.py:
--------------------------------------------------------------------------------
```python
1 | from typing import Any, Dict, List
2 |
3 | import requests
4 |
5 |
6 | def fetch_osf_providers() -> List[Dict[str, Any]]:
7 | """Fetch current list of valid OSF preprint providers from API"""
8 | url = "https://api.osf.io/v2/preprint_providers/"
9 | response = requests.get(url)
10 | response.raise_for_status()
11 | data = response.json()
12 |
13 | # Create provider objects from the response
14 | providers = []
15 | for provider in data["data"]:
16 | provider_obj = {
17 | "id": provider["id"],
18 | "type": "osf",
19 | "description": provider["attributes"]["description"],
20 | "taxonomies": provider["relationships"]["taxonomies"]["links"]["related"]["href"],
21 | "preprints": provider["relationships"]["preprints"]["links"]["related"]["href"],
22 | }
23 | providers.append(provider_obj)
24 |
25 | return sorted(providers, key=lambda p: p["id"])
26 |
27 |
28 | def get_external_providers() -> List[Dict[str, Any]]:
29 | """Get list of external (non-OSF) preprint providers"""
30 | return [
31 | {
32 | "id": "arxiv",
33 | "type": "standalone",
34 | "description": "arXiv is a free distribution service and an open-access archive for scholarly articles in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.",
35 | },
36 | {
37 | "id": "openalex",
38 | "type": "standalone",
39 | "description": "OpenAlex is a comprehensive index of scholarly works across all disciplines.",
40 | },
41 | ]
42 |
43 |
44 | def get_all_providers() -> List[Dict[str, Any]]:
45 | """Get combined list of all available providers"""
46 | osf_providers = fetch_osf_providers()
47 | external_providers = get_external_providers()
48 | all_providers = osf_providers + external_providers
49 | return sorted(all_providers, key=lambda p: p["id"].lower())
50 |
51 |
52 | def validate_provider(provider_id: str) -> bool:
53 | """Validate if a provider ID exists in the given providers list"""
54 | valid_ids = [p["id"] for p in get_all_providers()]
55 | return provider_id in valid_ids
56 |
```
--------------------------------------------------------------------------------
/.github/workflows/deploy.yml:
--------------------------------------------------------------------------------
```yaml
1 | name: Deploy to VPS
2 |
3 | on:
4 | push:
5 | branches: [main]
6 | workflow_dispatch:
7 |
8 | jobs:
9 | deploy:
10 | runs-on: ubuntu-latest
11 |
12 | steps:
13 | - name: Checkout code
14 | uses: actions/checkout@v4
15 |
16 | - name: Deploy to VPS
17 | uses: appleboy/[email protected]
18 | with:
19 | host: ${{ secrets.VPS_HOST }}
20 | username: ${{ secrets.VPS_USERNAME }}
21 | key: ${{ secrets.VPS_SSH_KEY }}
22 | passphrase: ${{ secrets.VPS_SSH_KEY_PASSPHRASE }}
23 | script: |
24 | cd /opt/paperclip
25 | git fetch origin main
26 | git reset --hard origin/main
27 |
28 | echo "🔄 Stopping containers..."
29 | docker-compose -f docker-compose.prod.yml down
30 |
31 | echo "🏗️ Building containers..."
32 | docker-compose -f docker-compose.prod.yml build --no-cache
33 |
34 | echo "🚀 Starting containers..."
35 | docker-compose -f docker-compose.prod.yml up -d
36 |
37 | echo "⏳ Waiting for containers to start..."
38 | sleep 10
39 |
40 | echo "📊 Container status:"
41 | docker-compose -f docker-compose.prod.yml ps --format "table {{.Service}}\t{{.Status}}\t{{.Ports}}"
42 |
43 | echo "📋 Recent logs from all services (filtered):"
44 | docker-compose -f docker-compose.prod.yml logs --tail=30 | grep -E "(ERROR|WARN|INFO|Ready|Starting|Listening)" | head -50
45 |
46 | echo "🧹 Cleaning up..."
47 | docker system prune -f
48 |
49 | - name: Show specific service logs on failure
50 | if: failure()
51 | uses: appleboy/[email protected]
52 | with:
53 | host: ${{ secrets.VPS_HOST }}
54 | username: ${{ secrets.VPS_USERNAME }}
55 | key: ${{ secrets.VPS_SSH_KEY }}
56 | passphrase: ${{ secrets.VPS_SSH_KEY_PASSPHRASE }}
57 | script: |
58 | cd /opt/paperclip
59 | echo "🔍 Filtered logs for debugging:"
60 | echo "--- Traefik status ---"
61 | docker-compose -f docker-compose.prod.yml logs --tail=50 traefik | grep -E "(ERROR|WARN|Ready|Starting|tls|certificate)" | head -30
62 | echo "--- Paperclip MCP Server status ---"
63 | docker-compose -f docker-compose.prod.yml logs --tail=50 paperclip-mcp | grep -E "(ERROR|WARN|Ready|Starting|Listening|Build)" | head -30
64 |
```
--------------------------------------------------------------------------------
/.github/workflows/ping-server.yml:
--------------------------------------------------------------------------------
```yaml
1 | name: Health Check
2 |
3 | on:
4 | schedule:
5 | # Run once per day at 10:00 UTC
6 | - cron: "0 10 * * *"
7 | workflow_dispatch:
8 |
9 | jobs:
10 | ping:
11 | runs-on: ubuntu-latest
12 |
13 | steps:
14 | - name: Checkout code
15 | uses: actions/checkout@v4
16 |
17 | - name: Set up Python
18 | uses: actions/setup-python@v4
19 | with:
20 | python-version: "3.11"
21 |
22 | - name: Install dependencies
23 | run: |
24 | python -m pip install --upgrade pip
25 | pip install fastmcp==2.11.0
26 |
27 | - name: Create ping script
28 | run: |
29 | cat > ping_server.py << 'EOF'
30 | import asyncio
31 | import sys
32 | import os
33 | from datetime import datetime
34 | from fastmcp.client.client import Client
35 |
36 | async def ping_server():
37 | server_url = 'https://paperclip.matsjfunke.com/mcp'
38 |
39 | print(f"🏓 Pinging MCP server at: {server_url}")
40 | print(f"⏰ Timestamp: {datetime.now().isoformat()}")
41 |
42 | try:
43 | # Create client instance
44 | client = Client(server_url)
45 |
46 | # Connect and ping
47 | async with client:
48 | print("✅ Successfully connected to server")
49 |
50 | # Send ping
51 | ping_result = await client.ping()
52 |
53 | if ping_result:
54 | print("🎯 Ping successful! Server is responsive")
55 | return True
56 | else:
57 | print("❌ Ping failed! Server did not respond properly")
58 | return False
59 |
60 | except Exception as e:
61 | print(f"💥 Error connecting to server: {str(e)}")
62 | print(f"🔧 Error type: {type(e).__name__}")
63 | return False
64 |
65 | if __name__ == "__main__":
66 | result = asyncio.run(ping_server())
67 | if not result:
68 | sys.exit(1)
69 | EOF
70 |
71 | - name: Run ping test
72 | run: python ping_server.py
73 |
74 | - name: Report ping failure
75 | if: failure()
76 | run: |
77 | echo "🚨 Server ping failed!"
78 | echo "⚠️ This could indicate:"
79 | echo " - Server is down or not responding"
80 | echo " - Network connectivity issues"
81 | echo " - Server is overloaded"
82 | echo " - Configuration problems"
83 | echo ""
84 | echo "🔍 Check the deploy workflow and server logs for more details"
85 |
86 | - name: Report ping success
87 | if: success()
88 | run: |
89 | echo "✅ Server ping successful!"
90 | echo "🟢 Paperclip server is healthy and responsive"
91 |
```
--------------------------------------------------------------------------------
/tests/test_metadata_retrieval.py:
--------------------------------------------------------------------------------
```python
1 | #!/usr/bin/env python3
2 | """
3 | Unit tests for metadata retrieval functionality.
4 | """
5 |
6 | import unittest
7 | import sys
8 | import os
9 |
10 | # Add src to path to import server modules
11 | sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'src'))
12 |
13 | from core import (
14 | fetch_single_arxiv_paper_metadata,
15 | fetch_single_openalex_paper_metadata,
16 | fetch_single_osf_preprint_metadata,
17 | )
18 |
19 |
20 | class TestMetadataRetrieval(unittest.TestCase):
21 | """Test class for paper metadata retrieval."""
22 |
23 | def setUp(self):
24 | """Set up test fixtures."""
25 | self.osf_id = "2stpg"
26 | self.openalex_id = "W4385245566"
27 | self.arxiv_id = "1709.06308v1"
28 |
29 | # Expected results based on tmp.py output
30 | self.expected_osf_title = "The Economy of Attention and the Novel"
31 | self.expected_openalex_title = "Attention Is All You Need"
32 | self.expected_arxiv_title = "Exploring Human-like Attention Supervision in Visual Question Answering"
33 |
34 | def test_osf_metadata_retrieval(self):
35 | """Test OSF paper metadata retrieval."""
36 | result = fetch_single_osf_preprint_metadata(self.osf_id)
37 |
38 | # Assert that result is a dictionary and not an error
39 | self.assertIsInstance(result, dict)
40 | self.assertNotIn("status", result) or result.get("status") != "error"
41 |
42 | # Assert title and ID
43 | self.assertEqual(result.get("title"), self.expected_osf_title)
44 | self.assertEqual(result.get("id"), self.osf_id)
45 |
46 | def test_openalex_metadata_retrieval(self):
47 | """Test OpenAlex paper metadata retrieval."""
48 | result = fetch_single_openalex_paper_metadata(self.openalex_id)
49 |
50 | # Assert that result is a dictionary and not an error
51 | self.assertIsInstance(result, dict)
52 | self.assertNotIn("status", result) or result.get("status") != "error"
53 |
54 | # Assert title and ID
55 | self.assertEqual(result.get("title"), self.expected_openalex_title)
56 | self.assertEqual(result.get("id"), self.openalex_id)
57 |
58 | def test_arxiv_metadata_retrieval(self):
59 | """Test ArXiv paper metadata retrieval."""
60 | result = fetch_single_arxiv_paper_metadata(self.arxiv_id)
61 |
62 | # Assert that result is a dictionary and not an error
63 | self.assertIsInstance(result, dict)
64 | self.assertNotIn("status", result) or result.get("status") != "error"
65 |
66 | # Assert title and ID
67 | self.assertEqual(result.get("title"), self.expected_arxiv_title)
68 | self.assertEqual(result.get("id"), self.arxiv_id)
69 |
70 | def test_metadata_contains_required_fields(self):
71 | """Test that metadata contains essential fields."""
72 | result = fetch_single_arxiv_paper_metadata(self.arxiv_id)
73 |
74 | # Assert required fields are present
75 | self.assertIn("title", result)
76 | self.assertIn("id", result)
77 | self.assertIsNotNone(result.get("title"))
78 | self.assertIsNotNone(result.get("id"))
79 |
80 |
81 | if __name__ == "__main__":
82 | unittest.main()
```
--------------------------------------------------------------------------------
/docker-compose.prod.yml:
--------------------------------------------------------------------------------
```yaml
1 | version: "3.8"
2 |
3 | services:
4 | traefik:
5 | image: traefik:v3.0
6 | container_name: traefik
7 | command:
8 | - "--api.insecure=false" # Disable insecure API dashboard for production security
9 | - "--providers.docker=true" # Auto-discover services via Docker labels
10 | # Only expose services that explicitly set traefik.enable=true (security best practice)
11 | - "--providers.docker.exposedbydefault=false"
12 | - "--entrypoints.web.address=:80" # HTTP entrypoint redirects to HTTPS
13 | - "--entrypoints.websecure.address=:443" # HTTPS entrypoint
14 | - "--certificatesresolvers.myresolver.acme.tlschallenge=true" # Automatic SSL certificate generation via Let's Encrypt TLS challenge
15 | - "--certificatesresolvers.myresolver.acme.email=mats.funke@gmail.com" # Email required for Let's Encrypt certificate notifications and recovery
16 | - "--certificatesresolvers.myresolver.acme.storage=/letsencrypt/acme.json" # SSL certificates persist across container restarts
17 | # Force HTTP to HTTPS redirect for security (all traffic must be encrypted)
18 | - "--entrypoints.web.http.redirections.entrypoint.to=websecure"
19 | - "--entrypoints.web.http.redirections.entrypoint.scheme=https"
20 | ports:
21 | - "80:80"
22 | - "443:443"
23 | volumes:
24 | - /var/run/docker.sock:/var/run/docker.sock:ro
25 | - traefik_letsencrypt:/letsencrypt # SSL certificates persist across container restarts
26 | networks:
27 | - web
28 | restart: unless-stopped
29 |
30 | paperclip-mcp:
31 | build:
32 | context: .
33 | dockerfile: Dockerfile
34 | container_name: paperclip-mcp
35 | command: python server.py
36 | labels:
37 | - "traefik.enable=true"
38 |
39 | # Define service first to avoid Traefik auto-generating conflicting services
40 | - "traefik.http.services.paperclip-mcp.loadbalancer.server.port=8000"
41 |
42 | # MCP server route - accessible via HTTPS
43 | - "traefik.http.routers.paperclip-mcp.rule=Host(`paperclip.matsjfunke.com`)"
44 | - "traefik.http.routers.paperclip-mcp.entrypoints=websecure"
45 | - "traefik.http.routers.paperclip-mcp.tls.certresolver=myresolver"
46 | - "traefik.http.routers.paperclip-mcp.service=paperclip-mcp"
47 |
48 | # CORS headers required for MCP protocol compatibility with AI clients
49 | - "traefik.http.middlewares.mcp-cors.headers.accesscontrolallowmethods=GET,POST,OPTIONS,PUT,DELETE"
50 | - "traefik.http.middlewares.mcp-cors.headers.accesscontrolallowheaders=Content-Type,Authorization,Accept,Origin,User-Agent,DNT,Cache-Control,X-Mx-ReqToken,Keep-Alive,X-Requested-With,If-Modified-Since,mcp-session-id"
51 | - "traefik.http.middlewares.mcp-cors.headers.accesscontrolalloworiginlist=*"
52 | - "traefik.http.middlewares.mcp-cors.headers.accesscontrolmaxage=86400"
53 |
54 | # Apply CORS middleware to the router
55 | - "traefik.http.routers.paperclip-mcp.middlewares=mcp-cors"
56 |
57 | networks:
58 | - web
59 | restart: unless-stopped
60 | depends_on:
61 | - traefik
62 | environment:
63 | - PYTHONPATH=/app
64 |
65 | volumes:
66 | # Named volume for Let's Encrypt certificates persistence across container restarts
67 | traefik_letsencrypt:
68 |
69 | networks:
70 | # Internal network for container communication (external=false for security)
71 | web:
72 | external: false
73 |
```
--------------------------------------------------------------------------------
/tests/test_pdf_retrieval.py:
--------------------------------------------------------------------------------
```python
1 | #!/usr/bin/env python3
2 | """
3 | Unit tests for PDF retrieval functionality.
4 | """
5 |
6 | import unittest
7 | import sys
8 | import os
9 | import asyncio
10 |
11 | # Add src to path to import server modules
12 | sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'src'))
13 |
14 | from core import (
15 | fetch_single_arxiv_paper_metadata,
16 | fetch_single_openalex_paper_metadata,
17 | fetch_single_osf_preprint_metadata,
18 | )
19 | from utils.pdf2md import download_paper_and_parse_to_markdown
20 |
21 |
22 | class TestPdfRetrieval(unittest.TestCase):
23 | """Test class for paper PDF retrieval and content extraction."""
24 |
25 | def setUp(self):
26 | """Set up test fixtures."""
27 | self.osf_id = "2stpg"
28 | self.openalex_id = "W4385245566"
29 | self.arxiv_id = "1709.06308v1"
30 |
31 | # Expected content starts based on tmp_pdf.py output
32 | self.expected_osf_start = "#### The Economy of Attention and the Novel"
33 | self.expected_openalex_start = "Skip to main content"
34 | self.expected_arxiv_start = "## **Exploring Human-like Attention Supervision in Visual Question Answering**"
35 |
36 | def test_osf_pdf_retrieval(self):
37 | """Test OSF paper PDF retrieval and content extraction."""
38 | metadata = fetch_single_osf_preprint_metadata(self.osf_id)
39 | result = asyncio.run(download_paper_and_parse_to_markdown(
40 | metadata=metadata,
41 | pdf_url_field="download_url",
42 | paper_id=self.osf_id,
43 | write_images=False
44 | ))
45 |
46 | # Assert that result is successful
47 | self.assertIsInstance(result, dict)
48 | self.assertEqual(result.get("status"), "success")
49 |
50 | # Assert content is retrieved and has expected start
51 | content = result.get("content", "")
52 | self.assertGreater(len(content), 1000) # Should have substantial content
53 | self.assertTrue(content.startswith(self.expected_osf_start))
54 |
55 | def test_openalex_pdf_retrieval(self):
56 | """Test OpenAlex paper PDF retrieval and content extraction."""
57 | metadata = fetch_single_openalex_paper_metadata(self.openalex_id)
58 | result = asyncio.run(download_paper_and_parse_to_markdown(
59 | metadata=metadata,
60 | pdf_url_field="pdf_url",
61 | paper_id=self.openalex_id,
62 | write_images=False
63 | ))
64 |
65 | # Assert that result is successful
66 | self.assertIsInstance(result, dict)
67 | self.assertEqual(result.get("status"), "success")
68 |
69 | # Assert content is retrieved and has expected start
70 | content = result.get("content", "")
71 | self.assertGreater(len(content), 1000) # Should have substantial content
72 | self.assertTrue(content.startswith(self.expected_openalex_start))
73 |
74 | def test_arxiv_pdf_retrieval(self):
75 | """Test ArXiv paper PDF retrieval and content extraction."""
76 | metadata = fetch_single_arxiv_paper_metadata(self.arxiv_id)
77 | result = asyncio.run(download_paper_and_parse_to_markdown(
78 | metadata=metadata,
79 | pdf_url_field="download_url",
80 | paper_id=self.arxiv_id,
81 | write_images=False
82 | ))
83 |
84 | # Assert that result is successful
85 | self.assertIsInstance(result, dict)
86 | self.assertEqual(result.get("status"), "success")
87 |
88 | # Assert content is retrieved and has expected start
89 | content = result.get("content", "")
90 | self.assertGreater(len(content), 1000) # Should have substantial content
91 | self.assertTrue(content.startswith(self.expected_arxiv_start))
92 |
93 | def test_pdf_content_contains_markdown(self):
94 | """Test that PDF content is properly converted to markdown."""
95 | metadata = fetch_single_arxiv_paper_metadata(self.arxiv_id)
96 | result = asyncio.run(download_paper_and_parse_to_markdown(
97 | metadata=metadata,
98 | pdf_url_field="download_url",
99 | paper_id=self.arxiv_id,
100 | write_images=False
101 | ))
102 |
103 | # Assert successful retrieval
104 | self.assertEqual(result.get("status"), "success")
105 |
106 | content = result.get("content", "")
107 |
108 | # Assert markdown characteristics are present
109 | self.assertIn("##", content) # Should contain markdown headers
110 | self.assertIn("**", content) # Should contain bold text
111 | self.assertGreater(len(content.split('\n')), 50) # Should have many lines
112 |
113 | def test_pdf_retrieval_includes_metadata(self):
114 | """Test that PDF retrieval includes paper metadata."""
115 | metadata = fetch_single_osf_preprint_metadata(self.osf_id)
116 | result = asyncio.run(download_paper_and_parse_to_markdown(
117 | metadata=metadata,
118 | pdf_url_field="download_url",
119 | paper_id=self.osf_id,
120 | write_images=False
121 | ))
122 |
123 | # Assert successful retrieval
124 | self.assertEqual(result.get("status"), "success")
125 |
126 | # Assert metadata is included
127 | result_metadata = result.get("metadata", {})
128 | self.assertIsInstance(result_metadata, dict)
129 | self.assertIn("title", result_metadata)
130 | self.assertIn("id", result_metadata)
131 |
132 |
133 | if __name__ == "__main__":
134 | unittest.main()
```
--------------------------------------------------------------------------------
/src/core/arxiv.py:
--------------------------------------------------------------------------------
```python
1 | import xml.etree.ElementTree as ET
2 | from typing import Any, Dict, Optional
3 | from urllib.parse import quote, urlencode
4 |
5 | import requests
6 |
7 | from utils import sanitize_api_queries
8 |
9 |
10 | def fetch_arxiv_papers(
11 | query: Optional[str] = None,
12 | category: Optional[str] = None,
13 | author: Optional[str] = None,
14 | title: Optional[str] = None,
15 | max_results: int = 100,
16 | start_index: int = 0,
17 | ) -> Dict[str, Any]:
18 | """
19 | Fetch papers from arXiv API using various search parameters.
20 |
21 | Args:
22 | query: General search query
23 | category: arXiv category (e.g., 'cs.AI', 'physics.gen-ph')
24 | author: Author name to search for
25 | title: Title keywords to search for
26 | max_results: Maximum number of results to return (default 100)
27 | start_index: Starting index for pagination (default 0)
28 |
29 | Returns:
30 | Dictionary containing papers data from arXiv API
31 | """
32 | # Build search query
33 | search_parts = []
34 |
35 | if query:
36 | search_parts.append(f"all:{sanitize_api_queries(query, max_length=200)}")
37 | if category:
38 | search_parts.append(f"cat:{sanitize_api_queries(category, max_length=50)}")
39 | if author:
40 | search_parts.append(f"au:{sanitize_api_queries(author, max_length=100)}")
41 | if title:
42 | search_parts.append(f"ti:{sanitize_api_queries(title, max_length=200)}")
43 |
44 | if not search_parts:
45 | # Default search if no parameters provided
46 | search_query = "all:*"
47 | else:
48 | search_query = " AND ".join(search_parts)
49 |
50 | # Build API URL
51 | base_url = "http://export.arxiv.org/api/query"
52 | params = {"search_query": search_query, "start": start_index, "max_results": min(max_results, 20)}
53 |
54 | query_string = urlencode(params, safe=":", quote_via=quote)
55 | url = f"{base_url}?{query_string}"
56 |
57 | try:
58 | response = requests.get(url, timeout=30)
59 | response.raise_for_status()
60 |
61 | # Parse XML response
62 | root = ET.fromstring(response.content)
63 |
64 | # Extract namespace
65 | ns = {"atom": "http://www.w3.org/2005/Atom", "arxiv": "http://arxiv.org/schemas/atom"}
66 |
67 | papers = []
68 | for entry in root.findall("atom:entry", ns):
69 | paper = _parse_arxiv_entry(entry, ns)
70 | papers.append(paper)
71 |
72 | return {
73 | "data": papers,
74 | "meta": {"total_results": len(papers), "start_index": start_index, "max_results": max_results, "search_query": search_query},
75 | }
76 |
77 | except requests.exceptions.RequestException as e:
78 | raise ValueError(f"Request failed: {str(e)}")
79 | except ET.ParseError as e:
80 | raise ValueError(f"Failed to parse arXiv response: {str(e)}")
81 |
82 |
83 | def _parse_arxiv_entry(entry, ns):
84 | """Parse a single arXiv entry from XML."""
85 | # Extract basic info
86 | arxiv_id = entry.find("atom:id", ns).text.split("/")[-1] if entry.find("atom:id", ns) is not None else ""
87 | title = entry.find("atom:title", ns).text.strip() if entry.find("atom:title", ns) is not None else ""
88 | summary = entry.find("atom:summary", ns).text.strip() if entry.find("atom:summary", ns) is not None else ""
89 | published = entry.find("atom:published", ns).text if entry.find("atom:published", ns) is not None else ""
90 | updated = entry.find("atom:updated", ns).text if entry.find("atom:updated", ns) is not None else ""
91 |
92 | # Extract authors
93 | authors = []
94 | for author in entry.findall("atom:author", ns):
95 | name_elem = author.find("atom:name", ns)
96 | if name_elem is not None:
97 | authors.append(name_elem.text)
98 |
99 | # Extract categories
100 | categories = []
101 | for category in entry.findall("atom:category", ns):
102 | term = category.get("term")
103 | if term:
104 | categories.append(term)
105 |
106 | # Extract links (PDF, abstract)
107 | pdf_url = ""
108 | abstract_url = ""
109 | for link in entry.findall("atom:link", ns):
110 | if link.get("type") == "application/pdf":
111 | pdf_url = link.get("href", "")
112 | elif link.get("rel") == "alternate":
113 | abstract_url = link.get("href", "")
114 |
115 | # Extract DOI if available
116 | doi = ""
117 | doi_elem = entry.find("arxiv:doi", ns)
118 | if doi_elem is not None:
119 | doi = doi_elem.text
120 |
121 | return {
122 | "id": arxiv_id,
123 | "title": title,
124 | "summary": summary,
125 | "authors": authors,
126 | "categories": categories,
127 | "published": published,
128 | "updated": updated,
129 | "pdf_url": pdf_url,
130 | "abstract_url": abstract_url,
131 | "doi": doi,
132 | }
133 |
134 |
135 | def fetch_single_arxiv_paper_metadata(paper_id: str) -> Dict[str, Any]:
136 | """
137 | Fetch metadata for a single arXiv paper by ID.
138 |
139 | Args:
140 | paper_id: arXiv paper ID (e.g., '2301.00001' or 'cs.AI/0001001')
141 |
142 | Returns:
143 | Dictionary containing paper metadata
144 | """
145 | # Validate paper exists first
146 | pdf_url = f"https://arxiv.org/pdf/{paper_id}"
147 | response = requests.head(pdf_url, timeout=10)
148 | if response.status_code != 200:
149 | raise ValueError(f"arXiv paper not found: {paper_id}")
150 |
151 | # Fetch metadata from API
152 | try:
153 | api_url = f"http://export.arxiv.org/api/query?id_list={paper_id}"
154 | response = requests.get(api_url, timeout=30)
155 | response.raise_for_status()
156 |
157 | # Parse XML response
158 | root = ET.fromstring(response.content)
159 | ns = {"atom": "http://www.w3.org/2005/Atom", "arxiv": "http://arxiv.org/schemas/atom"}
160 |
161 | entry = root.find("atom:entry", ns)
162 | if entry is None:
163 | raise ValueError(f"No metadata found for paper: {paper_id}")
164 |
165 | metadata = _parse_arxiv_entry(entry, ns)
166 | metadata["download_url"] = pdf_url
167 |
168 | return metadata
169 |
170 | except requests.exceptions.RequestException as e:
171 | raise ValueError(f"Failed to fetch paper metadata: {str(e)}")
172 | except ET.ParseError as e:
173 | raise ValueError(f"Failed to parse arXiv response: {str(e)}")
```
--------------------------------------------------------------------------------
/src/utils/pdf2md.py:
--------------------------------------------------------------------------------
```python
1 | """
2 | PDF processing utilities using pymupdf4llm.
3 |
4 | download_paper_and_parse_to_markdown() download_pdf_and_parse_to_markdown()
5 | (with metadata) (direct URL)
6 | | |
7 | v v
8 | Extract PDF URL from metadata Generate filename from URL
9 | | |
10 | +-------------------+---------------------+
11 | |
12 | v
13 | _download_and_parse_pdf_core()
14 | |
15 | v
16 | requests.get(pdf_url)
17 | |
18 | v
19 | extract_pdf_to_markdown()
20 | |
21 | v
22 | Return (content, size, message)
23 | |
24 | +-----------+-----------+
25 | | |
26 | v v
27 | Format response Format response
28 | with metadata with pdf_url
29 |
30 | The shared core logic eliminates code duplication while maintaining
31 | distinct interfaces for metadata-based vs direct URL workflows.
32 | """
33 |
34 | import os
35 | from typing import Optional
36 | import tempfile
37 | import httpx
38 | import requests
39 |
40 | import pymupdf4llm as pdfmd
41 |
42 |
43 | async def extract_pdf_to_markdown(file_input, filename: Optional[str] = None, write_images: bool = False) -> str:
44 | """
45 | Extract PDF content to markdown using pymupdf4llm.
46 |
47 | Args:
48 | file_input: Can be either:
49 | - A file path (str) to an existing PDF
50 | - File bytes/content (bytes) that will be written to temp file
51 | - A file object with .read() method (for async file handling)
52 | filename: Optional filename to use for temp file (only used when file_input is bytes/file object)
53 | write_images: Whether to extract and write images (default: False)
54 |
55 | Returns:
56 | Markdown content as string
57 | """
58 | temp_path = None
59 |
60 | try:
61 | # Handle different input types
62 | if isinstance(file_input, str) and os.path.exists(file_input):
63 | # Direct file path
64 | md = pdfmd.to_markdown(file_input, write_images=write_images)
65 | return md
66 |
67 | elif isinstance(file_input, bytes):
68 | # File bytes - write to temp file
69 | temp_filename = filename or "temp_pdf.pdf"
70 | temp_path = f"/tmp/{temp_filename}"
71 | with open(temp_path, "wb") as f:
72 | f.write(file_input)
73 | md = pdfmd.to_markdown(temp_path, write_images=write_images)
74 | return md
75 |
76 | elif hasattr(file_input, "read"):
77 | # File object (like FastAPI UploadFile)
78 | temp_filename = filename or getattr(file_input, "filename", "temp_pdf.pdf")
79 | temp_path = f"/tmp/{temp_filename}"
80 |
81 | # Handle both sync and async file objects
82 | if hasattr(file_input, "__aiter__") or hasattr(file_input.read, "__call__"):
83 | try:
84 | # Try async read first
85 | content = await file_input.read()
86 | except TypeError:
87 | # Fall back to sync read
88 | content = file_input.read()
89 | else:
90 | content = file_input.read()
91 |
92 | with open(temp_path, "wb") as f:
93 | f.write(content)
94 | md = pdfmd.to_markdown(temp_path, write_images=write_images)
95 | return md
96 |
97 | else:
98 | raise ValueError(f"Unsupported file_input type: {type(file_input)}")
99 |
100 | finally:
101 | # Clean up temporary file
102 | if temp_path and os.path.exists(temp_path):
103 | try:
104 | os.unlink(temp_path)
105 | except Exception:
106 | pass # Ignore cleanup errors
107 |
108 |
109 | async def _download_and_parse_pdf_core(
110 | pdf_url: str,
111 | filename: str = "paper.pdf",
112 | write_images: bool = False
113 | ) -> tuple[str, int, str]:
114 | # Download PDF
115 | pdf_response = requests.get(pdf_url, timeout=60)
116 | pdf_response.raise_for_status()
117 |
118 | # Parse PDF to markdown
119 | markdown_content = await extract_pdf_to_markdown(
120 | pdf_response.content,
121 | filename=filename,
122 | write_images=write_images
123 | )
124 |
125 | file_size = len(pdf_response.content)
126 | message = f"Successfully parsed PDF content ({file_size} bytes)"
127 |
128 | return markdown_content, file_size, message
129 |
130 |
131 | async def download_paper_and_parse_to_markdown(
132 | metadata: dict,
133 | pdf_url_field: str = "download_url",
134 | paper_id: str = "",
135 | write_images: bool = False
136 | ) -> dict:
137 | # Extract PDF URL from metadata
138 | pdf_url = metadata.get(pdf_url_field)
139 | if not pdf_url:
140 | return {
141 | "status": "error",
142 | "message": f"No PDF URL found in metadata field '{pdf_url_field}'",
143 | "metadata": metadata
144 | }
145 |
146 | try:
147 | filename = f"{paper_id}.pdf" if paper_id else "paper.pdf"
148 | markdown_content, file_size, message = await _download_and_parse_pdf_core(
149 | pdf_url, filename, write_images
150 | )
151 |
152 | return {
153 | "status": "success",
154 | "metadata": metadata,
155 | "content": markdown_content,
156 | "file_size": file_size,
157 | "message": message,
158 | }
159 |
160 | except requests.exceptions.RequestException as e:
161 | return {
162 | "status": "error",
163 | "message": f"Network error: {str(e)}",
164 | "metadata": metadata
165 | }
166 | except Exception as e:
167 | return {
168 | "status": "error",
169 | "message": f"Error parsing PDF: {str(e)}",
170 | "metadata": metadata
171 | }
172 |
173 |
174 | async def download_pdf_and_parse_to_markdown(pdf_url: str, write_images: bool = False) -> dict:
175 | try:
176 | filename = pdf_url.split('/')[-1] if '/' in pdf_url else "paper.pdf"
177 | if not filename.endswith('.pdf'):
178 | filename = "paper.pdf"
179 |
180 | markdown_content, file_size, message = await _download_and_parse_pdf_core(
181 | pdf_url, filename, write_images
182 | )
183 |
184 | return {
185 | "status": "success",
186 | "content": markdown_content,
187 | "file_size": file_size,
188 | "pdf_url": pdf_url,
189 | "message": message,
190 | }
191 |
192 | except requests.exceptions.RequestException as e:
193 | return {
194 | "status": "error",
195 | "message": f"Network error downloading PDF: {str(e)}",
196 | "pdf_url": pdf_url
197 | }
198 | except Exception as e:
199 | return {
200 | "status": "error",
201 | "message": f"Error parsing PDF: {str(e)}",
202 | "pdf_url": pdf_url
203 | }
204 |
```
--------------------------------------------------------------------------------
/src/tools.py:
--------------------------------------------------------------------------------
```python
1 |
2 | from typing import Annotated
3 |
4 | from fastmcp import FastMCP
5 |
6 | from core import (
7 | fetch_arxiv_papers,
8 | fetch_openalex_papers,
9 | fetch_osf_preprints,
10 | fetch_single_arxiv_paper_metadata,
11 | fetch_single_openalex_paper_metadata,
12 | fetch_single_osf_preprint_metadata,
13 | fetch_osf_providers,
14 | get_all_providers,
15 | )
16 | from utils.pdf2md import download_pdf_and_parse_to_markdown, download_paper_and_parse_to_markdown
17 |
18 | tools_mcp = FastMCP()
19 |
20 | @tools_mcp.tool(
21 | name="list_providers",
22 | description="Get the complete list of all available academic paper providers. Includes preprint servers (ArXiv, Open Science Framework (OSF) discipline-specific servers). Returns provider IDs for use with search_papers.",
23 | )
24 | async def list_providers() -> dict:
25 | """
26 | Call the osf api and hardcode other supported providers.
27 |
28 | """
29 | providers = get_all_providers()
30 |
31 | return {
32 | "providers": providers,
33 | "total_count": len(providers),
34 | }
35 |
36 |
37 | @tools_mcp.tool(
38 | name="search_papers",
39 | description="Find papers using supported filters. And retrieve their metadata.",
40 | )
41 | async def search_papers(
42 | query: Annotated[str | None, "Text search query for title, author, content"] = None,
43 | provider: Annotated[str | None, "Provider ID to filter preprints (e.g., psyarxiv, socarxiv, arxiv, openalex, osf)"] = None,
44 | subjects: Annotated[str | None, "Subject categories to filter by (e.g., psychology, neuroscience)"] = None,
45 | date_published_gte: Annotated[str | None, "Filter preprints published on or after this date (e.g., 2024-01-01)"] = None,
46 | ) -> dict:
47 | if provider and provider not in [p["id"] for p in get_all_providers()]:
48 | return {
49 | "error": f"Provider: {provider} not found. Please use list_preprint_providers to get the complete list of all available providers.",
50 | }
51 | if not provider:
52 | all_results = []
53 |
54 | arxiv_results = fetch_arxiv_papers(query=query, category=subjects)
55 | all_results.append(arxiv_results)
56 |
57 | openalex_results = fetch_openalex_papers(
58 | query=query,
59 | concepts=subjects,
60 | date_published_gte=date_published_gte
61 | )
62 | all_results.append(openalex_results)
63 |
64 | osf_results = fetch_osf_preprints(
65 | provider_id="osf",
66 | subjects=subjects,
67 | date_published_gte=date_published_gte,
68 | query=query,
69 | )
70 | all_results.append(osf_results)
71 |
72 | return {
73 | "papers": all_results,
74 | "total_count": len(all_results),
75 | "providers_searched": ["arxiv", "openalex", "osf"],
76 | }
77 | if provider == "osf" or provider in [p["id"] for p in fetch_osf_providers()]:
78 | return fetch_osf_preprints( provider_id=provider,
79 | subjects=subjects,
80 | date_published_gte=date_published_gte,
81 | query=query,
82 | )
83 | elif provider == "arxiv":
84 | return fetch_arxiv_papers(
85 | query=query,
86 | category=subjects,
87 | )
88 | elif provider == "openalex":
89 | return fetch_openalex_papers(
90 | query=query,
91 | concepts=subjects,
92 | date_published_gte=date_published_gte,
93 | )
94 |
95 |
96 | @tools_mcp.tool(
97 | name="get_paper_by_id",
98 | description="Download and convert an academic paper to markdown format by its ID. Returns full paper content including title, abstract, sections, and references. Supports ArXiv (e.g., '2407.06405v1'), OpenAlex (e.g., 'W4385245566'), and OSF IDs.",
99 | )
100 | async def get_paper_by_id(paper_id: str) -> dict:
101 | try:
102 | # Check if it's an OpenAlex paper ID (starts with 'W' followed by numbers)
103 | if paper_id.startswith("W") and paper_id[1:].isdigit():
104 | # OpenAlex paper ID format (e.g., "W4385245566")
105 | metadata = fetch_single_openalex_paper_metadata(paper_id)
106 | return await download_paper_and_parse_to_markdown(
107 | metadata=metadata,
108 | pdf_url_field="pdf_url",
109 | paper_id=paper_id,
110 | write_images=False
111 | )
112 | # Check if it's an arXiv paper ID (contains 'v' followed by version number or matches arXiv format)
113 | elif "." in paper_id and ("v" in paper_id or len(paper_id.split(".")[0]) == 4):
114 | # arXiv paper ID format (e.g., "2407.06405v1" or "cs.AI/0001001")
115 | metadata = fetch_single_arxiv_paper_metadata(paper_id)
116 | return await download_paper_and_parse_to_markdown(
117 | metadata=metadata,
118 | pdf_url_field="download_url",
119 | paper_id=paper_id,
120 | write_images=False
121 | )
122 | else:
123 | # OSF paper ID format
124 | metadata = fetch_single_osf_preprint_metadata(paper_id)
125 | # Handle error case from OSF metadata function
126 | if isinstance(metadata, dict) and metadata.get("status") == "error":
127 | return metadata
128 | return await download_paper_and_parse_to_markdown(
129 | metadata=metadata,
130 | pdf_url_field="download_url",
131 | paper_id=paper_id,
132 | write_images=False
133 | )
134 | except ValueError as e:
135 | return {"status": "error", "message": str(e), "metadata": {}}
136 |
137 |
138 | @tools_mcp.tool(
139 | name="get_paper_metadata_by_id",
140 | description="Get metadata for an academic paper by its ID without downloading full content. Returns title, authors, abstract, publication date, journal info, and download URLs. Supports ArXiv, OpenAlex, and OSF IDs.",
141 | )
142 | async def get_paper_metadata_by_id(preprint_id: str) -> dict:
143 | # Check if it's an OpenAlex paper ID (starts with 'W' followed by numbers)
144 | if preprint_id.startswith("W") and preprint_id[1:].isdigit():
145 | # OpenAlex paper ID format (e.g., "W4385245566")
146 | return fetch_single_openalex_paper_metadata(preprint_id)
147 | # Check if it's an arXiv paper ID (contains 'v' followed by version number or matches arXiv format)
148 | elif "." in preprint_id and ("v" in preprint_id or len(preprint_id.split(".")[0]) == 4):
149 | # arXiv paper ID format (e.g., "2407.06405v1" or "cs.AI/0001001")
150 | return fetch_single_arxiv_paper_metadata(preprint_id)
151 | else:
152 | # OSF paper ID format
153 | return fetch_single_osf_preprint_metadata(preprint_id)
154 |
155 |
156 | @tools_mcp.tool(
157 | name="get_paper_content_by_url",
158 | description="Download and convert the PDF of a paper to markdown format from a direct PDF URL. Returns full paper content parsed from the PDF including title, abstract, sections, and references.",
159 | )
160 | async def get_paper_content_by_url(pdf_url: str) -> dict:
161 | return await download_pdf_and_parse_to_markdown(pdf_url)
```
--------------------------------------------------------------------------------
/src/core/openalex.py:
--------------------------------------------------------------------------------
```python
1 | import requests
2 | from typing import Any, Dict, Optional
3 | from urllib.parse import urlencode
4 |
5 | from utils import sanitize_api_queries
6 |
7 |
8 | def fetch_openalex_papers(
9 | query: Optional[str] = None,
10 | author: Optional[str] = None,
11 | title: Optional[str] = None,
12 | publisher: Optional[str] = None,
13 | institution: Optional[str] = None,
14 | concepts: Optional[str] = None,
15 | date_published_gte: Optional[str] = None,
16 | max_results: int = 20,
17 | page: int = 1,
18 | ) -> Dict[str, Any]:
19 | """
20 | Fetch papers from the OpenAlex API using various search parameters.
21 |
22 | Args:
23 | query: General search query (full-text search)
24 | author: Author name to search for
25 | title: Title keywords to search for
26 | publisher: Publisher name to search for
27 | institution: Institution name to search for
28 | concepts: Concepts to filter by (e.g., 'computer science', 'artificial intelligence')
29 | date_published_gte: Published date greater than or equal to (YYYY-MM-DD)
30 | max_results: Maximum number of results to return (default 20, max 200)
31 | page: Page number for pagination (default 1)
32 |
33 | Returns:
34 | Dictionary containing papers data from OpenAlex API
35 | """
36 | base_url = "https://api.openalex.org/works"
37 | filters = {}
38 |
39 | if query:
40 | filters["search"] = sanitize_api_queries(query, max_length=500)
41 | if author:
42 | filters["filter"] = f"authors.author_name.search:{sanitize_api_queries(author, max_length=200)}"
43 | if title:
44 | if "filter" in filters:
45 | filters["filter"] += f",{sanitize_api_queries(title, max_length=500)}"
46 | else:
47 | filters["filter"] = f"title.search:{sanitize_api_queries(title, max_length=500)}"
48 | if publisher:
49 | if "filter" in filters:
50 | filters["filter"] += f",publisher.search:{sanitize_api_queries(publisher, max_length=200)}"
51 | else:
52 | filters["filter"] = f"publisher.search:{sanitize_api_queries(publisher, max_length=200)}"
53 | if institution:
54 | if "filter" in filters:
55 | filters["filter"] += f",institutions.institution_name.search:{sanitize_api_queries(institution, max_length=200)}"
56 | else:
57 | filters["filter"] = f"institutions.institution_name.search:{sanitize_api_queries(institution, max_length=200)}"
58 | if concepts:
59 | # OpenAlex concepts can be tricky, simple search might work for now
60 | if "filter" in filters:
61 | filters["filter"] += f",concepts.display_name.search:{sanitize_api_queries(concepts, max_length=200)}"
62 | else:
63 | filters["filter"] = f"concepts.display_name.search:{sanitize_api_queries(concepts, max_length=200)}"
64 | if date_published_gte:
65 | if "filter" in filters:
66 | filters["filter"] += f",publication_date:>{date_published_gte}"
67 | else:
68 | filters["filter"] = f"publication_date:>{date_published_gte}"
69 |
70 | # Add pagination and results limit
71 | filters["per_page"] = min(max_results, 200) # OpenAlex max per_page is 200
72 | filters["page"] = page
73 |
74 | try:
75 | query_string = urlencode(filters, safe=":,") # Allow colons and commas in filter values
76 | url = f"{base_url}?{query_string}"
77 | response = requests.get(url, timeout=30)
78 | response.raise_for_status()
79 | data = response.json()
80 |
81 | papers = []
82 | for result in data.get("results", []):
83 | paper = _parse_openalex_work(result)
84 | papers.append(paper)
85 |
86 | return {
87 | "data": papers,
88 | "meta": {
89 | "total_results": data.get("meta", {}).get("count", 0),
90 | "page": page,
91 | "per_page": filters["per_page"],
92 | "search_query": query, # Only include general query for simplicity
93 | },
94 | "links": data.get("meta", {}).get("next_page", ""),
95 | }
96 |
97 | except requests.exceptions.RequestException as e:
98 | raise ValueError(f"Request failed: {str(e)}")
99 |
100 |
101 | def _parse_openalex_work(work_data: Dict[str, Any]) -> Dict[str, Any]:
102 | """Parse a single OpenAlex work entry."""
103 | # Extract authors
104 | authors = []
105 | for authorship in work_data.get("authorships", []):
106 | author = authorship.get("author", {})
107 | if author and author.get("display_name"):
108 | authors.append(author["display_name"])
109 |
110 | # Extract concepts
111 | concepts = []
112 | for concept in work_data.get("concepts", []):
113 | if concept.get("display_name"):
114 | concepts.append(concept["display_name"])
115 |
116 | # Extract PDF URL from primary location or alternative locations
117 | pdf_url = ""
118 | primary_location = work_data.get("primary_location") or {}
119 | if primary_location and primary_location.get("pdf_url"):
120 | pdf_url = primary_location["pdf_url"]
121 | elif primary_location and primary_location.get("landing_page_url"):
122 | pdf_url = primary_location.get("landing_page_url", "")
123 | else:
124 | # Check all locations for a PDF URL if primary doesn't have one
125 | for location in work_data.get("locations", []):
126 | if location.get("pdf_url"):
127 | pdf_url = location["pdf_url"]
128 | break
129 |
130 | # Extract abstract from inverted index
131 | abstract = ""
132 | abstract_inverted_index = work_data.get("abstract_inverted_index", {})
133 | if abstract_inverted_index:
134 | abstract = _reconstruct_abstract_from_inverted_index(abstract_inverted_index)
135 |
136 | # Extract OpenAlex ID from URL
137 | openalex_id = work_data.get("id", "")
138 | if openalex_id.startswith("https://openalex.org/"):
139 | openalex_id = openalex_id.replace("https://openalex.org/", "")
140 |
141 | # Get primary location source info
142 | primary_source = ""
143 | if primary_location and primary_location.get("source"):
144 | source = primary_location.get("source") or {}
145 | primary_source = source.get("display_name", "")
146 |
147 | return {
148 | "id": openalex_id,
149 | "doi": work_data.get("doi", ""),
150 | "title": work_data.get("title", "") or work_data.get("display_name", ""),
151 | "abstract": abstract,
152 | "authors": authors,
153 | "publication_date": work_data.get("publication_date", ""),
154 | "publication_year": work_data.get("publication_year"),
155 | "cited_by_count": work_data.get("cited_by_count", 0),
156 | "concepts": concepts,
157 | "primary_location_url": (work_data.get("primary_location") or {}).get("landing_page_url", ""),
158 | "primary_source": primary_source,
159 | "pdf_url": pdf_url,
160 | "open_access_status": (work_data.get("open_access") or {}).get("oa_status", "closed"),
161 | "is_open_access": (work_data.get("primary_location") or {}).get("is_oa", False),
162 | "type": work_data.get("type", ""),
163 | "relevance_score": work_data.get("relevance_score", 0),
164 | }
165 |
166 |
167 | def _reconstruct_abstract_from_inverted_index(inverted_index: Dict[str, Any]) -> str:
168 | """Reconstruct abstract text from OpenAlex's inverted index format."""
169 | if not inverted_index:
170 | return ""
171 |
172 | try:
173 | # Create a list to hold words at their positions
174 | word_positions = []
175 |
176 | for word, positions in inverted_index.items():
177 | if isinstance(positions, list):
178 | for position in positions:
179 | word_positions.append((position, word))
180 |
181 | # Sort by position and reconstruct text
182 | word_positions.sort(key=lambda x: x[0])
183 | abstract_words = [word for _, word in word_positions]
184 |
185 | return " ".join(abstract_words)
186 | except Exception:
187 | # If reconstruction fails, return empty string
188 | return ""
189 |
190 |
191 | def fetch_single_openalex_paper_metadata(paper_id: str) -> Dict[str, Any]:
192 | """
193 | Fetch metadata for a single OpenAlex paper by ID.
194 |
195 | Args:
196 | paper_id: OpenAlex paper ID (e.g., 'W2741809809')
197 |
198 | Returns:
199 | Dictionary containing paper metadata
200 | """
201 | base_url = "https://api.openalex.org/works"
202 | url = f"{base_url}/{paper_id}"
203 |
204 | try:
205 | response = requests.get(url, timeout=30)
206 | response.raise_for_status()
207 | work_data = response.json()
208 |
209 | if not work_data.get("id"):
210 | raise ValueError(f"No metadata found for paper: {paper_id}")
211 |
212 | metadata = _parse_openalex_work(work_data)
213 | return metadata
214 |
215 | except requests.exceptions.RequestException as e:
216 | raise ValueError(f"Failed to fetch paper metadata: {str(e)}")
```
--------------------------------------------------------------------------------
/src/core/osf.py:
--------------------------------------------------------------------------------
```python
1 | from typing import Any, Dict, Optional
2 | from urllib.parse import quote, urlencode
3 |
4 | import requests
5 |
6 | from utils import sanitize_api_queries
7 |
8 | from .providers import fetch_osf_providers, validate_provider
9 |
10 |
11 | def fetch_osf_preprints(
12 | provider_id: Optional[str] = None,
13 | subjects: Optional[str] = None,
14 | date_published_gte: Optional[str] = None,
15 | query: Optional[str] = None,
16 | ) -> Dict[str, Any]:
17 | """
18 | NOTE: The OSF API only supports a limited set of filters. Many common filters
19 | like title, DOI, creator, etc. are NOT supported by the OSF API.
20 |
21 | When query is provided, uses the trove search endpoint which supports full-text search.
22 |
23 | Args:
24 | provider_id: The provider ID (e.g., 'psyarxiv', 'socarxiv')
25 | subjects: Subject filter (e.g., 'psychology', 'neuroscience')
26 | date_published_gte: Published date greater than or equal to (YYYY-MM-DD)
27 | query: Text search query for title, author, content (uses trove endpoint)
28 |
29 | Returns:
30 | Dictionary containing preprints data from OSF API or trove search
31 | """
32 | # If query is provided, use trove search endpoint
33 | if query:
34 | return fetch_osf_preprints_via_trove(query, provider_id)
35 |
36 | # Build query parameters (only using OSF API supported filters)
37 | filters = {}
38 |
39 | if provider_id:
40 | filters["filter[provider]"] = sanitize_api_queries(provider_id, max_length=50)
41 | if subjects:
42 | filters["filter[subjects]"] = sanitize_api_queries(subjects, max_length=100)
43 | if date_published_gte:
44 | filters["filter[date_published][gte]"] = date_published_gte # Dates don't need cleaning
45 |
46 | # Build URL with filters
47 | base_url = "https://api.osf.io/v2/preprints/"
48 | if filters:
49 | query_string = urlencode(filters, safe="", quote_via=quote)
50 | url = f"{base_url}?{query_string}"
51 | else:
52 | url = base_url
53 |
54 | try:
55 | response = requests.get(url, timeout=30)
56 | response.raise_for_status()
57 | return response.json()
58 | except requests.exceptions.HTTPError as e:
59 | if response.status_code == 400:
60 | if len(filters) > 1:
61 | simple_filters = {}
62 | if provider_id:
63 | simple_filters["filter[provider]"] = sanitize_api_queries(provider_id, max_length=50)
64 |
65 | simple_query = urlencode(simple_filters, safe="", quote_via=quote)
66 | simple_url = f"{base_url}?{simple_query}"
67 |
68 | try:
69 | simple_response = requests.get(simple_url, timeout=30)
70 | simple_response.raise_for_status()
71 | result = simple_response.json()
72 |
73 | # Add a note about the simplified search
74 | if "meta" not in result:
75 | result["meta"] = {}
76 | result["meta"][
77 | "search_note"
78 | ] = f"Original search failed (400 error), showing all results for provider '{provider_id}'. You may need to filter results manually."
79 | return result
80 | except:
81 | pass
82 |
83 | raise ValueError(f"Bad request (400) - The search parameters may be invalid. Original error: {str(e)}")
84 | else:
85 | raise e
86 | except requests.exceptions.RequestException as e:
87 | raise ValueError(f"Request failed: {str(e)}")
88 |
89 |
90 | def fetch_osf_preprints_via_trove(query: str, provider_id: Optional[str] = None) -> Dict[str, Any]:
91 | """
92 | Fetch preprints using the trove search endpoint and transform to standard format.
93 | """
94 | from urllib.parse import quote_plus
95 |
96 | # Build trove search URL
97 | base_url = "https://share.osf.io/trove/index-card-search"
98 | params = {
99 | "cardSearchFilter[resourceType]": "Preprint",
100 | "cardSearchText[*,creator.name,isContainedBy.creator.name]": sanitize_api_queries(query, max_length=200),
101 | "page[size]": "20", # Match our default page size
102 | "sort": "-relevance",
103 | }
104 |
105 | # Validate provider if specified (we'll filter results later)
106 | if provider_id:
107 | if not validate_provider(provider_id):
108 | osf_providers = fetch_osf_providers()
109 | valid_ids = [p["id"] for p in osf_providers]
110 | raise ValueError(f"Invalid OSF provider: {provider_id}. Valid OSF providers: {valid_ids}")
111 |
112 | # Build query string manually to handle complex parameter names
113 | query_parts = []
114 | for key, value in params.items():
115 | query_parts.append(f"{quote_plus(key)}={quote_plus(str(value))}")
116 | query_string = "&".join(query_parts)
117 | url = f"{base_url}?{query_string}"
118 |
119 | try:
120 | headers = {"Accept": "application/json"}
121 | response = requests.get(url, headers=headers, timeout=30)
122 | response.raise_for_status()
123 | trove_data = response.json()
124 |
125 | # Transform trove format to standard OSF API format
126 | transformed_data = []
127 | for item in trove_data.get("data", []):
128 | # Extract OSF ID from @id field
129 | osf_id = ""
130 | if "@id" in item and "osf.io/" in item["@id"]:
131 | osf_id = item["@id"].split("/")[-1]
132 |
133 | # Filter by provider if specified
134 | if provider_id:
135 | # Check if this item is from the specified provider
136 | publisher_info = item.get("publisher", [])
137 | if isinstance(publisher_info, list) and len(publisher_info) > 0:
138 | publisher_id = publisher_info[0].get("@id", "")
139 | # Extract provider ID from publisher URL (e.g., "https://osf.io/preprints/psyarxiv" -> "psyarxiv")
140 | if provider_id not in publisher_id:
141 | continue # Skip this item if it doesn't match the provider
142 | else:
143 | continue # Skip if no publisher info
144 |
145 | # Transform to standard format
146 | transformed_item = {
147 | "id": osf_id,
148 | "type": "preprints",
149 | "attributes": {
150 | "title": extract_first_value(item.get("title", [])),
151 | "description": extract_first_value(item.get("description", [])),
152 | "date_created": extract_first_value(item.get("dateCreated", [])),
153 | "date_published": extract_first_value(item.get("dateAccepted", [])),
154 | "date_modified": extract_first_value(item.get("dateModified", [])),
155 | "doi": extract_doi_from_identifiers(item.get("identifier", [])),
156 | "tags": [kw.get("@value", "") for kw in item.get("keyword", [])],
157 | "subjects": [subj.get("prefLabel", [{}])[0].get("@value", "") for subj in item.get("subject", [])],
158 | },
159 | "relationships": {},
160 | "links": {"self": item.get("@id", "")},
161 | }
162 | transformed_data.append(transformed_item)
163 |
164 | # Return in standard OSF API format
165 | return {
166 | "data": transformed_data,
167 | "meta": {
168 | "version": "2.0", # Match OSF API version
169 | "total": trove_data.get("meta", {}).get("total", len(transformed_data)),
170 | "search_note": f"Results from trove search for query: '{query}'",
171 | },
172 | "links": {
173 | "first": trove_data.get("links", {}).get("first", ""),
174 | "next": trove_data.get("links", {}).get("next", ""),
175 | "last": "",
176 | "prev": "",
177 | "meta": "",
178 | },
179 | }
180 |
181 | except requests.exceptions.RequestException as e:
182 | raise ValueError(f"Trove search failed: {str(e)}")
183 |
184 |
185 | def extract_first_value(field_list):
186 | """Extract the first @value from a field list."""
187 | if isinstance(field_list, list) and len(field_list) > 0:
188 | if isinstance(field_list[0], dict) and "@value" in field_list[0]:
189 | return field_list[0]["@value"]
190 | elif isinstance(field_list[0], str):
191 | return field_list[0]
192 | return ""
193 |
194 |
195 | def extract_doi_from_identifiers(identifiers):
196 | """Extract DOI from identifier list."""
197 | for identifier in identifiers:
198 | if isinstance(identifier, dict) and "@value" in identifier:
199 | value = identifier["@value"]
200 | if "doi.org" in value or value.startswith("10."):
201 | return value
202 | return ""
203 |
204 |
205 | def fetch_single_osf_preprint_metadata(preprint_id: str) -> Dict[str, Any]:
206 | try:
207 | preprint_url = f"https://api.osf.io/v2/preprints/{preprint_id}"
208 | response = requests.get(preprint_url, timeout=30)
209 | response.raise_for_status()
210 | preprint_data = response.json()
211 |
212 | primary_file_url = preprint_data["data"]["relationships"]["primary_file"]["links"]["related"]["href"]
213 | file_response = requests.get(primary_file_url, timeout=30)
214 | file_response.raise_for_status()
215 | file_data = file_response.json()
216 |
217 | # Get the download URL
218 | download_url = file_data["data"]["links"]["download"]
219 |
220 | # Prepare metadata first
221 | attributes = preprint_data["data"]["attributes"]
222 | metadata = {
223 | "id": preprint_id,
224 | "title": attributes.get("title", ""),
225 | "description": attributes.get("description", ""),
226 | "date_created": attributes.get("date_created", ""),
227 | "date_published": attributes.get("date_published", ""),
228 | "date_modified": attributes.get("date_modified", ""),
229 | "is_published": attributes.get("is_published", False),
230 | "is_preprint_orphan": attributes.get("is_preprint_orphan", False),
231 | "license_record": attributes.get("license_record", {}),
232 | "doi": attributes.get("doi", ""),
233 | "tags": attributes.get("tags", []),
234 | "subjects": attributes.get("subjects", []),
235 | "download_url": download_url,
236 | }
237 |
238 | if not download_url:
239 | return {"status": "error", "message": "Download URL not available", "metadata": metadata}
240 |
241 | return metadata
242 | except requests.exceptions.RequestException as e:
243 | raise ValueError(f"Failed to fetch preprint metadata: {str(e)}")
244 |
```