# Directory Structure
```
├── .gitignore
├── Cargo.lock
├── Cargo.toml
├── extension.toml
├── pdf_rag
│ ├── .aiignore
│ ├── .python-version
│ ├── pyproject.toml
│ ├── README.md
│ ├── src
│ │ └── pdf_rag
│ │ ├── __init__.py
│ │ ├── env.py
│ │ ├── rag.py
│ │ └── server.py
│ └── uv.lock
├── README.md
└── src
└── lib.rs
```
# Files
--------------------------------------------------------------------------------
/pdf_rag/.python-version:
--------------------------------------------------------------------------------
```
3.12
```
--------------------------------------------------------------------------------
/pdf_rag/.aiignore:
--------------------------------------------------------------------------------
```
.env
.venv
.idea
pdfsearch.sqlite
```
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
```
/target
__pycache__
.idea
.venv
.ruff_cache
*.wasm
*.sqlite
**/.env
```
--------------------------------------------------------------------------------
/pdf_rag/README.md:
--------------------------------------------------------------------------------
```markdown
# About
The heart of the extension - this is the MCP server logic.
```
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
```markdown
# PDF Search for Zed
A document search extension for Zed that lets you semantically search through a
PDF document and use the results in Zed's AI Assistant.
## Prerequisites
This extension currently requires:
1. An `OpenAI` API key (to generate embeddings)
2. `uv` installed on your system
**Note:** While the current setup requires an OpenAI API key for generating embeddings, we plan to implement a self-contained alternative in future versions. Community feedback will help prioritize these improvements.
## Quick Start
1. Clone the repository
```bash
git clone https://github.com/freespirit/pdfsearch-zed.git
```
2. Set up the Python environment for the MCP server:
```bash
cd pdfsearch-zed/pdf_rag
uv venv
uv sync
```
3. [Install Dev Extension](https://zed.dev/docs/extensions/developing-extensions) in Zed
4. Build the search db
```bash
cd /path/to/pdfsearch-zed/pdf_rag
echo "OPENAI_API_KEY=sk-..." > src/pdf_rag/.env
# This may take a couple of minutes, depending on the documents' size
# You can provide multiple files and directories as arguments.
# - files would be chunked.
# - a directory would be considered as if its files contains chunks.
# E.g. they won't be further split.
uv run src/pdf_rag/rag.py build "file1.pdf" "dir1" "file2.md" ...
```
5. Configure Zed
```json
"context_servers": {
"pdfsearch-context-server": {
"settings": {
"extension_path": "/path/to/pdfsearch-zed"
}
}
}
```
## Usage
1. Open Zed's AI Assistant panel
2. Type `/pdfsearch` followed by your search query
3. The extension will search the PDF and add relevant sections to the AI
Assistant's context
## Future Improvements
- [x] Self-contained vector store
- [ ] Self-contained embeddings
- [ ] Automated index building on first run
- [ ] Configurable result size
- [x] Support for multiple PDFs
- [x] Optional: Additional file formats beyond PDF
## Project Structure
- `pdf_rag/`: Python-based MCP server implementation
- `src/`: Zed extension code
- `extension.toml` and `Cargo.toml`: Zed extension configuration files
## Known Limitations
- Manual index building is required before first use
- Requires external services (OpenAI)
```
--------------------------------------------------------------------------------
/Cargo.toml:
--------------------------------------------------------------------------------
```toml
[package]
name = "pdfsearch"
version = "0.1.0"
edition = "2021"
[lib]
crate-type = ["cdylib"]
[dependencies]
serde = "1.0.217"
zed_extension_api = "0.3.0"
```
--------------------------------------------------------------------------------
/pdf_rag/src/pdf_rag/__init__.py:
--------------------------------------------------------------------------------
```python
import asyncio
from . import env, server
def main():
"""Main entry point for the package."""
asyncio.run(server.main())
__all__ = ["main", "server"]
```
--------------------------------------------------------------------------------
/extension.toml:
--------------------------------------------------------------------------------
```toml
id = "pdfsearch"
name = "PDF Search"
description = "A PDF search (RAG-style) MCP for Zed"
version = "0.2.0"
schema_version = 1
authors = ["stano <TBD email>"]
repository = "https://github.com/freespirit/pdfsearch-zed"
[context_servers.my-context-server]
name = "PDF Search"
```
--------------------------------------------------------------------------------
/pdf_rag/pyproject.toml:
--------------------------------------------------------------------------------
```toml
[project]
name = "pdf_rag"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
"httpx>=0.28.1",
"libsql-experimental==0.0.41",
"mcp[cli]>=1.6.0",
"openai>=1.58.1",
"pypdf>=5.1.0",
"tiktoken>=0.8.0",
]
[build-system]
requires = [ "hatchling",]
build-backend = "hatchling.build"
[project.scripts]
pdf_rag = "pdf_rag:main"
```
--------------------------------------------------------------------------------
/pdf_rag/src/pdf_rag/env.py:
--------------------------------------------------------------------------------
```python
import os
def load_env_file(file_path='.env'):
with open(file_path, 'r') as file:
for line in file:
line = line.strip()
if not line or line.startswith('#'):
continue
if '=' in line:
key, value = line.split('=', 1)
key = key.strip()
value = value.strip()
# Remove quotes if present
if value and (value[0] == value[-1] == '"' or value[0] == value[-1] == "'"):
value = value[1:-1]
os.environ[key] = value
```
--------------------------------------------------------------------------------
/src/lib.rs:
--------------------------------------------------------------------------------
```rust
use std::path::Path;
use serde::Deserialize;
use zed_extension_api::{
self as zed, serde_json, settings::ContextServerSettings, ContextServerId,
};
#[derive(Debug, Deserialize)]
struct PdfSearchContextServerSettings {
extension_path: String,
}
struct MyExtension {}
impl zed::Extension for MyExtension {
fn new() -> Self
where
Self: Sized,
{
Self {}
}
fn context_server_command(
&mut self,
context_server_id: &ContextServerId,
project: &zed::Project,
) -> zed::Result<zed::Command> {
let settings = ContextServerSettings::for_project("pdfsearch-context-server", project)?;
let Some(settings) = settings.settings else {
return Err("Missing `context_servers` settings for `pdfsearch-context-server`".into());
};
let settings: PdfSearchContextServerSettings =
serde_json::from_value(settings).map_err(|e| e.to_string())?;
let mcp_python_module = String::from("pdf_rag");
let extension_path = Path::new(settings.extension_path.as_str());
let mcp_server_path = extension_path.join(&mcp_python_module);
Ok(zed::Command {
command: "uv".to_string(),
args: vec![
format!("--directory={}", mcp_server_path.to_string_lossy()),
"run".to_string(),
mcp_python_module,
],
env: vec![],
})
}
}
zed::register_extension!(MyExtension);
```
--------------------------------------------------------------------------------
/pdf_rag/src/pdf_rag/server.py:
--------------------------------------------------------------------------------
```python
import asyncio
import mcp.server.stdio
import mcp.types as types
from mcp.server import NotificationOptions, Server
from mcp.server.models import InitializationOptions
from pdf_rag.env import load_env_file
from pdf_rag.rag import RAG
load_env_file()
server = Server("pdfsearch-server")
@server.list_prompts()
async def list_prompts() -> list[types.Prompt]:
return [
types.Prompt(
name="pdfsearch",
description="Do a RAG-style expansion of your prompt, enriching it with relevant information from the PDF.",
arguments=[
types.PromptArgument(
name="input",
description="What to look for in the document.",
required=True,
)
]
)
]
@server.list_tools()
async def list_tools() -> list[types.Tool]:
return [
types.Tool(
name="pdfsearch",
description="Retrieve relevant information from a document.",
inputSchema={
"type": "object",
"required": ["query"],
"properties": {
"query": {
"type": "string",
"description": "User provided query (to search for in the documents)",
}
},
},
)
]
@server.get_prompt()
async def get_prompt(
name: str,
arguments: dict[str, str] | None = None
) -> types.GetPromptResult:
if name != "pdfsearch":
raise ValueError(f"Prompt not found: {name}")
user_input = arguments.get("input") if arguments else ""
result = await _search(user_input)
return types.GetPromptResult(
messages=[
types.PromptMessage(
role="user",
content=types.TextContent(
type="text",
text=result,
),
)
]
)
@server.call_tool()
async def call_a_tool(
name: str,
arguments: dict
) -> list[types.TextContent | types.ImageContent | types.EmbeddedResource]:
if name != "pdfsearch":
raise ValueError(f"Unknown tool: {name}")
# if "query" not in arguments:
# raise ValueError("Missing required argument 'query'")
result = f"received: {arguments}"
try:
user_input = arguments["query"]
result = await _search(user_input)
except Exception as e:
result = str(e.__repr__())
return [types.TextContent(type="text", text=result)]
async def _search(user_input):
# TODO figure out when to build the vector db
rag = RAG()
related_chunks = await rag.search(user_input)
response = ""
for chunk in related_chunks:
response += "<text>\n"
response += chunk
response += "</text>\n"
response += "\n"
return response
async def main():
# Run the server using stdin/stdout streams
async with mcp.server.stdio.stdio_server() as (read_stream, write_stream):
await server.run(
read_stream,
write_stream,
InitializationOptions(
server_name="pdf_rag",
server_version="0.1.0",
capabilities=server.get_capabilities(
notification_options=NotificationOptions(),
experimental_capabilities={},
),
),
)
# This is needed if you'd like to connect to a custom client
if __name__ == "__main__":
asyncio.run(main())
```
--------------------------------------------------------------------------------
/pdf_rag/src/pdf_rag/rag.py:
--------------------------------------------------------------------------------
```python
"""THe RAG backbone of this MCP server"""
import argparse
import asyncio
from pathlib import Path
from typing import List
import libsql_experimental as libsql
import tiktoken
from openai import AsyncOpenAI
from pypdf import PdfReader
from tqdm import tqdm
from typing_extensions import Tuple
from pdf_rag.env import load_env_file
VECTOR_COLLECTION_NAME = "document_chunks"
EMBEDDING_DIMENSIONS = 1024
QUERY_DROP = f"DROP TABLE IF EXISTS {VECTOR_COLLECTION_NAME}"
QUERY_CREATE = (
f"CREATE TABLE IF NOT EXISTS {VECTOR_COLLECTION_NAME} ("
f" text TEXT,"
f" embedding F32_BLOB({EMBEDDING_DIMENSIONS})"
f");"
)
QUERY_INSERT = (
f"INSERT INTO {VECTOR_COLLECTION_NAME} (text, embedding) VALUES (?, vector32(?))"
)
QUERY_SEARCH = (
"SELECT text, vector_distance_cos(embedding, vector32(?)) "
f" FROM {VECTOR_COLLECTION_NAME} "
f" ORDER BY vector_distance_cos(embedding, vector32(?)) ASC "
f" LIMIT 10;"
)
load_env_file()
class RAG:
"""Basic document RAG system.
Splits a document in chunks and later retrieves the most relevant chunks for a given query.
"""
openai: AsyncOpenAI
db_file: Path
def __init__(self, db_path: str = "pdfsearch.sqlite"):
self.openai = AsyncOpenAI()
self.db_file = Path(db_path)
def build_search_db(self):
conn = libsql.connect(str(self.db_file.absolute()))
conn.execute(QUERY_DROP)
conn.execute(QUERY_CREATE)
conn.commit()
def add_knowledge(self, text: str, embedding: List[float]):
conn = libsql.connect(str(self.db_file.absolute()))
conn.execute(QUERY_INSERT, (text, str(embedding)))
conn.commit()
async def search(self, query: str) -> List[str]:
embedding = await embed(query, self.openai)
conn = libsql.connect(str(self.db_file.absolute()))
search_result = conn.execute(
QUERY_SEARCH,
(str(embedding), str(embedding)),
).fetchall()
conn.commit()
for row in search_result:
print(row[0][:25], row[1])
return [row[0] for row in search_result]
def chunkify(
text: str, max_tokens: int = 512, overlap: int = 64, tokenizer=None
) -> List[str]:
"""Split text into token-based chunks with overlap."""
if not text.strip():
return []
if tokenizer is None:
tokenizer = tiktoken.get_encoding("cl100k_base")
# Tokenize the entire text first
tokenized_text = tokenizer.encode(text)
num_tokens = len(tokenized_text)
chunks = []
# Iterate over the tokenized text in `max_tokens` increments with specified overlap
start_idx = 0
while start_idx < num_tokens:
end_idx = min(start_idx + max_tokens, num_tokens)
chunk_tokens = tokenized_text[start_idx:end_idx]
chunk_text = tokenizer.decode(chunk_tokens) # Decode the tokens back to text
chunks.append(chunk_text)
start_idx += max_tokens - overlap # Move window forward with overlap
return chunks
async def embed(text: str, client: AsyncOpenAI) -> list[float]:
response = await client.embeddings.create(
input=text, model="text-embedding-3-small", dimensions=1024
)
return response.data[0].embedding
async def embed_pdf(
file_path: Path, should_split: bool = True
) -> List[Tuple[str, List[float]]]:
pdf_text = ""
with open(file_path, "rb") as pdf_file:
reader = PdfReader(pdf_file)
for page in reader.pages:
pdf_text += page.extract_text(extraction_mode="plain")
chunks = chunkify(text=pdf_text) if should_split else [pdf_text]
tasks = [
embed(chunk, AsyncOpenAI())
for chunk in tqdm(chunks, desc=f"Embedding {file_path.name}")
]
embeddings = await asyncio.gather(*tasks)
return list(zip(chunks, embeddings))
async def embed_text(
text: str,
should_split: bool = True,
) -> List[Tuple[str, List[float]]]:
chunks = chunkify(text=text) if should_split else [text]
tasks = [
embed(chunk, AsyncOpenAI())
for chunk in tqdm(chunks, desc=f"Embedding {text[:25]}")
]
embeddings = await asyncio.gather(*tasks)
return list(zip(chunks, embeddings))
async def main():
parser = argparse.ArgumentParser(description="PDF document RAG system")
parser.add_argument(
"action",
choices=["build", "search", "chunkify"],
help="Action to perform",
)
parser.add_argument(
"inputs", nargs="+", help="Input files/directories or a search query"
)
args = parser.parse_args()
if args.action == "build":
rag = RAG()
rag.build_search_db()
# Process each input path
total_chunks = 0
for input_path in args.inputs:
path = Path(input_path)
if path.is_dir():
tasks = []
texts = []
for text_file in tqdm(
path.glob("*.txt"), desc=f"Embedding files in {path.name}"
):
text = text_file.read_text()
tasks.append(embed(text, AsyncOpenAI()))
texts.append(text)
embeddings = await asyncio.gather(*tasks)
for text, embedding in zip(texts, embeddings):
rag.add_knowledge(text, embedding)
elif path.is_file() and path.suffix.lower() == ".pdf":
embeddings = await embed_pdf(path)
for chunk, embedding in embeddings:
rag.add_knowledge(chunk, embedding)
total_chunks += len(embeddings)
else: # Assume a single file in other text format - txt, md...
text = path.read_text()
embeddings = await embed_text(text)
for text, embedding in embeddings:
rag.add_knowledge(text, embedding)
total_chunks += 1
print(f"Inserted {total_chunks} chunks of text")
elif args.action == "search":
rag = RAG()
result = rag.search(args.inputs[0])
elif args.action == "chunkify":
pdf_text = ""
with open(args.inputs[0], "rb") as pdf_file:
reader = PdfReader(pdf_file)
for page in reader.pages:
pdf_text += page.extract_text(extraction_mode="plain")
chunks = chunkify(text=pdf_text)
if __name__ == "__main__":
asyncio.run(main())
```