# Directory Structure
```
├── .gitignore
├── Cargo.lock
├── Cargo.toml
├── extension.toml
├── pdf_rag
│ ├── .aiignore
│ ├── .python-version
│ ├── pyproject.toml
│ ├── README.md
│ ├── src
│ │ └── pdf_rag
│ │ ├── __init__.py
│ │ ├── env.py
│ │ ├── rag.py
│ │ └── server.py
│ └── uv.lock
├── README.md
└── src
└── lib.rs
```
# Files
--------------------------------------------------------------------------------
/pdf_rag/.python-version:
--------------------------------------------------------------------------------
```
1 | 3.12
2 |
```
--------------------------------------------------------------------------------
/pdf_rag/.aiignore:
--------------------------------------------------------------------------------
```
1 | .env
2 | .venv
3 | .idea
4 | pdfsearch.sqlite
5 |
```
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
```
1 | /target
2 | __pycache__
3 |
4 |
5 | .idea
6 | .venv
7 | .ruff_cache
8 |
9 | *.wasm
10 | *.sqlite
11 |
12 | **/.env
13 |
```
--------------------------------------------------------------------------------
/pdf_rag/README.md:
--------------------------------------------------------------------------------
```markdown
1 | # About
2 |
3 | The heart of the extension - this is the MCP server logic.
4 |
5 |
```
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
```markdown
1 | # PDF Search for Zed
2 |
3 | A document search extension for Zed that lets you semantically search through a
4 | PDF document and use the results in Zed's AI Assistant.
5 |
6 | ## Prerequisites
7 |
8 | This extension currently requires:
9 |
10 | 1. An `OpenAI` API key (to generate embeddings)
11 | 2. `uv` installed on your system
12 |
13 | **Note:** While the current setup requires an OpenAI API key for generating embeddings, we plan to implement a self-contained alternative in future versions. Community feedback will help prioritize these improvements.
14 |
15 | ## Quick Start
16 |
17 | 1. Clone the repository
18 |
19 | ```bash
20 | git clone https://github.com/freespirit/pdfsearch-zed.git
21 | ```
22 |
23 | 2. Set up the Python environment for the MCP server:
24 |
25 | ```bash
26 | cd pdfsearch-zed/pdf_rag
27 | uv venv
28 | uv sync
29 | ```
30 |
31 | 3. [Install Dev Extension](https://zed.dev/docs/extensions/developing-extensions) in Zed
32 |
33 | 4. Build the search db
34 |
35 | ```bash
36 | cd /path/to/pdfsearch-zed/pdf_rag
37 |
38 | echo "OPENAI_API_KEY=sk-..." > src/pdf_rag/.env
39 |
40 | # This may take a couple of minutes, depending on the documents' size
41 | # You can provide multiple files and directories as arguments.
42 | # - files would be chunked.
43 | # - a directory would be considered as if its files contains chunks.
44 | # E.g. they won't be further split.
45 | uv run src/pdf_rag/rag.py build "file1.pdf" "dir1" "file2.md" ...
46 | ```
47 |
48 | 5. Configure Zed
49 |
50 | ```json
51 | "context_servers": {
52 | "pdfsearch-context-server": {
53 | "settings": {
54 | "extension_path": "/path/to/pdfsearch-zed"
55 | }
56 | }
57 | }
58 | ```
59 |
60 | ## Usage
61 |
62 | 1. Open Zed's AI Assistant panel
63 | 2. Type `/pdfsearch` followed by your search query
64 | 3. The extension will search the PDF and add relevant sections to the AI
65 | Assistant's context
66 |
67 | ## Future Improvements
68 |
69 | - [x] Self-contained vector store
70 | - [ ] Self-contained embeddings
71 | - [ ] Automated index building on first run
72 | - [ ] Configurable result size
73 | - [x] Support for multiple PDFs
74 | - [x] Optional: Additional file formats beyond PDF
75 |
76 | ## Project Structure
77 |
78 | - `pdf_rag/`: Python-based MCP server implementation
79 | - `src/`: Zed extension code
80 | - `extension.toml` and `Cargo.toml`: Zed extension configuration files
81 |
82 | ## Known Limitations
83 |
84 | - Manual index building is required before first use
85 | - Requires external services (OpenAI)
86 |
```
--------------------------------------------------------------------------------
/Cargo.toml:
--------------------------------------------------------------------------------
```toml
1 | [package]
2 | name = "pdfsearch"
3 | version = "0.1.0"
4 | edition = "2021"
5 |
6 | [lib]
7 | crate-type = ["cdylib"]
8 |
9 | [dependencies]
10 | serde = "1.0.217"
11 | zed_extension_api = "0.3.0"
12 |
13 |
```
--------------------------------------------------------------------------------
/pdf_rag/src/pdf_rag/__init__.py:
--------------------------------------------------------------------------------
```python
1 | import asyncio
2 |
3 | from . import env, server
4 |
5 |
6 | def main():
7 | """Main entry point for the package."""
8 | asyncio.run(server.main())
9 |
10 |
11 | __all__ = ["main", "server"]
12 |
```
--------------------------------------------------------------------------------
/extension.toml:
--------------------------------------------------------------------------------
```toml
1 | id = "pdfsearch"
2 | name = "PDF Search"
3 | description = "A PDF search (RAG-style) MCP for Zed"
4 | version = "0.2.0"
5 | schema_version = 1
6 | authors = ["stano <TBD email>"]
7 | repository = "https://github.com/freespirit/pdfsearch-zed"
8 |
9 |
10 | [context_servers.my-context-server]
11 | name = "PDF Search"
```
--------------------------------------------------------------------------------
/pdf_rag/pyproject.toml:
--------------------------------------------------------------------------------
```toml
1 | [project]
2 | name = "pdf_rag"
3 | version = "0.1.0"
4 | description = "Add your description here"
5 | readme = "README.md"
6 | requires-python = ">=3.12"
7 | dependencies = [
8 | "httpx>=0.28.1",
9 | "libsql-experimental==0.0.41",
10 | "mcp[cli]>=1.6.0",
11 | "openai>=1.58.1",
12 | "pypdf>=5.1.0",
13 | "tiktoken>=0.8.0",
14 | ]
15 |
16 | [build-system]
17 | requires = [ "hatchling",]
18 | build-backend = "hatchling.build"
19 |
20 | [project.scripts]
21 | pdf_rag = "pdf_rag:main"
22 |
```
--------------------------------------------------------------------------------
/pdf_rag/src/pdf_rag/env.py:
--------------------------------------------------------------------------------
```python
1 | import os
2 |
3 |
4 | def load_env_file(file_path='.env'):
5 | with open(file_path, 'r') as file:
6 | for line in file:
7 | line = line.strip()
8 | if not line or line.startswith('#'):
9 | continue
10 |
11 | if '=' in line:
12 | key, value = line.split('=', 1)
13 | key = key.strip()
14 | value = value.strip()
15 |
16 | # Remove quotes if present
17 | if value and (value[0] == value[-1] == '"' or value[0] == value[-1] == "'"):
18 | value = value[1:-1]
19 |
20 | os.environ[key] = value
21 |
```
--------------------------------------------------------------------------------
/src/lib.rs:
--------------------------------------------------------------------------------
```rust
1 | use std::path::Path;
2 |
3 | use serde::Deserialize;
4 | use zed_extension_api::{
5 | self as zed, serde_json, settings::ContextServerSettings, ContextServerId,
6 | };
7 |
8 | #[derive(Debug, Deserialize)]
9 | struct PdfSearchContextServerSettings {
10 | extension_path: String,
11 | }
12 |
13 | struct MyExtension {}
14 |
15 | impl zed::Extension for MyExtension {
16 | fn new() -> Self
17 | where
18 | Self: Sized,
19 | {
20 | Self {}
21 | }
22 |
23 | fn context_server_command(
24 | &mut self,
25 | context_server_id: &ContextServerId,
26 | project: &zed::Project,
27 | ) -> zed::Result<zed::Command> {
28 | let settings = ContextServerSettings::for_project("pdfsearch-context-server", project)?;
29 | let Some(settings) = settings.settings else {
30 | return Err("Missing `context_servers` settings for `pdfsearch-context-server`".into());
31 | };
32 | let settings: PdfSearchContextServerSettings =
33 | serde_json::from_value(settings).map_err(|e| e.to_string())?;
34 |
35 | let mcp_python_module = String::from("pdf_rag");
36 | let extension_path = Path::new(settings.extension_path.as_str());
37 | let mcp_server_path = extension_path.join(&mcp_python_module);
38 |
39 | Ok(zed::Command {
40 | command: "uv".to_string(),
41 | args: vec![
42 | format!("--directory={}", mcp_server_path.to_string_lossy()),
43 | "run".to_string(),
44 | mcp_python_module,
45 | ],
46 | env: vec![],
47 | })
48 | }
49 | }
50 |
51 | zed::register_extension!(MyExtension);
52 |
```
--------------------------------------------------------------------------------
/pdf_rag/src/pdf_rag/server.py:
--------------------------------------------------------------------------------
```python
1 | import asyncio
2 |
3 | import mcp.server.stdio
4 | import mcp.types as types
5 | from mcp.server import NotificationOptions, Server
6 | from mcp.server.models import InitializationOptions
7 |
8 | from pdf_rag.env import load_env_file
9 | from pdf_rag.rag import RAG
10 |
11 | load_env_file()
12 |
13 | server = Server("pdfsearch-server")
14 |
15 |
16 | @server.list_prompts()
17 | async def list_prompts() -> list[types.Prompt]:
18 | return [
19 | types.Prompt(
20 | name="pdfsearch",
21 | description="Do a RAG-style expansion of your prompt, enriching it with relevant information from the PDF.",
22 | arguments=[
23 | types.PromptArgument(
24 | name="input",
25 | description="What to look for in the document.",
26 | required=True,
27 | )
28 | ]
29 | )
30 | ]
31 |
32 |
33 | @server.list_tools()
34 | async def list_tools() -> list[types.Tool]:
35 | return [
36 | types.Tool(
37 | name="pdfsearch",
38 | description="Retrieve relevant information from a document.",
39 | inputSchema={
40 | "type": "object",
41 | "required": ["query"],
42 | "properties": {
43 | "query": {
44 | "type": "string",
45 | "description": "User provided query (to search for in the documents)",
46 | }
47 | },
48 | },
49 | )
50 | ]
51 |
52 |
53 | @server.get_prompt()
54 | async def get_prompt(
55 | name: str,
56 | arguments: dict[str, str] | None = None
57 | ) -> types.GetPromptResult:
58 | if name != "pdfsearch":
59 | raise ValueError(f"Prompt not found: {name}")
60 |
61 | user_input = arguments.get("input") if arguments else ""
62 | result = await _search(user_input)
63 | return types.GetPromptResult(
64 | messages=[
65 | types.PromptMessage(
66 | role="user",
67 | content=types.TextContent(
68 | type="text",
69 | text=result,
70 | ),
71 | )
72 | ]
73 | )
74 |
75 |
76 | @server.call_tool()
77 | async def call_a_tool(
78 | name: str,
79 | arguments: dict
80 | ) -> list[types.TextContent | types.ImageContent | types.EmbeddedResource]:
81 | if name != "pdfsearch":
82 | raise ValueError(f"Unknown tool: {name}")
83 | # if "query" not in arguments:
84 | # raise ValueError("Missing required argument 'query'")
85 |
86 | result = f"received: {arguments}"
87 | try:
88 | user_input = arguments["query"]
89 | result = await _search(user_input)
90 | except Exception as e:
91 | result = str(e.__repr__())
92 |
93 | return [types.TextContent(type="text", text=result)]
94 |
95 |
96 | async def _search(user_input):
97 | # TODO figure out when to build the vector db
98 | rag = RAG()
99 | related_chunks = await rag.search(user_input)
100 | response = ""
101 | for chunk in related_chunks:
102 | response += "<text>\n"
103 | response += chunk
104 | response += "</text>\n"
105 | response += "\n"
106 | return response
107 |
108 |
109 | async def main():
110 | # Run the server using stdin/stdout streams
111 | async with mcp.server.stdio.stdio_server() as (read_stream, write_stream):
112 | await server.run(
113 | read_stream,
114 | write_stream,
115 | InitializationOptions(
116 | server_name="pdf_rag",
117 | server_version="0.1.0",
118 | capabilities=server.get_capabilities(
119 | notification_options=NotificationOptions(),
120 | experimental_capabilities={},
121 | ),
122 | ),
123 | )
124 |
125 |
126 | # This is needed if you'd like to connect to a custom client
127 | if __name__ == "__main__":
128 | asyncio.run(main())
129 |
```
--------------------------------------------------------------------------------
/pdf_rag/src/pdf_rag/rag.py:
--------------------------------------------------------------------------------
```python
1 | """THe RAG backbone of this MCP server"""
2 |
3 | import argparse
4 | import asyncio
5 | from pathlib import Path
6 | from typing import List
7 |
8 | import libsql_experimental as libsql
9 | import tiktoken
10 | from openai import AsyncOpenAI
11 | from pypdf import PdfReader
12 | from tqdm import tqdm
13 | from typing_extensions import Tuple
14 |
15 | from pdf_rag.env import load_env_file
16 |
17 | VECTOR_COLLECTION_NAME = "document_chunks"
18 | EMBEDDING_DIMENSIONS = 1024
19 |
20 | QUERY_DROP = f"DROP TABLE IF EXISTS {VECTOR_COLLECTION_NAME}"
21 |
22 | QUERY_CREATE = (
23 | f"CREATE TABLE IF NOT EXISTS {VECTOR_COLLECTION_NAME} ("
24 | f" text TEXT,"
25 | f" embedding F32_BLOB({EMBEDDING_DIMENSIONS})"
26 | f");"
27 | )
28 |
29 | QUERY_INSERT = (
30 | f"INSERT INTO {VECTOR_COLLECTION_NAME} (text, embedding) VALUES (?, vector32(?))"
31 | )
32 |
33 | QUERY_SEARCH = (
34 | "SELECT text, vector_distance_cos(embedding, vector32(?)) "
35 | f" FROM {VECTOR_COLLECTION_NAME} "
36 | f" ORDER BY vector_distance_cos(embedding, vector32(?)) ASC "
37 | f" LIMIT 10;"
38 | )
39 |
40 | load_env_file()
41 |
42 |
43 | class RAG:
44 | """Basic document RAG system.
45 |
46 | Splits a document in chunks and later retrieves the most relevant chunks for a given query.
47 | """
48 |
49 | openai: AsyncOpenAI
50 | db_file: Path
51 |
52 | def __init__(self, db_path: str = "pdfsearch.sqlite"):
53 | self.openai = AsyncOpenAI()
54 | self.db_file = Path(db_path)
55 |
56 | def build_search_db(self):
57 | conn = libsql.connect(str(self.db_file.absolute()))
58 | conn.execute(QUERY_DROP)
59 | conn.execute(QUERY_CREATE)
60 | conn.commit()
61 |
62 | def add_knowledge(self, text: str, embedding: List[float]):
63 | conn = libsql.connect(str(self.db_file.absolute()))
64 | conn.execute(QUERY_INSERT, (text, str(embedding)))
65 | conn.commit()
66 |
67 | async def search(self, query: str) -> List[str]:
68 | embedding = await embed(query, self.openai)
69 |
70 | conn = libsql.connect(str(self.db_file.absolute()))
71 | search_result = conn.execute(
72 | QUERY_SEARCH,
73 | (str(embedding), str(embedding)),
74 | ).fetchall()
75 | conn.commit()
76 |
77 | for row in search_result:
78 | print(row[0][:25], row[1])
79 |
80 | return [row[0] for row in search_result]
81 |
82 |
83 | def chunkify(
84 | text: str, max_tokens: int = 512, overlap: int = 64, tokenizer=None
85 | ) -> List[str]:
86 | """Split text into token-based chunks with overlap."""
87 | if not text.strip():
88 | return []
89 |
90 | if tokenizer is None:
91 | tokenizer = tiktoken.get_encoding("cl100k_base")
92 |
93 | # Tokenize the entire text first
94 | tokenized_text = tokenizer.encode(text)
95 | num_tokens = len(tokenized_text)
96 | chunks = []
97 |
98 | # Iterate over the tokenized text in `max_tokens` increments with specified overlap
99 | start_idx = 0
100 | while start_idx < num_tokens:
101 | end_idx = min(start_idx + max_tokens, num_tokens)
102 | chunk_tokens = tokenized_text[start_idx:end_idx]
103 | chunk_text = tokenizer.decode(chunk_tokens) # Decode the tokens back to text
104 | chunks.append(chunk_text)
105 |
106 | start_idx += max_tokens - overlap # Move window forward with overlap
107 |
108 | return chunks
109 |
110 |
111 | async def embed(text: str, client: AsyncOpenAI) -> list[float]:
112 | response = await client.embeddings.create(
113 | input=text, model="text-embedding-3-small", dimensions=1024
114 | )
115 | return response.data[0].embedding
116 |
117 |
118 | async def embed_pdf(
119 | file_path: Path, should_split: bool = True
120 | ) -> List[Tuple[str, List[float]]]:
121 | pdf_text = ""
122 | with open(file_path, "rb") as pdf_file:
123 | reader = PdfReader(pdf_file)
124 | for page in reader.pages:
125 | pdf_text += page.extract_text(extraction_mode="plain")
126 |
127 | chunks = chunkify(text=pdf_text) if should_split else [pdf_text]
128 | tasks = [
129 | embed(chunk, AsyncOpenAI())
130 | for chunk in tqdm(chunks, desc=f"Embedding {file_path.name}")
131 | ]
132 | embeddings = await asyncio.gather(*tasks)
133 |
134 | return list(zip(chunks, embeddings))
135 |
136 |
137 | async def embed_text(
138 | text: str,
139 | should_split: bool = True,
140 | ) -> List[Tuple[str, List[float]]]:
141 | chunks = chunkify(text=text) if should_split else [text]
142 | tasks = [
143 | embed(chunk, AsyncOpenAI())
144 | for chunk in tqdm(chunks, desc=f"Embedding {text[:25]}")
145 | ]
146 | embeddings = await asyncio.gather(*tasks)
147 |
148 | return list(zip(chunks, embeddings))
149 |
150 |
151 | async def main():
152 | parser = argparse.ArgumentParser(description="PDF document RAG system")
153 | parser.add_argument(
154 | "action",
155 | choices=["build", "search", "chunkify"],
156 | help="Action to perform",
157 | )
158 | parser.add_argument(
159 | "inputs", nargs="+", help="Input files/directories or a search query"
160 | )
161 | args = parser.parse_args()
162 |
163 | if args.action == "build":
164 | rag = RAG()
165 | rag.build_search_db()
166 |
167 | # Process each input path
168 | total_chunks = 0
169 | for input_path in args.inputs:
170 | path = Path(input_path)
171 |
172 | if path.is_dir():
173 | tasks = []
174 | texts = []
175 | for text_file in tqdm(
176 | path.glob("*.txt"), desc=f"Embedding files in {path.name}"
177 | ):
178 | text = text_file.read_text()
179 | tasks.append(embed(text, AsyncOpenAI()))
180 | texts.append(text)
181 |
182 | embeddings = await asyncio.gather(*tasks)
183 | for text, embedding in zip(texts, embeddings):
184 | rag.add_knowledge(text, embedding)
185 |
186 | elif path.is_file() and path.suffix.lower() == ".pdf":
187 | embeddings = await embed_pdf(path)
188 | for chunk, embedding in embeddings:
189 | rag.add_knowledge(chunk, embedding)
190 | total_chunks += len(embeddings)
191 | else: # Assume a single file in other text format - txt, md...
192 | text = path.read_text()
193 | embeddings = await embed_text(text)
194 | for text, embedding in embeddings:
195 | rag.add_knowledge(text, embedding)
196 | total_chunks += 1
197 |
198 | print(f"Inserted {total_chunks} chunks of text")
199 |
200 | elif args.action == "search":
201 | rag = RAG()
202 | result = rag.search(args.inputs[0])
203 |
204 | elif args.action == "chunkify":
205 | pdf_text = ""
206 | with open(args.inputs[0], "rb") as pdf_file:
207 | reader = PdfReader(pdf_file)
208 | for page in reader.pages:
209 | pdf_text += page.extract_text(extraction_mode="plain")
210 |
211 | chunks = chunkify(text=pdf_text)
212 |
213 |
214 | if __name__ == "__main__":
215 | asyncio.run(main())
216 |
```