freespirit/pdfsearch-zed # codebase.md

# Directory Structure

```
├── .gitignore
├── Cargo.lock
├── Cargo.toml
├── extension.toml
├── pdf_rag
│   ├── .aiignore
│   ├── .python-version
│   ├── pyproject.toml
│   ├── README.md
│   ├── src
│   │   └── pdf_rag
│   │       ├── __init__.py
│   │       ├── env.py
│   │       ├── rag.py
│   │       └── server.py
│   └── uv.lock
├── README.md
└── src
    └── lib.rs
```

# Files

--------------------------------------------------------------------------------
/pdf_rag/.python-version:
--------------------------------------------------------------------------------

```
1 | 3.12
2 | 
```

--------------------------------------------------------------------------------
/pdf_rag/.aiignore:
--------------------------------------------------------------------------------

```
1 | .env
2 | .venv
3 | .idea
4 | pdfsearch.sqlite
5 | 
```

--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------

```
 1 | /target
 2 | __pycache__
 3 | 
 4 | 
 5 | .idea
 6 | .venv
 7 | .ruff_cache
 8 | 
 9 | *.wasm
10 | *.sqlite
11 | 
12 | **/.env
13 | 
```

--------------------------------------------------------------------------------
/pdf_rag/README.md:
--------------------------------------------------------------------------------

```markdown
1 | # About
2 | 
3 | The heart of the extension - this is the MCP server logic.
4 | 
5 | 
```

--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------

```markdown
 1 | # PDF Search for Zed
 2 | 
 3 | A document search extension for Zed that lets you semantically search through a
 4 | PDF document and use the results in Zed's AI Assistant.
 5 | 
 6 | ## Prerequisites
 7 | 
 8 | This extension currently requires:
 9 | 
10 | 1. An `OpenAI` API key (to generate embeddings)
11 | 2. `uv` installed on your system
12 | 
13 | **Note:** While the current setup requires an OpenAI API key for generating embeddings, we plan to implement a self-contained alternative in future versions. Community feedback will help prioritize these improvements.
14 | 
15 | ## Quick Start
16 | 
17 | 1. Clone the repository
18 | 
19 | ```bash
20 | git clone https://github.com/freespirit/pdfsearch-zed.git
21 | ```
22 | 
23 | 2. Set up the Python environment for the MCP server:
24 | 
25 | ```bash
26 | cd pdfsearch-zed/pdf_rag
27 | uv venv
28 | uv sync
29 | ```
30 | 
31 | 3. [Install Dev Extension](https://zed.dev/docs/extensions/developing-extensions) in Zed
32 | 
33 | 4. Build the search db
34 | 
35 | ```bash
36 | cd /path/to/pdfsearch-zed/pdf_rag
37 | 
38 | echo "OPENAI_API_KEY=sk-..." > src/pdf_rag/.env
39 | 
40 | # This may take a couple of minutes, depending on the documents' size
41 | # You can provide multiple files and directories as arguments.
42 | #  - files would be chunked.
43 | #  - a directory would be considered as if its files contains chunks.
44 | #    E.g. they won't be further split.
45 | uv run src/pdf_rag/rag.py build "file1.pdf" "dir1" "file2.md" ...
46 | ```
47 | 
48 | 5. Configure Zed
49 | 
50 | ```json
51 | "context_servers": {
52 |     "pdfsearch-context-server": {
53 |         "settings": {
54 |             "extension_path": "/path/to/pdfsearch-zed"
55 |         }
56 |     }
57 | }
58 | ```
59 | 
60 | ## Usage
61 | 
62 | 1. Open Zed's AI Assistant panel
63 | 2. Type `/pdfsearch` followed by your search query
64 | 3. The extension will search the PDF and add relevant sections to the AI
65 |    Assistant's context
66 | 
67 | ## Future Improvements
68 | 
69 | - [x] Self-contained vector store
70 | - [ ] Self-contained embeddings
71 | - [ ] Automated index building on first run
72 | - [ ] Configurable result size
73 | - [x] Support for multiple PDFs
74 | - [x] Optional: Additional file formats beyond PDF
75 | 
76 | ## Project Structure
77 | 
78 | - `pdf_rag/`: Python-based MCP server implementation
79 | - `src/`: Zed extension code
80 | - `extension.toml` and `Cargo.toml`: Zed extension configuration files
81 | 
82 | ## Known Limitations
83 | 
84 | - Manual index building is required before first use
85 | - Requires external services (OpenAI)
86 | 
```

--------------------------------------------------------------------------------
/Cargo.toml:
--------------------------------------------------------------------------------

```toml
 1 | [package]
 2 | name = "pdfsearch"
 3 | version = "0.1.0"
 4 | edition = "2021"
 5 | 
 6 | [lib]
 7 | crate-type = ["cdylib"]
 8 | 
 9 | [dependencies]
10 | serde = "1.0.217"
11 | zed_extension_api = "0.3.0"
12 | 
13 | 
```

--------------------------------------------------------------------------------
/pdf_rag/src/pdf_rag/__init__.py:
--------------------------------------------------------------------------------

```python
 1 | import asyncio
 2 | 
 3 | from . import env, server
 4 | 
 5 | 
 6 | def main():
 7 |     """Main entry point for the package."""
 8 |     asyncio.run(server.main())
 9 | 
10 | 
11 | __all__ = ["main", "server"]
12 | 
```

--------------------------------------------------------------------------------
/extension.toml:
--------------------------------------------------------------------------------

```toml
 1 | id = "pdfsearch"
 2 | name = "PDF Search"
 3 | description = "A PDF search (RAG-style) MCP for Zed"
 4 | version = "0.2.0"
 5 | schema_version = 1
 6 | authors = ["stano <TBD email>"]
 7 | repository = "https://github.com/freespirit/pdfsearch-zed"
 8 | 
 9 | 
10 | [context_servers.my-context-server]
11 | name = "PDF Search"
```

--------------------------------------------------------------------------------
/pdf_rag/pyproject.toml:
--------------------------------------------------------------------------------

```toml
 1 | [project]
 2 | name = "pdf_rag"
 3 | version = "0.1.0"
 4 | description = "Add your description here"
 5 | readme = "README.md"
 6 | requires-python = ">=3.12"
 7 | dependencies = [
 8 |     "httpx>=0.28.1",
 9 |     "libsql-experimental==0.0.41",
10 |     "mcp[cli]>=1.6.0",
11 |     "openai>=1.58.1",
12 |     "pypdf>=5.1.0",
13 |     "tiktoken>=0.8.0",
14 | ]
15 | 
16 | [build-system]
17 | requires = [ "hatchling",]
18 | build-backend = "hatchling.build"
19 | 
20 | [project.scripts]
21 | pdf_rag = "pdf_rag:main"
22 | 
```

--------------------------------------------------------------------------------
/pdf_rag/src/pdf_rag/env.py:
--------------------------------------------------------------------------------

```python
 1 | import os
 2 | 
 3 | 
 4 | def load_env_file(file_path='.env'):
 5 |     with open(file_path, 'r') as file:
 6 |         for line in file:
 7 |             line = line.strip()
 8 |             if not line or line.startswith('#'):
 9 |                 continue
10 | 
11 |             if '=' in line:
12 |                 key, value = line.split('=', 1)
13 |                 key = key.strip()
14 |                 value = value.strip()
15 | 
16 |                 # Remove quotes if present
17 |                 if value and (value[0] == value[-1] == '"' or value[0] == value[-1] == "'"):
18 |                     value = value[1:-1]
19 | 
20 |                 os.environ[key] = value
21 | 
```

--------------------------------------------------------------------------------
/src/lib.rs:
--------------------------------------------------------------------------------

```rust
 1 | use std::path::Path;
 2 | 
 3 | use serde::Deserialize;
 4 | use zed_extension_api::{
 5 |     self as zed, serde_json, settings::ContextServerSettings, ContextServerId,
 6 | };
 7 | 
 8 | #[derive(Debug, Deserialize)]
 9 | struct PdfSearchContextServerSettings {
10 |     extension_path: String,
11 | }
12 | 
13 | struct MyExtension {}
14 | 
15 | impl zed::Extension for MyExtension {
16 |     fn new() -> Self
17 |     where
18 |         Self: Sized,
19 |     {
20 |         Self {}
21 |     }
22 | 
23 |     fn context_server_command(
24 |         &mut self,
25 |         context_server_id: &ContextServerId,
26 |         project: &zed::Project,
27 |     ) -> zed::Result<zed::Command> {
28 |         let settings = ContextServerSettings::for_project("pdfsearch-context-server", project)?;
29 |         let Some(settings) = settings.settings else {
30 |             return Err("Missing `context_servers` settings for `pdfsearch-context-server`".into());
31 |         };
32 |         let settings: PdfSearchContextServerSettings =
33 |             serde_json::from_value(settings).map_err(|e| e.to_string())?;
34 | 
35 |         let mcp_python_module = String::from("pdf_rag");
36 |         let extension_path = Path::new(settings.extension_path.as_str());
37 |         let mcp_server_path = extension_path.join(&mcp_python_module);
38 | 
39 |         Ok(zed::Command {
40 |             command: "uv".to_string(),
41 |             args: vec![
42 |                 format!("--directory={}", mcp_server_path.to_string_lossy()),
43 |                 "run".to_string(),
44 |                 mcp_python_module,
45 |             ],
46 |             env: vec![],
47 |         })
48 |     }
49 | }
50 | 
51 | zed::register_extension!(MyExtension);
52 | 
```

--------------------------------------------------------------------------------
/pdf_rag/src/pdf_rag/server.py:
--------------------------------------------------------------------------------

```python
  1 | import asyncio
  2 | 
  3 | import mcp.server.stdio
  4 | import mcp.types as types
  5 | from mcp.server import NotificationOptions, Server
  6 | from mcp.server.models import InitializationOptions
  7 | 
  8 | from pdf_rag.env import load_env_file
  9 | from pdf_rag.rag import RAG
 10 | 
 11 | load_env_file()
 12 | 
 13 | server = Server("pdfsearch-server")
 14 | 
 15 | 
 16 | @server.list_prompts()
 17 | async def list_prompts() -> list[types.Prompt]:
 18 |     return [
 19 |         types.Prompt(
 20 |             name="pdfsearch",
 21 |             description="Do a RAG-style expansion of your prompt, enriching it with relevant information from the PDF.",
 22 |             arguments=[
 23 |                 types.PromptArgument(
 24 |                     name="input",
 25 |                     description="What to look for in the document.",
 26 |                     required=True,
 27 |                 )
 28 |             ]
 29 |         )
 30 |     ]
 31 | 
 32 | 
 33 | @server.list_tools()
 34 | async def list_tools() -> list[types.Tool]:
 35 |     return [
 36 |         types.Tool(
 37 |             name="pdfsearch",
 38 |             description="Retrieve relevant information from a document.",
 39 |             inputSchema={
 40 |                 "type": "object",
 41 |                 "required": ["query"],
 42 |                 "properties": {
 43 |                     "query": {
 44 |                         "type": "string",
 45 |                         "description": "User provided query (to search for in the documents)",
 46 |                     }
 47 |                 },
 48 |             },
 49 |         )
 50 |     ]
 51 | 
 52 | 
 53 | @server.get_prompt()
 54 | async def get_prompt(
 55 |         name: str,
 56 |         arguments: dict[str, str] | None = None
 57 | ) -> types.GetPromptResult:
 58 |     if name != "pdfsearch":
 59 |         raise ValueError(f"Prompt not found: {name}")
 60 | 
 61 |     user_input = arguments.get("input") if arguments else ""
 62 |     result = await _search(user_input)
 63 |     return types.GetPromptResult(
 64 |         messages=[
 65 |             types.PromptMessage(
 66 |                 role="user",
 67 |                 content=types.TextContent(
 68 |                     type="text",
 69 |                     text=result,
 70 |                 ),
 71 |             )
 72 |         ]
 73 |     )
 74 | 
 75 | 
 76 | @server.call_tool()
 77 | async def call_a_tool(
 78 |         name: str,
 79 |         arguments: dict
 80 | ) -> list[types.TextContent | types.ImageContent | types.EmbeddedResource]:
 81 |     if name != "pdfsearch":
 82 |         raise ValueError(f"Unknown tool: {name}")
 83 |     # if "query" not in arguments:
 84 |     #     raise ValueError("Missing required argument 'query'")
 85 | 
 86 |     result = f"received: {arguments}"
 87 |     try:
 88 |         user_input = arguments["query"]
 89 |         result = await _search(user_input)
 90 |     except Exception as e:
 91 |         result = str(e.__repr__())
 92 | 
 93 |     return [types.TextContent(type="text", text=result)]
 94 | 
 95 | 
 96 | async def _search(user_input):
 97 |     # TODO figure out when to build the vector db
 98 |     rag = RAG()
 99 |     related_chunks = await rag.search(user_input)
100 |     response = ""
101 |     for chunk in related_chunks:
102 |         response += "<text>\n"
103 |         response += chunk
104 |         response += "</text>\n"
105 |     response += "\n"
106 |     return response
107 | 
108 | 
109 | async def main():
110 |     # Run the server using stdin/stdout streams
111 |     async with mcp.server.stdio.stdio_server() as (read_stream, write_stream):
112 |         await server.run(
113 |             read_stream,
114 |             write_stream,
115 |             InitializationOptions(
116 |                 server_name="pdf_rag",
117 |                 server_version="0.1.0",
118 |                 capabilities=server.get_capabilities(
119 |                     notification_options=NotificationOptions(),
120 |                     experimental_capabilities={},
121 |                 ),
122 |             ),
123 |         )
124 | 
125 | 
126 | # This is needed if you'd like to connect to a custom client
127 | if __name__ == "__main__":
128 |     asyncio.run(main())
129 | 
```

--------------------------------------------------------------------------------
/pdf_rag/src/pdf_rag/rag.py:
--------------------------------------------------------------------------------

```python
  1 | """THe RAG backbone of this MCP server"""
  2 | 
  3 | import argparse
  4 | import asyncio
  5 | from pathlib import Path
  6 | from typing import List
  7 | 
  8 | import libsql_experimental as libsql
  9 | import tiktoken
 10 | from openai import AsyncOpenAI
 11 | from pypdf import PdfReader
 12 | from tqdm import tqdm
 13 | from typing_extensions import Tuple
 14 | 
 15 | from pdf_rag.env import load_env_file
 16 | 
 17 | VECTOR_COLLECTION_NAME = "document_chunks"
 18 | EMBEDDING_DIMENSIONS = 1024
 19 | 
 20 | QUERY_DROP = f"DROP TABLE IF EXISTS {VECTOR_COLLECTION_NAME}"
 21 | 
 22 | QUERY_CREATE = (
 23 |     f"CREATE TABLE IF NOT EXISTS {VECTOR_COLLECTION_NAME} ("
 24 |     f"  text TEXT,"
 25 |     f"  embedding F32_BLOB({EMBEDDING_DIMENSIONS})"
 26 |     f");"
 27 | )
 28 | 
 29 | QUERY_INSERT = (
 30 |     f"INSERT INTO {VECTOR_COLLECTION_NAME} (text, embedding) VALUES (?, vector32(?))"
 31 | )
 32 | 
 33 | QUERY_SEARCH = (
 34 |     "SELECT text, vector_distance_cos(embedding, vector32(?)) "
 35 |     f"  FROM {VECTOR_COLLECTION_NAME} "
 36 |     f"  ORDER BY vector_distance_cos(embedding, vector32(?)) ASC "
 37 |     f"  LIMIT 10;"
 38 | )
 39 | 
 40 | load_env_file()
 41 | 
 42 | 
 43 | class RAG:
 44 |     """Basic document RAG system.
 45 | 
 46 |     Splits a document in chunks and later retrieves the most relevant chunks for a given query.
 47 |     """
 48 | 
 49 |     openai: AsyncOpenAI
 50 |     db_file: Path
 51 | 
 52 |     def __init__(self, db_path: str = "pdfsearch.sqlite"):
 53 |         self.openai = AsyncOpenAI()
 54 |         self.db_file = Path(db_path)
 55 | 
 56 |     def build_search_db(self):
 57 |         conn = libsql.connect(str(self.db_file.absolute()))
 58 |         conn.execute(QUERY_DROP)
 59 |         conn.execute(QUERY_CREATE)
 60 |         conn.commit()
 61 | 
 62 |     def add_knowledge(self, text: str, embedding: List[float]):
 63 |         conn = libsql.connect(str(self.db_file.absolute()))
 64 |         conn.execute(QUERY_INSERT, (text, str(embedding)))
 65 |         conn.commit()
 66 | 
 67 |     async def search(self, query: str) -> List[str]:
 68 |         embedding = await embed(query, self.openai)
 69 | 
 70 |         conn = libsql.connect(str(self.db_file.absolute()))
 71 |         search_result = conn.execute(
 72 |             QUERY_SEARCH,
 73 |             (str(embedding), str(embedding)),
 74 |         ).fetchall()
 75 |         conn.commit()
 76 | 
 77 |         for row in search_result:
 78 |             print(row[0][:25], row[1])
 79 | 
 80 |         return [row[0] for row in search_result]
 81 | 
 82 | 
 83 | def chunkify(
 84 |     text: str, max_tokens: int = 512, overlap: int = 64, tokenizer=None
 85 | ) -> List[str]:
 86 |     """Split text into token-based chunks with overlap."""
 87 |     if not text.strip():
 88 |         return []
 89 | 
 90 |     if tokenizer is None:
 91 |         tokenizer = tiktoken.get_encoding("cl100k_base")
 92 | 
 93 |     # Tokenize the entire text first
 94 |     tokenized_text = tokenizer.encode(text)
 95 |     num_tokens = len(tokenized_text)
 96 |     chunks = []
 97 | 
 98 |     # Iterate over the tokenized text in `max_tokens` increments with specified overlap
 99 |     start_idx = 0
100 |     while start_idx < num_tokens:
101 |         end_idx = min(start_idx + max_tokens, num_tokens)
102 |         chunk_tokens = tokenized_text[start_idx:end_idx]
103 |         chunk_text = tokenizer.decode(chunk_tokens)  # Decode the tokens back to text
104 |         chunks.append(chunk_text)
105 | 
106 |         start_idx += max_tokens - overlap  # Move window forward with overlap
107 | 
108 |     return chunks
109 | 
110 | 
111 | async def embed(text: str, client: AsyncOpenAI) -> list[float]:
112 |     response = await client.embeddings.create(
113 |         input=text, model="text-embedding-3-small", dimensions=1024
114 |     )
115 |     return response.data[0].embedding
116 | 
117 | 
118 | async def embed_pdf(
119 |     file_path: Path, should_split: bool = True
120 | ) -> List[Tuple[str, List[float]]]:
121 |     pdf_text = ""
122 |     with open(file_path, "rb") as pdf_file:
123 |         reader = PdfReader(pdf_file)
124 |         for page in reader.pages:
125 |             pdf_text += page.extract_text(extraction_mode="plain")
126 | 
127 |     chunks = chunkify(text=pdf_text) if should_split else [pdf_text]
128 |     tasks = [
129 |         embed(chunk, AsyncOpenAI())
130 |         for chunk in tqdm(chunks, desc=f"Embedding {file_path.name}")
131 |     ]
132 |     embeddings = await asyncio.gather(*tasks)
133 | 
134 |     return list(zip(chunks, embeddings))
135 | 
136 | 
137 | async def embed_text(
138 |     text: str,
139 |     should_split: bool = True,
140 | ) -> List[Tuple[str, List[float]]]:
141 |     chunks = chunkify(text=text) if should_split else [text]
142 |     tasks = [
143 |         embed(chunk, AsyncOpenAI())
144 |         for chunk in tqdm(chunks, desc=f"Embedding {text[:25]}")
145 |     ]
146 |     embeddings = await asyncio.gather(*tasks)
147 | 
148 |     return list(zip(chunks, embeddings))
149 | 
150 | 
151 | async def main():
152 |     parser = argparse.ArgumentParser(description="PDF document RAG system")
153 |     parser.add_argument(
154 |         "action",
155 |         choices=["build", "search", "chunkify"],
156 |         help="Action to perform",
157 |     )
158 |     parser.add_argument(
159 |         "inputs", nargs="+", help="Input files/directories or a search query"
160 |     )
161 |     args = parser.parse_args()
162 | 
163 |     if args.action == "build":
164 |         rag = RAG()
165 |         rag.build_search_db()
166 | 
167 |         # Process each input path
168 |         total_chunks = 0
169 |         for input_path in args.inputs:
170 |             path = Path(input_path)
171 | 
172 |             if path.is_dir():
173 |                 tasks = []
174 |                 texts = []
175 |                 for text_file in tqdm(
176 |                     path.glob("*.txt"), desc=f"Embedding files in {path.name}"
177 |                 ):
178 |                     text = text_file.read_text()
179 |                     tasks.append(embed(text, AsyncOpenAI()))
180 |                     texts.append(text)
181 | 
182 |                 embeddings = await asyncio.gather(*tasks)
183 |                 for text, embedding in zip(texts, embeddings):
184 |                     rag.add_knowledge(text, embedding)
185 | 
186 |             elif path.is_file() and path.suffix.lower() == ".pdf":
187 |                 embeddings = await embed_pdf(path)
188 |                 for chunk, embedding in embeddings:
189 |                     rag.add_knowledge(chunk, embedding)
190 |                 total_chunks += len(embeddings)
191 |             else:  # Assume a single file in other text format - txt, md...
192 |                 text = path.read_text()
193 |                 embeddings = await embed_text(text)
194 |                 for text, embedding in embeddings:
195 |                     rag.add_knowledge(text, embedding)
196 |                     total_chunks += 1
197 | 
198 |         print(f"Inserted {total_chunks} chunks of text")
199 | 
200 |     elif args.action == "search":
201 |         rag = RAG()
202 |         result = rag.search(args.inputs[0])
203 | 
204 |     elif args.action == "chunkify":
205 |         pdf_text = ""
206 |         with open(args.inputs[0], "rb") as pdf_file:
207 |             reader = PdfReader(pdf_file)
208 |             for page in reader.pages:
209 |                 pdf_text += page.extract_text(extraction_mode="plain")
210 | 
211 |         chunks = chunkify(text=pdf_text)
212 | 
213 | 
214 | if __name__ == "__main__":
215 |     asyncio.run(main())
216 | 
```