# Directory Structure ``` ├── .gitignore ├── .python-version ├── LICENSE ├── package.json ├── pnpm-lock.yaml ├── pyproject.toml ├── README-CN.md ├── README.md ├── setup.sh ├── src │ ├── index.ts │ ├── Markdownify.ts │ ├── server.ts │ ├── tools.ts │ └── UVX.ts ├── tsconfig.json └── uv.lock ``` # Files -------------------------------------------------------------------------------- /.python-version: -------------------------------------------------------------------------------- ``` 3.11 ``` -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- ``` # Logs logs *.log npm-debug.log* yarn-debug.log* yarn-error.log* lerna-debug.log* .pnpm-debug.log* # Test and output files *.pdf *.docx *.md !README.md !README-CN.md output*.md *.bak # Diagnostic reports (https://nodejs.org/api/report.html) report.[0-9]*.[0-9]*.[0-9]*.[0-9]*.json # Runtime data pids *.pid *.seed *.pid.lock # Directory for instrumented libs generated by jscoverage/JSCover lib-cov # Coverage directory used by tools like istanbul coverage *.lcov # nyc test coverage .nyc_output # Grunt intermediate storage (https://gruntjs.com/creating-plugins#storing-task-files) .grunt # Bower dependency directory (https://bower.io/) bower_components # node-waf configuration .lock-wscript # Compiled binary addons (https://nodejs.org/api/addons.html) build/Release # Dependency directories node_modules/ jspm_packages/ .venv/ # Snowpack dependency directory (https://snowpack.dev/) web_modules/ # TypeScript cache *.tsbuildinfo # Optional npm cache directory .npm # Optional eslint cache .eslintcache # Optional stylelint cache .stylelintcache # Microbundle cache .rpt2_cache/ .rts2_cache_cjs/ .rts2_cache_es/ .rts2_cache_umd/ # Optional REPL history .node_repl_history # Output of 'npm pack' *.tgz # Yarn Integrity file .yarn-integrity # dotenv environment variable files .env .env.development.local .env.test.local .env.production.local .env.local # Build output dist/ ``` -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- ```markdown # Markdownify MCP Server - UTF-8 Enhanced This is an enhanced version of the [original Markdownify MCP project](https://github.com/cursor-ai/markdownify-mcp), with improved UTF-8 encoding support and optimized handling of multilingual content. [中文文档](README-CN.md) ## Enhancements - Added comprehensive UTF-8 encoding support - Optimized handling of multilingual content - Fixed encoding issues on Windows systems - Improved error handling mechanisms ## Key Differences from Original Project 1. Enhanced Encoding Support: - Full UTF-8 support across all operations - Proper handling of Chinese, Japanese, Korean and other non-ASCII characters - Fixed Windows-specific encoding issues (cmd.exe and PowerShell compatibility) 2. Improved Error Handling: - Detailed error messages in both English and Chinese - Better exception handling for network issues - Graceful fallback mechanisms for conversion failures 3. Extended Functionality: - Added support for batch processing multiple files - Enhanced YouTube video transcript handling - Improved metadata extraction from various file formats - Better preservation of document formatting 4. Performance Optimizations: - Optimized memory usage for large file conversions - Faster processing of multilingual content - Reduced dependency conflicts 5. Better Development Experience: - Comprehensive debugging options - Detailed logging system - Environment-specific configuration support - Clear documentation in both English and Chinese ## Features Supports converting various file types to Markdown: - PDF files - Images (with metadata) - Audio (with transcription) - Word documents (DOCX) - Excel spreadsheets (XLSX) - PowerPoint presentations (PPTX) - Web content: - YouTube video transcripts - Search results - General web pages - Existing Markdown files ## Quick Start 1. Clone this repository: ```bash git clone https://github.com/JDJR2024/markdownify-mcp-utf8.git cd markdownify-mcp-utf8 ``` 2. Install dependencies: ```bash pnpm install ``` Note: This will also install `uv` and related Python dependencies. 3. Build the project: ```bash pnpm run build ``` 4. Start the server: ```bash pnpm start ``` ## Requirements - Node.js 16.0 or higher - Python 3.8 or higher - pnpm package manager - Git ## Detailed Installation Guide ### 1. Environment Setup 1. Install Node.js: - Download from [Node.js official website](https://nodejs.org/) - Verify installation: `node --version` 2. Install pnpm: ```bash npm install -g pnpm pnpm --version ``` 3. Install Python: - Download from [Python official website](https://www.python.org/downloads/) - Ensure Python is added to PATH during installation - Verify installation: `python --version` 4. (Windows Only) Configure UTF-8 Support: ```bash # Set system-wide UTF-8 setx PYTHONIOENCODING UTF-8 # Set current session UTF-8 set PYTHONIOENCODING=UTF-8 # Enable UTF-8 in command prompt chcp 65001 ``` ### 2. Project Setup 1. Clone the repository: ```bash git clone https://github.com/JDJR2024/markdownify-mcp-utf8.git cd markdownify-mcp-utf8 ``` 2. Create and activate Python virtual environment: ```bash # Windows python -m venv .venv .venv\Scripts\activate # Linux/macOS python3 -m venv .venv source .venv/bin/activate ``` 3. Install project dependencies: ```bash # Install Node.js dependencies pnpm install # Install Python dependencies (will be handled by setup.sh) ./setup.sh ``` 4. Build the project: ```bash pnpm run build ``` ### 3. Verification 1. Start the server: ```bash pnpm start ``` 2. Test the installation: ```bash # Convert a web page python convert_utf8.py "https://example.com" # Convert a local file python convert_utf8.py "path/to/your/file.docx" ``` ## Usage Guide ### Basic Usage 1. Converting Web Pages: ```bash python convert_utf8.py "https://example.com" ``` The converted markdown will be saved as `converted_result.md` 2. Converting Local Files: ```bash # Convert DOCX python convert_utf8.py "document.docx" # Convert PDF python convert_utf8.py "document.pdf" # Convert PowerPoint python convert_utf8.py "presentation.pptx" # Convert Excel python convert_utf8.py "spreadsheet.xlsx" ``` 3. Converting YouTube Videos: ```bash python convert_utf8.py "https://www.youtube.com/watch?v=VIDEO_ID" ``` ### Advanced Usage 1. Environment Variables: ```bash # Set custom UV path export UV_PATH="/custom/path/to/uv" # Set custom output directory export MARKDOWN_OUTPUT_DIR="/custom/output/path" ``` 2. Batch Processing: Create a batch file (e.g., `convert_batch.txt`) with URLs or file paths: ```text https://example1.com https://example2.com file1.docx file2.pdf ``` Then run: ```bash while read -r line; do python convert_utf8.py "$line"; done < convert_batch.txt ``` ### Troubleshooting 1. Common Issues: - If you see encoding errors, ensure UTF-8 is properly set - For permission issues on Windows, run as Administrator - For Python path issues, ensure virtual environment is activated 2. Debugging: ```bash # Enable debug output export DEBUG=true python convert_utf8.py "your_file.docx" ``` ## Usage ### Command Line Convert web page to Markdown: ```bash python convert_utf8.py "https://example.com" ``` Convert local file: ```bash python convert_utf8.py "path/to/your/file.docx" ``` ### Desktop App Integration To integrate this server with a desktop app, add the following to your app's server configuration: ```js { "mcpServers": { "markdownify": { "command": "node", "args": [ "{ABSOLUTE_PATH}/dist/index.js" ], "env": { "UV_PATH": "/path/to/uv" } } } } ``` ## Troubleshooting 1. Encoding Issues - If you encounter character encoding issues, ensure the `PYTHONIOENCODING` environment variable is set to `utf-8` - Windows users may need to run `chcp 65001` to enable UTF-8 support 2. Permission Issues - Ensure you have sufficient file read/write permissions - On Windows, you may need to run as administrator ## Acknowledgments This project is based on the original work by Zach Caceres. Thanks to the original author for their outstanding contribution. ## License This project continues to be licensed under the MIT License. See the [LICENSE](LICENSE) file for details. ## Contributing Contributions are welcome! Before submitting a Pull Request, please: 1. Ensure your code follows the project's coding standards 2. Add necessary tests and documentation 3. Update relevant sections in the README ## Contact For issues or suggestions: 1. Submit an Issue: https://github.com/JDJR2024/markdownify-mcp-utf8/issues 2. Create a Pull Request: https://github.com/JDJR2024/markdownify-mcp-utf8/pulls 3. Email: [email protected] ``` -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- ```toml [project] name = "ocr" version = "0.1.0" description = "Add your description here" readme = "README.md" requires-python = ">=3.11" dependencies = [ "markitdown>=0.0.1a3", ] ``` -------------------------------------------------------------------------------- /setup.sh: -------------------------------------------------------------------------------- ```bash #!/bin/bash echo 'Installing Python dependencies for OCR...' echo 'Installing uv' curl -LsSf https://astral.sh/uv/install.sh | sh echo 'Using uv to install markitdown' uv sync echo 'Finished install Python dependencies' ``` -------------------------------------------------------------------------------- /tsconfig.json: -------------------------------------------------------------------------------- ```json { "compilerOptions": { "target": "ES2022", "module": "Node16", "moduleResolution": "node16", "outDir": "./dist", "rootDir": "./src", "strict": true, "esModuleInterop": true, "skipLibCheck": true, "forceConsistentCasingInFileNames": true }, "include": ["src/**/*.ts"], "exclude": ["node_modules", "dist", "src/**/*.test.ts"] } ``` -------------------------------------------------------------------------------- /src/index.ts: -------------------------------------------------------------------------------- ```typescript #! /usr/bin/env node import { createServer } from "./server.js"; import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js"; async function main() { const transport = new StdioServerTransport(); const server = createServer(); await server.connect(transport); } main().catch((error) => { console.error("Fatal error in main():", error); process.exit(1); }); ``` -------------------------------------------------------------------------------- /src/UVX.ts: -------------------------------------------------------------------------------- ```typescript import { exec } from "child_process"; import { promisify } from "util"; const execAsync = promisify(exec); export default class UVX { uvxPath: string; constructor(uvxPath: string) { this.uvxPath = uvxPath; } get path() { return this.uvxPath; } static async setup() { // const { stdout: uvxPath, stderr } = await execAsync("which uvx", { // env: { // ...process.env, // }, // }); // if (stderr) { // throw new Error( // "uvx not found in path, you must install uvx before running this server", // ); // } // HACK ALERT! return new UVX("/Users/zachcaceres/.local/bin/uvx"); } async installDeps() { // This is a hack to make sure that markitdown is installed before it's called in the OCRProcessor try { await execAsync(`${this.uvxPath} markitdown example.pdf`); } catch { console.log("UVX markitdown should be ready now"); } } } ``` -------------------------------------------------------------------------------- /package.json: -------------------------------------------------------------------------------- ```json { "name": "mcp-markdownify-server-utf8", "version": "0.0.1", "description": "MCP Markdownify Server with UTF-8 Support - Model Context Protocol Server for Converting Almost Anything to Markdown", "license": "MIT", "author": "quasimodo-XY (Based on work by @zcaceres)", "homepage": "https://github.com/JDJR2024/markdownify-mcp-utf8", "bugs": "https://github.com/JDJR2024/markdownify-mcp-utf8/issues", "type": "module", "bin": { "mcp-markdownify-server": "dist/index.js" }, "files": [ "dist" ], "scripts": { "build": "tsc && shx chmod +x dist/*.js", "prepare": "npm run build", "dev": "tsc --watch", "preinstall": "./setup.sh", "start": "node dist/index.js", "test": "bun test", "test:watch": "bun test --watch" }, "dependencies": { "@modelcontextprotocol/sdk": "1.0.1", "zod": "^3.24.1" }, "devDependencies": { "@types/node": "^22.9.3", "bun": "^1.1.41", "sdk": "link:@types/modelcontextprotocol/sdk", "shx": "^0.3.4", "ts-jest": "^29.2.5", "typescript": "^5.6.2" }, "keywords": [ "markdown", "converter", "utf8", "multilingual", "mcp", "model-context-protocol" ] } ``` -------------------------------------------------------------------------------- /src/Markdownify.ts: -------------------------------------------------------------------------------- ```typescript import { exec } from "child_process"; import { promisify } from "util"; import path from "path"; import fs from "fs"; import os from "os"; import { fileURLToPath } from "url"; const execAsync = promisify(exec); const __filename = fileURLToPath(import.meta.url); const __dirname = path.dirname(__filename); export type MarkdownResult = { path: string; text: string; }; export class Markdownify { private static async _markitdown( filePath: string, projectRoot: string, uvPath: string, ): Promise<string> { const venvPath = path.join(projectRoot, ".venv"); const markitdownPath = path.join(venvPath, "Scripts", "markitdown.exe"); if (!fs.existsSync(markitdownPath)) { throw new Error("markitdown executable not found"); } const { stdout, stderr } = await execAsync( `${venvPath}\\Scripts\\activate.bat && ${markitdownPath} "${filePath}"`, ); if (stderr) { throw new Error(`Error executing command: ${stderr}`); } return stdout; } private static async saveToTempFile(content: string): Promise<string> { const tempOutputPath = path.join( os.tmpdir(), `markdown_output_${Date.now()}.md`, ); fs.writeFileSync(tempOutputPath, content); return tempOutputPath; } static async toMarkdown({ filePath, url, projectRoot = path.resolve(__dirname, ".."), uvPath = "~/.local/bin/uv", }: { filePath?: string; url?: string; projectRoot?: string; uvPath?: string; }): Promise<MarkdownResult> { try { let inputPath: string; let isTemporary = false; if (url) { const response = await fetch(url); const content = await response.text(); inputPath = await this.saveToTempFile(content); isTemporary = true; } else if (filePath) { inputPath = filePath; } else { throw new Error("Either filePath or url must be provided"); } const text = await this._markitdown(inputPath, projectRoot, uvPath); const outputPath = await this.saveToTempFile(text); if (isTemporary) { fs.unlinkSync(inputPath); } return { path: outputPath, text }; } catch (e: unknown) { if (e instanceof Error) { throw new Error(`Error processing to Markdown: ${e.message}`); } else { throw new Error("Error processing to Markdown: Unknown error occurred"); } } } static async get({ filePath, }: { filePath: string; }): Promise<MarkdownResult> { if (!fs.existsSync(filePath)) { throw new Error("File does not exist"); } const text = await fs.promises.readFile(filePath, "utf-8"); return { path: filePath, text: text, }; } } ``` -------------------------------------------------------------------------------- /src/server.ts: -------------------------------------------------------------------------------- ```typescript import { z } from "zod"; import { Server } from "@modelcontextprotocol/sdk/server/index.js"; import { CallToolRequestSchema, ListToolsRequestSchema, } from "@modelcontextprotocol/sdk/types.js"; import { Markdownify } from "./Markdownify.js"; import * as tools from "./tools.js"; import { CallToolRequest } from "@modelcontextprotocol/sdk/types.js"; const RequestPayloadSchema = z.object({ filepath: z.string().optional(), url: z.string().optional(), projectRoot: z.string().optional(), uvPath: z.string().optional(), }); export function createServer() { const server = new Server( { name: "mcp-markdownify-server", version: "0.1.0", }, { capabilities: { tools: {}, }, }, ); server.setRequestHandler(ListToolsRequestSchema, async () => { return { tools: Object.values(tools), }; }); server.setRequestHandler( CallToolRequestSchema, async (request: CallToolRequest) => { const { name, arguments: args } = request.params; const validatedArgs = RequestPayloadSchema.parse(args); try { let result; switch (name) { case tools.YouTubeToMarkdownTool.name: case tools.BingSearchResultToMarkdownTool.name: case tools.WebpageToMarkdownTool.name: if (!validatedArgs.url) { throw new Error("URL is required for this tool"); } result = await Markdownify.toMarkdown({ url: validatedArgs.url, projectRoot: validatedArgs.projectRoot, uvPath: validatedArgs.uvPath || process.env.UV_PATH, }); break; case tools.PDFToMarkdownTool.name: case tools.ImageToMarkdownTool.name: case tools.AudioToMarkdownTool.name: case tools.DocxToMarkdownTool.name: case tools.XlsxToMarkdownTool.name: case tools.PptxToMarkdownTool.name: if (!validatedArgs.filepath) { throw new Error("File path is required for this tool"); } result = await Markdownify.toMarkdown({ filePath: validatedArgs.filepath, projectRoot: validatedArgs.projectRoot, uvPath: validatedArgs.uvPath || process.env.UV_PATH, }); break; case tools.GetMarkdownFileTool.name: if (!validatedArgs.filepath) { throw new Error("File path is required for this tool"); } result = await Markdownify.get({ filePath: validatedArgs.filepath, }); break; default: throw new Error("Tool not found"); } return { content: [ { type: "text", text: `Output file: ${result.path}` }, { type: "text", text: `Converted content:` }, { type: "text", text: result.text }, ], isError: false, }; } catch (e) { if (e instanceof Error) { return { content: [{ type: "text", text: `Error: ${e.message}` }], isError: true, }; } else { console.error(e); return { content: [{ type: "text", text: `Error: Unknown error occurred` }], isError: true, }; } } }, ); return server; } ``` -------------------------------------------------------------------------------- /src/tools.ts: -------------------------------------------------------------------------------- ```typescript import { ToolSchema } from "@modelcontextprotocol/sdk/types.js"; export const YouTubeToMarkdownTool = ToolSchema.parse({ name: "youtube-to-markdown", description: "Convert a YouTube video to markdown, including transcript if available", inputSchema: { type: "object", properties: { url: { type: "string", description: "URL of the YouTube video", }, }, required: ["url"], }, }); export const PDFToMarkdownTool = ToolSchema.parse({ name: "pdf-to-markdown", description: "Convert a PDF file to markdown", inputSchema: { type: "object", properties: { filepath: { type: "string", description: "Absolute path of the PDF file to convert", }, }, required: ["filepath"], }, }); export const BingSearchResultToMarkdownTool = ToolSchema.parse({ name: "bing-search-to-markdown", description: "Convert a Bing search results page to markdown", inputSchema: { type: "object", properties: { url: { type: "string", description: "URL of the Bing search results page", }, }, required: ["url"], }, }); export const WebpageToMarkdownTool = ToolSchema.parse({ name: "webpage-to-markdown", description: "Convert a webpage to markdown", inputSchema: { type: "object", properties: { url: { type: "string", description: "URL of the webpage to convert", }, }, required: ["url"], }, }); export const ImageToMarkdownTool = ToolSchema.parse({ name: "image-to-markdown", description: "Convert an image to markdown, including metadata and description", inputSchema: { type: "object", properties: { filepath: { type: "string", description: "Absolute path of the image file to convert", }, }, required: ["filepath"], }, }); export const AudioToMarkdownTool = ToolSchema.parse({ name: "audio-to-markdown", description: "Convert an audio file to markdown, including transcription if possible", inputSchema: { type: "object", properties: { filepath: { type: "string", description: "Absolute path of the audio file to convert", }, }, required: ["filepath"], }, }); export const DocxToMarkdownTool = ToolSchema.parse({ name: "docx-to-markdown", description: "Convert a DOCX file to markdown", inputSchema: { type: "object", properties: { filepath: { type: "string", description: "Absolute path of the DOCX file to convert", }, }, required: ["filepath"], }, }); export const XlsxToMarkdownTool = ToolSchema.parse({ name: "xlsx-to-markdown", description: "Convert an XLSX file to markdown", inputSchema: { type: "object", properties: { filepath: { type: "string", description: "Absolute path of the XLSX file to convert", }, }, required: ["filepath"], }, }); export const PptxToMarkdownTool = ToolSchema.parse({ name: "pptx-to-markdown", description: "Convert a PPTX file to markdown", inputSchema: { type: "object", properties: { filepath: { type: "string", description: "Absolute path of the PPTX file to convert", }, }, required: ["filepath"], }, }); export const GetMarkdownFileTool = ToolSchema.parse({ name: "get-markdown-file", description: "Get a markdown file by absolute file path", inputSchema: { type: "object", properties: { filepath: { type: "string", description: "Absolute path to file of markdown'd text", }, }, required: ["filepath"], }, }); ```