# Directory Structure
```
├── .gitignore
├── .python-version
├── LICENSE
├── package.json
├── pnpm-lock.yaml
├── pyproject.toml
├── README-CN.md
├── README.md
├── setup.sh
├── src
│ ├── index.ts
│ ├── Markdownify.ts
│ ├── server.ts
│ ├── tools.ts
│ └── UVX.ts
├── tsconfig.json
└── uv.lock
```
# Files
--------------------------------------------------------------------------------
/.python-version:
--------------------------------------------------------------------------------
```
3.11
```
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
```
# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
lerna-debug.log*
.pnpm-debug.log*
# Test and output files
*.pdf
*.docx
*.md
!README.md
!README-CN.md
output*.md
*.bak
# Diagnostic reports (https://nodejs.org/api/report.html)
report.[0-9]*.[0-9]*.[0-9]*.[0-9]*.json
# Runtime data
pids
*.pid
*.seed
*.pid.lock
# Directory for instrumented libs generated by jscoverage/JSCover
lib-cov
# Coverage directory used by tools like istanbul
coverage
*.lcov
# nyc test coverage
.nyc_output
# Grunt intermediate storage (https://gruntjs.com/creating-plugins#storing-task-files)
.grunt
# Bower dependency directory (https://bower.io/)
bower_components
# node-waf configuration
.lock-wscript
# Compiled binary addons (https://nodejs.org/api/addons.html)
build/Release
# Dependency directories
node_modules/
jspm_packages/
.venv/
# Snowpack dependency directory (https://snowpack.dev/)
web_modules/
# TypeScript cache
*.tsbuildinfo
# Optional npm cache directory
.npm
# Optional eslint cache
.eslintcache
# Optional stylelint cache
.stylelintcache
# Microbundle cache
.rpt2_cache/
.rts2_cache_cjs/
.rts2_cache_es/
.rts2_cache_umd/
# Optional REPL history
.node_repl_history
# Output of 'npm pack'
*.tgz
# Yarn Integrity file
.yarn-integrity
# dotenv environment variable files
.env
.env.development.local
.env.test.local
.env.production.local
.env.local
# Build output
dist/
```
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
```markdown
# Markdownify MCP Server - UTF-8 Enhanced
This is an enhanced version of the [original Markdownify MCP project](https://github.com/cursor-ai/markdownify-mcp), with improved UTF-8 encoding support and optimized handling of multilingual content.
[中文文档](README-CN.md)
## Enhancements
- Added comprehensive UTF-8 encoding support
- Optimized handling of multilingual content
- Fixed encoding issues on Windows systems
- Improved error handling mechanisms
## Key Differences from Original Project
1. Enhanced Encoding Support:
- Full UTF-8 support across all operations
- Proper handling of Chinese, Japanese, Korean and other non-ASCII characters
- Fixed Windows-specific encoding issues (cmd.exe and PowerShell compatibility)
2. Improved Error Handling:
- Detailed error messages in both English and Chinese
- Better exception handling for network issues
- Graceful fallback mechanisms for conversion failures
3. Extended Functionality:
- Added support for batch processing multiple files
- Enhanced YouTube video transcript handling
- Improved metadata extraction from various file formats
- Better preservation of document formatting
4. Performance Optimizations:
- Optimized memory usage for large file conversions
- Faster processing of multilingual content
- Reduced dependency conflicts
5. Better Development Experience:
- Comprehensive debugging options
- Detailed logging system
- Environment-specific configuration support
- Clear documentation in both English and Chinese
## Features
Supports converting various file types to Markdown:
- PDF files
- Images (with metadata)
- Audio (with transcription)
- Word documents (DOCX)
- Excel spreadsheets (XLSX)
- PowerPoint presentations (PPTX)
- Web content:
- YouTube video transcripts
- Search results
- General web pages
- Existing Markdown files
## Quick Start
1. Clone this repository:
```bash
git clone https://github.com/JDJR2024/markdownify-mcp-utf8.git
cd markdownify-mcp-utf8
```
2. Install dependencies:
```bash
pnpm install
```
Note: This will also install `uv` and related Python dependencies.
3. Build the project:
```bash
pnpm run build
```
4. Start the server:
```bash
pnpm start
```
## Requirements
- Node.js 16.0 or higher
- Python 3.8 or higher
- pnpm package manager
- Git
## Detailed Installation Guide
### 1. Environment Setup
1. Install Node.js:
- Download from [Node.js official website](https://nodejs.org/)
- Verify installation: `node --version`
2. Install pnpm:
```bash
npm install -g pnpm
pnpm --version
```
3. Install Python:
- Download from [Python official website](https://www.python.org/downloads/)
- Ensure Python is added to PATH during installation
- Verify installation: `python --version`
4. (Windows Only) Configure UTF-8 Support:
```bash
# Set system-wide UTF-8
setx PYTHONIOENCODING UTF-8
# Set current session UTF-8
set PYTHONIOENCODING=UTF-8
# Enable UTF-8 in command prompt
chcp 65001
```
### 2. Project Setup
1. Clone the repository:
```bash
git clone https://github.com/JDJR2024/markdownify-mcp-utf8.git
cd markdownify-mcp-utf8
```
2. Create and activate Python virtual environment:
```bash
# Windows
python -m venv .venv
.venv\Scripts\activate
# Linux/macOS
python3 -m venv .venv
source .venv/bin/activate
```
3. Install project dependencies:
```bash
# Install Node.js dependencies
pnpm install
# Install Python dependencies (will be handled by setup.sh)
./setup.sh
```
4. Build the project:
```bash
pnpm run build
```
### 3. Verification
1. Start the server:
```bash
pnpm start
```
2. Test the installation:
```bash
# Convert a web page
python convert_utf8.py "https://example.com"
# Convert a local file
python convert_utf8.py "path/to/your/file.docx"
```
## Usage Guide
### Basic Usage
1. Converting Web Pages:
```bash
python convert_utf8.py "https://example.com"
```
The converted markdown will be saved as `converted_result.md`
2. Converting Local Files:
```bash
# Convert DOCX
python convert_utf8.py "document.docx"
# Convert PDF
python convert_utf8.py "document.pdf"
# Convert PowerPoint
python convert_utf8.py "presentation.pptx"
# Convert Excel
python convert_utf8.py "spreadsheet.xlsx"
```
3. Converting YouTube Videos:
```bash
python convert_utf8.py "https://www.youtube.com/watch?v=VIDEO_ID"
```
### Advanced Usage
1. Environment Variables:
```bash
# Set custom UV path
export UV_PATH="/custom/path/to/uv"
# Set custom output directory
export MARKDOWN_OUTPUT_DIR="/custom/output/path"
```
2. Batch Processing:
Create a batch file (e.g., `convert_batch.txt`) with URLs or file paths:
```text
https://example1.com
https://example2.com
file1.docx
file2.pdf
```
Then run:
```bash
while read -r line; do python convert_utf8.py "$line"; done < convert_batch.txt
```
### Troubleshooting
1. Common Issues:
- If you see encoding errors, ensure UTF-8 is properly set
- For permission issues on Windows, run as Administrator
- For Python path issues, ensure virtual environment is activated
2. Debugging:
```bash
# Enable debug output
export DEBUG=true
python convert_utf8.py "your_file.docx"
```
## Usage
### Command Line
Convert web page to Markdown:
```bash
python convert_utf8.py "https://example.com"
```
Convert local file:
```bash
python convert_utf8.py "path/to/your/file.docx"
```
### Desktop App Integration
To integrate this server with a desktop app, add the following to your app's server configuration:
```js
{
"mcpServers": {
"markdownify": {
"command": "node",
"args": [
"{ABSOLUTE_PATH}/dist/index.js"
],
"env": {
"UV_PATH": "/path/to/uv"
}
}
}
}
```
## Troubleshooting
1. Encoding Issues
- If you encounter character encoding issues, ensure the `PYTHONIOENCODING` environment variable is set to `utf-8`
- Windows users may need to run `chcp 65001` to enable UTF-8 support
2. Permission Issues
- Ensure you have sufficient file read/write permissions
- On Windows, you may need to run as administrator
## Acknowledgments
This project is based on the original work by Zach Caceres. Thanks to the original author for their outstanding contribution.
## License
This project continues to be licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
## Contributing
Contributions are welcome! Before submitting a Pull Request, please:
1. Ensure your code follows the project's coding standards
2. Add necessary tests and documentation
3. Update relevant sections in the README
## Contact
For issues or suggestions:
1. Submit an Issue: https://github.com/JDJR2024/markdownify-mcp-utf8/issues
2. Create a Pull Request: https://github.com/JDJR2024/markdownify-mcp-utf8/pulls
3. Email: [email protected]
```
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
```toml
[project]
name = "ocr"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.11"
dependencies = [
"markitdown>=0.0.1a3",
]
```
--------------------------------------------------------------------------------
/setup.sh:
--------------------------------------------------------------------------------
```bash
#!/bin/bash
echo 'Installing Python dependencies for OCR...'
echo 'Installing uv'
curl -LsSf https://astral.sh/uv/install.sh | sh
echo 'Using uv to install markitdown'
uv sync
echo 'Finished install Python dependencies'
```
--------------------------------------------------------------------------------
/tsconfig.json:
--------------------------------------------------------------------------------
```json
{
"compilerOptions": {
"target": "ES2022",
"module": "Node16",
"moduleResolution": "node16",
"outDir": "./dist",
"rootDir": "./src",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true,
"forceConsistentCasingInFileNames": true
},
"include": ["src/**/*.ts"],
"exclude": ["node_modules", "dist", "src/**/*.test.ts"]
}
```
--------------------------------------------------------------------------------
/src/index.ts:
--------------------------------------------------------------------------------
```typescript
#! /usr/bin/env node
import { createServer } from "./server.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
async function main() {
const transport = new StdioServerTransport();
const server = createServer();
await server.connect(transport);
}
main().catch((error) => {
console.error("Fatal error in main():", error);
process.exit(1);
});
```
--------------------------------------------------------------------------------
/src/UVX.ts:
--------------------------------------------------------------------------------
```typescript
import { exec } from "child_process";
import { promisify } from "util";
const execAsync = promisify(exec);
export default class UVX {
uvxPath: string;
constructor(uvxPath: string) {
this.uvxPath = uvxPath;
}
get path() {
return this.uvxPath;
}
static async setup() {
// const { stdout: uvxPath, stderr } = await execAsync("which uvx", {
// env: {
// ...process.env,
// },
// });
// if (stderr) {
// throw new Error(
// "uvx not found in path, you must install uvx before running this server",
// );
// }
// HACK ALERT!
return new UVX("/Users/zachcaceres/.local/bin/uvx");
}
async installDeps() {
// This is a hack to make sure that markitdown is installed before it's called in the OCRProcessor
try {
await execAsync(`${this.uvxPath} markitdown example.pdf`);
} catch {
console.log("UVX markitdown should be ready now");
}
}
}
```
--------------------------------------------------------------------------------
/package.json:
--------------------------------------------------------------------------------
```json
{
"name": "mcp-markdownify-server-utf8",
"version": "0.0.1",
"description": "MCP Markdownify Server with UTF-8 Support - Model Context Protocol Server for Converting Almost Anything to Markdown",
"license": "MIT",
"author": "quasimodo-XY (Based on work by @zcaceres)",
"homepage": "https://github.com/JDJR2024/markdownify-mcp-utf8",
"bugs": "https://github.com/JDJR2024/markdownify-mcp-utf8/issues",
"type": "module",
"bin": {
"mcp-markdownify-server": "dist/index.js"
},
"files": [
"dist"
],
"scripts": {
"build": "tsc && shx chmod +x dist/*.js",
"prepare": "npm run build",
"dev": "tsc --watch",
"preinstall": "./setup.sh",
"start": "node dist/index.js",
"test": "bun test",
"test:watch": "bun test --watch"
},
"dependencies": {
"@modelcontextprotocol/sdk": "1.0.1",
"zod": "^3.24.1"
},
"devDependencies": {
"@types/node": "^22.9.3",
"bun": "^1.1.41",
"sdk": "link:@types/modelcontextprotocol/sdk",
"shx": "^0.3.4",
"ts-jest": "^29.2.5",
"typescript": "^5.6.2"
},
"keywords": [
"markdown",
"converter",
"utf8",
"multilingual",
"mcp",
"model-context-protocol"
]
}
```
--------------------------------------------------------------------------------
/src/Markdownify.ts:
--------------------------------------------------------------------------------
```typescript
import { exec } from "child_process";
import { promisify } from "util";
import path from "path";
import fs from "fs";
import os from "os";
import { fileURLToPath } from "url";
const execAsync = promisify(exec);
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);
export type MarkdownResult = {
path: string;
text: string;
};
export class Markdownify {
private static async _markitdown(
filePath: string,
projectRoot: string,
uvPath: string,
): Promise<string> {
const venvPath = path.join(projectRoot, ".venv");
const markitdownPath = path.join(venvPath, "Scripts", "markitdown.exe");
if (!fs.existsSync(markitdownPath)) {
throw new Error("markitdown executable not found");
}
const { stdout, stderr } = await execAsync(
`${venvPath}\\Scripts\\activate.bat && ${markitdownPath} "${filePath}"`,
);
if (stderr) {
throw new Error(`Error executing command: ${stderr}`);
}
return stdout;
}
private static async saveToTempFile(content: string): Promise<string> {
const tempOutputPath = path.join(
os.tmpdir(),
`markdown_output_${Date.now()}.md`,
);
fs.writeFileSync(tempOutputPath, content);
return tempOutputPath;
}
static async toMarkdown({
filePath,
url,
projectRoot = path.resolve(__dirname, ".."),
uvPath = "~/.local/bin/uv",
}: {
filePath?: string;
url?: string;
projectRoot?: string;
uvPath?: string;
}): Promise<MarkdownResult> {
try {
let inputPath: string;
let isTemporary = false;
if (url) {
const response = await fetch(url);
const content = await response.text();
inputPath = await this.saveToTempFile(content);
isTemporary = true;
} else if (filePath) {
inputPath = filePath;
} else {
throw new Error("Either filePath or url must be provided");
}
const text = await this._markitdown(inputPath, projectRoot, uvPath);
const outputPath = await this.saveToTempFile(text);
if (isTemporary) {
fs.unlinkSync(inputPath);
}
return { path: outputPath, text };
} catch (e: unknown) {
if (e instanceof Error) {
throw new Error(`Error processing to Markdown: ${e.message}`);
} else {
throw new Error("Error processing to Markdown: Unknown error occurred");
}
}
}
static async get({
filePath,
}: {
filePath: string;
}): Promise<MarkdownResult> {
if (!fs.existsSync(filePath)) {
throw new Error("File does not exist");
}
const text = await fs.promises.readFile(filePath, "utf-8");
return {
path: filePath,
text: text,
};
}
}
```
--------------------------------------------------------------------------------
/src/server.ts:
--------------------------------------------------------------------------------
```typescript
import { z } from "zod";
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import {
CallToolRequestSchema,
ListToolsRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";
import { Markdownify } from "./Markdownify.js";
import * as tools from "./tools.js";
import { CallToolRequest } from "@modelcontextprotocol/sdk/types.js";
const RequestPayloadSchema = z.object({
filepath: z.string().optional(),
url: z.string().optional(),
projectRoot: z.string().optional(),
uvPath: z.string().optional(),
});
export function createServer() {
const server = new Server(
{
name: "mcp-markdownify-server",
version: "0.1.0",
},
{
capabilities: {
tools: {},
},
},
);
server.setRequestHandler(ListToolsRequestSchema, async () => {
return {
tools: Object.values(tools),
};
});
server.setRequestHandler(
CallToolRequestSchema,
async (request: CallToolRequest) => {
const { name, arguments: args } = request.params;
const validatedArgs = RequestPayloadSchema.parse(args);
try {
let result;
switch (name) {
case tools.YouTubeToMarkdownTool.name:
case tools.BingSearchResultToMarkdownTool.name:
case tools.WebpageToMarkdownTool.name:
if (!validatedArgs.url) {
throw new Error("URL is required for this tool");
}
result = await Markdownify.toMarkdown({
url: validatedArgs.url,
projectRoot: validatedArgs.projectRoot,
uvPath: validatedArgs.uvPath || process.env.UV_PATH,
});
break;
case tools.PDFToMarkdownTool.name:
case tools.ImageToMarkdownTool.name:
case tools.AudioToMarkdownTool.name:
case tools.DocxToMarkdownTool.name:
case tools.XlsxToMarkdownTool.name:
case tools.PptxToMarkdownTool.name:
if (!validatedArgs.filepath) {
throw new Error("File path is required for this tool");
}
result = await Markdownify.toMarkdown({
filePath: validatedArgs.filepath,
projectRoot: validatedArgs.projectRoot,
uvPath: validatedArgs.uvPath || process.env.UV_PATH,
});
break;
case tools.GetMarkdownFileTool.name:
if (!validatedArgs.filepath) {
throw new Error("File path is required for this tool");
}
result = await Markdownify.get({
filePath: validatedArgs.filepath,
});
break;
default:
throw new Error("Tool not found");
}
return {
content: [
{ type: "text", text: `Output file: ${result.path}` },
{ type: "text", text: `Converted content:` },
{ type: "text", text: result.text },
],
isError: false,
};
} catch (e) {
if (e instanceof Error) {
return {
content: [{ type: "text", text: `Error: ${e.message}` }],
isError: true,
};
} else {
console.error(e);
return {
content: [{ type: "text", text: `Error: Unknown error occurred` }],
isError: true,
};
}
}
},
);
return server;
}
```
--------------------------------------------------------------------------------
/src/tools.ts:
--------------------------------------------------------------------------------
```typescript
import { ToolSchema } from "@modelcontextprotocol/sdk/types.js";
export const YouTubeToMarkdownTool = ToolSchema.parse({
name: "youtube-to-markdown",
description:
"Convert a YouTube video to markdown, including transcript if available",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "URL of the YouTube video",
},
},
required: ["url"],
},
});
export const PDFToMarkdownTool = ToolSchema.parse({
name: "pdf-to-markdown",
description: "Convert a PDF file to markdown",
inputSchema: {
type: "object",
properties: {
filepath: {
type: "string",
description: "Absolute path of the PDF file to convert",
},
},
required: ["filepath"],
},
});
export const BingSearchResultToMarkdownTool = ToolSchema.parse({
name: "bing-search-to-markdown",
description: "Convert a Bing search results page to markdown",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "URL of the Bing search results page",
},
},
required: ["url"],
},
});
export const WebpageToMarkdownTool = ToolSchema.parse({
name: "webpage-to-markdown",
description: "Convert a webpage to markdown",
inputSchema: {
type: "object",
properties: {
url: {
type: "string",
description: "URL of the webpage to convert",
},
},
required: ["url"],
},
});
export const ImageToMarkdownTool = ToolSchema.parse({
name: "image-to-markdown",
description:
"Convert an image to markdown, including metadata and description",
inputSchema: {
type: "object",
properties: {
filepath: {
type: "string",
description: "Absolute path of the image file to convert",
},
},
required: ["filepath"],
},
});
export const AudioToMarkdownTool = ToolSchema.parse({
name: "audio-to-markdown",
description:
"Convert an audio file to markdown, including transcription if possible",
inputSchema: {
type: "object",
properties: {
filepath: {
type: "string",
description: "Absolute path of the audio file to convert",
},
},
required: ["filepath"],
},
});
export const DocxToMarkdownTool = ToolSchema.parse({
name: "docx-to-markdown",
description: "Convert a DOCX file to markdown",
inputSchema: {
type: "object",
properties: {
filepath: {
type: "string",
description: "Absolute path of the DOCX file to convert",
},
},
required: ["filepath"],
},
});
export const XlsxToMarkdownTool = ToolSchema.parse({
name: "xlsx-to-markdown",
description: "Convert an XLSX file to markdown",
inputSchema: {
type: "object",
properties: {
filepath: {
type: "string",
description: "Absolute path of the XLSX file to convert",
},
},
required: ["filepath"],
},
});
export const PptxToMarkdownTool = ToolSchema.parse({
name: "pptx-to-markdown",
description: "Convert a PPTX file to markdown",
inputSchema: {
type: "object",
properties: {
filepath: {
type: "string",
description: "Absolute path of the PPTX file to convert",
},
},
required: ["filepath"],
},
});
export const GetMarkdownFileTool = ToolSchema.parse({
name: "get-markdown-file",
description: "Get a markdown file by absolute file path",
inputSchema: {
type: "object",
properties: {
filepath: {
type: "string",
description: "Absolute path to file of markdown'd text",
},
},
required: ["filepath"],
},
});
```