kreuzberg
Сообществоот kreuzberg-dev
A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 50+ formats. Available for Rust, Python, Ruby, Go, and TypeScript/Node.js—or use via CLI, REST API, or MCP server.
Описание
# Kreuzberg [](https://discord.gg/pXxagNK2zN) [](https://badge.fury.io/py/kreuzberg) [](https://www.npmjs.com/package/@kreuzberg/node) [](https://rubygems.org/gems/kreuzberg) [](https://pkg.go.dev/github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg) [](https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg) [](https://www.nuget.org/packages/Goldziher.Kreuzberg/) [](https://kreuzberg.dev/) [](https://opensource.org/licenses/MIT) **A polyglot document intelligence framework with a Rust core.** Extract text, metadata, and structured information from PDFs, Office documents, images, and 56 formats. Available for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, and C#—or use via CLI, REST API, or MCP server. > **🚀 Version 4.0.0 Release Candidate** > This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter. Help us make the stable release better! ## Why use Kreuzberg - **Truly polyglot** – Native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, C# - **Production-ready** – Battle-tested with comprehensive error handling and validation - **56 formats** – PDF, Office documents, images, HTML, XML, emails, archives, and more - **OCR built-in** – Multiple backends (Tesseract, EasyOCR, PaddleOCR) with table extraction support - **Flexible deployment** – Use as library, CLI tool, REST API server, or MCP server - **Memory efficient** – Streaming parsers with constant memory usage for multi-GB files 📖 **[Complete Documentation](https://kreuzberg.dev/)** • 🚀 **[Installation Guides](#installation)** ## Kreuzberg Cloud (Coming Soon) Don't want to manage Rust infrastructure? **Kreuzberg Cloud** is a managed document extraction API launching at the beginning of 2026. - Hosted REST API with async jobs and webhooks - Built-in chunking and embeddings for RAG pipelines - Premium OCR backends for 95%+ accuracy - No infrastructure to maintain ## Installation Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started: - **[Python](packages/python/README.md)** – Installation, basic usage, async/sync APIs - **[Ruby](packages/ruby/README.md)** – Installation, basic usage, configuration - **[TypeScript/Node.js](crates/kreuzberg-node/README.md)** – Installation, types, promises - **[Go](packages/go/README.md)** – Installation, native library setup, sync/async extraction + batch APIs _Note: Windows builds use MinGW and don't support embeddings (ONNX Runtime requires MSVC)_ - **[Java](packages/java/README.md)** – Installation, FFM API usage, Maven/Gradle setup - **[C#](packages/csharp/README.md)** – Installation, P/Invoke usage, NuGet package - **[Rust](crates/kreuzberg/README.md)** – Crate usage, features, async/sync APIs - **[CLI](https://kreuzberg.dev/cli/usage/)** – Command-line usage, batch processing, options ## Supported Formats ### Documents & Productivity | Format | Extensions | Metadata | Tables | Images | |--------|-----------|----------|--------|--------| | PDF | `.pdf` | ✅ | ✅ | ✅ | | Word | `.docx`, `.doc` | ✅ | ✅ | ✅ | | Excel | `.xlsx`, `.xls`, `.ods` | ✅ | ✅ | ❌ | | PowerPoint | `.pptx`, `.ppt` | ✅ | ✅ | ✅ | | Rich Text | `.rtf` | ✅ | ❌ | ❌ | | EPUB | `.epub` | ✅ | ❌ | ❌ | ### Images All image formats support OCR: `.jpg`, `.jpeg`, `.png`, `.tiff`, `.tif`, `.bmp`, `.gif`, `.webp`, `.jp2` ### Web & Structured Data | Format | Extensions | Features | |--------|-----------|----------| | HTML | `.html`, `.htm` | Metadata extraction, link preservation | | XML | `.xml` | Streaming parser for multi-GB files | | JSON | `.json` | Intelligent field detection | | YAML | `.yaml` | Structure preservation | | TOML | `.toml` | Configuration parsing | ### Email & Archives | Format | Extensions | Features | |--------|-----------|----------| | Email | `.eml`, `.msg` | Full metadata, attachment extraction | | Archives | `.zip`, `.tar`, `.gz`, `.7z` | File listing, metadata | ### Academic & Technical LaTeX (`.tex`), BibTeX (`.bib`), Jupyter (`.ipynb`), reStructuredText (`.rst`), Org Mode (`.org`), Markdown (`.md`) **[Complete Format Documentation](https://kreuzberg.dev/reference/formats/)** ## Key Features ### OCR with Table Extraction Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR) with intelligent table detection and reco
Отзывы (0)
Пока нет отзывов. Будьте первым!
Статистика
Информация
Технологии
Похожие серверы
mcp-chain-of-draft-server
Chain of Draft Server is a powerful AI-driven tool that helps developers make better decisions through systematic, iterative refinement of thoughts and designs. It integrates seamlessly with popular AI agents and provides a structured approach to reasoning, API design, architecture decisions, code reviews, and implementation planning.
mcp-use-ts
mcp-use is the framework for MCP with the best DX - Build AI agents, create MCP servers with UI widgets, and debug with built-in inspector. Includes client SDK, server SDK, React hooks, and powerful dev tools.
mesh
Define and compose secure MCPs in TypeScript. Generate AI workflows and agents with React + Tailwind UI. Deploy anywhere.
rhinomcp
RhinoMCP connects Rhino 3D to AI Agent through the Model Context Protocol (MCP)