kreuzberg

# Kreuzberg [![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN) [![PyPI](https://badge.fury.io/py/kreuzberg.svg)](https://badge.fury.io/py/kreuzberg) [![npm](https://img.shields.io/npm/v/@kreuzberg/node)](https://www.npmjs.com/package/@kreuzberg/node) [![RubyGems](https://badge.fury.io/rb/kreuzberg.svg)](https://rubygems.org/gems/kreuzberg) [![Go Reference](https://pkg.go.dev/badge/github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg.svg)](https://pkg.go.dev/github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg) [![Maven Central](https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg)](https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg) [![NuGet](https://img.shields.io/nuget/v/Goldziher.Kreuzberg)](https://www.nuget.org/packages/Goldziher.Kreuzberg/) [![Documentation](https://img.shields.io/badge/docs-kreuzberg.dev-blue)](https://kreuzberg.dev/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) **A polyglot document intelligence framework with a Rust core.** Extract text, metadata, and structured information from PDFs, Office documents, images, and 56 formats. Available for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, and C#—or use via CLI, REST API, or MCP server. > **🚀 Version 4.0.0 Release Candidate** > This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter. Help us make the stable release better! ## Why use Kreuzberg - **Truly polyglot** – Native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, C# - **Production-ready** – Battle-tested with comprehensive error handling and validation - **56 formats** – PDF, Office documents, images, HTML, XML, emails, archives, and more - **OCR built-in** – Multiple backends (Tesseract, EasyOCR, PaddleOCR) with table extraction support - **Flexible deployment** – Use as library, CLI tool, REST API server, or MCP server - **Memory efficient** – Streaming parsers with constant memory usage for multi-GB files 📖 **[Complete Documentation](https://kreuzberg.dev/)** • 🚀 **[Installation Guides](#installation)** ## Kreuzberg Cloud (Coming Soon) Don't want to manage Rust infrastructure? **Kreuzberg Cloud** is a managed document extraction API launching at the beginning of 2026. - Hosted REST API with async jobs and webhooks - Built-in chunking and embeddings for RAG pipelines - Premium OCR backends for 95%+ accuracy - No infrastructure to maintain ## Installation Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started: - **[Python](packages/python/README.md)** – Installation, basic usage, async/sync APIs - **[Ruby](packages/ruby/README.md)** – Installation, basic usage, configuration - **[TypeScript/Node.js](crates/kreuzberg-node/README.md)** – Installation, types, promises - **[Go](packages/go/README.md)** – Installation, native library setup, sync/async extraction + batch APIs _Note: Windows builds use MinGW and don't support embeddings (ONNX Runtime requires MSVC)_ - **[Java](packages/java/README.md)** – Installation, FFM API usage, Maven/Gradle setup - **[C#](packages/csharp/README.md)** – Installation, P/Invoke usage, NuGet package - **[Rust](crates/kreuzberg/README.md)** – Crate usage, features, async/sync APIs - **[CLI](https://kreuzberg.dev/cli/usage/)** – Command-line usage, batch processing, options ## Supported Formats ### Documents & Productivity | Format | Extensions | Metadata | Tables | Images | |--------|-----------|----------|--------|--------| | PDF | `.pdf` | ✅ | ✅ | ✅ | | Word | `.docx`, `.doc` | ✅ | ✅ | ✅ | | Excel | `.xlsx`, `.xls`, `.ods` | ✅ | ✅ | ❌ | | PowerPoint | `.pptx`, `.ppt` | ✅ | ✅ | ✅ | | Rich Text | `.rtf` | ✅ | ❌ | ❌ | | EPUB | `.epub` | ✅ | ❌ | ❌ | ### Images All image formats support OCR: `.jpg`, `.jpeg`, `.png`, `.tiff`, `.tif`, `.bmp`, `.gif`, `.webp`, `.jp2` ### Web & Structured Data | Format | Extensions | Features | |--------|-----------|----------| | HTML | `.html`, `.htm` | Metadata extraction, link preservation | | XML | `.xml` | Streaming parser for multi-GB files | | JSON | `.json` | Intelligent field detection | | YAML | `.yaml` | Structure preservation | | TOML | `.toml` | Configuration parsing | ### Email & Archives | Format | Extensions | Features | |--------|-----------|----------| | Email | `.eml`, `.msg` | Full metadata, attachment extraction | | Archives | `.zip`, `.tar`, `.gz`, `.7z` | File listing, metadata | ### Academic & Technical LaTeX (`.tex`), BibTeX (`.bib`), Jupyter (`.ipynb`), reStructuredText (`.rst`), Org Mode (`.org`), Markdown (`.md`) **[Complete Format Documentation](https://kreuzberg.dev/reference/formats/)** ## Key Features ### OCR with Table Extraction Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR) with intelligent table detection and reco

Описание

Отзывы (0)

Статистика

Информация

Технологии

Похожие серверы

mcp-chain-of-draft-server

mcp-use-ts

mesh

rhinomcp