Kreuzberg v4.7.0: Code Intelligence for 248 Languages

Kreuzberg v4.7.0 is now available. This is a Rust-core document intelligence library that works with Python, TypeScript/Node.js, Go, Ruby, Java, C#, PHP, Elixir, R, C, and WASM.

Code Intelligence and Extraction

The main highlight is code intelligence and extraction. Kreuzberg now supports 248 formats through the tree-sitter-language-pack library. This enables efficient code parsing for direct integration as a library for agents and via MCP. Agents can work with code repositories, review pull requests, index codebases, and analyze source files.

Kreuzberg extracts at the AST level:

Functions
Classes
Imports
Exports
Symbols
Docstrings

with code chunking that respects scope boundaries.

Markdown Quality Improvements

Poor document extraction can lead to issues down the pipeline. The team created a benchmark harness using Structural F1 and Text F1 scoring across over 350 documents and 23 formats, then optimized based on that.

Specific improvements:

LaTeX: improved from 0% to 100% SF1
XLSX: increased from 30% to 100% SF1
PDF table SF1: went from 15.5% to 53.7%

All 23 formats are now at over 80% SF1. The output pipelines receive is now structurally correct by default.

Other Key Features

New markdown rendering layer and new HTML output support
OpenWebUI integration as a document extraction backend
Options for docling-serve compatibility or direct connection
Unified architecture where every extractor creates a standard typed document representation
TOON wire format - a compact document encoding that reduces LLM prompt token usage by 30 to 50%
Semantic chunk labeling
JSON output
Strict configuration validation
Improved security

Availability

Kreuzberg is available on GitHub: https://github.com/kreuzberg-dev/kreuzberg

Kreuzberg Cloud will be out soon - a hosted version for teams that want the same extraction quality without managing infrastructure. More information at: https://kreuzberg.dev

Contributions are welcome.

📖 Read the full source: r/LocalLLaMA