Kreuzberg v4.7.0 adds code intelligence for 248 languages and improved markdown extraction

✍️ OpenClawRadar📅 Published: April 14, 2026🔗 Source
Kreuzberg v4.7.0 adds code intelligence for 248 languages and improved markdown extraction
Ad

Kreuzberg v4.7.0 is now available. This is a Rust-core document intelligence library that works with Python, TypeScript/Node.js, Go, Ruby, Java, C#, PHP, Elixir, R, C, and WASM.

Code Intelligence and Extraction

The main highlight is code intelligence and extraction. Kreuzberg now supports 248 formats through the tree-sitter-language-pack library. This enables efficient code parsing for direct integration as a library for agents and via MCP. Agents can work with code repositories, review pull requests, index codebases, and analyze source files.

Kreuzberg extracts at the AST level:

  • Functions
  • Classes
  • Imports
  • Exports
  • Symbols
  • Docstrings

with code chunking that respects scope boundaries.

Markdown Quality Improvements

Poor document extraction can lead to issues down the pipeline. The team created a benchmark harness using Structural F1 and Text F1 scoring across over 350 documents and 23 formats, then optimized based on that.

Specific improvements:

  • LaTeX: improved from 0% to 100% SF1
  • XLSX: increased from 30% to 100% SF1
  • PDF table SF1: went from 15.5% to 53.7%

All 23 formats are now at over 80% SF1. The output pipelines receive is now structurally correct by default.

Ad

Other Key Features

  • New markdown rendering layer and new HTML output support
  • OpenWebUI integration as a document extraction backend
  • Options for docling-serve compatibility or direct connection
  • Unified architecture where every extractor creates a standard typed document representation
  • TOON wire format - a compact document encoding that reduces LLM prompt token usage by 30 to 50%
  • Semantic chunk labeling
  • JSON output
  • Strict configuration validation
  • Improved security

Availability

Kreuzberg is available on GitHub: https://github.com/kreuzberg-dev/kreuzberg

Kreuzberg Cloud will be out soon - a hosted version for teams that want the same extraction quality without managing infrastructure. More information at: https://kreuzberg.dev

Contributions are welcome.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also