Kreuzberg v4.7.0 adds code intelligence for 248 languages and improved markdown extraction

Kreuzberg v4.7.0 is now available. This is a Rust-core document intelligence library that works with Python, TypeScript/Node.js, Go, Ruby, Java, C#, PHP, Elixir, R, C, and WASM.
Code Intelligence and Extraction
The main highlight is code intelligence and extraction. Kreuzberg now supports 248 formats through the tree-sitter-language-pack library. This enables efficient code parsing for direct integration as a library for agents and via MCP. Agents can work with code repositories, review pull requests, index codebases, and analyze source files.
Kreuzberg extracts at the AST level:
- Functions
- Classes
- Imports
- Exports
- Symbols
- Docstrings
with code chunking that respects scope boundaries.
Markdown Quality Improvements
Poor document extraction can lead to issues down the pipeline. The team created a benchmark harness using Structural F1 and Text F1 scoring across over 350 documents and 23 formats, then optimized based on that.
Specific improvements:
- LaTeX: improved from 0% to 100% SF1
- XLSX: increased from 30% to 100% SF1
- PDF table SF1: went from 15.5% to 53.7%
All 23 formats are now at over 80% SF1. The output pipelines receive is now structurally correct by default.
Other Key Features
- New markdown rendering layer and new HTML output support
- OpenWebUI integration as a document extraction backend
- Options for docling-serve compatibility or direct connection
- Unified architecture where every extractor creates a standard typed document representation
- TOON wire format - a compact document encoding that reduces LLM prompt token usage by 30 to 50%
- Semantic chunk labeling
- JSON output
- Strict configuration validation
- Improved security
Availability
Kreuzberg is available on GitHub: https://github.com/kreuzberg-dev/kreuzberg
Kreuzberg Cloud will be out soon - a hosted version for teams that want the same extraction quality without managing infrastructure. More information at: https://kreuzberg.dev
Contributions are welcome.
📖 Read the full source: r/LocalLLaMA
👀 See Also

molequla: Continual Learning AI Organism Built from Scratch with ClaudeCode
molequla is a continual learning AI organism implemented from scratch in Go, C, JavaScript, and Rust with a Python orchestrator. Each element is a full transformer implementation with vector autograd, trained on raw text, that grows and develops a personality over time.

Claude Review: IntelliJ Plugin for Real-Time Code Review with Claude Code
Claude Review is an open-source IntelliJ plugin that automatically reviews code changes on every file save using Claude Code. It sends unstaged git diffs to Claude with customizable prompts and displays findings as native IntelliJ annotations.

Open-source tool automates Meta ad competitor analysis with Claude Code
Ads Machine is an open-source system built with Claude Code that scrapes competitor ads from Meta's Ad Library, transcribes videos, extracts hooks and angles, and grades ads based on how long they've been running. It can generate variations from successful ads and push campaigns to Meta.

Hollow AgentOS: Run Claude-like agents locally on RTX 5070 using Qwen 3.5 9B
A self-modifying agent system running Qwen 3.5 9B on local hardware cuts Claude API costs by 50%. Uses iterative testing and self-improvement loop to develop software without human intervention.