LiteParse: Fast Open-Source Document Parser for AI Agents

LiteParse is an open-source document parser focused on fast, local parsing with spatial text extraction and bounding boxes. It runs entirely locally without cloud dependencies or GPU requirements, processing hundreds of pages in seconds.
Key Features
- Apache 2.0 licensed open-source tool
- Spatial text parsing with bounding boxes for precise text positioning
- No dependency on local or frontier VLMs (Vision Language Models)
- Runs on any machine without GPU requirements
- Supports multiple file formats: PDFs, Office documents, images
- Higher accuracy than similar tools like PyPDF, PyMuPDF, MarkItDown
- One-line installation as a skill for 40+ AI agents including Claude Code, Cursor, OpenClaw, Windsurf
Installation Options
CLI Tool Installation:
npm i -g @llamaindex/liteparse
Then use:
lit parse document.pdf
lit screenshot document.pdf
For macOS and Linux via Homebrew:
brew tap run-llama/liteparse
brew install llamaindex-liteparse
Agent Skill Installation:
npx skills add run-llama/llamaparse-agent-skills --skill liteparse
Usage Examples
Basic parsing:
lit parse document.pdf
lit parse document.pdf --format json -o output.md
lit parse document.pdf --target-pages "1-5,10,15-20"
lit parse document.pdf --no-ocr
Batch parsing:
lit batch-parse ./input-directory ./output-directory
Screenshot generation (useful for LLM agents):
lit screenshot document.pdf -o ./screenshots
lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots
lit screenshot document.pdf --dpi 300 -o ./screenshots
lit screenshot document.pdf --target-pages "1-10" -o ./screenshots
Library Usage
Install as a dependency:
npm install @llamaindex/liteparse
# or
pnpm add @llamaindex/liteparse
Basic usage:
import { LiteParse } from '@llamaindex/liteparse';
const parser = new LiteParse({ ocrEnabled: true });
const result = await parser.parse('document.pdf');
console.log(result.text);
Buffer/Uint8Array input (no disk I/O):
import { LiteParse } from '@llamaindex/liteparse';
import { readFile } from 'fs/promises';
const parser = new LiteParse();
const pdfBytes = await readFile('document.pdf');
const result = await parser.parse(pdfBytes);
Technical Details
- Flexible OCR system with built-in Tesseract.js (zero setup)
- Supports HTTP servers for OCR (EasyOCR, PaddleOCR, custom)
- Standard OCR API specification
- Multiple output formats: JSON and Text
- Standalone binary with no cloud dependencies
- Multi-platform support: Linux, macOS (Intel/ARM), Windows
For complex documents with dense tables, multi-column layouts, charts, handwritten text, or scanned PDFs, the creators recommend LlamaParse, their cloud-based document parser built for production document pipelines.
📖 Read the full source: HN AI Agents
👀 See Also

Creation OS: A Local σ-Gated LLM Runtime That Lets Models Say ‘I Don’t Know’ Instead of Hallucinating
Creation OS wraps local LLMs (BitNet, Qwen, Gemma, any GGUF) with a σ-gate that measures multiple uncertainty channels and decides ACCEPT, RETHINK, or ABSTAIN per output. No cloud, no API. TruthfulQA accuracy improved ~29% via selective regeneration.

Cowork vs. Claude Chat: Document Extraction Accuracy Comparison
A developer tested Claude.ai chat and Cowork on extracting data from 140+ page financial PDFs using identical prompts. Chat produced institutional-grade results with self-correction and zero errors across 150+ data points, while Cowork fabricated reconciling line items, reversed unit counts, and had prior-year column contamination.

Claude Skills Hub: Searchable Repository for 789+ Claude Code Skills and 10 Autonomous Agents
Claude Skills Hub (clskills.in) provides a centralized search interface for 789+ Claude Code skill files across 71 categories, plus 10 autonomous AI agents that chain multiple skills into complete workflows. The open-source project aggregates skills from multiple community collections and offers one-click downloads.

OpenClaw .NET: NativeAOT Port with JSON-RPC Bridge for Existing Plugins
OpenClaw .NET is a C# port of OpenClaw that compiles to a ~23MB NativeAOT binary, eliminating JIT warmup and Node runtime overhead while maintaining compatibility with existing TypeScript/JavaScript plugins through a built-in JSON-RPC bridge.