DocMason: Build Local Knowledge Base from Office Files

What DocMason Does

DocMason is a local, file-based knowledge base system designed for deep research over private work documents. The core concept is "The repo is the app. Codex is the runtime." It compiles office files into structured evidence bundles that AI agents can reason over while maintaining strict provenance tracking.

Key Features from Source

Handles multiple office document types: PPTX, DOCX, XLSX, PDFs, and even .EML files
Extracts multimodal information including IT architecture diagrams and Excel sheet data
Maintains document structure and visual semantics (slide layouts, presenter notes, spreadsheet references, formatting signals)
Runs locally with no cloud ingestion or hidden backends
Provides incremental knowledge base syncing when files are added or revised
Enforces strict data contracts and provenance boundaries

How It Works

DocMason operates as a production-grade runtime that forces AI to respect original document structure. Instead of flattening complex files into unstructured text blobs, it creates deterministic file-based evidence and runs offline retrieval algorithms locally on your machine.

Getting Started

Two setup paths are described in the source:

Path A (Start Small):

Drop work files into the DocMason/original_doc/ folder
Open the DocMason folder in Codex
Ask questions naturally - DocMason guides through environment setup
Approves prompts when building the knowledge base

Path B (Stage Entire Folders):

Drop department-level folders into DocMason/original_doc/
Open in Codex and tell it: "Please prepare the DocMason environment."
Then: "Please build the knowledge base."
Once complete, ask complex research questions against the entire corpus

The system is designed so you don't need to memorize internal commands - just speak naturally to your AI agent within a valid workspace.

Technical Details

DocMason addresses specific limitations of existing document AI tools:

Preserves visual layout, presenter notes, and chart-text relationships in slide decks
Maintains multi-sheet references and nested tables in spreadsheets
Retains formatting semantics like red text for "Risk" or indentation for hierarchies
Enables cross-document reasoning for multi-part proposals

The repository structure includes adapters, knowledge_base, runtime, skills, and sample_corpus directories, with configuration managed through docmason.yaml and pyproject.toml files.

📖 Read the full source: HN AI Agents