Lightfeed Extractor: TypeScript Library for LLM Web Data Extraction

Lightfeed Extractor is a TypeScript library built for robust web data extraction using LLMs and Playwright browser automation. It addresses common pain points in web scraping pipelines where traditional CSS selectors break when sites change layout, and raw LLM approaches struggle with HTML noise, malformed JSON output, and URL issues.

Key Features

HTML to LLM-ready markdown conversion: Extracts main content while stripping navigation bars, headers, footers, and tracking junk. Includes optional image inclusion and URL cleaning.
LLM extraction with Zod schemas: Works with any LangChain-compatible LLM (OpenAI, Gemini, Claude, Ollama) and uses Zod schemas for type-safe extraction with real validation.
JSON recovery: Sanitizes and recovers partial data from malformed LLM output instead of failing entirely. If 19 out of 20 products parse correctly, you get those 19.
Built-in browser automation: Uses Playwright with support for local, serverless, or remote browsers. Includes anti-bot patches for reliable web scraping.
AI browser navigation integration: Pairs with @lightfeed/browser-agent for AI-driven page navigation before extraction.
URL handling: Manages relative URLs, removes invalid ones, repairs markdown-escaped links, and cleans tracking parameters.

Installation and Usage

Install via npm:

npm install @lightfeed/extractor

Then install your preferred LLM provider:

# OpenAI
npm install @langchain/openai
# Google Gemini
npm install @langchain/google-genai
# Anthropic
npm install @langchain/anthropic
# Ollama (local models)
npm install @langchain/ollama

Example usage for e-commerce product extraction:

import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import { extract, ContentFormat, Browser } from "@lightfeed/extractor";
import { z } from "zod";

// Define schema for product catalog extraction
const productCatalogSchema = z.object({
  products: z.array(
    z.object({
      name: z.string().describe("Product name or title"),
      brand: z.string().optional().describe("Brand name"),
      price: z.number().describe("Current price"),
      originalPrice: z.number().optional().describe("Original price if on sale"),
      rating: z.number().optional().describe("Product rating out of 5"),
      reviewCount: z.number().optional().describe("Number of reviews"),
      productUrl: z.string().url().describe("Link to product detail page"),
      imageUrl: z.string().url().optional().describe("Product image URL")
    })
  ).describe("List of bread and bakery products")
});
// Create browser instance
const browser = new Browser({
  type: "local", // also supporting serverless and remote browser
  headless: false
});

The library is Apache 2.0 licensed and used in production at Lightfeed for data pipelines that scrape websites and extract structured data. It's designed for developers building web scraping workflows who want to avoid writing repetitive boilerplate for HTML cleanup, markdown conversion, LLM calls, JSON parsing, error recovery, and schema validation.

📖 Read the full source: HN LLM Tools

Lightfeed Extractor: TypeScript Library for Robust Web Data Extraction with LLMs

Key Features

Installation and Usage

👀 See Also

Open-Sourced CLAUDE.md Keeps Claude Code Agents Productive for Hours, Not Looping

Claude AI Session Compaction Issues and Workarounds

Handoffs Pattern in Claude Workflows: Two-File Split vs One-Doc Summary

OpenClaw Implements Agent History Compression to Reduce Context Usage