Lightfeed Extractor: TypeScript Library for Robust Web Data Extraction with LLMs

Lightfeed Extractor is a TypeScript library built for robust web data extraction using LLMs and Playwright browser automation. It addresses common pain points in web scraping pipelines where traditional CSS selectors break when sites change layout, and raw LLM approaches struggle with HTML noise, malformed JSON output, and URL issues.
Key Features
- HTML to LLM-ready markdown conversion: Extracts main content while stripping navigation bars, headers, footers, and tracking junk. Includes optional image inclusion and URL cleaning.
- LLM extraction with Zod schemas: Works with any LangChain-compatible LLM (OpenAI, Gemini, Claude, Ollama) and uses Zod schemas for type-safe extraction with real validation.
- JSON recovery: Sanitizes and recovers partial data from malformed LLM output instead of failing entirely. If 19 out of 20 products parse correctly, you get those 19.
- Built-in browser automation: Uses Playwright with support for local, serverless, or remote browsers. Includes anti-bot patches for reliable web scraping.
- AI browser navigation integration: Pairs with @lightfeed/browser-agent for AI-driven page navigation before extraction.
- URL handling: Manages relative URLs, removes invalid ones, repairs markdown-escaped links, and cleans tracking parameters.
Installation and Usage
Install via npm:
npm install @lightfeed/extractor
Then install your preferred LLM provider:
# OpenAI
npm install @langchain/openai
# Google Gemini
npm install @langchain/google-genai
# Anthropic
npm install @langchain/anthropic
# Ollama (local models)
npm install @langchain/ollama
Example usage for e-commerce product extraction:
import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import { extract, ContentFormat, Browser } from "@lightfeed/extractor";
import { z } from "zod";
// Define schema for product catalog extraction
const productCatalogSchema = z.object({
products: z.array(
z.object({
name: z.string().describe("Product name or title"),
brand: z.string().optional().describe("Brand name"),
price: z.number().describe("Current price"),
originalPrice: z.number().optional().describe("Original price if on sale"),
rating: z.number().optional().describe("Product rating out of 5"),
reviewCount: z.number().optional().describe("Number of reviews"),
productUrl: z.string().url().describe("Link to product detail page"),
imageUrl: z.string().url().optional().describe("Product image URL")
})
).describe("List of bread and bakery products")
});
// Create browser instance
const browser = new Browser({
type: "local", // also supporting serverless and remote browser
headless: false
});
The library is Apache 2.0 licensed and used in production at Lightfeed for data pipelines that scrape websites and extract structured data. It's designed for developers building web scraping workflows who want to avoid writing repetitive boilerplate for HTML cleanup, markdown conversion, LLM calls, JSON parsing, error recovery, and schema validation.
📖 Read the full source: HN LLM Tools
👀 See Also

OpenClaw Multi-Agent Workflow Issues: Stalling, Context Loss, and Token Inefficiency
A developer reports OpenClaw multi-agent workflows frequently stall with agents hanging, experience context leakage despite custom documentation, and consume excessive tokens with no output. The setup used Gemini 3 Pro/Codex models with a COO orchestrator and specialized task agents.

Bifrost LLM Gateway: 11 Microsecond Overhead, Single Binary in Go
Bifrost is an open-source LLM proxy written in Go that routes requests to OpenAI, Anthropic, Azure, and Bedrock with 11 microsecond overhead per request, handling 5,000 RPS on a $20/month VPS.

Reflect MCP Server Implements Reflexion Paper for Persistent Coding Agent Memory
A developer implemented the Reflexion paper (Shinn et al., NeurIPS 2023) as an MCP server to give local coding agents persistent memory of their mistakes. The system uses regex-based pattern matching on error messages and stores lessons in SQLite with FTS5.

Travel Hacking Toolkit: AI Skills and MCP Servers for Points and Miles Search
A GitHub repository provides 7 markdown skills and 6 MCP servers that teach Claude Code and OpenCode to search award flights across 25+ mileage programs, compare cash prices, pull loyalty balances, and find hotels and ferries. Setup requires cloning the repo and running setup.sh.