Cull: Open-Source Dataset Curation Engine for AI Image Pipelines

Cull is a machine curation engine for AI image datasets, built and maintained by u/Compunerd3. It automates the entire pipeline: scraping, classifying, captioning, and sorting — outputting a folder of triaged images with SD prompts ready for LoRA or finetune training.
End-to-End Pipeline
- Scraping: Supports Civitai (.com and .red), X/Twitter, Reddit, Discord, and any URL gallery-dl supports — Pixiv, DeviantArt, booru family, ArtStation, Tumblr, FurAffinity/e621, Imgur, Flickr, and ~340 others.
- Queue: Each image + source-side prompt dropped into a local queue. Per-source dedup, no database.
- Classification: Uses a vision-language model via multiple LM Studio instances (local) or Groq (cloud) — any OpenAI-compatible endpoint. Strict 17-field JSON schema ensures structured output.
- Sorting: Keepers go into category folders with a .txt prompt and a .vision.json audit record. Two score gates (quality + topic relevance) tunable in the UI.
- Dashboard: Flask + Alpine.js UI with start/stop, source toggles, gallery, prompt editor, ZIP export, and per-source stats.
Use Cases
The author used Cull for a 300-image LoRA and a 100,000-image finetune dataset. Set topic (e.g., "Female Influencer" or {artist} style art), toggle AUTO_CAPTION_ENABLED, walk away. For prompt-less archives, point LOCAL_IMPORT_DIR at a folder of JPEGs, toggle off prompt requirement, and turn on auto-captioning — each image gets an SD prompt, booru tags, or natural-language caption.
Technical Details
- Vision worker pluggable: Subclass
BaseVisionWorker, register. Two LM Studio endpoints run in parallel; keepalive worker pings every 15s to avoid idle-unload; optional idle-unloader to free VRAM. - AI assistant integration: Ships with Claude Code skill bundle in
.claude/skills/(cull-helper, lmstudio-vision, metadata-schema) and three sub-agents — works with Claude Code, Cursor, Aider, Codex. - Self-updater: Toast in dashboard, click Update, pulls from origin/main and relaunches.
- Stack: Python 3.10+, Flask, Alpine.js, Pillow, Playwright (X scraper), gallery-dl. Single machine, no Redis, no DB, no Docker.
- License: MIT.
Roadmap
Planned: more vision-worker backends, improved requeue UI, small headless CLI, video scraping and classification.
Repo: https://github.com/tlennon-ie/cull | Screenshots: https://imgur.com/a/kSvsAW9
📖 Read the full source: r/LocalLLaMA
👀 See Also

Cowork vs. Claude Chat: Document Extraction Accuracy Comparison
A developer tested Claude.ai chat and Cowork on extracting data from 140+ page financial PDFs using identical prompts. Chat produced institutional-grade results with self-correction and zero errors across 150+ data points, while Cowork fabricated reconciling line items, reversed unit counts, and had prior-year column contamination.

HyperResearch: Open-source Claude Code skill harness turns it into a deep research agent
HyperResearch converts Claude Code into a 16-step deep research pipeline with persistent knowledge store, fact-checking, and authenticated web sessions. Open-source, single-command install, outperforms OpenAI and Google on DeepResearch Bench.

SpruceChat Runs 0.5B LLM On-Device on Miyoo Handhelds via llama.cpp
SpruceChat runs Qwen2.5-0.5B entirely on-device on handheld gaming devices using llama.cpp, with no cloud or WiFi required. On a Miyoo A30 (Cortex-A7 quad-core), it loads in ~60 seconds and generates at ~1-2 tokens/second.

Microsoft Teams SDK Adds HTTP Server Adapter for Existing AI Agents
The Microsoft Teams SDK now includes an HTTP server adapter that lets developers connect existing AI agents to Teams without rewriting their code. It works with LangChain chains, Slack bots, and Azure Foundry deployments by injecting a POST /api/messages endpoint into existing Express servers.