Cull: Open-Source AI Image Dataset Curation Engine

Cull is a machine curation engine for AI image datasets, built and maintained by u/Compunerd3. It automates the entire pipeline: scraping, classifying, captioning, and sorting — outputting a folder of triaged images with SD prompts ready for LoRA or finetune training.

End-to-End Pipeline

Scraping: Supports Civitai (.com and .red), X/Twitter, Reddit, Discord, and any URL gallery-dl supports — Pixiv, DeviantArt, booru family, ArtStation, Tumblr, FurAffinity/e621, Imgur, Flickr, and ~340 others.
Queue: Each image + source-side prompt dropped into a local queue. Per-source dedup, no database.
Classification: Uses a vision-language model via multiple LM Studio instances (local) or Groq (cloud) — any OpenAI-compatible endpoint. Strict 17-field JSON schema ensures structured output.
Sorting: Keepers go into category folders with a .txt prompt and a .vision.json audit record. Two score gates (quality + topic relevance) tunable in the UI.
Dashboard: Flask + Alpine.js UI with start/stop, source toggles, gallery, prompt editor, ZIP export, and per-source stats.

Use Cases

The author used Cull for a 300-image LoRA and a 100,000-image finetune dataset. Set topic (e.g., "Female Influencer" or {artist} style art), toggle AUTO_CAPTION_ENABLED, walk away. For prompt-less archives, point LOCAL_IMPORT_DIR at a folder of JPEGs, toggle off prompt requirement, and turn on auto-captioning — each image gets an SD prompt, booru tags, or natural-language caption.

Technical Details

Vision worker pluggable: Subclass BaseVisionWorker, register. Two LM Studio endpoints run in parallel; keepalive worker pings every 15s to avoid idle-unload; optional idle-unloader to free VRAM.
AI assistant integration: Ships with Claude Code skill bundle in .claude/skills/ (cull-helper, lmstudio-vision, metadata-schema) and three sub-agents — works with Claude Code, Cursor, Aider, Codex.
Self-updater: Toast in dashboard, click Update, pulls from origin/main and relaunches.
Stack: Python 3.10+, Flask, Alpine.js, Pillow, Playwright (X scraper), gallery-dl. Single machine, no Redis, no DB, no Docker.
License: MIT.