Lemonade by AMD: Open Source Local LLM Server for GPU and NPU

What Lemonade Is
Lemonade is a local AI server built by AMD and the local AI community that runs text, image, and speech models on GPUs and NPUs. It's open source, designed to be private, and claims to be ready in minutes on any PC.
Key Features and Specifications
- Native C++ Backend: Lightweight service that is only 2MB
- One Minute Install: Simple installer that sets up the stack automatically
- OpenAI API Compatible: Works with hundreds of apps out-of-box and integrates in minutes
- Auto-configures for your hardware: Configures dependencies for your GPU and NPU
- Multi-engine compatibility: Works with llama.cpp, Ryzen AI SW, FastFlowLM, and more
- Multiple Models at Once: Run more than one model at the same time
- Cross-platform: A consistent experience across Windows, Linux, and macOS (beta)
- Built-in app: A GUI that lets you download, try, and switch models quickly
- Unified API: One local service for every modality including chat, vision, image generation, transcription, and speech generation
Model Support and Performance
The server can load models like gpt-oss-120b or Qwen-Coder-Next for advanced tool use. For tuning, you can use --no-mmap to speed up load times and increase context size to 64 or more. The source mentions that with 128 GB of unified RAM, you can load larger models.
Ecosystem Integration
Lemonade is integrated in many apps and works out-of-box with hundreds more thanks to the OpenAI API standard. Mentioned integrations include Open WebUI, n8n, Gaia Infinity, Arcade, GitHub Copilot, OpenHands, Dify, Deep Tutor, and Iterate.ai.
Community and Development
The project has 2.1k stars on GitHub and an active Discord community with 117 online at the time of the source. It's described as being built by the local AI community for every PC, with the philosophy that local AI should be free, open, fast, and private.
📖 Read the full source: HN LLM Tools
👀 See Also

Codeset improves coding agents with repo-specific context from git history
Codeset generates static files from git history that provide context like past bugs, root causes, and co-change relationships. Testing showed 5.3pp improvement on codeset-gym-python and 2pp on SWE-Bench Pro with OpenAI Codex.

How Mendral Cut LLM Costs by Upgrading to Opus: Triager Pattern, SQL Access, and Sub-Agent Architecture
Mendral switched from Sonnet to Opus 4.6 for CI failure analysis but reduced costs by using a Haiku triager to divert 80% of failures, giving agents SQL access to ClickHouse instead of pushing logs, and spawning cheap sub-agents to do the actual digging.

KV Cache Reuse for Long Conversations on Apple Silicon Delivers 200x Speedup
A developer implemented session-based KV cache reuse for local LLM inference using Apple's MLX framework, achieving a 200x improvement in time-to-first-token at 100K context length. The approach keeps the KV cache in memory across conversation turns, processing only new tokens.

Multi-LLM Paper-Trading Bot with Claude Opus as Lead Engineer and Gemini as Strategist: Architecture Breakdown
A solo builder shares a 4,900-LOC paper-trading bot on Alpaca where Claude Opus 4 (Engineer) has veto power over Gemini Pro (Strategist), with a 270+ entry disagreement log called the Strategist Codex.