Meeting Summarization on a 6GB GPU: qwen3.5:0.8B Works at 57s, Granite 4 350M Hallucinates

VoiceFlow is an open-source (MIT) dictation and transcription tool that runs completely locally — the only network call is an optional LLM summary endpoint (Ollama, llama.cpp, Groq, OpenAI). v1.6.0, released today, adds a meeting recorder: mic + system audio mixed into a stereo file, transcribed by faster-whisper, then summarized by any endpoint you configure.
Benchmark: Sub-1B Models on Real Meeting Transcripts
On a RTX 3060 Laptop 6GB (~4.3GB free after Whisper loads, Ollama 0.23, Arch Linux), with a real 4-minute meeting transcript (~2900 chars):
- qwen3.5:0.8B (873M, Q8_0) — default num_ctx (4096) got eaten by thinking tokens. Fix:
After fix: 1562-char structured summary (TL;DR, decisions, action items, open questions) in 57 seconds, using 2.2GB VRAM. Works.FROM qwen3.5:0.8b PARAMETER num_ctx 16384 - Granite 4.0 350M — faster (0.6–2.8s per summary), properly structured output, but hallucinated badly: on a transcript about Anthropic acquiring Bun, it returned “Anthropic's acquisition by Anthropic” and invented Binance. On another meeting, it produced a Star Trek bridge log (“Starship Cassiopeia”). Keywords were present but relationships scrambled.
Conclusion: qwen3.5:0.8B is the working floor for local meeting summarization; nothing sub-500M has produced coherent output on real conversational data yet.
Free Cloud Option: Groq's llama-3.3-70B
Groq's free tier on llama-3.3-70B gives ~2 second summaries, output “tighter” than the local 0.8B. Only failure was a 4-hour transcript exceeding their context window. For most meeting lengths, it's a solid free alternative.
The Open Question: Long-Context Summarization on Low VRAM
The author asks the community: for 1-2 hour transcripts (~30K–60K tokens) on a 6-8GB GPU, what works? Options: wider context (eating VRAM), chunked map-reduce, or a different small model that holds structure on long inputs — without needing 24GB.
VoiceFlow ships as a single .exe (Windows) or .AppImage (Linux), built with Pyloid + React + faster-whisper + SQLite. CUDA auto-detect with CPU fallback. Onboarding (model, mic, hotkey) takes ~1 minute.
📖 Read the full source: r/LocalLLaMA
👀 See Also

LLMock: HTTP-based mocking server for deterministic LLM testing across processes
LLMock is a real HTTP server that mocks OpenAI, Claude, and Gemini APIs, allowing developers to run deterministic tests across multiple processes without hitting real APIs. It supports SSE streaming, tool calls, predicate routing, and request journaling with zero dependencies.

YouTube Transcript MCP Improves Claude Research Workflow
A YouTube transcript MCP allows Claude to pull full transcripts with timestamps from YouTube links, eliminating manual tab switching and copy-pasting. The user reports significantly better answers when Claude has actual transcripts versus user summaries.

Exporting AI Agent Memories Using Claude's Import Function
A Reddit user shares a prompt for extracting stored memories from AI agents like ChatGPT and Claude, then importing them into OpenClaw. The prompt requests all stored context including instructions, personal details, projects, tools, and preferences.

Microsoft DebugMCP VS Code Extension Gives AI Agents Debugging Capabilities
Microsoft DebugMCP is a VS Code extension that exposes the full VS Code debugger to AI coding agents via the Model Context Protocol (MCP), enabling them to set breakpoints, step through code, inspect variables, and evaluate expressions.