Fine-Tuning Qwen 14B for Discord Autocomplete

A developer shared their experience on how they fine-tuned the Qwen 14B model to function as an autocomplete tool using their Discord messages. This setup closely resembles tools like GitHub Copilot, where suggestions are made as you type.
The developer used approximately 250 conversations sourced from Discord, obtained through a scraping tool, as their dataset. Each conversation was formatted as chat-ml training samples, particularly focusing on messages where the user said something last, without code blocks or links. This choice indicates a focus on conversational tone rather than technical content.
The Qwen 14B model was fine-tuned using the unsloth.ai platform and QLoRA on a Kaggle GPU, with the entire training process lasting roughly 15 minutes due to the small dataset size. They then merged the fine-tuned model into a .gguf format for local use via ollama.com.
The frontend of this autocomplete tool is implemented as a Chrome extension. It captures the last few messages and the user's ongoing input to build a chat-ml prompt with the appropriate context, which is then used to generate a completion from the Ollama-provided model. A zero-width Unicode character is cleverly used to indicate where the suggestion begins, while pressing shift+tab will accept the suggestion.
The current setup is operational on Discord, with potential future expansions to support other sites. The developer also suggests experimenting with different model sizes, as the current 14B model nearly maximally uses the available memory. They propose that 4B or 8B models might be viable alternatives, albeit with potential data limitations.
Source code and further details are available on the developer's GitHub at github.com/b44ken/finetune.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Yozora-fm: Interactive Anime Music Galaxy Visualization
Yozora-fm is an interactive visualization where each star represents an anime opening or ending song, with over 9,000 tracks mapped by genre and era. Users can click stars to play videos or explore the galaxy interface.

TideSurf: DOM compression tool reduces web agent token usage 30x, speeds TTFT 12x
TideSurf v0.3 converts rendered DOM to markdown-like compressed format, reducing token consumption by 32x on GitHub pages versus raw DOM while adding 18 interactive tools for LLM agents.

FixAI: Browser Game Teaches Consumer Law by Fighting Corporate AI Bots
FixAI is a browser game with 36 levels where players argue against corporate or government AI systems using real consumer laws. Built with Vanilla JS, Node/Express, and Claude Haiku, it features a resistance scoring system and educational explanations of legal arguments.

Running NemoClaw with Local vLLM: Setup Notes and Agent Engineering Observations
A developer documented running NVIDIA's NemoClaw sandboxed AI agent platform with a local Nemotron 9B v2 model via vLLM on WSL2. Key findings include inference routing details, parser compatibility issues, and observations about the agent engineering gap.