Running MiniMax M2.7 Q8_0 128K on 2x3090 with CPU Offloading – Real-World Benchmarks and Config

In a recent r/LocalLLaMA post, a user shares their experience pushing the MiniMax M2.7 model (at Q8_0 quantization) to 128K context on a 2x3090 setup with 256GB DDR4 and a secondhand 10900X CPU. The key challenge: running a large MoE model with unquantized KV cache on relatively low-end hardware for its class.
Performance Numbers
The user reports:
- Prompt processing: ~50 tokens per second
- Token generation: ~10 tokens per second
- Described as “very slow but usable for coding agent workflows”
Configuration
They use ik-llama-cuda (a llama.cpp fork) with the following flags (from their NixOS config):
${ik-llama-cuda}/bin/llama-server \
-m ${modelPath} \
--host 0.0.0.0 \
--port ${toString cfg.port} \
-c ${toString cfg.contextLength} \
-ngl 999 \
--cpu-moe \
-sm graph \
-fa on \
-t 16 \
-tb 16 \
-b 4096 \
-ub 4096 \
-np 1 \
-muge \
-ger \
--jinja \
--metrics \
--temp 1.0 \
--top-p 0.95 \
--top-k 40 \
--min-p 0.01Notable flags:
--cpu-moe– offloads MoE expert computation to CPU-sm graph– enables graph-based scheduling-fa on– flash attention-t 16/-tb 16– 16 threads for compute and batch respectively-b 4096/-ub 4096– batch and ubatch size-muge– memory-usage-guided expert loading (probably)-ger– GPU expert routing
Context & Motivation
The user reports Q8_0 was chosen to mitigate “weird behavior” seen at lower quants. They note that the model’s draft model for speculative decoding was not released for M2.7, which could have improved speed. They are primarily interested in accuracy over speed, as long as generation doesn’t take “literally all day.”
Takeaway for Developers
This is a practical datapoint for anyone running large MoE models on multi-GPU setups with system RAM. The --cpu-moe approach allows scaling context far beyond VRAM limits, albeit at reduced speed. For coding agent workflows where latency is less critical, this tradeoff may be acceptable.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Agent Framework Token Bloat: A 500:1 Input-to-Output Ratio Is Normal
A self-hosted agent framework user reports ~21k input tokens per message and 500:1 input-to-output ratio from tool definitions, system prompt, and memory. Community confirms 15-25k baseline context is common for tool-using agents.

Negation Prompting Is Weak: Instead, Explicitly Describe the Desired Behavior
A Reddit analysis shows that telling Claude "don't be wordy" or "don't moralize" barely works. Instead, use positive instructions like "respond in 1-2 sentences" or "give me a direct answer, treat caveats as optional." Also, ending with "thanks!" warms the tone.

Silent Success: One Dev's Approach to Cron Job Alerting
A developer on r/openclaw stops sending success notifications for healthy cron runs, alerting only on auth failures, state corruption, or repeated failures.

Running OpenClaw on a Raspberry Pi Model B with Free APIs
OpenClaw runs stably on a Raspberry Pi Model B with free tier APIs including Google Gemma 4 31B IT (~20 RPM, unlimited context) and Gemini Flash, with Firefox headless outperforming Chromium for browser automation.