MiniMax M2.7 Q8_0 128K on 2x3090: Benchmarks & Config

In a recent r/LocalLLaMA post, a user shares their experience pushing the MiniMax M2.7 model (at Q8_0 quantization) to 128K context on a 2x3090 setup with 256GB DDR4 and a secondhand 10900X CPU. The key challenge: running a large MoE model with unquantized KV cache on relatively low-end hardware for its class.

Performance Numbers

The user reports:

Prompt processing: ~50 tokens per second
Token generation: ~10 tokens per second
Described as “very slow but usable for coding agent workflows”

Configuration

They use ik-llama-cuda (a llama.cpp fork) with the following flags (from their NixOS config):

${ik-llama-cuda}/bin/llama-server \
  -m ${modelPath} \
  --host 0.0.0.0 \
  --port ${toString cfg.port} \
  -c ${toString cfg.contextLength} \
  -ngl 999 \
  --cpu-moe \
  -sm graph \
  -fa on \
  -t 16 \
  -tb 16 \
  -b 4096 \
  -ub 4096 \
  -np 1 \
  -muge \
  -ger \
  --jinja \
  --metrics \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 40 \
  --min-p 0.01

Notable flags:

--cpu-moe – offloads MoE expert computation to CPU
-sm graph – enables graph-based scheduling
-fa on – flash attention
-t 16 / -tb 16 – 16 threads for compute and batch respectively
-b 4096 / -ub 4096 – batch and ubatch size
-muge – memory-usage-guided expert loading (probably)
-ger – GPU expert routing

Context & Motivation

The user reports Q8_0 was chosen to mitigate “weird behavior” seen at lower quants. They note that the model’s draft model for speculative decoding was not released for M2.7, which could have improved speed. They are primarily interested in accuracy over speed, as long as generation doesn’t take “literally all day.”

Takeaway for Developers

This is a practical datapoint for anyone running large MoE models on multi-GPU setups with system RAM. The --cpu-moe approach allows scaling context far beyond VRAM limits, albeit at reduced speed. For coding agent workflows where latency is less critical, this tradeoff may be acceptable.

📖 Read the full source: r/LocalLLaMA