Steerling-8B: An Interpretable Language Model with Token-Level Attribution

✍️ OpenClawRadar📅 Published: February 24, 2026🔗 Source
Steerling-8B: An Interpretable Language Model with Token-Level Attribution
Ad

Model Architecture and Capabilities

Steerling-8B is built on a causal discrete diffusion model backbone that enables steering generation across multi-token sequences rather than only at the next-token level. The key design decomposes the model's embeddings into three explicit pathways: approximately 33,000 supervised "known" concepts, approximately 100,000 "discovered" concepts the model learns on its own, and a residual component that captures remaining information.

The model uses training loss functions that ensure signal routing through concepts without fundamental performance tradeoffs. Concepts feed into logits through a linear path, allowing every prediction to decompose exactly into per-concept contributions. These contributions can be edited at inference time without retraining.

Performance and Interpretability Metrics

Despite being trained on significantly fewer compute than comparable models, Steerling-8B achieves competitive performance across standard benchmarks. The model outperforms both LLaMA2-7B and Deepseek-7B on overall average despite using fewer FLOPs, and remains within range of models trained with 2-10× more compute.

On a held-out validation set, over 84% of token-level contribution comes from the concept module, indicating the model is not just using the residual to make predictions. When the residual pathway is removed, performance on several LM Harness tasks shows only a small effect, suggesting the model's predictive signal is largely routed through concepts rather than hidden channels.

Steerling can detect known concepts in text with 96.2% AU (Area Under the curve).

Ad

Practical Features

For any group of output tokens that Steerling generates, users can trace these tokens to:

  • Input context: The specific prompt tokens that influenced the output
  • Concepts: Human-understandable topics in the model's representations (both tone like "analytical, clinical" and content like "Genetic alteration methodologies")
  • Training data: The training data sources that drove the output, showing distribution across sources like ArXiv, Wikipedia, and FLAN

The model enables inference-time alignment via concept control, replacing thousands of safety training examples with explicit concept-level steering. It also allows suppressing or amplifying specific concepts at inference time without retraining.

Available Artifacts

  • Model weights available on Hugging Face
  • Companion code on GitHub
  • Package on PyPI

📖 Read the full source: HN AI Agents

Ad

👀 See Also