Steerling-8B: An Interpretable Language Model with Token-Level Attribution

Model Architecture and Capabilities
Steerling-8B is built on a causal discrete diffusion model backbone that enables steering generation across multi-token sequences rather than only at the next-token level. The key design decomposes the model's embeddings into three explicit pathways: approximately 33,000 supervised "known" concepts, approximately 100,000 "discovered" concepts the model learns on its own, and a residual component that captures remaining information.
The model uses training loss functions that ensure signal routing through concepts without fundamental performance tradeoffs. Concepts feed into logits through a linear path, allowing every prediction to decompose exactly into per-concept contributions. These contributions can be edited at inference time without retraining.
Performance and Interpretability Metrics
Despite being trained on significantly fewer compute than comparable models, Steerling-8B achieves competitive performance across standard benchmarks. The model outperforms both LLaMA2-7B and Deepseek-7B on overall average despite using fewer FLOPs, and remains within range of models trained with 2-10× more compute.
On a held-out validation set, over 84% of token-level contribution comes from the concept module, indicating the model is not just using the residual to make predictions. When the residual pathway is removed, performance on several LM Harness tasks shows only a small effect, suggesting the model's predictive signal is largely routed through concepts rather than hidden channels.
Steerling can detect known concepts in text with 96.2% AU (Area Under the curve).
Practical Features
For any group of output tokens that Steerling generates, users can trace these tokens to:
- Input context: The specific prompt tokens that influenced the output
- Concepts: Human-understandable topics in the model's representations (both tone like "analytical, clinical" and content like "Genetic alteration methodologies")
- Training data: The training data sources that drove the output, showing distribution across sources like ArXiv, Wikipedia, and FLAN
The model enables inference-time alignment via concept control, replacing thousands of safety training examples with explicit concept-level steering. It also allows suppressing or amplifying specific concepts at inference time without retraining.
Available Artifacts
- Model weights available on Hugging Face
- Companion code on GitHub
- Package on PyPI
📖 Read the full source: HN AI Agents
👀 See Also

cortex-engine MCP server adds persistent memory and multi-agent support
cortex-engine v0.4.0 is an open-source MCP server that gives AI agents persistent long-term memory with tools like observe(), query(), believe(), and dream(). It now supports multiple agents with isolated memory namespaces.

engram v3.4.0 Adds Anthropic Plugin to Keep Claude Code Running Under New Rate Limits
engram v3.4.0 introduces a dedicated Anthropic plugin for Claude Code, adding three skills to manage costs, query context, and surface errors. Install with `/plugin install engram` or `npm install -g engramx@latest`.

Hubcap Bridge: Persistent Two-Way Messaging Between CLI and Browser JavaScript via CDP
Hubcap Bridge is a new feature in the Hubcap CLI tool that creates a persistent two-way message channel between local processes and JavaScript running in browser pages via the Chrome DevTools Protocol. It enables Claude Code skills to interact with web apps through their internal JavaScript APIs without requiring public API access.

Claude Cowork vs OpenClaw: Where the replacement narrative holds and breaks
Claude Cowork offers persistent desktop sessions with low friction, while OpenClaw maintains advantages in system-level automation, skill ecosystems, and workflow control.