DeepSeek-V4 Pro and Flash: 1.6T Parameters, 1M Token Context, Hybrid Attention

DeepSeek AI has released a preview of the DeepSeek-V4 series on Hugging Face. The lineup includes two Mixture-of-Experts (MoE) language models:
- DeepSeek-V4-Pro: 1.6 trillion total parameters, 49 billion activated per token
- DeepSeek-V4-Flash: 284 billion total parameters, 13 billion activated per token
Both models support a context length of one million tokens.
Architectural Upgrades
The V4 series introduces a hybrid attention mechanism combining:
- Compressed Sparse Attention (CSA)
- Heavily Compressed Attention (HCA)
At the 1M-token context length, DeepSeek-V4-Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared to DeepSeek-V3.2.
Additionally, the models incorporate Manifold-Constrained Hyper-Connections (mHC) to strengthen residual connections, improving training stability.
Model Details
- Repository:
deepseek-ai/DeepSeek-V4-Proon Hugging Face - Pipeline tag:
text-generation - Auto model class:
AutoModelForCausalLM - License: MIT
- Weights: sharded safetensors, including BF16, F32, F8_E8M0, F8_E4M3, and INT8 formats
- Total parameter count from safetensors: ~862 billion parameters (likely total across all experts)
Benchmarks and Efficiency
The technical report (not yet fully public) mentions that the hybrid attention dramatically improves long-context efficiency. In the 1M-token setting, the model achieves a 73% reduction in FLOPs and 90% reduction in KV cache vs V3.2.
For developers running long-context applications (e.g., document analysis, codebase understanding, multi-turn agents), this makes DeepSeek-V4 a compelling choice for beating context-length limits without proportional compute costs.
Who It's For
This release targets developers building AI agents that need to process very long documents, large codebases, or multi-turn conversations with full context retention.
📖 Read the full source: HN AI Agents
👀 See Also

Merlin Research releases Qwen3.5-4B-Safety-Thinking model for structured reasoning
Merlin Research has released Qwen3.5-4B-Safety-Thinking, a 4 billion parameter safety-aligned reasoning model built on Qwen3.5. The model is designed for structured 'thinking' and safety in real-world scenarios including agent systems.

Current State of Chinese LLMs: Market Leaders, Open Models, and Business Models
A Reddit analysis details the Chinese LLM landscape, identifying ByteDance's Doubao as the proprietary market leader and DeepSeek as the most innovative, while outlining the business models of major players and 'Six AI Small Tigers' focused on open-weight models.

Auditing API Logs Reveals AI Agents Waste Tokens on Context Window Bloat
A Reddit audit finds Claude agents burn 30k+ tokens on file exploration and verbose logs before writing code, causing architectural decay as context fills with noise.

Developer Perspectives on AI Anxiety and 'AI Psychosis'
A Reddit discussion reveals widespread anxiety among developers using AI tools, with different age groups experiencing distinct pressures: 35-45 year olds feel constant reinvention pressure, 25-35 year olds worry about skills becoming obsolete, and under-25 developers face burnout risks despite AI fluency.