DeepSeek-V4 Pro & Flash: 1.6T Params, 1M Token Context

DeepSeek AI has released a preview of the DeepSeek-V4 series on Hugging Face. The lineup includes two Mixture-of-Experts (MoE) language models:

DeepSeek-V4-Pro: 1.6 trillion total parameters, 49 billion activated per token
DeepSeek-V4-Flash: 284 billion total parameters, 13 billion activated per token

Both models support a context length of one million tokens.

Architectural Upgrades

The V4 series introduces a hybrid attention mechanism combining:

Compressed Sparse Attention (CSA)
Heavily Compressed Attention (HCA)

At the 1M-token context length, DeepSeek-V4-Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared to DeepSeek-V3.2.

Additionally, the models incorporate Manifold-Constrained Hyper-Connections (mHC) to strengthen residual connections, improving training stability.

Model Details

Repository: deepseek-ai/DeepSeek-V4-Pro on Hugging Face
Pipeline tag: text-generation
Auto model class: AutoModelForCausalLM
License: MIT
Weights: sharded safetensors, including BF16, F32, F8_E8M0, F8_E4M3, and INT8 formats
Total parameter count from safetensors: ~862 billion parameters (likely total across all experts)

Benchmarks and Efficiency

The technical report (not yet fully public) mentions that the hybrid attention dramatically improves long-context efficiency. In the 1M-token setting, the model achieves a 73% reduction in FLOPs and 90% reduction in KV cache vs V3.2.

For developers running long-context applications (e.g., document analysis, codebase understanding, multi-turn agents), this makes DeepSeek-V4 a compelling choice for beating context-length limits without proportional compute costs.

Who It's For

This release targets developers building AI agents that need to process very long documents, large codebases, or multi-turn conversations with full context retention.

📖 Read the full source: HN AI Agents