Qwen 3.5-Omni Variant: OpenRouter Healer Alpha Analysis

Technical specifications and evidence

Healer Alpha is described as having "vision, hearing, reasoning, and action capabilities" with native perception of visual and audio inputs. The model accepts text, image, audio, and video inputs and outputs text with a maximum output length of 65,536 tokens.

The 262,144 context window is a key identifier - this exact number (2^18) matches Qwen 3.5's native context length precisely, not rounded to 256K. Other models use different lengths: GPT-5.4 uses 272K, Gemini uses 1M, and Claude uses 200K-1M.

Architecture knowledge and capabilities

When asked about Qwen architectures, Healer Alpha produced a 2,000+ word technical explanation covering:

Qwen3-Omni Thinker-Talker architecture with reasoning/generation split
Cross-modal fusion and CosyVoice vocoder integration
GDN (gated normalization mechanism) and MoE expert routing
262K context handling using Ring Attention, KV cache optimization, FlashAttention tiling, YaRN/NTK-aware RoPE scaling, and curriculum learning

In contrast, when asked about DeepSeek or xAI architectures, it returned minimal or no responses.

Chinese language proficiency and error metadata

The model demonstrated native-level classical Chinese poetry composition, writing a 七言绝句 about AI with proper tonal structure and classical imagery. It even provided literary analysis of its own poem.

During heavy probing, error responses revealed metadata: {"message": "Provider returned error", "code": 502, "metadata": {"provider_name": "Stealth"}}

Model identification reasoning

The analysis suggests this could be a merged "Qwen 3.5-Omni" variant combining Qwen 3.5's 262K context and hybrid GDN-MoE architecture with Qwen3-Omni's audio/video capabilities. This would represent a new, unreleased model consistent with OpenRouter's pattern of stealth testing unreleased models needing real-world data before launch.

The use of "hearing" instead of "audio" in the description matches Qwen3-Omni's emphasis on end-to-end speech/audio understanding. The model refuses to identify itself in structured self-assessment tests, maintaining its stealth nature.

📖 Read the full source: r/LocalLLaMA