Anam Cara-3: Faster, Interactive AI Avatars with Audio-to-Video Pipeline

Anam has released its latest model, cara-3, designed to create interactive avatars. The avatar utilizes a two-stage pipeline where a diffusion transformer converts audio into motion embeddings (including head position, eye gaze, lip shape, and expression). These embeddings are then applied to a reference image to generate video frames, allowing for animation of any face without the need for retraining.

Notably, Cara-3 can achieve a time-to-first-frame of approximately 70ms on an H200, which supports many concurrent avatar sessions on a single GPU. This speed is partly due to the novel flow matching variant used for audio-to-motion transformation, as conventional techniques proved unstable.

An independent blind evaluation showed that Cara-3 outperformed competitors like HeyGen, Tavus, and D-ID, scoring 24% higher on average across various metrics. Responsiveness, as evidenced by a Spearman correlation coefficient of 0.697, is shown to impact user experience more than visual quality (0.473).

Anam has also open-sourced their training data pipeline backbone, Metaxy, to facilitate iterative development without retaking costly steps.

📖 Read the full source: HN AI Agents

Anam Cara-3: Advancements in Interactive AI Avatars

👀 See Also

Talkie: A 13B LLM Trained Exclusively on Pre-1931 Text, Using Claude as a Judge in RL Training

Deterministic vs Probabilistic Code Generation: Why Bun's Vibe-Coded Rust Conversion Raises Red Flags

Anthropic files lawsuit to prevent Pentagon blacklisting over AI restrictions

OpenClaw Developer Reports Context Compaction Issues During Driftwatch V3 Build