Netflix Releases VOID: Video Object and Interaction Deletion Model on Hugging Face

✍️ OpenClawRadar📅 Published: April 14, 2026🔗 Source
Netflix Releases VOID: Video Object and Interaction Deletion Model on Hugging Face
Ad

What VOID Does

VOID removes objects from videos along with all interactions they induce on the scene — not just secondary effects like shadows and reflections, but physical interactions like objects falling when a person is removed.

Technical Requirements

  • Requires a GPU with 40GB+ VRAM (e.g., A100)
  • Built on CogVideoX-Fun-V1.5-5b-InP
  • Fine-tuned for video inpainting with interaction-aware quadmask conditioning
  • Quadmask is a 4-value mask that encodes: primary object (remove), overlap regions, affected regions (falling objects, displaced items), and background (keep)
  • Resolution: 384x672 (default)
  • Max frames: 197
  • Scheduler: DDIM
  • Precision: BF16 with FP8 quantization for memory efficiency

Model Files

  • void_pass1.safetensors - Base inpainting model (required)
  • void_pass2.safetensors - Warped-noise refinement for temporal consistency (optional)

Pass 1 is sufficient for most videos. Pass 2 adds optical flow-warped latent initialization for improved temporal consistency on longer clips.

Quick Start

The included notebook handles setup, downloads models, runs inference on a sample video, and displays the result.

git clone https://github.com/netflix/void-model.git
cd void-model
Ad

CLI Usage

# Install dependencies
pip install -r requirements.txt

Download the base model

huggingface-cli download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP
--local-dir ./CogVideoX-Fun-V1.5-5b-InP

Download VOID checkpoints

huggingface-cli download netflix/void-model
--local-dir .

Run Pass 1 inference on a sample

python inference/cogvideox_fun/predict_v2v.py
--config config/quadmask_cogvideox.py
--config.data.data_rootdir= "./sample"
--config.experiment.run_seqs= "lime"
--config.experiment.save_path= "./outputs"
--config.video_model.transformer_path= "./void_pass1.safetensors"

Input Format

Each video needs three files in a folder:

  • input_video.mp4 - source video
  • quadmask_0.mp4 - 4-value mask (0=remove, 63=overlap, 127=affected, 255=keep)
  • prompt.json - {"bg": "description of scene after removal"}

The repo includes a mask generation pipeline (VLM-MASK-REASONER/) that creates quadmasks from raw videos using SAM2 + Gemini.

Training Details

  • Trained on paired counterfactual videos generated from two sources: HUMOTO (human-object interactions rendered in Blender with physics simulation) and Kubric (object-only interactions using Google Scanned Objects)
  • Training was run on 8x A100 80GB GPUs using DeepSpeed ZeRO Stage 2

Architecture

  • Base: CogVideoX 3D Transformer (5B parameters)
  • Input: Video + quadmask + text prompt describing the scene after removal

📖 Read the full source: HN AI Agents

Ad

👀 See Also