Building a Precise Video Language with Human-AI Oversight

Zhiqiu Lin; Chancharik Mitra; Siyuan Cen; Isaac Li; Yuhan Huang; Yu Tong Tiffany Ling; Hewei Wang; Irene Pi; Shihang Zhu; Ryan Rao; George Liu; Jiaxi Li; Ruojin Li; Yili Han; Yilun Du; Deva Ramanan

arXiv:2604.21718·cs.CV·April 28, 2026

Building a Precise Video Language with Human-AI Oversight

Zhiqiu Lin, Chancharik Mitra, Siyuan Cen, Isaac Li, Yuhan Huang, Yu Tong Tiffany Ling, Hewei Wang, Irene Pi, Shihang Zhu, Ryan Rao, George Liu, Jiaxi Li, Ruojin Li, Yili Han, Yilun Du, Deva Ramanan

PDF

2 Repos 1 Models 1 Datasets

TL;DR

This paper introduces a comprehensive framework combining structured specifications, human critique, and model fine-tuning to enhance the precision of video captioning and generation, outperforming existing models.

Contribution

It presents a novel human-AI oversight framework with structured visual primitives and critique-based supervision to improve video-language models.

Findings

01

Critique quality directly influences downstream performance.

02

The approach outperforms closed-source models like Gemini-3.1-Pro.

03

Fine-tuning with human oversight enables detailed control over video generation.

Abstract

Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
chancharikm/CHAI_SFT_model_8b
model· 276 dl
276 dl

Datasets

chancharikm/CHAI_testset
dataset· 945 dl
945 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.