IF-VidCap: Can Video Caption Models Follow Instructions?
Shihao Li, Yuanxing Zhang, Jiangtao Wu, Zhide Lei, Yiwen He, Runzhe Wen, Chenxi Liao, Chengkang Jiang, An Ping, Shuo Gao, Suhan Wang, Zhaozhou Bian, Zijun Zhou, Jingyi Xie, Jiayi Zhou, Jing Wang, Yifan Yao, Weihao Xie, Yingshui Tan, Yanghai Wang, Qianqian Xie, Zhaoxiang Zhang

TL;DR
This paper introduces IF-VidCap, a benchmark for evaluating how well video captioning models follow specific user instructions, highlighting the need for controllable, instruction-aware captioning systems.
Contribution
The paper presents a new benchmark, IF-VidCap, with a systematic framework to assess instruction-following in video captioning models, and provides a comprehensive evaluation of over 20 models.
Findings
Open-source models are closing the performance gap with proprietary ones.
Models specialized in dense captioning underperform on complex instructions.
Future research should improve both descriptive richness and instruction-following fidelity.
Abstract
Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the…
Peer Reviews
Decision·ICLR 2026 Poster
1. IF-VidCap is the first benchmark to explicitly evaluate instruction-following in video captioning (27 constraint types), addressing a critical gap beyond traditional accuracy/fluency metrics. 2. 1,400 video-instruction-checklist triplets via a two-stage pipeline (auto-generation + human refinement), with 83.6% modification rate and consensus-based validation. 3. Combines rule-based checks (deterministic) + LLM-as-Judge QA (semantic), achieving 96.33% human-agreement for reliable assessment.
1. 1,400 samples is relatively small compared to text-only instruction-following benchmarks (e.g., IFEval, CFBench). And videos average 20.5s and max out at 60s — does not test long-form temporal reasoning or multi-scene narratives. 2. Evaluation focuses on compliance, not quality. Does not assess fluency, coherence, or creativity of generated captions. 3. Training data distribution gap: Uses a "caption-to-instruction" generation method, which may not reflect real user instruction distributions.
This work demonstrates significant strengths through its creation of IF-VidCap, the first benchmark systematically evaluating instruction-following in video captioning with complex, real-world constraints. The benchmark is built on high-quality, carefully curated data and features a comprehensive, human-validated evaluation protocol. Its extensive experiments across ~20 diverse models yield clear insights into scaling effects and model capabilities, while the accompanying training dataset proves
The benchmark has several limitations, including its focus on short videos which excludes long-form content and constrained summarization tasks. Its evaluation, while efficient, relies on automated LLM judgments that may miss nuanced errors and depends on proprietary models, raising reproducibility concerns. Although fine-tuning demonstrates improvement, the absolute performance gains remain modest, and the analysis lacks a deeper investigation into the underlying reasons. Furthermore, the paper
1. Valuable and timely evaluation benchmark — the proposed dataset fills a significant gap in assessing instruction-following behavior for video captioning models. 2. Covers a wide range of different settings, including multiple constraint types, compositional tasks, and diverse video sources. 3. Includes a fine-tuning dataset, enabling reproducibility and extension for future research. 4. Two-format setting (rule-based vs. open-ended checking) is well-designed and helps assess both structur
1. Lack of detail on video selection and preprocessing: It’s unclear how the 350 base videos were chosen and filtered beyond general quality criteria. The authors should provide a full list or dataset summary for reproducibility. 2. Limited discussion on annotation consistency: Although human refinement is mentioned, inter-annotator agreement or quality control statistics are not detailed. 3. Benchmark scope limitation: The dataset focuses primarily on short or medium-length videos (2–60 secon
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Generative Adversarial Networks and Image Synthesis
