IF-VidCap: Can Video Caption Models Follow Instructions?

Shihao Li; Yuanxing Zhang; Jiangtao Wu; Zhide Lei; Yiwen He; Runzhe Wen; Chenxi Liao; Chengkang Jiang; An Ping; Shuo Gao; Suhan Wang; Zhaozhou Bian; Zijun Zhou; Jingyi Xie; Jiayi Zhou; Jing Wang; Yifan Yao; Weihao Xie; Yingshui Tan; Yanghai Wang; Qianqian Xie; Zhaoxiang Zhang; Jiaheng Liu

arXiv:2510.18726·cs.CV·October 22, 2025

IF-VidCap: Can Video Caption Models Follow Instructions?

Shihao Li, Yuanxing Zhang, Jiangtao Wu, Zhide Lei, Yiwen He, Runzhe Wen, Chenxi Liao, Chengkang Jiang, An Ping, Shuo Gao, Suhan Wang, Zhaozhou Bian, Zijun Zhou, Jingyi Xie, Jiayi Zhou, Jing Wang, Yifan Yao, Weihao Xie, Yingshui Tan, Yanghai Wang, Qianqian Xie, Zhaoxiang Zhang

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper introduces IF-VidCap, a benchmark for evaluating how well video captioning models follow specific user instructions, highlighting the need for controllable, instruction-aware captioning systems.

Contribution

The paper presents a new benchmark, IF-VidCap, with a systematic framework to assess instruction-following in video captioning models, and provides a comprehensive evaluation of over 20 models.

Findings

01

Open-source models are closing the performance gap with proprietary ones.

02

Models specialized in dense captioning underperform on complex instructions.

03

Future research should improve both descriptive richness and instruction-following fidelity.

Abstract

Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. IF-VidCap is the first benchmark to explicitly evaluate instruction-following in video captioning (27 constraint types), addressing a critical gap beyond traditional accuracy/fluency metrics. 2. 1,400 video-instruction-checklist triplets via a two-stage pipeline (auto-generation + human refinement), with 83.6% modification rate and consensus-based validation. 3. Combines rule-based checks (deterministic) + LLM-as-Judge QA (semantic), achieving 96.33% human-agreement for reliable assessment.

Weaknesses

1. 1,400 samples is relatively small compared to text-only instruction-following benchmarks (e.g., IFEval, CFBench). And videos average 20.5s and max out at 60s — does not test long-form temporal reasoning or multi-scene narratives. 2. Evaluation focuses on compliance, not quality. Does not assess fluency, coherence, or creativity of generated captions. 3. Training data distribution gap: Uses a "caption-to-instruction" generation method, which may not reflect real user instruction distributions.

Reviewer 02Rating 8Confidence 3

Strengths

This work demonstrates significant strengths through its creation of IF-VidCap, the first benchmark systematically evaluating instruction-following in video captioning with complex, real-world constraints. The benchmark is built on high-quality, carefully curated data and features a comprehensive, human-validated evaluation protocol. Its extensive experiments across ~20 diverse models yield clear insights into scaling effects and model capabilities, while the accompanying training dataset proves

Weaknesses

The benchmark has several limitations, including its focus on short videos which excludes long-form content and constrained summarization tasks. Its evaluation, while efficient, relies on automated LLM judgments that may miss nuanced errors and depends on proprietary models, raising reproducibility concerns. Although fine-tuning demonstrates improvement, the absolute performance gains remain modest, and the analysis lacks a deeper investigation into the underlying reasons. Furthermore, the paper

Reviewer 03Rating 8Confidence 4

Strengths

1. Valuable and timely evaluation benchmark — the proposed dataset fills a significant gap in assessing instruction-following behavior for video captioning models. 2. Covers a wide range of different settings, including multiple constraint types, compositional tasks, and diverse video sources. 3. Includes a fine-tuning dataset, enabling reproducibility and extension for future research. 4. Two-format setting (rule-based vs. open-ended checking) is well-designed and helps assess both structur

Weaknesses

1. Lack of detail on video selection and preprocessing: It’s unclear how the 350 base videos were chosen and filtered beyond general quality criteria. The authors should provide a full list or dataset summary for reproducibility. 2. Limited discussion on annotation consistency: Although human refinement is mentioned, inter-annotator agreement or quality control statistics are not detailed. 3. Benchmark scope limitation: The dataset focuses primarily on short or medium-length videos (2–60 secon

Code & Models

Datasets

NJU-LINK/IF-VidCap
dataset· 369 dl
369 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Generative Adversarial Networks and Image Synthesis