CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval
Yifan Xu, Xinhao Li, Yichun Yang, Desen Meng, Rui Huang, Limin Wang

TL;DR
CaReBench is a new benchmark with detailed annotations and tailored metrics for evaluating fine-grained video captioning and retrieval, addressing limitations of existing short-description datasets.
Contribution
It introduces a comprehensive benchmark with spatial and temporal annotations, new evaluation metrics, and a unified baseline model for detailed video understanding tasks.
Findings
The benchmark enables detailed evaluation of spatial and temporal biases.
The proposed baseline achieves competitive results in both retrieval and captioning.
New metrics ReBias and CapST provide insights into model biases and performance.
Abstract
Video understanding, including video captioning and retrieval, is still a great challenge for video-language models (VLMs). The existing video retrieval and caption benchmarks only include short descriptions, limits their ability of detailed video understanding evaluation. To address this problem, we present CaReBench, a testing benchmark for fine-grained video captioning and retrieval with 1,000 high-quality pairs of videos and human-annotated detailed captions. Uniquely, it provides manually separated spatial annotations and temporal annotations for each video. Based on this design, we introduce two evaluation metrics, ReBias and CapST, specifically tailored for video retrieval and video captioning tasks, respectively. These metrics enable a comprehensive investigation into the spatial and temporal biases inherent in VLMs. In addition, to handle both video retrieval and video…
Peer Reviews
Decision·ICLR 2026 Poster
1. **Innovative Benchmark Design:** CAREBENCH introduces a uniquely structured dataset with hierarchical annotations that explicitly separate spatial and temporal descriptions, filling a clear gap in current benchmarks. 2. **New Evaluation Metrics:** ReBias and CapST are well-motivated and address the shortcomings of existing metrics (e.g., CIDEr, AutoDQ, VDCScore), providing more interpretable and fine-grained evaluations. 3. **Unified Framework:** The CARE model elegantly unifies retrieval a
1. **Reliance on LLM-Based Evaluation:** CapST depends on an LLM (DeepSeek-V3) as the evaluator, which may introduce bias. The paper lacks inter-rater consistency checks or human alignment experiments. 2. **Theoretical Depth:** The paper mainly focuses on empirical contributions. The “unified mapping” idea (ϕ: RT×H×W×C → RD) is intriguing but not theoretically explored. 3. **Incomplete Bias Mitigation:** While ReBias reveals spatiotemporal imbalance, the proposed model still shows clear tempor
1. CAREBENCH demonstrates notable strengths across key areas. Its originality lies in the novel hierarchical annotation schema—dividing video content into summary, objects, actions, and miscellaneous categories—which addresses a critical gap in video understanding evaluation. The introduction of specialized ReBias and CapST metrics further showcases innovation, leveraging the dataset's structure to examine biases and caption quality. 2. The work exhibits strong methodological rigor through its
1. The work has several limitations that warrant consideration. While the paper effectively identifies significant spatiotemporal biases in models, it stops at measurement and does not propose methods to mitigate these biases. This focus on benchmarking over developing corrective techniques somewhat limits its immediate practical impact on improving model design. 2. The generalization capability of the proposed CARE model remains uncertain. Its evaluation on standard benchmarks beyond CAREBENCH
1. Proposes CaReBench, a fine-grained benchmark with detailed spatial and temporal annotations and new metrics ReBias and CapST for analyzing spatiotemporal bias. 2. Introduces a unified model CARe that jointly handles video captioning and retrieval through a two-stage training framework. 3. Demonstrates strong experimental results and clear analysis, supported by high-quality human annotations and well-presented methodology.
See questions.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
