CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

Jiyuan Wang; Huan Ouyang; Jiuzhou Lin; Chunyu Lin; Dewen Fan; Boheng Zhang; Haonan Fan; Fei Zuo; Jia Sun; Huaiqing Wang; Honglie Wang; Yiyang Fan; Zhenlong Yuan; Zijun Li; Yongrui Heng; Guosheng Lin; Fan Yang; Tingting Gao

arXiv:2605.11723·cs.CV·May 13, 2026

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

Jiyuan Wang, Huan Ouyang, Jiuzhou Lin, Chunyu Lin, Dewen Fan, Boheng Zhang, Haonan Fan, Fei Zuo, Jia Sun, Huaiqing Wang, Honglie Wang, Yiyang Fan, Zhenlong Yuan, Zijun Li, Yongrui Heng, Guosheng Lin, Fan Yang, Tingting Gao

PDF

TL;DR

This paper introduces CaC, a hierarchical video reward model that improves anomaly detection accuracy and interpretability through a coarse-to-fine spatiotemporal reasoning process, trained on a large-scale annotated dataset.

Contribution

The paper presents a novel hierarchical anomaly reward model with a new large-scale dataset and a three-stage training paradigm incorporating reinforcement learning and IoU-based rewards.

Findings

01

CaC achieves 25.7% accuracy improvement on anomaly benchmarks.

02

Using CaC as a reward reduces generated-video anomalies by 11.7%.

03

The model demonstrates stable focus on subtle anomalies through structured reasoning.

Abstract

In this paper, we propose Concentrate and Concentrate (CaC), a coarse-to-fine anomaly reward model based on Vision-Language Models. During inference, it first conducts a global temporal scan to anchor anomalous time windows, then performs fine-grained spatial grounding within the localized interval, and finally derives robust judgments via structured spatiotemporal Chain-of-Thought reasoning. To equip the model with these capabilities, we construct the first large-scale generated video anomaly dataset with per-frame bounding-box annotations, temporal anomaly windows, and fine-grained attribution labels. Building on this dataset, we design a three-stage progressive training paradigm. The model initially learns spatial and temporal anchoring through single- and multi-frame supervised fine-tuning, and then is optimized by a reinforcement learning strategy based on two-turn Group Relative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.