ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models
Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, Ziwei Liu

TL;DR
ShotBench is a new benchmark for evaluating cinematic understanding in vision-language models, revealing their current limitations and guiding the development of improved models like ShotVL.
Contribution
The paper introduces ShotBench, a comprehensive cinematic language understanding benchmark, and ShotQA, a large-scale dataset, along with a fine-tuned model ShotVL that surpasses existing approaches.
Findings
Current VLMs score below 60% accuracy on ShotBench.
ShotVL significantly outperforms existing models on cinematic QA tasks.
Open-sourced data, models, and code to accelerate research.
Abstract
Cinematography, the fundamental visual language of film, is essential for conveying narrative, emotion, and aesthetic quality. While recent Vision-Language Models (VLMs) demonstrate strong general visual understanding, their proficiency in comprehending the nuanced cinematic grammar embedded within individual shots remains largely unexplored and lacks robust evaluation. This critical gap limits both fine-grained visual comprehension and the precision of AI-assisted video generation. To address this, we introduce ShotBench, a comprehensive benchmark specifically designed for cinematic language understanding. It features over 3.5k expert-annotated QA pairs from images and video clips, meticulously curated from over 200 acclaimed (predominantly Oscar-nominated) films and spanning eight key cinematography dimensions. Our evaluation of 24 leading VLMs on ShotBench reveals their substantial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
