Seeking Universal Shot Language Understanding Solutions
Haoxin Liu, Harshavardhan Kamarthi, Zhiyuan Zhao, Hongjie Chen, B. Aditya Prakash

TL;DR
This paper introduces SLU-SUITE, a large dataset and evaluation suite for shot language understanding in films, and proposes universal models that outperform existing methods on diverse cinematic tasks.
Contribution
The paper presents SLU-SUITE, a comprehensive dataset for cinematic shot understanding, and introduces universal SLU models, UniShot and AgentShots, that achieve state-of-the-art performance across multiple tasks.
Findings
SLU-SUITE contains 490K QA pairs across 33 tasks.
Universal models outperform task-specific methods on in-domain tasks.
Models surpass commercial VLMs by 22% on out-of-domain tasks.
Abstract
Shot language understanding (SLU) is crucial for cinematic analysis but remains challenging due to its diverse cinematographic dimensions and subjective expert judgment. While vision-language models (VLMs) have shown strong ability in general visual understanding, recent studies reveal judgment discrepancies between VLMs and film experts on SLU tasks. To address this gap, we introduce SLU-SUITE, a comprehensive training and evaluation suite containing 490K human-annotated QA pairs across 33 tasks spanning six film-grounded dimensions. Using SLU-SUITE, we originally observe two insights into VLM-based SLU from: the model side, which diagnoses key bottlenecks of modules; the data side, which quantifies cross-dimensional influences among tasks. These findings motivate our universal SLU solutions from two complementary paradigms: UniShot, a balanced one-for-all generalist trained via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
