Long Story Short: Story-level Video Understanding from 20K Short Films
Ridouane Ghermi, Xi Wang, Vicky Kalogeiton, Ivan Laptev

TL;DR
This paper introduces SF20K, a large dataset of 20,143 amateur films designed to advance long-term video understanding and reasoning, and demonstrates that instruction tuning on this dataset enhances model performance.
Contribution
The paper presents SF20K, the largest publicly available long-form movie dataset, enabling new long-term video tasks and addressing data leakage issues in existing datasets.
Findings
SF20K enables long-term video reasoning tasks.
Recent VLMs perform well on SF20K.
Instruction tuning improves model performance significantly.
Abstract
Recent developments in vision-language models have significantly advanced video understanding. Existing datasets and tasks, however, have notable limitations. Most datasets are confined to short videos with limited events and narrow narratives. For example, datasets with instructional and egocentric videos often depict activities of one person in a single scene. Although existing movie datasets offer richer content, they are often limited to short-term tasks, lack publicly available videos, and frequently encounter data leakage issues given the use of subtitles and other information about commercial movies during LLM pretraining. To address the above limitations, we propose Short-Films 20K (SF20K), the largest publicly available movie dataset. SF20K is composed of 20,143 amateur films and offers long-term video tasks in the form of multiple-choice and open-ended question answering. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization
MethodsSparse Evolutionary Training
