SF20K Competition 2025: Summary and findings
Ridouane Ghermi, Xi Wang, Vicky Kalogeiton, Ivan Laptev

TL;DR
The SF20K Competition 2025 advances story-level video understanding through a new open-ended question-answering benchmark on amateur films, revealing key insights into multimodal reasoning and model efficiency.
Contribution
This paper introduces the SF20K benchmark, evaluates diverse models, and uncovers that narrative understanding and subtitle quality are critical for long-form video QA.
Findings
Shot-level processing outperforms uniform frame sampling.
Smaller multi-stage pipelines can match larger models' performance.
Subtitle quality significantly impacts question-answering accuracy.
Abstract
This report presents the results and findings of the first edition of the Short-Films 20K (SF20K) Competition, held in conjunction with the SLoMO Workshop at ICCV 2025. The competition is designed to advance story-level video understanding beyond short-clip action recognition, introducing an open-ended video question-answering task built on a corpus of amateur short films. This setup ensures that models must rely on multimodal understanding rather than memorization of popular movies. Evaluation is conducted using the SF20K-Test benchmark (95 movies, 979 question-answer pairs) and scored via LLM-QA-Eval, an automated judge based on GPT-4.1-nano. The competition attracted 22 teams and 286 submissions across two tracks: a Main Track with unrestricted model size and a Special Track limited to models under 8 billion parameters. The winning team achieved 65.7% accuracy on the Main Track and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
