TL;DR
Video-Oasis is a diagnostic suite that critically evaluates current video understanding benchmarks, revealing significant gaps and guiding future research with practical insights.
Contribution
It introduces a systematic evaluation framework for video understanding, highlighting limitations of existing benchmarks and providing guidelines for more robust future models.
Findings
54% of benchmark samples are solvable without visual or temporal input
State-of-the-art models barely outperform random guessing on remaining samples
Provides practical guidelines for designing more effective video understanding algorithms
Abstract
The inherent complexity of video understanding makes it difficult to attribute whether performance gains stem from visual perception, linguistic reasoning, or knowledge priors. While many benchmarks have emerged to assess high-level reasoning, the essential criteria that constitute video understanding remain largely overlooked. Instead of introducing yet another benchmark, we take a step back to re-examine the current landscape of video understanding. In this work, we provide Video-Oasis, a sustainable diagnostic suite designed to systematically evaluate existing evaluations and distill spatio-temporal challenges for video understanding. Our analysis reveals two critical findings: (1) 54% of existing benchmark samples are solvable without visual input or temporal context, and (2) on the remaining samples, state-of-the-art models exhibit performance barely exceeding random guessing. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
