TL;DR
GridProbe introduces a test-time adaptive frame selection method for long-video vision-language models, reducing computational cost while maintaining accuracy through posterior-probing and interpretability.
Contribution
It proposes a training-free, posterior-probing inference paradigm that adaptively selects relevant frames based on question difficulty, improving efficiency without retraining.
Findings
Matches baseline accuracy with 3.36x less compute on Video-MME-v2.
Pareto-dominates baseline on LongVideoBench with 0.35x compute.
Decoupling selector and QA models enhances efficiency and accuracy.
Abstract
Long-video understanding in VLMs is bottlenecked by a single monolithic forward pass over thousands of frames at quadratic attention cost. A common mitigation is to first select a small subset of informative frames before the forward pass; common for training-free selectors via auxiliary encoder-space similarities. Such signals are capped by contrastive pretraining, which usually fails on reasoning-heavy queries (negation, cross-frame counting, holistic summarization). We propose GridProbe, an efficient training-free posterior-probing inference paradigm that scores evidence in answer space using a frozen VLM's own reasoning and then selects question-relevant frames adaptively, resulting in sub-quadratic attention cost with little to no accuracy loss. We arrange frames on a grid and run lightweight row R and column C probes, where each probe reads its peak posterior as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
