Frame-Voyager: Learning to Query Frames for Video Large Language Models
Sicheng Yu, Chengkai Jin, Huanyu Wang, Zhenghao Chen, Sheng Jin,, Zhongrong Zuo, Xiaolei Xu, Zhenbang Sun, Bingni Zhang, Jiawei Wu, Hao Zhang,, Qianru Sun

TL;DR
Frame-Voyager is a novel method that learns to select the most informative frame combinations for video question answering, improving the efficiency and performance of Video-LLMs by addressing input length constraints.
Contribution
It introduces a learning-based frame querying approach with a new data labeling pipeline, enhancing video understanding in Video-LLMs beyond traditional sampling methods.
Findings
Significantly improves Video-LLM performance on question answering benchmarks.
Effective in selecting informative frames, reducing input length issues.
Demonstrates versatility across different Video-LLMs and tasks.
Abstract
Video Large Language Models (Video-LLMs) have made remarkable progress in video understanding tasks. However, they are constrained by the maximum length of input tokens, making it impractical to input entire videos. Existing frame selection approaches, such as uniform frame sampling and text-frame retrieval, fail to account for the information density variations in the videos or the complex instructions in the tasks, leading to sub-optimal performance. In this paper, we propose Frame-Voyager that learns to query informative frame combinations, based on the given textual queries in the task. To train Frame-Voyager, we introduce a new data collection and labeling pipeline, by ranking frame combinations using a pre-trained Video-LLM. Given a video of M frames, we traverse its T-frame combinations, feed them into a Video-LLM, and rank them based on Video-LLM's prediction losses. Using this…
Peer Reviews
Decision·ICLR 2025 Poster
This is a reasonable extension from previous keyframe selection work [1] where it relies more on single keyframe selection, and does not consider temporal relations/modeling among frames. The proposed pseudo-label scheme from a VLM + list of combinations of frames makes sense to me. Authors also conduct extensive experiments across diverse popular benchmarks to show their effectiveness. Overall, I think the proposed Frame-Voyager makes a good contribution to the keyframe selection in video-lan
1. As the paper mentioned, this pseudo-label strategy is not scalable, and I think the frame combination part is a bit tricky. This is like creating artificial rewards according to video (like a sandbox in RL) to train the reward model. However, the reward is not always reliable from a VLM even though we compute it according to GT answers. For example, as shown in some previous studies [2], the model will generate correct answers when provided with wrong localized clips. It is true that we mig
1. The motivation is clear and easy-to-follow. 2. The presentation of the paper is of high quality. 3. The analysis is comprehensive.
1. The inflexibility of the proposed frame selection method. Different videos usually have different “information density”, which means that some video could be represented by even a single frame and some need densely sampled frames. This also would be affected by the query type. The number of keyframes in the proposed framework seems to be a fixed hyper-parameter, which would be hard to generalize to all different videos in the wild (for different length / query types, need different hyper-para
**1. Originality:** The novel formulation of frame selection as a ranking problem is compelling. This approach minimizes the need for extensive manual labeling to obtain ground truth, offering a fresh and efficient perspective on the task. **2. Significance:** Frame selection is a critical challenge in video understanding due to the high dimensionality of video data and the computational demands involved. Traditional uniform sampling often misses relevant content, making this problem a key focu
**1. Data Collection Cost:** The number of frame combinations increases exponentially with M and T (L166), leading the authors to limit M to 16 or 32 and T to 2 or 4. Despite these restrictions, evaluating $C(32, 4) \approx 36K$ combinations for a single QA sample is still costly, raising concerns about data collection efficiency. Thus, this paper needs to provide more details on the computational resources and time required for data collection. **2. Generalization to More Frames:** The method
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
