Exploring the Design Space of Visual Context Representation in Video MLLMs
Yifan Du, Yuqi Huo, Kun Zhou, Zijia Zhao, Haoyu Lu, Han Huang, Wayne, Xin Zhao, Bingning Wang, Weipeng Chen, and Ji-Rong Wen

TL;DR
This paper systematically explores the design space of visual context representation in Video Multimodal Large Language Models, optimizing frame and token selection strategies to enhance video understanding performance.
Contribution
It formulates visual context representation as a constrained optimization problem and derives optimal selection strategies through extensive empirical analysis.
Findings
Optimal frame and token selection strategies improve model performance.
Scaling effects influence the effectiveness of selection strategies.
Derived formulas align with empirical results for best performance.
Abstract
Video Multimodal Large Language Models (MLLMs) have shown remarkable capability of understanding the video semantics on various downstream tasks. Despite the advancements, there is still a lack of systematic research on visual context representation, which refers to the scheme to select frames from a video and further select the tokens from a frame. In this paper, we explore the design space for visual context representation, and aim to improve the performance of video MLLMs by finding more effective representation schemes. Firstly, we formulate the task of visual context representation as a constrained optimization problem, and model the language modeling loss as a function of the number of frames and the number of embeddings (or tokens) per frame, given the maximum visual context window size. Then, we explore the scaling effects in frame selection and token selection respectively, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Speech and dialogue systems · Recommender Systems and Techniques
