Exploring the Design Space of Visual Context Representation in Video   MLLMs

Yifan Du; Yuqi Huo; Kun Zhou; Zijia Zhao; Haoyu Lu; Han Huang; Wayne; Xin Zhao; Bingning Wang; Weipeng Chen; and Ji-Rong Wen

arXiv:2410.13694·cs.CV·October 18, 2024

Exploring the Design Space of Visual Context Representation in Video MLLMs

Yifan Du, Yuqi Huo, Kun Zhou, Zijia Zhao, Haoyu Lu, Han Huang, Wayne, Xin Zhao, Bingning Wang, Weipeng Chen, and Ji-Rong Wen

PDF

Open Access 1 Repo

TL;DR

This paper systematically explores the design space of visual context representation in Video Multimodal Large Language Models, optimizing frame and token selection strategies to enhance video understanding performance.

Contribution

It formulates visual context representation as a constrained optimization problem and derives optimal selection strategies through extensive empirical analysis.

Findings

01

Optimal frame and token selection strategies improve model performance.

02

Scaling effects influence the effectiveness of selection strategies.

03

Derived formulas align with empirical results for best performance.

Abstract

Video Multimodal Large Language Models (MLLMs) have shown remarkable capability of understanding the video semantics on various downstream tasks. Despite the advancements, there is still a lack of systematic research on visual context representation, which refers to the scheme to select frames from a video and further select the tokens from a frame. In this paper, we explore the design space for visual context representation, and aim to improve the performance of video MLLMs by finding more effective representation schemes. Firstly, we formulate the task of visual context representation as a constrained optimization problem, and model the language modeling loss as a function of the number of frames and the number of embeddings (or tokens) per frame, given the maximum visual context window size. Then, we explore the scaling effects in frame selection and token selection respectively, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rucaibox/opt-visor
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Speech and dialogue systems · Recommender Systems and Techniques