Inference Compute-Optimal Video Vision Language Models
Peiqi Wang, ShengYun Peng, Xuewen Zhang, Hanchao Yu, Yibo Yang, Lifu Huang, Fujun Liu, Qifan Wang

TL;DR
This paper explores how to optimally allocate inference compute in video vision language models by analyzing the effects of model size, frame count, and visual tokens, providing practical guidance for resource-constrained settings.
Contribution
It introduces a method to identify the inference compute-optimal configuration of video vision language models considering key scaling factors and resource constraints.
Findings
Optimal scaling factors depend on task performance and data size.
Data size shifts the compute-optimal frontier.
Practical recommendations for resource-efficient model configuration.
Abstract
This work investigates the optimal allocation of inference compute across three key scaling factors in video vision language models: language model size, frame count, and the number of visual tokens per frame. While prior works typically focuses on optimizing model efficiency or improving performance without considering resource constraints, we instead identify optimal model configuration under fixed inference compute budgets. We conduct large-scale training sweeps and careful parametric modeling of task performance to identify the inference compute-optimal frontier. Our experiments reveal how task performance depends on scaling factors and finetuning data size, as well as how changes in data size shift the compute-optimal frontier. These findings translate to practical tips for selecting these scaling factors.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
