Inference Compute-Optimal Video Vision Language Models

Peiqi Wang; ShengYun Peng; Xuewen Zhang; Hanchao Yu; Yibo Yang; Lifu Huang; Fujun Liu; Qifan Wang

arXiv:2505.18855·cs.CV·May 27, 2025

Inference Compute-Optimal Video Vision Language Models

Peiqi Wang, ShengYun Peng, Xuewen Zhang, Hanchao Yu, Yibo Yang, Lifu Huang, Fujun Liu, Qifan Wang

PDF

Open Access 1 Video

TL;DR

This paper explores how to optimally allocate inference compute in video vision language models by analyzing the effects of model size, frame count, and visual tokens, providing practical guidance for resource-constrained settings.

Contribution

It introduces a method to identify the inference compute-optimal configuration of video vision language models considering key scaling factors and resource constraints.

Findings

01

Optimal scaling factors depend on task performance and data size.

02

Data size shifts the compute-optimal frontier.

03

Practical recommendations for resource-efficient model configuration.

Abstract

This work investigates the optimal allocation of inference compute across three key scaling factors in video vision language models: language model size, frame count, and the number of visual tokens per frame. While prior works typically focuses on optimizing model efficiency or improving performance without considering resource constraints, we instead identify optimal model configuration under fixed inference compute budgets. We conduct large-scale training sweeps and careful parametric modeling of task performance to identify the inference compute-optimal frontier. Our experiments reveal how task performance depends on scaling factors and finetuning data size, as well as how changes in data size shift the compute-optimal frontier. These findings translate to practical tips for selecting these scaling factors.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Inference Compute-Optimal Video Vision Language Models· underline

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques