Scaling Capability in Token Space: An Analysis of Large Vision Language Model

Tenghui Li; Guoxu Zhou; Xuyang Zhao; Qibin Zhao

arXiv:2412.18387·cs.AI·December 30, 2025

Scaling Capability in Token Space: An Analysis of Large Vision Language Model

Tenghui Li, Guoxu Zhou, Xuyang Zhao, Qibin Zhao

PDF

Open Access 1 Repo

TL;DR

This paper develops a theoretical framework to analyze how the number of vision tokens influences the performance of large vision-language models, revealing two distinct scaling regimes and validating predictions with empirical data.

Contribution

It introduces a mathematical model linking vision token count to model performance, identifying sublinear and linear scaling regimes, and empirically validating these predictions across benchmarks.

Findings

01

Two scaling regimes identified: sublinear and linear.

02

Model performance aligns with the theoretical scaling predictions.

03

Empirical results support the proposed mathematical framework.

Abstract

Large language models have demonstrated predictable scaling behaviors with respect to model parameters and training data. This study investigates whether a similar scaling relationship exist for vision-language models with respect to the number of vision tokens. A mathematical framework is developed to characterize a relationship between vision token number and the expected divergence of distance between vision-referencing sequences. The theoretical analysis reveals two distinct scaling regimes: sublinear scaling for less vision tokens and linear scaling for more vision tokens. This aligns with model performance relationships of the form \(S(n) \approx c / n^{\alpha(n)}\), where the scaling exponent relates to the correlation structure between vision token representations. Empirical validations across multiple vision-language benchmarks show that model performance matches the prediction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tenghuilee/scalingcapfusedvisionlm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques