Scaling Capability in Token Space: An Analysis of Large Vision Language Model
Tenghui Li, Guoxu Zhou, Xuyang Zhao, Qibin Zhao

TL;DR
This paper develops a theoretical framework to analyze how the number of vision tokens influences the performance of large vision-language models, revealing two distinct scaling regimes and validating predictions with empirical data.
Contribution
It introduces a mathematical model linking vision token count to model performance, identifying sublinear and linear scaling regimes, and empirically validating these predictions across benchmarks.
Findings
Two scaling regimes identified: sublinear and linear.
Model performance aligns with the theoretical scaling predictions.
Empirical results support the proposed mathematical framework.
Abstract
Large language models have demonstrated predictable scaling behaviors with respect to model parameters and training data. This study investigates whether a similar scaling relationship exist for vision-language models with respect to the number of vision tokens. A mathematical framework is developed to characterize a relationship between vision token number and the expected divergence of distance between vision-referencing sequences. The theoretical analysis reveals two distinct scaling regimes: sublinear scaling for less vision tokens and linear scaling for more vision tokens. This aligns with model performance relationships of the form \(S(n) \approx c / n^{\alpha(n)}\), where the scaling exponent relates to the correlation structure between vision token representations. Empirical validations across multiple vision-language benchmarks show that model performance matches the prediction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques
