When LLaVA Meets Objects: Token Composition for Vision-Language-Models

Soumya Jahagirdar; Walid Bousselham; Anna Kukleva; Hilde Kuehne

arXiv:2602.04864·cs.CV·February 10, 2026

When LLaVA Meets Objects: Token Composition for Vision-Language-Models

Soumya Jahagirdar, Walid Bousselham, Anna Kukleva, Hilde Kuehne

PDF

Open Access

TL;DR

Mask-LLaVA introduces a flexible vision-language model that combines multi-level visual features to reduce token usage, enabling efficient inference without retraining and maintaining competitive performance.

Contribution

It proposes a novel framework that combines mask-based object representations with global and local tokens, allowing dynamic token reduction during inference.

Findings

01

Achieves competitive benchmark results with fewer tokens.

02

Enables dynamic token selection at test time.

03

Maintains performance without retraining when reducing tokens.

Abstract

Current autoregressive Vision Language Models (VLMs) usually rely on a large number of visual tokens to represent images, resulting in a need for more compute especially at inference time. To address this problem, we propose Mask-LLaVA, a framework that leverages different levels of visual features to create a compact yet information-rich visual representation for autoregressive VLMs. Namely, we combine mask-based object representations together with global tokens and local patch tokens. While all tokens are used during training, it shows that the resulting model can flexibly drop especially the number of mask-based object-tokens at test time, allowing to adapt the number of tokens during inference without the need to retrain the model and without a significant drop in performance. We evaluate the proposed approach on a suite of standard benchmarks showing results competitive to current…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis