Matryoshka Query Transformer for Large Vision-Language Models
Wenbo Hu, Zi-Yi Dou, Liunian Harold Li, Amita Kamath, Nanyun Peng,, Kai-Wei Chang

TL;DR
This paper introduces the Matryoshka Query Transformer (MQT), a flexible model that encodes images into variable numbers of visual tokens, enabling adaptable computational costs while maintaining high performance across multiple vision-language benchmarks.
Contribution
The paper proposes MQT, a novel approach that allows dynamic adjustment of visual tokens during inference, reducing computational costs without sacrificing accuracy, and demonstrates its effectiveness with the LLaVA framework.
Findings
MQT-LLAVA matches LLaVA-1.5 performance with 256 tokens.
Reducing tokens to 16 results in only 2.4 points performance drop.
Using 2 tokens yields minimal performance loss on specific tasks.
Abstract
Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m <= M latent query tokens and train the model using only these first m…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Cognitive Computing and Networks
MethodsLinear Layer · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections
