Matryoshka Query Transformer for Large Vision-Language Models

Wenbo Hu; Zi-Yi Dou; Liunian Harold Li; Amita Kamath; Nanyun Peng,; Kai-Wei Chang

arXiv:2405.19315·cs.CV·June 10, 2024·1 cites

Matryoshka Query Transformer for Large Vision-Language Models

Wenbo Hu, Zi-Yi Dou, Liunian Harold Li, Amita Kamath, Nanyun Peng,, Kai-Wei Chang

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces the Matryoshka Query Transformer (MQT), a flexible model that encodes images into variable numbers of visual tokens, enabling adaptable computational costs while maintaining high performance across multiple vision-language benchmarks.

Contribution

The paper proposes MQT, a novel approach that allows dynamic adjustment of visual tokens during inference, reducing computational costs without sacrificing accuracy, and demonstrates its effectiveness with the LLaVA framework.

Findings

01

MQT-LLAVA matches LLaVA-1.5 performance with 256 tokens.

02

Reducing tokens to 16 results in only 2.4 points performance drop.

03

Using 2 tokens yields minimal performance loss on specific tasks.

Abstract

Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m <= M latent query tokens and train the model using only these first m…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gordonhu608/mqt-llava
pytorchOfficial

Models

🤗
gordonhu/MQT-LLaVA-7b
model· 7 dl· ♡ 5
7 dl♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Cognitive Computing and Networks

MethodsLinear Layer · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections