Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models

Mohamad Zamini; Diksha Shukla

arXiv:2512.18910·cs.CV·December 23, 2025

Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models

Mohamad Zamini, Diksha Shukla

PDF

Open Access

TL;DR

Delta-LLaVA introduces a token-efficient vision-language model that employs a low-rank alignment and lightweight specialization layers, significantly improving inference speed and training efficiency while maintaining strong performance.

Contribution

The paper proposes a novel base-then-specialize alignment approach with a low-rank DeltaProjection for efficient token formation in vision-language models.

Findings

01

Inference throughput improves by up to 55%.

02

End-to-end training accelerates by 4-5x in pretraining.

03

Consistent performance gains across multiple benchmarks.

Abstract

Multimodal Large Language Models (MLLMs) combine visual and textual representations to enable rich reasoning capabilities. However, the high computational cost of processing dense visual tokens remains a major bottleneck. A critical component in this pipeline is the visual projector, which bridges the vision encoder and the language model. Standard designs often employ a simple multi-layer perceptron for direct token mapping, but this approach scales poorly with high-resolution inputs, introducing significant redundancy. We present Delta-LLaVA, a token-efficient projector that employs a low-rank DeltaProjection to align multi-level vision features into a compact subspace before further interaction. On top of this base alignment, lightweight Transformer blocks act as specialization layers, capturing both global and local structure under constrained token budgets. Extensive experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning