HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models

Md Jahidul Islam

arXiv:2603.16653·cs.CV·March 18, 2026

HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models

Md Jahidul Islam

PDF

Open Access

TL;DR

HeBA introduces modality-specific structural biases into vision-language models, using heterogeneous processing and regularization techniques, resulting in improved stability and accuracy across multiple few-shot benchmarks.

Contribution

HeBA presents a novel architectural framework with modality-specific processing, bottleneck regularization, and active gradient initialization for enhanced VLM adaptation.

Findings

01

Achieves state-of-the-art results on 11 few-shot benchmarks.

02

Demonstrates improved stability and convergence speed.

03

Outperforms conventional homogeneous adapter designs.

Abstract

Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a "one-size-fits-all" architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities -- spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis