HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models
Md Jahidul Islam

TL;DR
HeBA introduces modality-specific structural biases into vision-language models, using heterogeneous processing and regularization techniques, resulting in improved stability and accuracy across multiple few-shot benchmarks.
Contribution
HeBA presents a novel architectural framework with modality-specific processing, bottleneck regularization, and active gradient initialization for enhanced VLM adaptation.
Findings
Achieves state-of-the-art results on 11 few-shot benchmarks.
Demonstrates improved stability and convergence speed.
Outperforms conventional homogeneous adapter designs.
Abstract
Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a "one-size-fits-all" architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities -- spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
