Fine-Grained VLM Fine-tuning via Latent Hierarchical Adapter Learning
Yumiao Zhao, Bo Jiang, Yuhe Ding, Xiao Wang, Jin Tang, Bin Luo

TL;DR
This paper introduces LatHAdapter, a hyperbolic space-based fine-tuning method for vision-language models that captures semantic hierarchies to improve few-shot classification, especially for unknown classes.
Contribution
It proposes a novel LatHAdapter that models latent semantic hierarchies using hyperbolic space, addressing limitations of existing adapters in capturing one-to-many category-image associations.
Findings
Outperforms existing fine-tuning methods on four few-shot tasks.
Enhances generalization to unknown classes.
Effectively models semantic hierarchies in hyperbolic space.
Abstract
Adapter-based approaches have garnered attention for fine-tuning pre-trained Vision-Language Models (VLMs) on few-shot classification tasks. These methods strive to develop a lightweight module that better aligns visual and (category) textual representations, thereby enhancing performance on downstream few-shot learning tasks. However, existing adapters generally learn/align (category) textual-visual modalities via explicit spatial proximity in the underlying embedding space, which i) fails to capture the inherent one-to-many associations between categories and image samples and ii) struggles to establish accurate associations between the unknown categories and images. To address these issues, inspired by recent works on hyperbolic learning, we develop a novel Latent Hierarchical Adapter (LatHAdapter) for fine-tuning VLMs on downstream few-shot classification tasks. The core of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
