HeGraphAdapter: Tuning Multi-Modal Vision-Language Models with Heterogeneous Graph Adapter
Yumiao Zhao, Bo Jiang, Xiao Wang, Qin Xu, Jin Tang

TL;DR
This paper introduces HeGraphAdapter, a novel method that constructs a heterogeneous graph to better model interactions between visual and textual modalities, improving the adaptation of vision-language models for various downstream tasks.
Contribution
The paper proposes a heterogeneous graph adapter that captures intra- and inter-modality relationships, enhancing vision-language model tuning beyond existing similarity-based methods.
Findings
Significant performance improvements on 11 benchmark datasets.
Effective modeling of intra- and inter-modality interactions.
Enhanced classification accuracy with the proposed approach.
Abstract
Adapter-based tuning methods have shown significant potential in transferring knowledge from pre-trained Vision-Language Models to the downstream tasks. However, after reviewing existing adapters, we find they generally fail to fully explore the interactions between different modalities in constructing task-specific knowledge. Also, existing works usually only focus on similarity matching between positive text prompts, making it challenging to distinguish the classes with high similar visual contents. To address these issues, in this paper, we propose a novel Heterogeneous Graph Adapter to achieve tuning VLMs for the downstream tasks. To be specific, we first construct a unified heterogeneous graph mode, which contains i) visual nodes, positive text nodes and negative text nodes, and ii) several types of edge connections to comprehensively model the intra-modality, inter-modality and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Natural Language Processing Techniques
MethodsContrastive Language-Image Pre-training · Adapter · Focus · Graph Neural Network
