HeGraphAdapter: Tuning Multi-Modal Vision-Language Models with   Heterogeneous Graph Adapter

Yumiao Zhao; Bo Jiang; Xiao Wang; Qin Xu; Jin Tang

arXiv:2410.07854·cs.CV·October 11, 2024

HeGraphAdapter: Tuning Multi-Modal Vision-Language Models with Heterogeneous Graph Adapter

Yumiao Zhao, Bo Jiang, Xiao Wang, Qin Xu, Jin Tang

PDF

Open Access

TL;DR

This paper introduces HeGraphAdapter, a novel method that constructs a heterogeneous graph to better model interactions between visual and textual modalities, improving the adaptation of vision-language models for various downstream tasks.

Contribution

The paper proposes a heterogeneous graph adapter that captures intra- and inter-modality relationships, enhancing vision-language model tuning beyond existing similarity-based methods.

Findings

01

Significant performance improvements on 11 benchmark datasets.

02

Effective modeling of intra- and inter-modality interactions.

03

Enhanced classification accuracy with the proposed approach.

Abstract

Adapter-based tuning methods have shown significant potential in transferring knowledge from pre-trained Vision-Language Models to the downstream tasks. However, after reviewing existing adapters, we find they generally fail to fully explore the interactions between different modalities in constructing task-specific knowledge. Also, existing works usually only focus on similarity matching between positive text prompts, making it challenging to distinguish the classes with high similar visual contents. To address these issues, in this paper, we propose a novel Heterogeneous Graph Adapter to achieve tuning VLMs for the downstream tasks. To be specific, we first construct a unified heterogeneous graph mode, which contains i) visual nodes, positive text nodes and negative text nodes, and ii) several types of edge connections to comprehensively model the intra-modality, inter-modality and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Natural Language Processing Techniques

MethodsContrastive Language-Image Pre-training · Adapter · Focus · Graph Neural Network