Improving vision-language alignment with graph spiking hybrid Networks

Siyu Zhang; Wenzhe Liu; Yeming Chen; Yiming Wu; Heming Zheng; Cheng; Cheng

arXiv:2501.19069·cs.CV·March 4, 2025

Improving vision-language alignment with graph spiking hybrid Networks

Siyu Zhang, Wenzhe Liu, Yeming Chen, Yiming Wu, Heming Zheng, Cheng, Cheng

PDF

Open Access

TL;DR

This paper introduces a novel graph spiking hybrid network that leverages panoptic segmentation and contrastive learning to improve vision-language alignment by capturing rich semantic relations and contextual features.

Contribution

It proposes a new GSHN model combining SNNs and GATs, utilizing panoptic segmentation and a novel pre-training method to enhance semantic representation in VL tasks.

Findings

01

GSHN outperforms existing models on multiple VL benchmarks.

02

The use of contrastive learning improves embedding similarity and model robustness.

03

Panoptic segmentation enhances the quality of visual semantic features.

Abstract

To bridge the semantic gap between vision and language (VL), it is necessary to develop a good alignment strategy, which includes handling semantic diversity, abstract representation of visual information, and generalization ability of models. Recent works use detector-based bounding boxes or patches with regular partitions to represent visual semantics. While current paradigms have made strides, they are still insufficient for fully capturing the nuanced contextual relations among various objects. This paper proposes a comprehensive visual semantic representation module, necessitating the utilization of panoptic segmentation to generate coherent fine-grained semantic features. Furthermore, we propose a novel Graph Spiking Hybrid Network (GSHN) that integrates the complementary advantages of Spiking Neural Networks (SNNs) and Graph Attention Networks (GATs) to encode visual semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Multimodal Machine Learning Applications · Robotics and Automated Systems

MethodsSoftmax · Attention Is All You Need · Contrastive Learning