Unifying Visual and Semantic Feature Spaces with Diffusion Models for   Enhanced Cross-Modal Alignment

Yuze Zheng; Zixuan Li; Xiangxian Li; Jinxing Liu; Yuqing Wang; Xiangxu; Meng; and Lei Meng

arXiv:2407.18854·cs.CV·July 29, 2024

Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment

Yuze Zheng, Zixuan Li, Xiangxian Li, Jinxing Liu, Yuqing Wang, Xiangxu, Meng, and Lei Meng

PDF

Open Access

TL;DR

This paper introduces MARNet, a multimodal alignment and reconstruction network utilizing diffusion models to improve cross-modal feature alignment and robustness in image classification tasks.

Contribution

The paper proposes a novel MARNet framework with a diffusion-based cross-modal reconstruction module to enhance multimodal alignment and resistance to visual noise.

Findings

01

MARNet improves image feature quality on benchmark datasets

02

The framework enhances robustness against visual noise

03

It can be integrated into existing classification models easily

Abstract

Image classification models often demonstrate unstable performance in real-world applications due to variations in image information, driven by differing visual perspectives of subject objects and lighting discrepancies. To mitigate these challenges, existing studies commonly incorporate additional modal information matching the visual data to regularize the model's learning process, enabling the extraction of high-quality visual features from complex image regions. Specifically, in the realm of multimodal learning, cross-modal alignment is recognized as an effective strategy, harmonizing different modal information by learning a domain-consistent latent feature space for visual and semantic features. However, this approach may face limitations due to the heterogeneity between multimodal information, such as differences in feature distribution and structure. To address this issue, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Surveying and Cultural Heritage · Generative Adversarial Networks and Image Synthesis · Image Retrieval and Classification Techniques

MethodsDiffusion