Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment
Yuze Zheng, Zixuan Li, Xiangxian Li, Jinxing Liu, Yuqing Wang, Xiangxu, Meng, and Lei Meng

TL;DR
This paper introduces MARNet, a multimodal alignment and reconstruction network utilizing diffusion models to improve cross-modal feature alignment and robustness in image classification tasks.
Contribution
The paper proposes a novel MARNet framework with a diffusion-based cross-modal reconstruction module to enhance multimodal alignment and resistance to visual noise.
Findings
MARNet improves image feature quality on benchmark datasets
The framework enhances robustness against visual noise
It can be integrated into existing classification models easily
Abstract
Image classification models often demonstrate unstable performance in real-world applications due to variations in image information, driven by differing visual perspectives of subject objects and lighting discrepancies. To mitigate these challenges, existing studies commonly incorporate additional modal information matching the visual data to regularize the model's learning process, enabling the extraction of high-quality visual features from complex image regions. Specifically, in the realm of multimodal learning, cross-modal alignment is recognized as an effective strategy, harmonizing different modal information by learning a domain-consistent latent feature space for visual and semantic features. However, this approach may face limitations due to the heterogeneity between multimodal information, such as differences in feature distribution and structure. To address this issue, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Surveying and Cultural Heritage · Generative Adversarial Networks and Image Synthesis · Image Retrieval and Classification Techniques
MethodsDiffusion
