Semantic-Space-Intervened Diffusive Alignment for Visual Classification

Zixuan Li; Lei Meng; Guoqing Chao; Wei Wu; Xiaoshuo Yan; Yimeng Yang; Zhuang Qi; Xiangxu Meng

arXiv:2505.05721·cs.CV·May 27, 2025

Semantic-Space-Intervened Diffusive Alignment for Visual Classification

Zixuan Li, Lei Meng, Guoqing Chao, Wei Wu, Xiaoshuo Yan, Yimeng Yang, Zhuang Qi, Xiangxu Meng

PDF

Open Access

TL;DR

This paper introduces SeDA, a novel diffusion-based method that progressively aligns visual and textual features via a shared semantic space, significantly improving cross-modal visual classification accuracy.

Contribution

SeDA employs a bi-stage diffusion framework with a semantic space bridge and stepwise feature interaction, advancing cross-modal alignment techniques.

Findings

01

SeDA outperforms existing methods in multiple visual classification scenarios.

02

The diffusion-controlled semantic models improve cross-modal feature consistency.

03

Progressive alignment enhances the integration of textual information into visual features.

Abstract

Cross-modal alignment is an effective approach to improving visual classification. Existing studies typically enforce a one-step mapping that uses deep neural networks to project the visual features to mimic the distribution of textual features. However, they typically face difficulties in finding such a projection due to the two modalities in both the distribution of class-wise samples and the range of their feature values. To address this issue, this paper proposes a novel Semantic-Space-Intervened Diffusive Alignment method, termed SeDA, models a semantic space as a bridge in the visual-to-textual projection, considering both types of features share the same class-level information in classification. More importantly, a bi-stage diffusion framework is developed to enable the progressive alignment between the two modalities. Specifically, SeDA first employs a Diffusion-Controlled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsDiffusion