Rethinking Fusion: Disentangled Learning of Shared and Modality-Specific Information for Stance Detection

Zhiyu Xie; Fuqiang Niu; Genan Dai; Qianlong Wang; Li Dong; Bowen Zhang; Hu Huang

arXiv:2601.21675·cs.MM·January 30, 2026

Rethinking Fusion: Disentangled Learning of Shared and Modality-Specific Information for Stance Detection

Zhiyu Xie, Fuqiang Niu, Genan Dai, Qianlong Wang, Li Dong, Bowen Zhang, Hu Huang

PDF

Open Access

TL;DR

This paper introduces DiME, a novel multi-modal stance detection architecture that disentangles modality-specific and shared information, leading to improved performance over existing methods in both in-target and zero-shot scenarios.

Contribution

DiME explicitly separates modality-specific and shared signals using specialized experts and a gating mechanism, advancing multi-modal stance detection techniques.

Findings

01

DiME outperforms baseline models on four benchmark datasets.

02

Disentangling modality-specific and shared information improves accuracy.

03

The approach is effective in both in-target and zero-shot settings.

Abstract

Multi-modal stance detection (MSD) aims to determine an author's stance toward a given target using both textual and visual content. While recent methods leverage multi-modal fusion and prompt-based learning, most fail to distinguish between modality-specific signals and cross-modal evidence, leading to suboptimal performance. We propose DiME (Disentangled Multi-modal Experts), a novel architecture that explicitly separates stance information into textual-dominant, visual-dominant, and cross-modal shared components. DiME first uses a target-aware Chain-of-Thought prompt to generate reasoning-guided textual input. Then, dual encoders extract modality features, which are processed by three expert modules with specialized loss functions: contrastive learning for modality-specific experts and cosine alignment for shared representation learning. A gating network adaptively fuses expert…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Gaze Tracking and Assistive Technology