Multimodal Structure Learning: Disentangling Shared and Specific Topology via Cross-Modal Graphical Lasso

Fei Wang; Yutong Zhang; Xiong Wang

arXiv:2604.03953·cs.CV·April 7, 2026

Multimodal Structure Learning: Disentangling Shared and Specific Topology via Cross-Modal Graphical Lasso

Fei Wang, Yutong Zhang, Xiong Wang

PDF

TL;DR

This paper introduces CM-GLasso, a novel method for disentangling shared and specific multimodal topologies using cross-modal graphical models, improving interpretability and performance in vision-language tasks.

Contribution

It proposes a unified framework combining cross-attention, semantic node extraction, and joint structure learning to better disentangle multimodal dependencies.

Findings

01

Achieves state-of-the-art results on eight benchmarks.

02

Effectively disentangles shared and category-specific structures.

03

Improves interpretability of multimodal representations.

Abstract

Learning interpretable multimodal representations inherently relies on uncovering the conditional dependencies between heterogeneous features. However, sparse graph estimation techniques, such as Graphical Lasso (GLasso), to visual-linguistic domains is severely bottlenecked by high-dimensional noise, modality misalignment, and the confounding of shared versus category-specific topologies. In this paper, we propose Cross-Modal Graphical Lasso (CM-GLasso) that overcomes these fundamental limitations. By coupling a novel text-visualization strategy with a unified vision-language encoder, we strictly align multimodal features into a shared latent space. We introduce a cross-attention distillation mechanism that condenses high-dimensional patches into explicit semantic nodes, naturally extracting spatial-aware cross-modal priors. Furthermore, we unify tailored GLasso estimation and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.