Toward Structural Multimodal Representations: Specialization, Selection, and Sparsification via Mixture-of-Experts
Hahyeon Choi, Nojun Kwak

TL;DR
This paper introduces S3, a structural framework for multimodal learning that decomposes inputs into semantic experts, enabling selective routing and sparsification for improved accuracy and compactness.
Contribution
S3 offers a novel approach to multimodal representation by structuring inputs into semantic experts with task-specific routing and pruning, contrasting with contrastive or InfoMax methods.
Findings
S3 improves accuracy across four benchmarks.
Performance peaks at intermediate sparsity levels.
Structured representations outperform fixed embeddings.
Abstract
We propose S3 (Specialization, Selection, Sparsification), a framework that rethinks multimodal learning through a structural perspective. Instead of encoding all signals into a fixed embedding, S3 decomposes multimodal inputs into semantic experts and selectively routes them for each task. Specialization forms concept-level experts in a shared latent space, Selection adapts routing for task-specific needs, and Sparsification prunes low-utility paths to yield compact, information-minimal representations. Across four MultiBench benchmarks, S3 improves accuracy and shows a consistent reverse U-shaped sparsity-performance trend, with peak performance at intermediate sparsity. These results suggest that structuring multimodal representations as selectable semantic components provides a practical and principled alternative to contrastive learning or InfoMax-driven approaches.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
