FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning
Van Duc Cuong, Ta Dinh Tam, Tran Duc Chinh, Nguyen Thi Hanh

TL;DR
FLUID introduces a token-level multimodal fusion method with learnable query transforms, adaptive gating, and expert specialization, significantly improving robustness and accuracy in multimodal classification tasks.
Contribution
The paper proposes a novel token-level fusion pipeline with learnable queries, contrastive alignment, and a Mixture-of-Experts for scalable, noise-resilient multimodal learning.
Findings
Achieves 91% accuracy on GLAMI-1M benchmark.
Outperforms prior methods in robustness to noise and class imbalance.
Demonstrates effectiveness of components through ablation studies.
Abstract
Multimodal classification requires robust integration of visual and textual signals, yet common fusion strategies are brittle and vulnerable to modality-specific noise. In this paper, we present \textsc{FLUID}-Flow-Latent Unified Integration via Token Distillation for Expert Specialization, a principled token-level pipeline that improves cross-modal robustness and scalability. \textsc{FLUID} contributes three core elements: (1) \emph{Q-transforms}, learnable query tokens that distill and retain salient token-level features from modality-specific backbones; (2) a two-stage fusion scheme that enforces cross-modal consistency via contrastive alignment and then performs adaptive, task-aware fusion through a gating mechanism and a \emph{Q-bottleneck} that selectively compresses information for downstream reasoning; and (3) a lightweight, load-balanced Mixture-of-Experts at prediction time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
