MOVER: Multimodal Optimal Transport with Volume-based Embedding Regularization
Haochen You, Baojing Liu

TL;DR
MOVER introduces a multimodal learning framework combining optimal transport and volume-based regularization to achieve semantically aligned, structured representations across multiple modalities, improving retrieval and generalization.
Contribution
It presents a novel combination of optimal transport and geometric regularization for multimodal embedding, enhancing cross-modal alignment and structure.
Findings
Outperforms state-of-the-art in text-video-audio retrieval
Improves zero-shot and finetuned retrieval performance
Enhances structural consistency and generalization in embeddings
Abstract
Recent advances in multimodal learning have largely relied on pairwise contrastive objectives to align different modalities, such as text, video, and audio, in a shared embedding space. While effective in bi-modal setups, these approaches struggle to generalize across multiple modalities and often lack semantic structure in high-dimensional spaces. In this paper, we propose MOVER, a novel framework that combines optimal transport-based soft alignment with volume-based geometric regularization to build semantically aligned and structured multimodal representations. By integrating a transport-guided matching mechanism with a geometric volume minimization objective (GAVE), MOVER encourages consistent alignment across all modalities in a modality-agnostic manner. Experiments on text-video-audio retrieval tasks demonstrate that MOVER significantly outperforms prior state-of-the-art methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEnergy Efficient Wireless Sensor Networks · Speech and Audio Processing
