MOVER: Multimodal Optimal Transport with Volume-based Embedding Regularization

Haochen You; Baojing Liu

arXiv:2508.12149·cs.AI·August 19, 2025

MOVER: Multimodal Optimal Transport with Volume-based Embedding Regularization

Haochen You, Baojing Liu

PDF

Open Access

TL;DR

MOVER introduces a multimodal learning framework combining optimal transport and volume-based regularization to achieve semantically aligned, structured representations across multiple modalities, improving retrieval and generalization.

Contribution

It presents a novel combination of optimal transport and geometric regularization for multimodal embedding, enhancing cross-modal alignment and structure.

Findings

01

Outperforms state-of-the-art in text-video-audio retrieval

02

Improves zero-shot and finetuned retrieval performance

03

Enhances structural consistency and generalization in embeddings

Abstract

Recent advances in multimodal learning have largely relied on pairwise contrastive objectives to align different modalities, such as text, video, and audio, in a shared embedding space. While effective in bi-modal setups, these approaches struggle to generalize across multiple modalities and often lack semantic structure in high-dimensional spaces. In this paper, we propose MOVER, a novel framework that combines optimal transport-based soft alignment with volume-based geometric regularization to build semantically aligned and structured multimodal representations. By integrating a transport-guided matching mechanism with a geometric volume minimization objective (GAVE), MOVER encourages consistent alignment across all modalities in a modality-agnostic manner. Experiments on text-video-audio retrieval tasks demonstrate that MOVER significantly outperforms prior state-of-the-art methods…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEnergy Efficient Wireless Sensor Networks · Speech and Audio Processing