Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
Peng Jin, Jinfa Huang, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song,, David A. Clifton, Jie Chen

TL;DR
This paper introduces EMCL, a novel contrastive learning method using expectation-maximization to learn compact, more discriminative video-and-language representations, significantly improving retrieval performance.
Contribution
The paper proposes EMCL, a new approach that finds a compact basis for the latent space, reducing its rank and enhancing semantic representation power in video-and-language tasks.
Findings
Outperforms previous state-of-the-art methods on three benchmark datasets.
Enhances representation discriminability and retrieval accuracy.
Can be integrated into existing models without additional training.
Abstract
Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the modality gap between visual and textual representation can not be fully eliminated. In this paper, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases. Such feature decomposition of video-and-language representations reduces the rank of the latent space, resulting in increased representing power for the semantics. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Cancer-related molecular mechanisms research · Domain Adaptation and Few-Shot Learning
MethodsContrastive Learning · Contrastive Language-Image Pre-training
