MAMA: Meta-optimized Angular Margin Contrastive Framework for   Video-Language Representation Learning

Thong Nguyen; Yi Bin; Xiaobao Wu; Xinshuai Dong; Zhiyuan Hu; Khoi Le,; Cong-Duy Nguyen; See-Kiong Ng; and Luu Anh Tuan

arXiv:2407.03788·cs.CV·October 11, 2024

MAMA: Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning

Thong Nguyen, Yi Bin, Xiaobao Wu, Xinshuai Dong, Zhiyuan Hu, Khoi Le,, Cong-Duy Nguyen, See-Kiong Ng, and Luu Anh Tuan

PDF

Open Access 1 Repo

TL;DR

MAMA introduces a contrastive learning framework with angular margin regularization and dynamic sample weighting to improve video-language representations, addressing data quality and concept distribution issues for better downstream task performance.

Contribution

The paper proposes MAMA, a novel contrastive learning approach with angular margin and adaptive weighting, enhancing video-language representation quality and robustness.

Findings

01

Achieves superior performance on video question answering datasets.

02

Improves text-video retrieval accuracy.

03

Effectively handles data quality and concept imbalance issues.

Abstract

Data quality stands at the forefront of deciding the effectiveness of video-language representation learning. However, video-text pairs in previous data typically do not align perfectly with each other, which might lead to video-language representations that do not accurately reflect cross-modal semantics. Moreover, previous data also possess an uneven distribution of concepts, thereby hampering the downstream performance across unpopular subjects. To address these problems, we propose MAMA, a new approach to learning video-language representations by utilizing a contrastive objective with a subtractive angular margin to regularize cross-modal representations in their effort to reach perfect similarity. Furthermore, to adapt to the non-uniform concept distribution, MAMA utilizes a multi-layer perceptron (MLP)-parameterized weighting function that maps loss values to sample weights which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nguyentthong/MAMA
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsFocus · ALIGN