Hierarchical Banzhaf Interaction for General Video-Language Representation Learning
Peng Jin, Hao Li, Li Yuan, Shuicheng Yan, Jie Chen

TL;DR
This paper introduces Hierarchical Banzhaf Interaction, a novel game-theoretic approach for fine-grained video-language representation learning, improving multimodal understanding for various downstream tasks.
Contribution
It models video-text interactions using multivariate cooperative game theory and proposes a hierarchical Banzhaf Interaction to capture detailed semantic correspondences.
Findings
Achieves superior performance on text-video retrieval benchmarks.
Effectively handles fine-grained semantic interactions.
Demonstrates strong generalization across multiple tasks.
Abstract
Multimodal representation learning, with contrastive learning, plays an important role in the artificial intelligence domain. As an important subfield, video-language representation learning focuses on learning representations using global semantic interactions between pre-defined video-text pairs. However, to enhance and refine such coarse-grained global interactions, more detailed interactions are necessary for fine-grained multimodal learning. In this study, we introduce a new approach that models video-text as game players using multivariate cooperative game theory to handle uncertainty during fine-grained semantic interactions with diverse granularity, flexible combination, and vague intensity. Specifically, we design the Hierarchical Banzhaf Interaction to simulate the fine-grained correspondence between video clips and textual words from hierarchical perspectives. Furthermore, to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
