Hierarchical Banzhaf Interaction for General Video-Language   Representation Learning

Peng Jin; Hao Li; Li Yuan; Shuicheng Yan; Jie Chen

arXiv:2412.20964·cs.CV·January 1, 2025

Hierarchical Banzhaf Interaction for General Video-Language Representation Learning

Peng Jin, Hao Li, Li Yuan, Shuicheng Yan, Jie Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces Hierarchical Banzhaf Interaction, a novel game-theoretic approach for fine-grained video-language representation learning, improving multimodal understanding for various downstream tasks.

Contribution

It models video-text interactions using multivariate cooperative game theory and proposes a hierarchical Banzhaf Interaction to capture detailed semantic correspondences.

Findings

01

Achieves superior performance on text-video retrieval benchmarks.

02

Effectively handles fine-grained semantic interactions.

03

Demonstrates strong generalization across multiple tasks.

Abstract

Multimodal representation learning, with contrastive learning, plays an important role in the artificial intelligence domain. As an important subfield, video-language representation learning focuses on learning representations using global semantic interactions between pre-defined video-text pairs. However, to enhance and refine such coarse-grained global interactions, more detailed interactions are necessary for fine-grained multimodal learning. In this study, we introduce a new approach that models video-text as game players using multivariate cooperative game theory to handle uncertainty during fine-grained semantic interactions with diverse granularity, flexible combination, and vague intensity. Specifically, we design the Hierarchical Banzhaf Interaction to simulate the fine-grained correspondence between video clips and textual words from hierarchical perspectives. Furthermore, to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jpthu17/HBI
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition