ActBERT: Learning Global-Local Video-Text Representations
Linchao Zhu, Yi Yang

TL;DR
ActBERT introduces a self-supervised method for learning joint video-text representations by integrating global action cues, local objects, and linguistic data through an innovative transformer architecture, improving performance on multiple video-language tasks.
Contribution
It proposes a novel ENtangled Transformer block and a global-local action modeling approach for enhanced video-text representation learning in a self-supervised manner.
Findings
Outperforms state-of-the-art methods on video-text tasks
Effective in text-video retrieval and video captioning
Demonstrates strong generalization across multiple tasks
Abstract
In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze the mutual interactions between linguistic texts and local regional objects. It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. Second, we introduce an ENtangled Transformer block (ENT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. Global-local correspondences are discovered via judicious clues extraction from contextual information. It enforces the joint videotext representation to be aware of fine-grained objects as well as global human intention. We validate the generalization capability of ActBERT on downstream video-and language tasks, i.e.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
ActBERT: Learning Global-Local Video-Text Representations· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques
MethodsLinear Layer · Attentive Walk-Aggregating Graph Neural Network · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Dropout · Softmax · Multi-Head Attention · Residual Connection · Dense Connections
