ActBERT: Learning Global-Local Video-Text Representations

Linchao Zhu; Yi Yang

arXiv:2011.07231·cs.CV·November 17, 2020·1 cites

ActBERT: Learning Global-Local Video-Text Representations

Linchao Zhu, Yi Yang

PDF

Open Access 1 Repo 1 Video

TL;DR

ActBERT introduces a self-supervised method for learning joint video-text representations by integrating global action cues, local objects, and linguistic data through an innovative transformer architecture, improving performance on multiple video-language tasks.

Contribution

It proposes a novel ENtangled Transformer block and a global-local action modeling approach for enhanced video-text representation learning in a self-supervised manner.

Findings

01

Outperforms state-of-the-art methods on video-text tasks

02

Effective in text-video retrieval and video captioning

03

Demonstrates strong generalization across multiple tasks

Abstract

In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze the mutual interactions between linguistic texts and local regional objects. It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. Second, we introduce an ENtangled Transformer block (ENT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. Global-local correspondences are discovered via judicious clues extraction from contextual information. It enforces the joint videotext representation to be aware of fine-grained objects as well as global human intention. We validate the generalization capability of ActBERT on downstream video-and language tasks, i.e.,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

PaddlePaddle/PaddleVideo/blob/develop/docs/en/model_zoo/multimodal/actbert.md
paddleOfficial

Videos

ActBERT: Learning Global-Local Video-Text Representations· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques

MethodsLinear Layer · Attentive Walk-Aggregating Graph Neural Network · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Dropout · Softmax · Multi-Head Attention · Residual Connection · Dense Connections