Less is More: ClipBERT for Video-and-Language Learning via Sparse   Sampling

Jie Lei; Linjie Li; Luowei Zhou; Zhe Gan; Tamara L. Berg; Mohit; Bansal; Jingjing Liu

arXiv:2102.06183·cs.CV·February 12, 2021·48 cites

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit, Bansal, Jingjing Liu

PDF

Open Access 1 Repo

TL;DR

ClipBERT introduces a sparse sampling framework for video-and-language learning, enabling end-to-end training with fewer clips, which outperforms dense feature methods across diverse datasets, demonstrating a less-is-more principle.

Contribution

The paper proposes ClipBERT, a novel framework that uses sparse sampling for efficient end-to-end video-and-language learning, reducing computational costs and improving accuracy.

Findings

01

Outperforms dense feature methods on multiple datasets

02

Effective across videos of varying lengths and domains

03

End-to-end training with sparse clips is more accurate

Abstract

The canonical approach to video-and-language learning (e.g., video question answering) dictates a neural model to learn from offline-extracted dense video features from vision models and text features from language models. These feature extractors are trained independently and usually on tasks different from the target domains, rendering these fixed features sub-optimal for downstream tasks. Moreover, due to the high computational overload of dense video features, it is often difficult (or infeasible) to plug feature extractors directly into existing approaches for easy finetuning. To provide a remedy to this dilemma, we propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks, by employing sparse sampling, where only a single or a few sparsely sampled short clips from a video are used at each training step. Experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jayleicn/ClipBERT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsClipBERT