Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit, Bansal, Jingjing Liu

TL;DR
ClipBERT introduces a sparse sampling framework for video-and-language learning, enabling end-to-end training with fewer clips, which outperforms dense feature methods across diverse datasets, demonstrating a less-is-more principle.
Contribution
The paper proposes ClipBERT, a novel framework that uses sparse sampling for efficient end-to-end video-and-language learning, reducing computational costs and improving accuracy.
Findings
Outperforms dense feature methods on multiple datasets
Effective across videos of varying lengths and domains
End-to-end training with sparse clips is more accurate
Abstract
The canonical approach to video-and-language learning (e.g., video question answering) dictates a neural model to learn from offline-extracted dense video features from vision models and text features from language models. These feature extractors are trained independently and usually on tasks different from the target domains, rendering these fixed features sub-optimal for downstream tasks. Moreover, due to the high computational overload of dense video features, it is often difficult (or infeasible) to plug feature extractors directly into existing approaches for easy finetuning. To provide a remedy to this dilemma, we propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks, by employing sparse sampling, where only a single or a few sparsely sampled short clips from a video are used at each training step. Experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsClipBERT
