Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

TL;DR
This paper introduces FrozenBiLM, a zero-shot VideoQA method using frozen bidirectional language models combined with visual inputs, outperforming previous approaches across multiple datasets.
Contribution
The work demonstrates that frozen bidirectional language models can be effectively adapted for zero-shot VideoQA, providing a stronger and more cost-efficient alternative to autoregressive models.
Findings
Outperforms state-of-the-art in zero-shot VideoQA on multiple datasets.
Effective integration of visual inputs with frozen BiLM using light trainable modules.
Competitive results in few-shot and fully-supervised settings.
Abstract
Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam · Dense Connections · Softmax
