Zero-Shot Video Question Answering via Frozen Bidirectional Language   Models

Antoine Yang; Antoine Miech; Josef Sivic; Ivan Laptev; Cordelia Schmid

arXiv:2206.08155·cs.CV·October 11, 2022·64 cites

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

PDF

Open Access 3 Repos 1 Video

TL;DR

This paper introduces FrozenBiLM, a zero-shot VideoQA method using frozen bidirectional language models combined with visual inputs, outperforming previous approaches across multiple datasets.

Contribution

The work demonstrates that frozen bidirectional language models can be effectively adapted for zero-shot VideoQA, providing a stronger and more cost-efficient alternative to autoregressive models.

Findings

01

Outperforms state-of-the-art in zero-shot VideoQA on multiple datasets.

02

Effective integration of visual inputs with frozen BiLM using light trainable modules.

03

Competitive results in few-shot and fully-supervised settings.

Abstract

Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam · Dense Connections · Softmax