RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning
Yunchuan Ma, Laiyun Qing, Guorong Li, Yuankai Qi, Amin Beheshti, Quan Z. Sheng, Qingming Huang

TL;DR
RETTA introduces a zero-shot video captioning framework that leverages large-scale pretrained models and test-time adaptation with learnable tokens, significantly improving captioning performance without requiring ground truth data.
Contribution
The paper proposes a novel retrieval-enhanced test-time adaptation method that enables large pretrained models to generate accurate video captions in a zero-shot setting.
Findings
Achieves 5.1%-32.4% CIDEr improvements over state-of-the-art zero-shot methods.
Efficiently adapts models in just 16 iterations without ground truth data.
Demonstrates effectiveness on MSR-VTT, MSVD, and VATEX datasets.
Abstract
Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA), which takes advantage of existing pretrained large-scale vision and language models to directly generate captions with test-time adaptation. Specifically, we bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium among these four…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Vision and Imaging
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Discriminative Fine-Tuning · Multi-Head Attention · Dense Connections · Attention Dropout · Weight Decay · Cosine Annealing · Dropout
