RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning

Yunchuan Ma; Laiyun Qing; Guorong Li; Yuankai Qi; Amin Beheshti; Quan Z. Sheng; Qingming Huang

arXiv:2405.07046·cs.CV·October 29, 2025

RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning

Yunchuan Ma, Laiyun Qing, Guorong Li, Yuankai Qi, Amin Beheshti, Quan Z. Sheng, Qingming Huang

PDF

Open Access

TL;DR

RETTA introduces a zero-shot video captioning framework that leverages large-scale pretrained models and test-time adaptation with learnable tokens, significantly improving captioning performance without requiring ground truth data.

Contribution

The paper proposes a novel retrieval-enhanced test-time adaptation method that enables large pretrained models to generate accurate video captions in a zero-shot setting.

Findings

01

Achieves 5.1%-32.4% CIDEr improvements over state-of-the-art zero-shot methods.

02

Efficiently adapts models in just 16 iterations without ground truth data.

03

Demonstrates effectiveness on MSR-VTT, MSVD, and VATEX datasets.

Abstract

Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA), which takes advantage of existing pretrained large-scale vision and language models to directly generate captions with test-time adaptation. Specifically, we bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium among these four…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Vision and Imaging

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Discriminative Fine-Tuning · Multi-Head Attention · Dense Connections · Attention Dropout · Weight Decay · Cosine Annealing · Dropout