SGCap: Decoding Semantic Group for Zero-shot Video Captioning

Zeyu Pan; Ping Li; Wenxiao Wang

arXiv:2508.01270·cs.CV·August 5, 2025

SGCap: Decoding Semantic Group for Zero-shot Video Captioning

Zeyu Pan, Ping Li, Wenxiao Wang

PDF

Open Access

TL;DR

SGCap introduces a novel zero-shot video captioning approach that models temporal dynamics and semantic diversity, significantly improving performance over previous methods and rivaling fully supervised models.

Contribution

It proposes the Semantic Group Decoding strategy along with Key Sentences Selection and Probability Sampling Supervision modules for effective zero-shot video captioning.

Findings

01

Outperforms previous zero-shot methods on benchmarks.

02

Achieves performance competitive with fully supervised models.

03

Effectively models temporal dynamics and semantic diversity.

Abstract

Zero-shot video captioning aims to generate sentences for describing videos without training the model on video-text pairs, which remains underexplored. Existing zero-shot image captioning methods typically adopt a text-only training paradigm, where a language decoder reconstructs single-sentence embeddings obtained from CLIP. However, directly extending them to the video domain is suboptimal, as applying average pooling over all frames neglects temporal dynamics. To address this challenge, we propose a Semantic Group Captioning (SGCap) method for zero-shot video captioning. In particular, it develops the Semantic Group Decoding (SGD) strategy to employ multi-frame information while explicitly modeling inter-frame temporal relationships. Furthermore, existing zero-shot captioning methods that rely on cosine similarity for sentence retrieval and reconstruct the description supervised by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis