SGCap: Decoding Semantic Group for Zero-shot Video Captioning
Zeyu Pan, Ping Li, Wenxiao Wang

TL;DR
SGCap introduces a novel zero-shot video captioning approach that models temporal dynamics and semantic diversity, significantly improving performance over previous methods and rivaling fully supervised models.
Contribution
It proposes the Semantic Group Decoding strategy along with Key Sentences Selection and Probability Sampling Supervision modules for effective zero-shot video captioning.
Findings
Outperforms previous zero-shot methods on benchmarks.
Achieves performance competitive with fully supervised models.
Effectively models temporal dynamics and semantic diversity.
Abstract
Zero-shot video captioning aims to generate sentences for describing videos without training the model on video-text pairs, which remains underexplored. Existing zero-shot image captioning methods typically adopt a text-only training paradigm, where a language decoder reconstructs single-sentence embeddings obtained from CLIP. However, directly extending them to the video domain is suboptimal, as applying average pooling over all frames neglects temporal dynamics. To address this challenge, we propose a Semantic Group Captioning (SGCap) method for zero-shot video captioning. In particular, it develops the Semantic Group Decoding (SGD) strategy to employ multi-frame information while explicitly modeling inter-frame temporal relationships. Furthermore, existing zero-shot captioning methods that rely on cosine similarity for sentence retrieval and reconstruct the description supervised by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis
