Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning
Xin Wang, Jiawei Wu, Da Zhang, Yu Su, William Yang Wang

TL;DR
This paper introduces a novel zero-shot video captioning task and proposes a Topic-Aware Mixture of Experts model that leverages external semantic knowledge to generalize to unseen activities.
Contribution
The paper presents a new zero-shot video captioning framework using topic-aware experts and external semantic embeddings, enabling generalization to unseen activities.
Findings
Effective in describing unseen activities
Utilizes external topic-related text corpus
Shows strong generalization ability
Abstract
Although promising results have been achieved in video captioning, existing models are limited to the fixed inventory of activities in the training corpus, and do not generalize to open vocabulary scenarios. Here we introduce a novel task, zero-shot video captioning, that aims at describing out-of-domain videos of unseen activities. Videos of different activities usually require different captioning strategies in many aspects, i.e. word selection, semantic construction, and style expression etc, which poses a great challenge to depict novel activities without paired training data. But meanwhile, similar activities share some of those aspects in common. Therefore, We propose a principled Topic-Aware Mixture of Experts (TAMoE) model for zero-shot video captioning, which learns to compose different experts based on different topic embeddings, implicitly transferring the knowledge learned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
