End-to-End Modeling via Information Tree for One-Shot Natural Language Spatial Video Grounding
Mengze Li, Tianbao Wang, Haoyu Zhang, Shengyu Zhang, Zhou Zhao, Jiaxu, Miao, Wenqiao Zhang, Wenming Tan, Jin Wang, Peng Wang, Shiliang Pu, Fei Wu

TL;DR
This paper introduces IT-OS, an end-to-end model for one-shot natural language spatial video grounding that effectively handles irrelevant frames and limited supervision using an information tree and self-supervised learning.
Contribution
The paper proposes the novel Information Tree module and self-supervised tasks to improve one-shot video grounding with minimal annotations.
Findings
Effective elimination of irrelevant frames via information tree
Improved grounding accuracy with limited supervision
Outperforms existing methods on benchmark datasets
Abstract
Natural language spatial video grounding aims to detect the relevant objects in video frames with descriptive sentences as the query. In spite of the great advances, most existing methods rely on dense video frame annotations, which require a tremendous amount of human effort. To achieve effective grounding under a limited annotation budget, we investigate one-shot video grounding, and learn to ground natural language in all video frames with solely one frame labeled, in an end-to-end manner. One major challenge of end-to-end one-shot video grounding is the existence of videos frames that are either irrelevant to the language query or the labeled frames. Another challenge relates to the limited supervision, which might result in ineffective representation learning. To address these challenges, we designed an end-to-end model via Information Tree for One-Shot video grounding (IT-OS). Its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
