End-to-End Modeling via Information Tree for One-Shot Natural Language   Spatial Video Grounding

Mengze Li; Tianbao Wang; Haoyu Zhang; Shengyu Zhang; Zhou Zhao; Jiaxu; Miao; Wenqiao Zhang; Wenming Tan; Jin Wang; Peng Wang; Shiliang Pu; Fei Wu

arXiv:2203.08013·cs.CV·May 24, 2022

End-to-End Modeling via Information Tree for One-Shot Natural Language Spatial Video Grounding

Mengze Li, Tianbao Wang, Haoyu Zhang, Shengyu Zhang, Zhou Zhao, Jiaxu, Miao, Wenqiao Zhang, Wenming Tan, Jin Wang, Peng Wang, Shiliang Pu, Fei Wu

PDF

TL;DR

This paper introduces IT-OS, an end-to-end model for one-shot natural language spatial video grounding that effectively handles irrelevant frames and limited supervision using an information tree and self-supervised learning.

Contribution

The paper proposes the novel Information Tree module and self-supervised tasks to improve one-shot video grounding with minimal annotations.

Findings

01

Effective elimination of irrelevant frames via information tree

02

Improved grounding accuracy with limited supervision

03

Outperforms existing methods on benchmark datasets

Abstract

Natural language spatial video grounding aims to detect the relevant objects in video frames with descriptive sentences as the query. In spite of the great advances, most existing methods rely on dense video frame annotations, which require a tremendous amount of human effort. To achieve effective grounding under a limited annotation budget, we investigate one-shot video grounding, and learn to ground natural language in all video frames with solely one frame labeled, in an end-to-end manner. One major challenge of end-to-end one-shot video grounding is the existence of videos frames that are either irrelevant to the language query or the labeled frames. Another challenge relates to the limited supervision, which might result in ineffective representation learning. To address these challenges, we designed an end-to-end model via Information Tree for One-Shot video grounding (IT-OS). Its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.