Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding
Zhu Zhang, Zhou Zhao, Zhijie Lin, Baoxing Huai, Nicholas Jing Yuan

TL;DR
This paper introduces an object-aware multi-branch relation network that improves spatio-temporal video grounding by effectively modeling object relations in unaligned data and multi-form sentences.
Contribution
It proposes a novel multi-branch relation network with diversity loss for better object relation modeling in unaligned video grounding tasks.
Findings
Outperforms existing methods on benchmark datasets.
Effectively distinguishes notable objects in complex scenes.
Enhances relation reasoning between key objects.
Abstract
Spatio-temporal video grounding aims to retrieve the spatio-temporal tube of a queried object according to the given sentence. Currently, most existing grounding methods are restricted to well-aligned segment-sentence pairs. In this paper, we explore spatio-temporal video grounding on unaligned data and multi-form sentences. This challenging task requires to capture critical object relations to identify the queried target. However, existing approaches cannot distinguish notable objects and remain in ineffective relation modeling between unnecessary objects. Thus, we propose a novel object-aware multi-branch relation network for object-aware relation discovery. Concretely, we first devise multiple branches to develop object-aware region modeling, where each branch focuses on a crucial object mentioned in the sentence. We then propose multi-branch relation reasoning to capture critical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques
