Co-Grounding Networks with Semantic Attention for Referring Expression   Comprehension in Videos

Sijie Song; Xudong Lin; Jiaying Liu; Zongming Guo; Shih-Fu Chang

arXiv:2103.12346·cs.CV·March 24, 2021·1 cites

Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos

Sijie Song, Xudong Lin, Jiaying Liu, Zongming Guo, Shih-Fu Chang

PDF

Open Access

TL;DR

This paper introduces a novel co-grounding framework with semantic attention for referring expression comprehension in videos, improving accuracy and consistency over previous multi-stage methods by integrating temporal and attribute-based cues.

Contribution

The paper proposes a one-stage co-grounding approach that combines semantic attention and cross-frame feature learning, advancing video and image referring expression comprehension.

Findings

01

Outperforms previous methods on VID and LiOTB datasets

02

Achieves higher accuracy and stability in video grounding

03

Improves performance on the RefCOCO image dataset

Abstract

In this paper, we address the problem of referring expression comprehension in videos, which is challenging due to complex expression and scene dynamics. Unlike previous methods which solve the problem in multiple stages (i.e., tracking, proposal-based matching), we tackle the problem from a novel perspective, \textbf{co-grounding}, with an elegant one-stage framework. We enhance the single-frame grounding accuracy by semantic attention learning and improve the cross-frame grounding consistency with co-grounding feature learning. Semantic attention learning explicitly parses referring cues in different attributes to reduce the ambiguity in the complex expression. Co-grounding feature learning boosts visual feature representations by integrating temporal correlation to reduce the ambiguity caused by scene dynamics. Experiment results demonstrate the superiority of our framework on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning