RefCrowd: Grounding the Target in Crowd with Referring Expressions

Heqian Qiu; Hongliang Li; Taijin Zhao; Lanxiao Wang; Qingbo Wu and; Fanman Meng

arXiv:2206.08172·cs.CV·June 17, 2022

RefCrowd: Grounding the Target in Crowd with Referring Expressions

Heqian Qiu, Hongliang Li, Taijin Zhao, Lanxiao Wang, Qingbo Wu and, Fanman Meng

PDF

Open Access

TL;DR

This paper introduces RefCrowd, a new dataset and a novel Fine-grained Multi-modal Attribute Contrastive Network (FMAC) for referring expression comprehension in crowded scenes, addressing the challenge of distinguishing similar individuals using natural language.

Contribution

The paper presents a new dataset, RefCrowd, and a novel FMAC model that effectively grounds referring expressions in crowded scenes by focusing on fine-grained attribute features.

Findings

01

FMAC outperforms existing methods on RefCrowd and other REF datasets.

02

RefCrowd enables more accurate crowd understanding with natural language.

03

The end-to-end toolbox facilitates further research in multi-modal understanding.

Abstract

Crowd understanding has aroused the widespread interest in vision domain due to its important practical significance. Unfortunately, there is no effort to explore crowd understanding in multi-modal domain that bridges natural language and computer vision. Referring expression comprehension (REF) is such a representative multi-modal task. Current REF studies focus more on grounding the target object from multiple distinctive categories in general scenarios. It is difficult to applied to complex real-world crowd understanding. To fill this gap, we propose a new challenging dataset, called RefCrowd, which towards looking for the target person in crowd with referring expressions. It not only requires to sufficiently mine the natural language information, but also requires to carefully focus on subtle differences between the target and a crowd of persons with similar appearance, so as to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition