Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries
Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, Anton van den Hengel

TL;DR
The paper introduces the PLAN network, a unified, explainable framework that uses dual recurrent attention mechanisms to improve object discovery in images based on natural language expressions, including dialogs.
Contribution
It proposes a novel parallel attention framework that jointly relates language and visual content for referring expression comprehension, outperforming existing methods.
Findings
Outperforms state-of-the-art on RefCOCO, RefCOCO+, and GuessWhat?! datasets.
Handles variable-length language inputs, from short phrases to multi-round dialogs.
Provides visualizable and explainable attention mechanisms.
Abstract
Recognising objects according to a pre-defined fixed set of class labels has been well studied in the Computer Vision. There are a great many practical applications where the subjects that may be of interest are not known beforehand, or so easily delineated, however. In many of these cases natural language dialog is a natural way to specify the subject of interest, and the task achieving this capability (a.k.a, Referring Expression Comprehension) has recently attracted attention. To this end we propose a unified framework, the ParalleL AttentioN (PLAN) network, to discover the object in an image that is being referred to in variable length natural expression descriptions, from short phrases query to long multi-round dialogs. The PLAN network has two attention mechanisms that relate parts of the expressions to both the global visual content and also directly to object candidates.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
