One for All: One-stage Referring Expression Comprehension with Dynamic Reasoning
Zhipeng Zhang, Zhimin Wei, Zhongzhen Huang, Rui Niu, Peng Wang

TL;DR
This paper introduces a dynamic multi-step reasoning network for referring expression comprehension that adjusts reasoning steps on-the-fly, improving performance across various datasets with expressions of differing complexity.
Contribution
It proposes a novel Transformer-based model with reinforcement learning to dynamically determine reasoning steps, addressing fixed-step limitations in prior models.
Findings
Achieves state-of-the-art results on multiple REC datasets.
Improves handling of complex, long, and compositional expressions.
Demonstrates significant performance gains over fixed-step models.
Abstract
Referring Expression Comprehension (REC) is one of the most important tasks in visual reasoning that requires a model to detect the target object referred by a natural language expression. Among the proposed pipelines, the one-stage Referring Expression Comprehension (OSREC) has become the dominant trend since it merges the region proposal and selection stages. Many state-of-the-art OSREC models adopt a multi-hop reasoning strategy because a sequence of objects is frequently mentioned in a single expression which needs multi-hop reasoning to analyze the semantic relation. However, one unsolved issue of these models is that the number of reasoning steps needs to be pre-defined and fixed before inference, ignoring the varying complexity of expressions. In this paper, we propose a Dynamic Multi-step Reasoning Network, which allows the reasoning steps to be dynamically adjusted based on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Dense Connections · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding
