Object Ordering with Bidirectional Matchings for Visual Reasoning
Hao Tan, Mohit Bansal

TL;DR
This paper introduces a novel neural model for visual reasoning that uses bidirectional attention and reinforcement learning to accurately match objects with language in complex images, improving performance on the NLVR dataset.
Contribution
The paper presents an end-to-end neural approach combining bidirectional attention and RL-based pointer networks for object ordering in visual reasoning tasks, a novel method for the NLVR dataset.
Findings
Achieves 4-6% absolute improvement over previous state-of-the-art methods.
Effectively matches object orderings with natural language descriptions.
Demonstrates strong performance on both structured and raw image datasets.
Abstract
Visual reasoning with compositional natural language instructions, e.g., based on the newly-released Cornell Natural Language Visual Reasoning (NLVR) dataset, is a challenging task, where the model needs to have the ability to create an accurate mapping between the diverse phrases and the several objects placed in complex arrangements in the image. Further, this mapping needs to be processed to answer the question in the statement given the ordering and relationship of the objects across three similar images. In this paper, we propose a novel end-to-end neural model for the NLVR task, where we first use joint bidirectional attention to build a two-way conditioning between the visual information and the language phrases. Next, we use an RL-based pointer network to sort and process the varying number of unordered objects (so as to match the order of the statement phrases) in each of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
