"Show me the cup": Reference with Continuous Representations
Gemma Boleda, Sebastian Pad\'o, Marco Baroni

TL;DR
This paper presents a neural network model that can identify objects in images based on natural language descriptions, handling both visual and multimodal references with high accuracy.
Contribution
It introduces a neural approach to reference resolution using continuous representations, capable of individuating objects in shared scenes.
Findings
Model performs competitively with manually engineered systems.
Handles both purely visual and multimodal referents.
Capable of indicating failure when referent is ambiguous or absent.
Abstract
One of the most basic functions of language is to refer to objects in a shared scene. Modeling reference with continuous representations is challenging because it requires individuation, i.e., tracking and distinguishing an arbitrary number of referents. We introduce a neural network model that, given a definite description and a set of objects represented by natural images, points to the intended object if the expression has a unique referent, or indicates a failure, if it does not. The model, directly trained on reference acts, is competitive with a pipeline manually engineered to perform the same task, both when referents are purely visual, and when they are characterized by a combination of visual and linguistic properties.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
