Generation and Comprehension of Unambiguous Object Descriptions
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan, Yuille, Kevin Murphy

TL;DR
This paper introduces a deep learning-based method for generating and understanding unambiguous object descriptions in images, outperforming previous approaches and providing a new large-scale dataset for the task.
Contribution
The paper presents a novel deep learning model for unambiguous referring expression generation and comprehension, along with a new large-scale dataset based on MS-COCO.
Findings
Our method outperforms previous approaches in generating unambiguous descriptions.
The dataset enables objective evaluation of referring expression tasks.
The toolbox facilitates visualization and assessment of model performance.
Abstract
We propose a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described. We show that our method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous objects in the scene. Our model is inspired by recent successes of deep learning methods for image captioning, but while image captioning is difficult to evaluate, our task allows for easy objective evaluation. We also present a new large-scale dataset for referring expressions, based on MS-COCO. We have released the dataset and a toolbox for visualization and evaluation, see https://github.com/mjhucla/Google_Refexp_toolbox
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/paligemma2-3b-ft-docci-448-jaxmodel· ♡ 2♡ 2
- 🤗google/paligemma2-10b-ft-docci-448-jaxmodel· ♡ 2♡ 2
- 🤗google/paligemma2-3b-mix-224model· 43k dl· ♡ 4843k dl♡ 48
- 🤗google/paligemma2-3b-mix-448-jaxmodel· 2 dl· ♡ 22 dl♡ 2
- 🤗google/paligemma2-3b-ft-docci-448model· 36k dl· ♡ 1336k dl♡ 13
- 🤗google/paligemma2-10b-ft-docci-448model· 927 dl· ♡ 17927 dl♡ 17
- 🤗google/paligemma2-10b-mix-224-jaxmodel
- 🤗google/paligemma2-3b-mix-448model· 3.8k dl· ♡ 573.8k dl♡ 57
- 🤗google/paligemma2-10b-mix-224model· 194 dl· ♡ 10194 dl♡ 10
- 🤗google/paligemma2-10b-mix-448model· 551 dl· ♡ 35551 dl♡ 35
Videos
Generation and Comprehension of Unambiguous Object Descriptions· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques
