Language-Mediated, Object-Centric Representation Learning

Ruocheng Wang; Jiayuan Mao; Samuel J. Gershman; Jiajun Wu

arXiv:2012.15814·cs.LG·June 9, 2021·1 cites

Language-Mediated, Object-Centric Representation Learning

Ruocheng Wang, Jiayuan Mao, Samuel J. Gershman, Jiajun Wu

PDF

Open Access

TL;DR

LORL introduces a novel framework that combines vision and language to learn object-centric scene representations, improving unsupervised object discovery and aiding downstream tasks like referring expression comprehension.

Contribution

LORL is the first to integrate language concepts with unsupervised object discovery algorithms, enhancing their performance and interpretability.

Findings

01

LORL improves object discovery accuracy on benchmark datasets.

02

Language integration enhances concept learning and downstream task performance.

03

LORL is compatible with various unsupervised object discovery methods.

Abstract

We present Language-mediated, Object-centric Representation Learning (LORL), a paradigm for learning disentangled, object-centric scene representations from vision and language. LORL builds upon recent advances in unsupervised object discovery and segmentation, notably MONet and Slot Attention. While these algorithms learn an object-centric representation just by reconstructing the input image, LORL enables them to further learn to associate the learned representations to concepts, i.e., words for object categories, properties, and spatial relationships, from language input. These object-centric concepts derived from language facilitate the learning of object-centric representations. LORL can be integrated with various unsupervised object discovery algorithms that are language-agnostic. Experiments show that the integration of LORL consistently improves the performance of unsupervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition

MethodsMixture model network