Language-Mediated, Object-Centric Representation Learning
Ruocheng Wang, Jiayuan Mao, Samuel J. Gershman, Jiajun Wu

TL;DR
LORL introduces a novel framework that combines vision and language to learn object-centric scene representations, improving unsupervised object discovery and aiding downstream tasks like referring expression comprehension.
Contribution
LORL is the first to integrate language concepts with unsupervised object discovery algorithms, enhancing their performance and interpretability.
Findings
LORL improves object discovery accuracy on benchmark datasets.
Language integration enhances concept learning and downstream task performance.
LORL is compatible with various unsupervised object discovery methods.
Abstract
We present Language-mediated, Object-centric Representation Learning (LORL), a paradigm for learning disentangled, object-centric scene representations from vision and language. LORL builds upon recent advances in unsupervised object discovery and segmentation, notably MONet and Slot Attention. While these algorithms learn an object-centric representation just by reconstructing the input image, LORL enables them to further learn to associate the learned representations to concepts, i.e., words for object categories, properties, and spatial relationships, from language input. These object-centric concepts derived from language facilitate the learning of object-centric representations. LORL can be integrated with various unsupervised object discovery algorithms that are language-agnostic. Experiments show that the integration of LORL consistently improves the performance of unsupervised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
MethodsMixture model network
