Bridging the gap to real-world language-grounded visual concept learning
Whie Jung, Semin Kim, Junee Kim, Seunghoon Hong

TL;DR
This paper introduces a scalable, adaptive framework for language-grounded visual concept learning in real-world scenes, enabling diverse concept axes discovery and manipulation without prior knowledge or additional parameters.
Contribution
It proposes a universal prompting strategy and compositional anchoring objective for adaptive concept axis identification and grounding in real-world images.
Findings
Outperforms existing methods in visual concept editing tasks.
Demonstrates strong compositional generalization across diverse datasets.
Effectively discovers and manipulates multiple concept axes without prior knowledge.
Abstract
Human intelligence effortlessly interprets visual scenes along a rich spectrum of semantic dimensions. However, existing approaches to language-grounded visual concept learning are limited to a few predefined primitive axes, such as color and shape, and are typically explored in synthetic datasets. In this work, we propose a scalable framework that adaptively identifies image-related concept axes and grounds visual concepts along these axes in real-world scenes. Leveraging a pretrained vision-language model and our universal prompting strategy, our framework identifies a diverse image-related axes without any prior knowledge. Our universal concept encoder adaptively binds visual features to the discovered axes without introducing additional model parameters for each concept. To ground visual concepts along the discovered axes, we optimize a compositional anchoring objective, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
