Bridging the gap to real-world language-grounded visual concept learning

Whie Jung; Semin Kim; Junee Kim; Seunghoon Hong

arXiv:2510.21412·cs.CV·November 11, 2025

Bridging the gap to real-world language-grounded visual concept learning

Whie Jung, Semin Kim, Junee Kim, Seunghoon Hong

PDF

Open Access 1 Video

TL;DR

This paper introduces a scalable, adaptive framework for language-grounded visual concept learning in real-world scenes, enabling diverse concept axes discovery and manipulation without prior knowledge or additional parameters.

Contribution

It proposes a universal prompting strategy and compositional anchoring objective for adaptive concept axis identification and grounding in real-world images.

Findings

01

Outperforms existing methods in visual concept editing tasks.

02

Demonstrates strong compositional generalization across diverse datasets.

03

Effectively discovers and manipulates multiple concept axes without prior knowledge.

Abstract

Human intelligence effortlessly interprets visual scenes along a rich spectrum of semantic dimensions. However, existing approaches to language-grounded visual concept learning are limited to a few predefined primitive axes, such as color and shape, and are typically explored in synthetic datasets. In this work, we propose a scalable framework that adaptively identifies image-related concept axes and grounds visual concepts along these axes in real-world scenes. Leveraging a pretrained vision-language model and our universal prompting strategy, our framework identifies a diverse image-related axes without any prior knowledge. Our universal concept encoder adaptively binds visual features to the discovered axes without introducing additional model parameters for each concept. To ground visual concepts along the discovered axes, we optimize a compositional anchoring objective, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Bridging the gap to real-world language-grounded visual concept learning· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis