If you can describe it, they can see it: Cross-Modal Learning of Visual Concepts from Textual Descriptions

Carlo Alberto Barbano; Luca Molinaro; Massimiliano Ciranni; Emanuele Aiello; Vito Paolo Pastore; Marco Grangetto

arXiv:2411.15611·cs.CV·December 18, 2025

If you can describe it, they can see it: Cross-Modal Learning of Visual Concepts from Textual Descriptions

Carlo Alberto Barbano, Luca Molinaro, Massimiliano Ciranni, Emanuele Aiello, Vito Paolo Pastore, Marco Grangetto

PDF

Open Access

TL;DR

This paper introduces a novel method called Knowledge Transfer (KT) that enables vision-language models to learn new visual concepts solely from textual descriptions, enhancing their zero-shot capabilities without needing visual examples.

Contribution

The paper proposes a new approach to teach VLMs new concepts using only text, reusing existing knowledge within the same model, unlike previous methods requiring visual data or external generators.

Findings

01

KT effectively introduces new visual concepts from a single text description.

02

The approach refines existing concept representations within VLMs.

03

KT significantly boosts zero-shot performance across multiple VLM tasks.

Abstract

Humans can visualize new and unknown concepts from their natural language description, based on their experience and previous knowledge. Insipired by this, we present a way to extend this ability to Vision-Language Models (VLMs), teaching them novel concepts by only using a textual description. We refer to this approach as Knowledge Transfer (KT). Our hypothesis is that the knowledge of a pre-trained VLM can be re-used to represent previously unknown concepts. Provided with a textual description of the novel concept, KT works by aligning relevant features of the visual encoder, obtained through model inversion, to its text representation. Differently from approaches relying on visual examples or external generative models, KT transfers knowledge within the same VLM by injecting visual knowledge directly from the text. Through an extensive evaluation on several VLM tasks, including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems