If you can describe it, they can see it: Cross-Modal Learning of Visual Concepts from Textual Descriptions
Carlo Alberto Barbano, Luca Molinaro, Massimiliano Ciranni, Emanuele Aiello, Vito Paolo Pastore, Marco Grangetto

TL;DR
This paper introduces a novel method called Knowledge Transfer (KT) that enables vision-language models to learn new visual concepts solely from textual descriptions, enhancing their zero-shot capabilities without needing visual examples.
Contribution
The paper proposes a new approach to teach VLMs new concepts using only text, reusing existing knowledge within the same model, unlike previous methods requiring visual data or external generators.
Findings
KT effectively introduces new visual concepts from a single text description.
The approach refines existing concept representations within VLMs.
KT significantly boosts zero-shot performance across multiple VLM tasks.
Abstract
Humans can visualize new and unknown concepts from their natural language description, based on their experience and previous knowledge. Insipired by this, we present a way to extend this ability to Vision-Language Models (VLMs), teaching them novel concepts by only using a textual description. We refer to this approach as Knowledge Transfer (KT). Our hypothesis is that the knowledge of a pre-trained VLM can be re-used to represent previously unknown concepts. Provided with a textual description of the novel concept, KT works by aligning relevant features of the visual encoder, obtained through model inversion, to its text representation. Differently from approaches relying on visual examples or external generative models, KT transfers knowledge within the same VLM by injecting visual knowledge directly from the text. Through an extensive evaluation on several VLM tasks, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
