Large Language Models Implicitly Learn to See and Hear Just By Reading
Prateek Verma, Mert Pilanci

TL;DR
This paper reveals that large language models trained solely on text can inherently develop the ability to understand images and audio, demonstrating a form of multi-modal perception without explicit multi-modal training.
Contribution
It shows that text-only training enables LLMs to implicitly learn visual and auditory understanding, expanding their capabilities beyond language processing.
Findings
LLMs can classify images like CIFAR-10 and Fashion-MNIST using text-trained weights.
Audio classification on FSD-50K and GTZAN datasets is effective with text-based models.
The approach demonstrates internal circuits in LLMs that support multi-modal understanding.
Abstract
This paper presents a fascinating find: By training an auto-regressive LLM model on text tokens, the text model inherently develops internally an ability to understand images and audio, thereby developing the ability to see and hear just by reading. Popular audio and visual LLM models fine-tune text LLM models to give text output conditioned on images and audio embeddings. On the other hand, our architecture takes in patches of images, audio waveforms or tokens as input. It gives us the embeddings or category labels typical of a classification pipeline. We show the generality of text weights in aiding audio classification for datasets FSD-50K and GTZAN. Further, we show this working for image classification on CIFAR-10 and Fashion-MNIST, as well on image patches. This pushes the notion of text-LLMs learning powerful internal circuits that can be utilized by activating necessary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
