Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs
Mustafa Shukor, Matthieu Cord

TL;DR
This paper investigates how frozen Large Language Models (LLMs) can generalize to multimodal inputs like images, videos, and audio, revealing an implicit alignment that explains their multimodal capabilities and suggesting ways to improve efficiency and evaluation.
Contribution
The study uncovers the implicit multimodal alignment in frozen LLMs, linking architecture to multimodal generalization and proposing metrics and methods for model optimization.
Findings
Perceptual tokens are distinguishable from textual tokens within LLMs.
Perceptual and textual tokens activate similar weights despite differences.
Implicit alignment correlates with task performance and reduces hallucinations.
Abstract
Large Language Models (LLMs) have demonstrated impressive performance on multimodal tasks, without any multimodal finetuning. They are the building block for Large Multimodal Models, yet, we still lack a proper understanding of their success. In this work, we expose frozen LLMs to image, video, audio and text inputs and analyse their internal representation aiming to understand their generalization beyond textual inputs. Findings. Perceptual tokens (1) are easily distinguishable from textual ones inside LLMs, with significantly different representations, and complete translation to textual tokens does not exist. Yet, (2) both perceptual and textual tokens activate similar LLM weights. Despite being different, (3) perceptual and textual tokens are implicitly aligned inside LLMs, we call this the implicit multimodal alignment (IMA), and argue that this is linked to architectural design,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Natural Language Processing Techniques
