Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to   Multimodal Inputs

Mustafa Shukor; Matthieu Cord

arXiv:2405.16700·cs.CV·October 8, 2024

Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs

Mustafa Shukor, Matthieu Cord

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper investigates how frozen Large Language Models (LLMs) can generalize to multimodal inputs like images, videos, and audio, revealing an implicit alignment that explains their multimodal capabilities and suggesting ways to improve efficiency and evaluation.

Contribution

The study uncovers the implicit multimodal alignment in frozen LLMs, linking architecture to multimodal generalization and proposing metrics and methods for model optimization.

Findings

01

Perceptual tokens are distinguishable from textual tokens within LLMs.

02

Perceptual and textual tokens activate similar weights despite differences.

03

Implicit alignment correlates with task performance and reduces hallucinations.

Abstract

Large Language Models (LLMs) have demonstrated impressive performance on multimodal tasks, without any multimodal finetuning. They are the building block for Large Multimodal Models, yet, we still lack a proper understanding of their success. In this work, we expose frozen LLMs to image, video, audio and text inputs and analyse their internal representation aiming to understand their generalization beyond textual inputs. Findings. Perceptual tokens (1) are easily distinguishable from textual ones inside LLMs, with significantly different representations, and complete translation to textual tokens does not exist. Yet, (2) both perceptual and textual tokens activate similar LLM weights. Despite being different, (3) perceptual and textual tokens are implicitly aligned inside LLMs, we call this the implicit multimodal alignment (IMA), and argue that this is linked to architectural design,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mshukor/ima-lmms
pytorchOfficial

Models

🤗
mshukor/IMA-DePALM
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Natural Language Processing Techniques