LMFusion: Adapting Pretrained Language Models for Multimodal Generation

Weijia Shi; Xiaochuang Han; Chunting Zhou; Weixin Liang; Xi Victoria; Lin; Luke Zettlemoyer; Lili Yu

arXiv:2412.15188·cs.CL·February 6, 2025

LMFusion: Adapting Pretrained Language Models for Multimodal Generation

Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria, Lin, Luke Zettlemoyer, Lili Yu

PDF

Open Access

TL;DR

LMFusion is a novel framework that enhances pretrained text-only language models with multimodal capabilities, enabling efficient understanding and generation of both text and images without retraining the entire model.

Contribution

It introduces a method to adapt existing LLMs for multimodal tasks by adding parallel transformer modules for images, training only the new modules while freezing the original language model.

Findings

01

Improves image understanding by 20%

02

Enhances image generation by 3.6%

03

Uses only 50% of the FLOPs compared to training from scratch

Abstract

We present LMFusion, a framework for empowering pretrained text-only large language models (LLMs) with multimodal generative capabilities, enabling them to understand and generate both text and images in arbitrary sequences. LMFusion leverages existing Llama-3's weights for processing texts autoregressively while introducing additional and parallel transformer modules for processing images with diffusion. During training, the data from each modality is routed to its dedicated modules: modality-specific feedforward layers, query-key-value projections, and normalization layers process each modality independently, while the shared self-attention layers allow interactions across text and image features. By freezing the text-specific modules and only training the image-specific modules, LMFusion preserves the language capabilities of text-only LLMs while developing strong visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Multimodal Machine Learning Applications