Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

Han Lin; Jaemin Cho; Amir Zadeh; Chuan Li; Mohit Bansal

arXiv:2508.05954·cs.CV·August 11, 2025

Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

Han Lin, Jaemin Cho, Amir Zadeh, Chuan Li, Mohit Bansal

PDF

Open Access 1 Models

TL;DR

Bifrost-1 introduces a unified framework that efficiently combines pretrained multimodal LLMs and diffusion models using patch-level CLIP image embeddings, enabling high-quality controllable image generation without extensive retraining.

Contribution

The paper proposes a novel method that bridges pretrained multimodal LLMs and diffusion models via patch-level CLIP latents, reducing training costs and preserving reasoning abilities.

Findings

01

Achieves high-fidelity controllable image generation.

02

Performs comparably or better than previous methods.

03

Requires substantially less training compute.

Abstract

There is growing interest in integrating high-fidelity visual synthesis capabilities into large language models (LLMs) without compromising their strong reasoning capabilities. Existing methods that directly train LLMs or bridge LLMs and diffusion models usually suffer from costly training since the backbone LLMs have not seen image representations during pretraining. We present Bifrost-1, a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models using patch-level CLIP image embeddings as latent variables, which are natively aligned with the MLLM's CLIP visual encoder. These patch-level image embeddings are integrated into the diffusion model with a lightweight adaptation of its ControlNet. To retain the original multimodal reasoning capabilities of MLLMs, we equip the MLLM with a visual generation branch initialized from the original MLLM parameters when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
hanlincs/Bifrost-1
model· 14 dl· ♡ 2
14 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling