ANOLE: An Open, Autoregressive, Native Large Multimodal Models for   Interleaved Image-Text Generation

Ethan Chern; Jiadi Su; Yan Ma; Pengfei Liu

arXiv:2407.06135·cs.CL·July 9, 2024·1 cites

ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation

Ethan Chern, Jiadi Su, Yan Ma, Pengfei Liu

PDF

Open Access 1 Repo 2 Models

TL;DR

Anole is an open-source, autoregressive large multimodal model capable of interleaved image and text generation, addressing limitations of previous models by integrating visual and textual modalities natively and efficiently.

Contribution

It introduces Anole, a native, autoregressive multimodal model built on Chameleon, with a novel fine-tuning strategy for efficient, high-quality image-text generation.

Findings

01

High-quality, coherent multimodal generation demonstrated

02

Open-sourced model and training framework provided

03

Efficient fine-tuning strategy enhances performance

Abstract

Previous open-source large multimodal models (LMMs) have faced several limitations: (1) they often lack native integration, requiring adapters to align visual representations with pre-trained large language models (LLMs); (2) many are restricted to single-modal generation; (3) while some support multimodal generation, they rely on separate diffusion models for visual modeling and generation. To mitigate these limitations, we present Anole, an open, autoregressive, native large multimodal model for interleaved image-text generation. We build Anole from Meta AI's Chameleon, adopting an innovative fine-tuning strategy that is both data-efficient and parameter-efficient. Anole demonstrates high-quality, coherent multimodal generation capabilities. We have open-sourced our model, training framework, and instruction tuning data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gair-nlp/anole
jaxOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications

MethodsALIGN · Diffusion