Orthus: Autoregressive Interleaved Image-Text Generation with   Modality-Specific Heads

Siqi Kou; Jiachun Jin; Zhihong Liu; Chang Liu; Ye Ma; Jian Jia; Quan; Chen; Peng Jiang; Zhijie Deng

arXiv:2412.00127·cs.CV·April 17, 2025

Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads

Siqi Kou, Jiachun Jin, Zhihong Liu, Chang Liu, Ye Ma, Jian Jia, Quan, Chen, Peng Jiang, Zhijie Deng

PDF

Open Access 1 Models

TL;DR

Orthus is a novel autoregressive transformer that effectively generates images from text, answers questions based on visual inputs, and creates complex interleaved image-text content by handling discrete and continuous modalities with modality-specific heads.

Contribution

The paper introduces Orthus, a new AR multimodal model with modality-specific heads that improves image-text generation and understanding, using a soft alternative to VQ and a diffusion head for continuous image features.

Findings

01

Outperforms baselines like Show-o and Chameleon on standard benchmarks.

02

Achieves a GenEval score of 0.58 and MME-P score of 1265.8 with 7B parameters.

03

Efficient training within 72 A100 GPU hours.

Abstract

We introduce Orthus, an autoregressive (AR) transformer that excels in generating images given textual prompts, answering questions based on visual inputs, and even crafting lengthy image-text interleaved contents. Unlike prior arts on unified multimodal modeling, Orthus simultaneously copes with discrete text tokens and continuous image features under the AR modeling principle. The continuous treatment of visual signals minimizes the information loss for both image understanding and generation while the fully AR formulation renders the characterization of the correlation between modalities straightforward. The key mechanism enabling Orthus to leverage these advantages lies in its modality-specific heads -- one regular language modeling (LM) head predicts discrete text tokens and one diffusion head generates continuous image features conditioning on the output of the backbone. We devise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
SJTU-DENG-Lab/Orthus-7B-base
model· 6 dl· ♡ 1
6 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling

MethodsDiffusion