Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads
Siqi Kou, Jiachun Jin, Zhihong Liu, Chang Liu, Ye Ma, Jian Jia, Quan, Chen, Peng Jiang, Zhijie Deng

TL;DR
Orthus is a novel autoregressive transformer that effectively generates images from text, answers questions based on visual inputs, and creates complex interleaved image-text content by handling discrete and continuous modalities with modality-specific heads.
Contribution
The paper introduces Orthus, a new AR multimodal model with modality-specific heads that improves image-text generation and understanding, using a soft alternative to VQ and a diffusion head for continuous image features.
Findings
Outperforms baselines like Show-o and Chameleon on standard benchmarks.
Achieves a GenEval score of 0.58 and MME-P score of 1265.8 with 7B parameters.
Efficient training within 72 A100 GPU hours.
Abstract
We introduce Orthus, an autoregressive (AR) transformer that excels in generating images given textual prompts, answering questions based on visual inputs, and even crafting lengthy image-text interleaved contents. Unlike prior arts on unified multimodal modeling, Orthus simultaneously copes with discrete text tokens and continuous image features under the AR modeling principle. The continuous treatment of visual signals minimizes the information loss for both image understanding and generation while the fully AR formulation renders the characterization of the correlation between modalities straightforward. The key mechanism enabling Orthus to leverage these advantages lies in its modality-specific heads -- one regular language modeling (LM) head predicts discrete text tokens and one diffusion head generates continuous image features conditioning on the output of the backbone. We devise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling
MethodsDiffusion
