Unified Text-Image Generation with Weakness-Targeted Post-Training

Jiahui Chen; Philippe Hansen-Estruch; Xiaochuang Han; Yushi Hu; Emily Dinan; Amita Kamath; Michal Drozdzal; Reyhane Askari-Hemmat; Luke Zettlemoyer; Marjan Ghazvininejad

arXiv:2601.04339·cs.CV·January 22, 2026

Unified Text-Image Generation with Weakness-Targeted Post-Training

Jiahui Chen, Philippe Hansen-Estruch, Xiaochuang Han, Yushi Hu, Emily Dinan, Amita Kamath, Michal Drozdzal, Reyhane Askari-Hemmat, Luke Zettlemoyer, Marjan Ghazvininejad

PDF

Open Access

TL;DR

This paper presents a post-training method for unified text-image generation models that autonomously transition from text reasoning to image synthesis, improving performance across multiple benchmarks.

Contribution

It introduces a reward-weighted post-training approach with targeted synthetic data to enhance fully unified multimodal generation models.

Findings

01

Improved T2I performance on four benchmarks

02

Targeted post-training data outperforms broad datasets

03

Reward-weighted training enhances cross-modal generation

Abstract

Unified multimodal generation architectures that jointly produce text and images have recently emerged as a promising direction for text-to-image (T2I) synthesis. However, many existing systems rely on explicit modality switching, generating reasoning text before switching manually to image generation. This separate, sequential inference process limits cross-modal coupling and prohibits automatic multimodal generation. This work explores post-training to achieve fully unified text-image generation, where models autonomously transition from textual reasoning to visual synthesis within a single inference process. We examine the impact of joint text-image generation on T2I performance and the relative importance of each modality during post-training. We additionally explore different post-training data strategies, showing that a targeted dataset addressing specific limitations achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Humanities and Scholarship