StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation

Yi Wu; Lingting Zhu; Shengju Qian; Lei Liu; Wandi Qiao; Lequan Yu; Bin Li

arXiv:2505.19874·cs.CV·May 27, 2025

StyleAR: Customizing Multimodal Autoregressive Model for Style-Aligned Text-to-Image Generation

Yi Wu, Lingting Zhu, Shengju Qian, Lei Liu, Wandi Qiao, Lequan Yu, Bin Li

PDF

Open Access 3 Reviews

TL;DR

StyleAR introduces a novel method for style-aligned text-to-image generation by leveraging binary stylized data, a specialized data curation process, and style-enhanced tokens, overcoming data acquisition challenges.

Contribution

We propose StyleAR, a new approach combining data curation and model innovations to enable style-aligned generation using limited stylized data.

Findings

01

Outperforms existing methods in style consistency and quality.

02

Effectively utilizes binary stylized data for training.

03

Demonstrates superior qualitative and quantitative results.

Abstract

In the current research landscape, multimodal autoregressive (AR) models have shown exceptional capabilities across various domains, including visual understanding and generation. However, complex tasks such as style-aligned text-to-image generation present significant challenges, particularly in data acquisition. In analogy to instruction-following tuning for image editing of AR models, style-aligned generation requires a reference style image and prompt, resulting in a text-image-to-image triplet where the output shares the style and semantics of the input. However, acquiring large volumes of such triplet data with specific styles is considerably more challenging than obtaining conventional text-to-image data used for training generative models. To address this issue, we propose StyleAR, an innovative approach that combines a specially designed data curation method with our proposed…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- The qualitative results are strong, showing that the method often produces visually pleasing stylized outputs compared to diffusion-based baselines. - The proposed style-enhanced token mechanism addresses the content leakage problem that is common in existing style transfer methods. - The work explores style alignment in autoregressive models, which is a less-studied direction compared to diffusion-based approaches.

Weaknesses

- The methodological novelty appears limited. The overall pipeline mainly involves constructing (text, stylized image) pairs and fine-tuning the AR model, while the style token extraction and integration mechanism resembles an adaptation of existing approaches. - The proposed model requires training, which makes it computationally more expensive than training-free stylization approaches. - Quantitative performance lags behind some baselines.

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper eliminates the need for difficult-to-acquire triplet data (as illustrated in Fig. 3), which can lower the data barrier for style-aligned generation tasks. 2. From the user study and the qualitative results, StyleAR achieves better performance compared to existing approaches.

Weaknesses

1. The idea in this method is a little bit confusing: a). The paper's core premise is difficult to follow. A central claim is that the method does not require triplet data, yet the authors use InstantStyle to synthesize stylized data for training. This seems contradictory. b). Furthermore, it is not explained what loss or constraint is used to enforce style consistency during training. c). The claim that Gaussian noise (n) "weakens irrelevant semantic features" is also questionable, as thi

Reviewer 03Rating 2Confidence 5

Strengths

The chapter organization of the paper is clear, and the selected images are feasible.

Weaknesses

The introduction in Section 3.2 on training data is not sufficiently addressed. How can binary groups ensure that the prompts cover a wide range of styles? The explanation is very unclear, and it does not convey the efforts made in terms of data. The training framework seems similar to the approach of diffusion models for style control, lacking novelty. Many recent state-of-the-art methods were not included in the comparison. Why is CLIP still used as the style encoder? In fact, there are alr

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation

MethodsContrastive Language-Image Pre-training