TL;DR
StyleVAR introduces a novel autoregressive framework for image style transfer that models style and content in a learned latent space, achieving superior results across multiple benchmarks.
Contribution
The paper proposes a new autoregressive modeling approach with a blended cross-attention mechanism for controllable style transfer, trained with reinforcement fine-tuning for improved perceptual quality.
Findings
Outperforms AdaIN baseline on Style Loss, Content Loss, LPIPS, SSIM, DreamSim, and CLIP metrics.
Reinforcement fine-tuning with GRPO improves perceptual alignment.
Effective in transferring textures while preserving semantic structure, especially in landscapes and architecture.
Abstract
We build on the Visual Autoregressive Modeling (VAR) framework and formulate style transfer as conditional discrete sequence modeling in a learned latent space. Images are decomposed into multi-scale representations and tokenized into discrete codes by a VQ-VAE; a transformer then autoregressively models the distribution of target tokens conditioned on style and content tokens. To inject style and content information, we introduce a blended cross-attention mechanism in which the evolving target representation attends to its own history, while style and content features act as queries that decide which aspects of this history to emphasize. A scale-dependent blending coefficient controls the relative influence of style and content at each stage, encouraging the synthesized representation to align with both the content structure and the style texture without breaking the autoregressive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
