3SGen: Unified Subject, Style, and Structure-Driven Image Generation with Adaptive Task-specific Memory
Xinyang Song, Libin Wang, Weining Wang, Zhiwei Li, Jianxin Sun, Dandan Zheng, Jingdong Chen, Qi Li, Zhenan Sun

TL;DR
3SGen introduces a unified, task-aware image generation framework that effectively combines subject, style, and structure conditioning within a single model, utilizing adaptive memory to enhance task transferability and detail preservation.
Contribution
The paper presents 3SGen, a novel unified model with adaptive memory for simultaneous subject, style, and structure conditioning, improving task disentanglement and scalability.
Findings
Outperforms existing methods on multiple benchmarks.
Effectively disentangles and combines different conditioning modes.
Scales well to complex, compositional inputs.
Abstract
Recent image generation approaches often address subject, style, and structure-driven conditioning in isolation, leading to feature entanglement and limited task transferability. In this paper, we introduce 3SGen, a task-aware unified framework that performs all three conditioning modes within a single model. 3SGen employs an MLLM equipped with learnable semantic queries to align text-image semantics, complemented by a VAE branch that preserves fine-grained visual details. At its core, an Adaptive Task-specific Memory (ATM) module dynamically disentangles, stores, and retrieves condition-specific priors, such as identity for subjects, textures for styles, and spatial layouts for structures, via a lightweight gating mechanism along with several scalable memory items. This design mitigates inter-task interference and naturally scales to compositional inputs. In addition, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection · Aesthetic Perception and Analysis
