TL;DR
USO introduces a unified framework for style and subject-driven image generation by disentangling content and style, leveraging a large dataset, a novel learning scheme, and reward optimization to achieve state-of-the-art results.
Contribution
The paper proposes a novel unified model that combines style and subject-driven generation through disentangled learning and reward optimization, along with a new benchmark for evaluation.
Findings
USO achieves state-of-the-art performance in style similarity and subject fidelity.
The disentangled learning scheme effectively separates content and style features.
The USO-Bench provides a comprehensive evaluation of style and subject consistency.
Abstract
Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The core innovations of the described work include: 1. Unification of style and subject-driven generation under a single framework by addressing the disentanglement and recombination of content and style. 2. Construction of a large-scale triplet dataset with content images, style images, and stylized content images. 3. Implementation of a content-style disentangled learning scheme 4. Integration of a style reward-learning paradigm (SRL) to enhance model performance. The expression in this pape
1. Content Leakage in Style Encoder: The use of SigLIP as the style encoder is a notable weakness, as SigLIP features inherently encapsulate semantic information alongside style. This leads to potential content leakage, undermining the core objective of content-style disentanglement. A more dedicated style encoder should be considered to purify the style representation. 2. Lack of Explicit Content Consistency Constraint: The mechanism for preserving content consistency, particularly against lay
1.The results look good. 2.The synthesis pipeline is relatively clear. 3.When compared with recent methods, the advancement of USO is evident. 4.It can handle various style control tasks.
The training details of the expert models in the two synthesis pipelines are not mentioned. Figure 2 is not clear. The authors implemented USO based on UNO; to what extent does the achieved ID preservation capability stem from UNO? When using CSD as the style reward, will it affect or even compromise the evaluation? The authors did not measure the copy-paste phenomenon. In some cases, the facial pose does not appear to change significantly. More reasonable supplementary cases may be needed.
1.USO is the first framework to unify subject-driven generation and style transfer. 2.USO achieves state-of-the-art subject fidelity and visual consistency on benchmark datasets.
1.CSD is trained on contrastive learning that uses artist name as the style label. There are discrepancy in the artworks for the single artist and most images in CSD's trained dataset are oil paintings, resulting un-reliable style reward score. 2.Proposed framework is similar to OmniStyle (replace VAE with SigLIP) and proposed reward learning is widely used in previous work, lack of novelty; 3.The paper shows that baselines fall shorts in text following in Figure 4, but USO do not achieve the
1. USO validated that style-driven generation and subject-driven generation could be jointly trained. The model obtain good results on subject-driven generation benchmarks. 2. The authors conducted rich experiments to demonstrate the effectiveness of the method. 3. The paper is organized well and overall easy to follow.
Major Weakness: 1. Cross-Task Triplet Curation Framework lacks novelty and misses many implementation details. First, the data synthesis framework lacks novelty. Presented in Figure 2, USO data curation framework reverses the style target image to generate style reference and content reference, which is not new because a similar idea of style removal module had been proposed and demonstrated effectively in 2023 [1]. Second, many implementation details have been hided. For example, what
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
