Deciphering Personalization: Towards Fine-Grained Explainability in Natural Language for Personalized Image Generation Models
Haoming Wang, Wei Gao

TL;DR
This paper introduces FineXL, a novel method for providing fine-grained natural language explanations of personalization in image generation models, enhancing interpretability and understanding of multiple personalization aspects.
Contribution
The paper presents FineXL, a new technique that offers detailed natural language explanations and scores for different personalization aspects in image generation models.
Findings
Improves explainability accuracy by 56% across scenarios.
Provides detailed natural language descriptions for multiple personalization aspects.
Enhances understanding of personalized image generation models.
Abstract
Image generation models are usually personalized in practical uses in order to better meet the individual users' heterogeneous needs, but most personalized models lack explainability about how they are being personalized. Such explainability can be provided via visual features in generated images, but is difficult for human users to understand. Explainability in natural language is a better choice, but the existing approaches to explainability in natural language are limited to be coarse-grained. They are unable to precisely identify the multiple aspects of personalization, as well as the varying levels of personalization in each aspect. To address such limitation, in this paper we present a new technique, namely \textbf{FineXL}, towards \textbf{Fine}-grained e\textbf{X}plainability in natural \textbf{L}anguage for personalized image generation models. FineXL can provide natural…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Accurate problem positioning close to practical needs: Focus on the core user demand for "accurate explanation" when selecting personalized models, with experimental design close to real application scenarios (e.g., multi-dimensional personalization, adaptation to different model architectures). 2. Outstanding method generality and practicality: No additional training is required, and it can be directly adapted to mainstream image generation architectures such as diffusion models, GANs, and
1. Insufficient exploration of the limitations of the linear representation assumption: The paper assumes that high-level distribution differences can be linearly decomposed into combinations of low-level concepts, but does not verify the applicability of this assumption in complex scenarios - for example, when there are interaction effects between personalization dimensions or concepts are high-dimensional non-linear features, whether the error of linear decomposition will increase significantl
- The idea designed for how can publisher and user to select their own model has value for real-world application. - Linearly decomposition of image representation to form some "keyword" for a personalization model is interesting.
- The scope of experiments and design is limited. If the scope is only limited to style personalized, it's hard to cover many general cases in choosing personalized models like subject-driven personalization and abstract concept personalization. - Lack of elaboration and implementation for claim in figure 1. Author claim that this system can help model publisher to choose the data mixture, but it's missing in this paper. Could author clarify about this part? - Concerns about the necessity: The a
Practical objective: Goes beyond “is it personalized” to which aspects changed and by how much, in readable concepts (e.g., vivid/abstract/bold), useful for diagnosis and preference alignment. End-to-end and controllable: Clear pipeline (concept discovery → concept vectorization → linear decomposition) with orthogonality screening and a residual threshold to curb redundancy and control decomposition depth. Representation checks: Explicit probes of alignment/linearity/orthogonality across multi
* **VLM dependency:** Concept discovery relies on a VLM and can be sensitive to the choice of model and prompt templates; the paper could more fully quantify how this propagates to final explanation quality. * **Strong linear/orthogonality assumptions:** Real concepts are often correlated (e.g., vivid ↔ contrast). Even with orthogonality metrics and thresholds, leakage or non-unique decompositions may occur. * **Limited stability reporting:** Thresholds for orthogonality (e_{\text{ortho}}) and d
- The general direction of the paper, categorizing differences between models, is promising. - The paper is well written and easy to follow.
- **Focus on stylistic concepts.** The concepts used to characterize differences between models across generations appear to be mostly stylistic. Visual concepts can exist at different levels of abstraction—such as patterns or textures, color palettes, camera parameters etc. Some of these concepts can be difficult to capture with language, and their decomposition can be learned [1,2]. Additionally, personalized models may encode semantic or cultural biases that was not studied. - **Categorizing
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
