Directional Textual Inversion for Personalized Text-to-Image Generation
Kunhee Kim, NaHyeon Park, Kibeom Hong, Hyunjung Shim

TL;DR
This paper introduces Directional Textual Inversion (DTI), a novel method that improves personalized text-to-image generation by fixing embedding norms and optimizing only direction, enabling better fidelity and interpolation of concepts.
Contribution
The paper proposes DTI, which constrains embedding magnitudes and optimizes direction on the hypersphere, addressing TI's failures on complex prompts and enabling smooth concept interpolation.
Findings
DTI outperforms TI in text fidelity and subject similarity.
Directional optimization enhances semantic coherence and interpolation.
Hyperspherical parameterization enables smooth concept transitions.
Abstract
Textual Inversion (TI) is an efficient approach to text-to-image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out-of-distribution magnitudes, degrading prompt conditioning in pre-norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre-norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in-distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises-Fisher prior, yielding a constant-direction prior gradient that is simple and efficient to incorporate. Across…
Peer Reviews
Decision·ICLR 2026 Poster
1. It is useful that the norm inflation dilutes the impact of positional embedding and the residual updates within the transformer. 2. The claims are supported with theoretical justification. 3. The writing is clear and easy to follow.
1. The baselines are limited to heavily outdated methods. Further comparison is needed [1-3]. 2. The norm inflation of the learned token had already been explored, not only in personalization [4], but also in VLM prompt tuning [5]. The quantitative comparison with [4] is provided, it is not clearly explain how the proposed approach differ conceptually. 3. Fig.3 - Is SLERP interpolation specifically available to the DTI? For fair comparison, SLERP should also be applied for TI for concept interpo
1) The idea behind the presented method is clear, and the text is well written. 2) The paper provides a solid theoretical basis for the proposed modifications. 3) The authors demonstrate the additional capabilities of their method. 4) The paper contains many experiments and an extensive ablation study.
1) The work contains a lot of theoretical discussion, but most of it is based on asymptotical estimations (e.g. Lemmas 1 and 2, Proposition 1). The problems described do not seem obvious to me and I am not sure they could arise in practice. (questions 1, 2) 2) I am unsure about the correct evaluation of TI and CrossInit on SDXL (Table 2). It seems that the methods are slightly overfitted. (question 3) 3) Some parts of the work (e.g. Fig. 1, Tab. 7) lack details on how they were obtained. (questi
The paper exhibits several notable strengths that enhance its scholarly contribution. First, the analysis of Textual Inversion (TI) limitations is particularly insightful, as it introduces a new perspective by identifying and rigorously examining embedding norm inflation—a previously underexplored issue. This is supported by empirical evidence, such as the demonstration of out-of-distribution norms and semantic drift, and complemented by theoretical foundations that explain how large magnitude
The paper's introduction emphasizes TI's poor performance with complex prompts, yet the subsequent technical analysis in Section 2 appears decoupled from this specific issue. The theoretical framework primarily explains general performance degradation due to norm inflation but does not explicitly model or analyze the distinct challenges of compositional prompts, creating a slight disconnect between the stated motivation and the core technical solution. While the paper claims TI is an "efficient
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Topic Modeling
