TL;DR
LoRAtorio introduces a train-free, intrinsic method for composing multiple LoRA adapters in diffusion models, improving personalization and open-ended skill combination without retraining.
Contribution
The paper proposes a novel latent space, similarity-based composition framework for multiple LoRA adapters, addressing domain drift and enabling dynamic inference-time selection.
Findings
Achieves up to 1.3% improvement in ClipScore
72.43% win rate in GPT-4V evaluations
Effective generalization across multiple diffusion models
Abstract
Low-Rank Adaptation (LoRA) has become a widely adopted technique in text-to-image diffusion models, enabling the personalisation of visual concepts such as characters, styles, and objects. However, existing approaches struggle to effectively compose multiple LoRA adapters, particularly in open-ended settings where the number and nature of required skills are not known in advance. In this work, we present LoRAtorio, a novel train-free framework for multi-LoRA composition that leverages intrinsic model behaviour. Our method is motivated by two key observations: (1) LoRA adapters trained on narrow domains produce denoised outputs that diverge from the base model, and (2) when operating out-of-distribution, LoRA outputs show behaviour closer to the base model than when conditioned in distribution. The balance between these two observations allows for exceptional performance in the single…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is well structured and easy to follow. 2. The proposed approach demonstrates better results with increasing the active LoRA adapters 3. Re-centering of the unconditional noise could be used independently 4. Both UNet and DiT-based models are checked 5. The human and VLM-based evaluations are fully described 6. Extensive appendix
1. MultiLoRA composition task with a dynamic LoRA selection probably requires more detailed description as now it lacks motivation (at least some potential use cases) 2.The majority of the comparisons are done using CLIPScore that is a good proxy metric; however, a more extensive human or VLM-based evaluation is suggested 3. Only composition of LoRas for the Character, Style and Background are considered. No compositions with LoRAs for faster inference (e.g., LCM) are checked 4. see questions
1. The authors propose spatially-aware similarity metric to use as a proxy for LoRA adapter's confidence, with sound theoretical motivation. 2. The authors extend the task of multi-LoRA composition to a dynamic module selection setting, which is a good, real-world skill composition scenario.
1. The first contribution seems to be incremental - MultLFG (2nd best method) proposes "... training-free frequency-aware multi-LoRA merging. The key idea is to decompose LoRA-based noise predictions into frequency subbands and perform adaptive merging based on relevance scores." (https://arxiv.org/pdf/2505.20525), whereas this paper proposes patched cosine distance instead of frequency subbands. 2. The second contribution - re-centering - is, per your results in Table 6a, only better by 0.01 (3
The paper demonstrates originality by proposing a train-free, intrinsically guided framework for multi-LoRA composition, departing from the reliance on weight merging or learned gating. The quality of the work is evident in the methodology, including spatial patch-based weighting, re-centered guidance, and dynamic module selection. The paper is clearly written, with effective visualizations and thorough empirical support.
While the paper presents an innovative and effective approach, there are several notable weaknesses that merit attention. First, the authors do not release their code, which hinders reproducibility and weakens the reliability of the claimed results. Second, the core mechanism—spatial patch-based weighting—raises concerns when dealing with heterogeneous LoRA types. For example, style-oriented LoRAs may introduce global stylistic shifts across all spatial regions, while object-specific LoRAs affec
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
