AutoLoRA: Automatic LoRA Retrieval and Fine-Grained Gated Fusion for Text-to-Image Generation
Zhiwen Li, Zhongjie Duan, Die Chen, Cen Chen, Daoyuan Chen, Yaliang Li, Yingda Chen

TL;DR
AutoLoRA introduces a semantic-driven retrieval and dynamic fusion framework for LoRA modules, enhancing text-to-image generation by enabling scalable, data-efficient model customization without relying on original training data.
Contribution
The paper presents a novel framework combining semantic-based LoRA retrieval and fine-grained gated fusion for improved multi-LoRA integration in image generation models.
Findings
Significant performance improvements in image quality.
Effective semantic retrieval of LoRA modules without original training data.
Dynamic, context-specific fusion of multiple LoRA modules.
Abstract
Despite recent advances in photorealistic image generation through large-scale models like FLUX and Stable Diffusion v3, the practical deployment of these architectures remains constrained by their inherent intractability to parameter fine-tuning. While low-rank adaptation (LoRA) have demonstrated efficacy in enabling model customization with minimal parameter overhead, the effective utilization of distributed open-source LoRA modules faces three critical challenges: sparse metadata annotation, the requirement for zero-shot adaptation capabilities, and suboptimal fusion strategies for multi-LoRA fusion strategies. To address these limitations, we introduce a novel framework that enables semantic-driven LoRA retrieval and dynamic aggregation through two key components: (1) weight encoding-base LoRA retriever that establishes a shared semantic space between LoRA parameter matrices and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- **Motivation is reasonable**: Clear formulation of a weight‑probe encoder for data‑free LoRA semantics and cross‑interaction gating with layerwise scaling; ablations show the gate helps and Global‑LoRA improves fusion. - **Empirical results cover multiple aspects**: Multi-aspect covered evaluations on synthetic prompts, rewritten DiffusionDB prompts, object–style fusion, and random multi‑LoRA fusion with consistent gains. - **Clarity of Figures and Diagrams**: The embedding clusters by th
- **Motivation is Good, but experimental setup is naive**: The experimental setup for generation model like this is too naive and hard to delivery useful signal for community to deploy or explore this direction. Using simple concept / styling LoRAs are not useful. - **Experimental Setup Fairness**: LoRAs are pre‑filtered to those that improve MPS/HPS/VQA, and many prompts are synthetic/rewritten, reducing ecological validity. - **Compare to Model Souping**: There are previous works in the dir
The paper addresses a timely and practical challenge in the use of community-generated LoRA adapters for text-to-image generation, proposing a system that operates without requiring access to training data or metadata. Its main contribution lies in introducing a weight-based retrieval mechanism that encodes LoRA parameters into a shared embedding space with textual prompts using contrastive learning. This design is original in attempting to interpret LoRA weights semantically, without relying on
While the paper presents an interesting attempt to formalize LoRA retrieval and fusion, several critical weaknesses limit its credibility and contribution. First, the retriever training setup appears fundamentally underpowered: the model is trained on only 162 LoRA modules, an extremely small dataset for contrastive learning of this type. Given that CLIP-style embeddings require hundreds of thousands of diverse examples to generalize, it is doubtful that meaningful cross-modal alignment could em
1. The paper introduces a strategy for encoding LoRA weights to enable effective retrieval. 2. It proposes a novel gated fusion mechanism for combining multiple LoRAs, which does not require training on specific LoRAs and scales independently of their number. 3. The authors conduct extensive experiments and ablation studies to demonstrate the contribution of each component in their method.
1. The paper includes a limited number of visual examples; additional qualitative results would help to better assess output quality. 2. Some proofreading is needed, especially in section 3. For instance, line 149 contains: “.. the input dimension of the linear layer (corresponds to k in *your* notation)” 3. It is unclear whether the retrieved LoRAs are indeed the most relevant ones, and whether the text descriptions of generated images sufficiently capture the characteristics of each LoRA, see
1. **Novelty of "Model as Semantic"**: The core idea of encoding the LoRA weights directly, rather than relying solely on textual metadata, is novel and highly promising. The intuition that the model parameters (weights) themselves hold semantic meaning, akin to word embeddings in NLP, is insightful. Using this embedding space for retrieval is a strong and original conceptual contribution. 2. **Methodological Simplicity**: The architecture designed for this weight embedding—projecting LoRA modu
My recommendation for rejection is primarily driven by severe concerns regarding the paper's experimental soundness, which are detailed below. ### Primary Concern: - **Potential for Circular Reasoning (Cherry-Picking)**: My most significant concern, leading to the "Poor" soundness score, is the potential for selection bias in the experimental validation. In Appendix C (e.g., Lines 698-701), the authors state that from an initial pool of 1,100 LoRAs, they "retained" a curated pool of 162 based
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Advanced Image Fusion Techniques
