TL;DR
MIRO introduces multi-reward conditioning during pretraining of text-to-image models, enhancing image quality, diversity, and training efficiency by directly learning user preferences.
Contribution
This paper presents MIRO, a novel pretraining method that conditions on multiple rewards, improving T2I generation quality and training speed over traditional single-reward approaches.
Findings
Achieves state-of-the-art results on GenEval compositional benchmark.
Improves user-preference scores such as PickAScore, ImageReward, HPSv2.
Speeds up training of text-to-image models.
Abstract
The default paradigm of post-training text-to-image generators includes post-hoc selection of generated images, and subsequent training with one reward model to align the generator to the reward, typically user preference. This discards informative data as well as optimizes only for a single reward, hence harming diversity, semantic fidelity and efficiency. Instead, we propose MIRO, a method that conditions the model on multiple rewards during training, thus letting the model learn user preferences directly. MIRO pre-training both improves the visual quality of the generated images and speeds up the training, achieving state of the art on the GenEval compositional benchmark and user-preference scores (PickAScore, ImageReward, HPSv2).
Peer Reviews
Decision·Submitted to ICLR 2026
- Converging up to 19.1x faster on AestheticScore is a massive training speedup. And the inference efficiency achieves SOTA quality with 370x less compute than FLUX-dev. It also makes Best-of-N sampling way, way cheaper. - The paper shows (in Fig 2) that single-reward models totally overfit and tank other metrics. By training on 7 different rewards , MIRO is forced to find a healthy balance, and it ends up doing great on all of them. - It's especially good at tough compositional tasks like Pos
- The whole framework is now completely dependent on the quality of your N reward models. If those models are biased or flawed, MIRO will just learn to be biased and flawed - The paper mentions augmenting 16M images. You have to run seven different reward models over all 16M images before you can even start your faster training. That's a huge, non-trivial compute cost that has to be paid first.
* The idea of embedding multiple reward signals into pretraining is conceptually simple yet powerful, unifying data quality, efficiency, and controllability within one framework. * Empirical results are strong and consistent: MIRO outperforms single‑reward and baseline models across aesthetic and alignment benchmarks, reaching state‑of‑the‑art GenEval and user preference scores with much lower compute. * The method yields clear interpretability and controllability at inference time, enabling exp
* The paper lacks ablation studies analyzing sensitivity to the number and choice of reward models; it is unclear how redundant or correlated rewards affect performance or training stability. * Although MIRO shows significant efficiency improvements, the presentation of computational cost may be incomplete. Details on hardware, batch size, and training duration are sparse, making comparisons to larger models somewhat uneven.
1. The proposed approach demonstrates accelerated convergence during training, as evidenced by Figure 3, which illustrates MIRO’s significantly faster optimization compared to baseline methods. 2. MIRO consistently outperforms the baseline across all evaluated benchmarks, with Table 1 highlighting superior performance on GenEval and PartiPrompts metrics. 3. The integration of multiple reward models during pretraining represents an innovative strategy in text-to-image (T2I) generation, addressing
1. Experimental Limitations: The experimental design raises critical concerns regarding scalability and generalizability. The study focuses solely on a 0.36B parameter model—a relatively small architecture in T2I research—and trains it on only 16M image-text pairs. These constraints undermine confidence in the method’s ability to scale to industry-standard large models (e.g., 10B+ parameters) or real-world datasets. Additionally, the conclusions drawn from such limited experiments lack sufficien
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
