MAR-MAER: Metric-Aware and Ambiguity-Adaptive Autoregressive Image Generation
Kai Dong, Tingting Bai

TL;DR
MAR-MAER is a hierarchical autoregressive framework for text-to-image generation that improves image quality and handles ambiguous prompts by aligning representations with human metrics and incorporating controlled randomness.
Contribution
It introduces a metric-aware embedding regularization and a probabilistic latent model for ambiguity, surpassing previous models in quality and diversity.
Findings
Achieves +1.6 CLIPScore over baseline
Improves HPSv2 by +5.3 points
Produces more diverse images for ambiguous prompts
Abstract
Autoregressive (AR) models have demonstrated significant success in the realm of text-to-image generation. However, they usually face two major challenges. Firstly, the generated images may not always meet the quality standards expected by humans. Furthermore, these models face difficulty when dealing with ambiguous prompts that could be interpreted in several valid ways. To address these issues, we introduce MAR-MAER, an innovative hierarchical autoregressive framework. It combines two main components. It is a metric-aware embedding regularization method. The other one is a probabilistic latent model used for handling ambiguous semantics. Our method utilizes a lightweight projection head, which is trained with an adaptive kernel regression loss function. This aligns the model's internal representations with human-preferred quality metrics, such as CLIPScore and HPSv2. As a result, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
