TL;DR
This paper introduces a new evaluation score, ICT, and an HP score model to better align image generation with human aesthetic preferences, surpassing existing text-image alignment methods in accuracy and quality.
Contribution
It proposes a novel evaluation metric and a new preference model trained on image modality, improving image aesthetic assessment beyond traditional text-image alignment.
Findings
ICT score surpasses existing evaluation metrics by over 10% in accuracy.
HP score model enhances image aesthetics and detail quality.
The approach improves state-of-the-art text-to-image model optimization.
Abstract
Contemporary image generation systems have achieved high fidelity and superior aesthetic quality beyond basic text-image alignment. However, existing evaluation frameworks have failed to evolve in parallel. This study reveals that human preference reward models fine-tuned based on CLIP and BLIP architectures have inherent flaws: they inappropriately assign low scores to images with rich details and high aesthetic value, creating a significant discrepancy with actual human aesthetic preferences. To address this issue, we design a novel evaluation score, ICT (Image-Contained-Text) score, that achieves and surpasses the objectives of text-image alignment by assessing the degree to which images represent textual content. Building upon this foundation, we further train an HP (High-Preference) score model using solely the image modality to enhance image aesthetics and detail quality while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
