Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation
Dimitrios Christodoulou, Mads Kuhlmann-J{\o}rgensen

TL;DR
This paper introduces a large-scale, diverse human annotation framework for evaluating text-to-image models, enabling comprehensive and bias-mitigated ranking of model performance based on subjective criteria.
Contribution
It presents an efficient annotation method leveraging global human feedback, collecting over 2 million votes to evaluate multiple models on subjective aspects.
Findings
Successful collection of 2 million annotations from diverse annotators
Effective ranking of models based on subjective criteria
Reduced bias through demographic diversity
Abstract
Efficiently evaluating the performance of text-to-image models is difficult as it inherently requires subjective judgment and human preference, making it hard to compare different models and quantify the state of the art. Leveraging Rapidata's technology, we present an efficient annotation framework that sources human feedback from a diverse, global pool of annotators. Our study collected over 2 million annotations across 4,512 images, evaluating four prominent models (DALL-E 3, Flux.1, MidJourney, and Stable Diffusion) on style preference, coherence, and text-to-image alignment. We demonstrate that our approach makes it feasible to comprehensively rank image generation models based on a vast pool of annotators and show that the diverse annotator demographics reflect the world population, significantly decreasing the risk of biases.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI)
