TL;DR
This paper introduces Semantic Object Accuracy, a new evaluation metric for text-to-image models that measures how well generated images contain objects mentioned in captions, and proposes a model that explicitly models individual objects for improved generation.
Contribution
The paper presents a novel object-aware evaluation metric and a generative model that explicitly models objects, improving the assessment and quality of text-to-image synthesis.
Findings
SOA correlates well with human judgment
Object-aware models outperform global models
SOA provides more meaningful evaluation than Inception Score
Abstract
Generative adversarial networks conditioned on textual image descriptions are capable of generating realistic-looking images. However, current methods still struggle to generate images based on complex image captions from a heterogeneous domain. Furthermore, quantitatively evaluating these text-to-image models is challenging, as most evaluation metrics only judge image quality but not the conformity between the image and its caption. To address these challenges we introduce a new model that explicitly models individual objects within an image and a new evaluation metric called Semantic Object Accuracy (SOA) that specifically evaluates images given an image caption. The SOA uses a pre-trained object detector to evaluate if a generated image contains objects that are mentioned in the image caption, e.g. whether an image generated from "a car driving down the street" contains a car. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
