TL;DR
This paper introduces a large-scale human evaluation dataset for estimating the quality of image captions without ground-truth references, enabling better filtering of low-quality captions in real-world applications.
Contribution
It presents a new human evaluation process, a large dataset of over 600k ratings, and baseline models for caption quality estimation that improve caption filtering.
Findings
QE models trained on coarse ratings effectively detect low-quality captions
Large-scale dataset enables robust training of caption quality estimators
Filtering low-quality captions improves user experience
Abstract
Automatic image captioning has improved significantly over the last few years, but the problem is far from being solved, with state of the art models still often producing low quality captions when used in the wild. In this paper, we focus on the task of Quality Estimation (QE) for image captions, which attempts to model the caption quality from a human perspective and without access to ground-truth references, so that it can be applied at prediction time to detect low-quality captions produced on previously unseen images. For this task, we develop a human evaluation process that collects coarse-grained caption annotations from crowdsourced users, which is then used to collect a large scale dataset spanning more than 600k caption quality ratings. We then carefully validate the quality of the collected ratings and establish baseline models for this new QE task. Finally, we further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
