TL;DR
This paper introduces a policy gradient method to directly optimize a combined semantic and syntactic image captioning metric, SPIDEr, resulting in captions that better align with human preferences and improve over previous training methods.
Contribution
It proposes a novel policy gradient approach using Monte Carlo rollouts to optimize the SPIDEr metric for image captioning, outperforming prior methods like MIXER.
Findings
Optimized captions are more semantically faithful and syntactically fluent.
The method yields captions preferred by human raters over traditional MLE-trained models.
The approach simplifies optimization and improves results across metrics.
Abstract
Current image captioning methods are usually trained via (penalized) maximum likelihood estimation. However, the log-likelihood score of a caption does not correlate well with human assessments of quality. Standard syntactic evaluation metrics, such as BLEU, METEOR and ROUGE, are also not well correlated. The newer SPICE and CIDEr metrics are better correlated, but have traditionally been hard to optimize for. In this paper, we show how to use a policy gradient (PG) method to directly optimize a linear combination of SPICE and CIDEr (a combination we call SPIDEr): the SPICE score ensures our captions are semantically faithful to the image, while CIDEr score ensures our captions are syntactically fluent. The PG method we propose improves on the prior MIXER approach, by using Monte Carlo rollouts instead of mixing MLE training with PG. We show empirically that our algorithm leads to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Improved Image Captioning via Policy Gradient optimization of SPIDEr· youtube
