Evaluating Automatically Generated Phoneme Captions for Images

Justin van der Hout; Zolt\'an D'Haese; Mark Hasegawa-Johnson; Odette; Scharenborg

arXiv:2007.15916·cs.CL·August 3, 2020

Evaluating Automatically Generated Phoneme Captions for Images

Justin van der Hout, Zolt\'an D'Haese, Mark Hasegawa-Johnson, Odette, Scharenborg

PDF

TL;DR

This paper investigates the evaluation of Image2Speech, a task of generating spoken image descriptions, by implementing a phoneme-based system, converting phoneme captions to words, and analyzing metric correlations with human ratings.

Contribution

It introduces a phoneme-based Image2Speech system, compares objective metrics with human judgments, and highlights the need for phoneme-aware evaluation metrics.

Findings

01

BLEU4 correlates best among tested metrics with human ratings

02

Current metrics are limited by word-based assumptions

03

Phoneme-level evaluation metrics are needed for better assessment

Abstract

Image2Speech is the relatively new task of generating a spoken description of an image. This paper presents an investigation into the evaluation of this task. For this, first an Image2Speech system was implemented which generates image captions consisting of phoneme sequences. This system outperformed the original Image2Speech system on the Flickr8k corpus. Subsequently, these phoneme captions were converted into sentences of words. The captions were rated by human evaluators for their goodness of describing the image. Finally, several objective metric scores of the results were correlated with these human ratings. Although BLEU4 does not perfectly correlate with human ratings, it obtained the highest correlation among the investigated metrics, and is the best currently existing metric for the Image2Speech task. Current metrics are limited by the fact that they assume their input to be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.