Multimodal Pivots for Image Caption Translation

Julian Hitschler; Shigehiko Schamoni; Stefan Riezler

arXiv:1601.03916·cs.CL·February 3, 2021

Multimodal Pivots for Image Caption Translation

Julian Hitschler, Shigehiko Schamoni, Stefan Riezler

PDF

TL;DR

This paper introduces a multimodal pivoting method for image caption translation that leverages visual similarity to improve translation quality without requiring large parallel datasets.

Contribution

It proposes a novel approach using image retrieval and caption reranking based on visual similarity, reducing dependence on parallel data.

Findings

01

Achieved a 1 BLEU point improvement over strong baselines

02

Utilized CNN-based image similarity for crosslingual reranking

03

Demonstrated effectiveness with monolingual captioned image datasets

Abstract

We present an approach to improve statistical machine translation of image descriptions by multimodal pivots defined in visual space. The key idea is to perform image retrieval over a database of images that are captioned in the target language, and use the captions of the most similar images for crosslingual reranking of translation outputs. Our approach does not depend on the availability of large amounts of in-domain parallel data, but only relies on available large datasets of monolingually captioned images, and on state-of-the-art convolutional neural networks to compute image similarities. Our experimental evaluation shows improvements of 1 BLEU point over strong baselines.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.