Paraphrase Acquisition from Image Captions
Marcel Gohsen, Matthias Hagen, Martin Potthast, Benno Stein

TL;DR
This paper leverages web image captions, especially from Wikipedia, to create a new paraphrase dataset, analyzing their linguistic features and demonstrating high reliability of the extraction method.
Contribution
It introduces a novel dataset of image caption paraphrases from Wikipedia and presents a mining technology to identify and analyze paraphrases based on caption similarity.
Findings
The Wikipedia-IPC dataset effectively captures paraphrases for images.
Paraphrases from different sources exhibit distinct syntactic and semantic styles.
The proposed mining method reliably identifies paraphrases with high annotation agreement.
Abstract
We propose to use image captions from the Web as a previously underutilized resource for paraphrases (i.e., texts with the same "message") and to create and analyze a corresponding dataset. When an image is reused on the Web, an original caption is often assigned. We hypothesize that different captions for the same image naturally form a set of mutual paraphrases. To demonstrate the suitability of this idea, we analyze captions in the English Wikipedia, where editors frequently relabel the same image for different articles. The paper introduces the underlying mining technology, the resulting Wikipedia-IPC dataset, and compares known paraphrase corpora with respect to their syntactic and semantic paraphrase similarity to our new resource. In this context, we introduce characteristic maps along the two similarity dimensions to identify the style of paraphrases coming from different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
