Resolving Language and Vision Ambiguities Together: Joint Segmentation &   Prepositional Attachment Resolution in Captioned Scenes

Gordon Christie; Ankit Laddha; Aishwarya Agrawal; Stanislaw Antol,; Yash Goyal; Kevin Kochersberger; Dhruv Batra

arXiv:1604.02125·cs.CV·September 27, 2016

Resolving Language and Vision Ambiguities Together: Joint Segmentation & Prepositional Attachment Resolution in Captioned Scenes

Gordon Christie, Ankit Laddha, Aishwarya Agrawal, Stanislaw Antol,, Yash Goyal, Kevin Kochersberger, Dhruv Batra

PDF

Open Access

TL;DR

This paper introduces a joint approach to semantic segmentation and prepositional phrase attachment resolution in captioned images, improving accuracy by reasoning about image content and language ambiguities simultaneously.

Contribution

It presents a novel joint reasoning framework that combines segmentation and language parsing, outperforming existing methods significantly.

Findings

01

Outperforms Stanford Parser by 17.91% and 12.83% in two experiments

02

Produces diverse hypotheses for segmentation and attachment resolution

03

Joint reasoning yields more accurate results than separate modules

Abstract

We present an approach to simultaneously perform semantic segmentation and prepositional phrase attachment resolution for captioned images. Some ambiguities in language cannot be resolved without simultaneously reasoning about an associated image. If we consider the sentence "I shot an elephant in my pajamas", looking at language alone (and not using common sense), it is unclear if it is the person or the elephant wearing the pajamas or both. Our approach produces a diverse set of plausible hypotheses for both semantic segmentation and prepositional phrase attachment resolution that are then jointly reranked to select the most consistent pair. We show that our semantic segmentation and prepositional phrase attachment resolution modules have complementary strengths, and that joint reasoning produces more accurate results than any module operating in isolation. Multiple hypotheses are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Multimodal Machine Learning Applications · Language, Metaphor, and Cognition