Resolving Language and Vision Ambiguities Together: Joint Segmentation & Prepositional Attachment Resolution in Captioned Scenes
Gordon Christie, Ankit Laddha, Aishwarya Agrawal, Stanislaw Antol,, Yash Goyal, Kevin Kochersberger, Dhruv Batra

TL;DR
This paper introduces a joint approach to semantic segmentation and prepositional phrase attachment resolution in captioned images, improving accuracy by reasoning about image content and language ambiguities simultaneously.
Contribution
It presents a novel joint reasoning framework that combines segmentation and language parsing, outperforming existing methods significantly.
Findings
Outperforms Stanford Parser by 17.91% and 12.83% in two experiments
Produces diverse hypotheses for segmentation and attachment resolution
Joint reasoning yields more accurate results than separate modules
Abstract
We present an approach to simultaneously perform semantic segmentation and prepositional phrase attachment resolution for captioned images. Some ambiguities in language cannot be resolved without simultaneously reasoning about an associated image. If we consider the sentence "I shot an elephant in my pajamas", looking at language alone (and not using common sense), it is unclear if it is the person or the elephant wearing the pajamas or both. Our approach produces a diverse set of plausible hypotheses for both semantic segmentation and prepositional phrase attachment resolution that are then jointly reranked to select the most consistent pair. We show that our semantic segmentation and prepositional phrase attachment resolution modules have complementary strengths, and that joint reasoning produces more accurate results than any module operating in isolation. Multiple hypotheses are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Multimodal Machine Learning Applications · Language, Metaphor, and Cognition
