Grounded Semantic Composition for Visual Scenes

P. Gorniak; D. Roy

arXiv:1107.0031·cs.AI·July 4, 2011

Grounded Semantic Composition for Visual Scenes

P. Gorniak, D. Roy

PDF

TL;DR

This paper introduces a visually-grounded language understanding model that combines word meanings to interpret complex spatial referring expressions in scenes, demonstrating high accuracy in selecting correct referents.

Contribution

It presents a novel model integrating visual grounding with compositional semantics for understanding complex referring expressions in scenes.

Findings

01

Successfully interprets a broad range of spatial referring expressions

02

Achieves high accuracy in referent selection tasks

03

Analyzes the influence of visual context on semantics

Abstract

We present a visually-grounded language understanding model based on a study of how people verbally describe objects in scenes. The emphasis of the model is on the combination of individual word meanings to produce meanings for complex referring expressions. The model has been implemented, and it is able to understand a broad range of spatial referring expressions. We describe our implementation of word level visually-grounded semantics and their embedding in a compositional parsing framework. The implemented system selects the correct referents in response to natural language expressions for a large percentage of test cases. In an analysis of the system's successes and failures we reveal how visual context influences the semantics of utterances and propose future extensions to the model that take such context into account.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.