Multimodal Contextualized Semantic Parsing from Speech

Jordan Voas; Raymond Mooney; and David Harwath

arXiv:2406.06438·cs.CL·June 11, 2024·1 cites

Multimodal Contextualized Semantic Parsing from Speech

Jordan Voas, Raymond Mooney, and David Harwath

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces SPICE, a new task for multimodal semantic parsing in context, along with a dataset and model to improve agents' understanding of speech and visual data in dynamic environments.

Contribution

The paper presents SPICE, a novel structured framework for multimodal semantic parsing in context, and introduces the VG-SPICE dataset and AViD-SP model for enhanced multimodal understanding.

Findings

01

VG-SPICE dataset challenges agents with visual scene graph construction from speech.

02

AViD-SP model demonstrates effective multimodal integration.

03

Framework improves contextual awareness in artificial agents.

Abstract

We introduce Semantic Parsing in Contextual Environments (SPICE), a task designed to enhance artificial agents' contextual awareness by integrating multimodal inputs with prior contexts. SPICE goes beyond traditional semantic parsing by offering a structured, interpretable framework for dynamically updating an agent's knowledge with new information, mirroring the complexity of human communication. We develop the VG-SPICE dataset, crafted to challenge agents with visual scene graph construction from spoken conversational exchanges, highlighting speech and visual data integration. We also present the Audio-Vision Dialogue Scene Parser (AViD-SP) developed for use on VG-SPICE. These innovations aim to improve multimodal information processing and integration. Both the VG-SPICE dataset and the AViD-SP model are publicly available.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Multimodal Contextualized Semantic Parsing from Speech· underline

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques