SGGNet$^2$: Speech-Scene Graph Grounding Network for Speech-guided   Navigation

Dohyun Kim; Yeseung Kim; Jaehwi Jang; Minjae Song; Woojin Choi; and; Daehyung Park

arXiv:2307.07468·cs.RO·April 16, 2024·1 cites

SGGNet$^2$: Speech-Scene Graph Grounding Network for Speech-guided Navigation

Dohyun Kim, Yeseung Kim, Jaehwi Jang, Minjae Song, Woojin Choi, and, Daehyung Park

PDF

Open Access

TL;DR

This paper introduces SGGNet$^2$, a robust speech-scene graph grounding network that improves spoken language understanding for robot navigation by leveraging acoustic similarities and integrating ASR systems.

Contribution

The paper presents a novel extension of the scene-graph grounding network that incorporates acoustic similarity from ASR to enhance speech grounding accuracy.

Findings

01

Effective grounding of spoken commands demonstrated.

02

Improved navigation performance on real robot.

03

Robustness to speech variability shown.

Abstract

The spoken language serves as an accessible and efficient interface, enabling non-experts and disabled users to interact with complex assistant robots. However, accurately grounding language utterances gives a significant challenge due to the acoustic variability in speakers' voices and environmental noise. In this work, we propose a novel speech-scene graph grounding network (SGGNet $^{2}$ ) that robustly grounds spoken utterances by leveraging the acoustic similarity between correctly recognized and misrecognized words obtained from automatic speech recognition (ASR) systems. To incorporate the acoustic similarity, we extend our previous grounding model, the scene-graph-based grounding network (SGGNet), with the ASR model from NVIDIA NeMo. We accomplish this by feeding the latent vector of speech pronunciations into the BERT-based grounding network within SGGNet. We evaluate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques