Data-Centric Improvements for Enhancing Multi-Modal Understanding in   Spoken Conversation Modeling

Maximillian Chen; Ruoxi Sun; Sercan \"O. Ar{\i}k

arXiv:2412.15995·cs.CL·December 23, 2024

Data-Centric Improvements for Enhancing Multi-Modal Understanding in Spoken Conversation Modeling

Maximillian Chen, Ruoxi Sun, Sercan \"O. Ar{\i}k

PDF

Open Access

TL;DR

This paper presents a data-centric, multi-task learning approach that significantly improves multimodal understanding in conversational speech modeling, achieving state-of-the-art results with limited data and introducing a new dataset for spoken dialogue.

Contribution

It introduces a novel multi-task learning paradigm and a new dataset, ASK-QA, to enhance multimodal speech understanding with minimal data in conversational AI.

Findings

01

Achieved state-of-the-art on Spoken-SQuAD with only 10% of training data.

02

Developed a multi-task learning framework utilizing auxiliary tasks.

03

Introduced ASK-QA, a new dataset for multi-turn spoken dialogue.

Abstract

Conversational assistants are increasingly popular across diverse real-world applications, highlighting the need for advanced multimodal speech modeling. Speech, as a natural mode of communication, encodes rich user-specific characteristics such as speaking rate and pitch, making it critical for effective interaction. Our work introduces a data-centric customization approach for efficiently enhancing multimodal understanding in conversational speech modeling. Central to our contributions is a novel multi-task learning paradigm that involves designing auxiliary tasks to utilize a small amount of speech data. Our approach achieves state-of-the-art performance on the Spoken-SQuAD benchmark, using only 10% of the training data with open-weight models, establishing a robust and efficient framework for audio-centric conversational modeling. We also introduce ASK-QA, the first dataset for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems