Data-Centric Improvements for Enhancing Multi-Modal Understanding in Spoken Conversation Modeling
Maximillian Chen, Ruoxi Sun, Sercan \"O. Ar{\i}k

TL;DR
This paper presents a data-centric, multi-task learning approach that significantly improves multimodal understanding in conversational speech modeling, achieving state-of-the-art results with limited data and introducing a new dataset for spoken dialogue.
Contribution
It introduces a novel multi-task learning paradigm and a new dataset, ASK-QA, to enhance multimodal speech understanding with minimal data in conversational AI.
Findings
Achieved state-of-the-art on Spoken-SQuAD with only 10% of training data.
Developed a multi-task learning framework utilizing auxiliary tasks.
Introduced ASK-QA, a new dataset for multi-turn spoken dialogue.
Abstract
Conversational assistants are increasingly popular across diverse real-world applications, highlighting the need for advanced multimodal speech modeling. Speech, as a natural mode of communication, encodes rich user-specific characteristics such as speaking rate and pitch, making it critical for effective interaction. Our work introduces a data-centric customization approach for efficiently enhancing multimodal understanding in conversational speech modeling. Central to our contributions is a novel multi-task learning paradigm that involves designing auxiliary tasks to utilize a small amount of speech data. Our approach achieves state-of-the-art performance on the Spoken-SQuAD benchmark, using only 10% of the training data with open-weight models, establishing a robust and efficient framework for audio-centric conversational modeling. We also introduce ASK-QA, the first dataset for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
