Comprehensive Audio Query Handling System with Integrated Expert Models and Contextual Understanding
Vakada Naveen, Arvind Krishna Sridhar, Yinyi Guo, Erik Visser

TL;DR
This paper introduces an integrated audio query system combining specialized models and a large language model to accurately handle diverse audio-related questions, demonstrating superior performance on custom and benchmark datasets.
Contribution
The system uniquely combines intent classification, expert audio models, and a large language model for comprehensive audio query handling, with improved accuracy over existing models.
Findings
BERT-based intent classifier outperforms LLM-fewshot classifier.
Significant accuracy improvements on custom audio tasks.
Outperforms 7B parameter models on MMAU sound benchmark.
Abstract
This paper presents a comprehensive chatbot system designed to handle a wide range of audio-related queries by integrating multiple specialized audio processing models. The proposed system uses an intent classifier, trained on a diverse audio query dataset, to route queries about audio content to expert models such as Automatic Speech Recognition (ASR), Speaker Diarization, Music Identification, and Text-to-Audio generation. A 3.8 B LLM model then takes inputs from an Audio Context Detection (ACD) module extracting audio event information from the audio and post processes text domain outputs from the expert models to compute the final response to the user. We evaluated the system on custom audio tasks and MMAU sound set benchmarks. The custom datasets were motivated by target use cases not covered in industry benchmarks and included ACD-timestamp-QA (Question Answering) as well as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Softmax · Linear Warmup With Linear Decay · Multi-Head Attention · WordPiece · Dropout · Sparse Evolutionary Training · Dense Connections · Layer Normalization
