Attention-guided Evidence Grounding for Spoken Question Answering
Ke Yang, Bolin Chen, Yuejie Li, Yueying Hua, Jianhao Nie, Yueping He, Bowen Li, Chengjun Mao

TL;DR
This paper introduces an end-to-end attention-guided evidence grounding framework for spoken question answering that improves accuracy and efficiency by explicitly locating relevant evidence in the model's latent space, reducing hallucinations and latency.
Contribution
The paper proposes a novel end-to-end framework using cross-modal attention and supervised fine-tuning to improve evidence grounding in spoken QA systems, outperforming cascaded baselines.
Findings
Reduces hallucinations in spoken QA.
Achieves 62% inference latency reduction.
Outperforms cascaded baseline systems.
Abstract
Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model's latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Multimodal Machine Learning Applications
