Attention-guided Evidence Grounding for Spoken Question Answering

Ke Yang; Bolin Chen; Yuejie Li; Yueying Hua; Jianhao Nie; Yueping He; Bowen Li; Chengjun Mao

arXiv:2603.16292·cs.CL·March 19, 2026

Attention-guided Evidence Grounding for Spoken Question Answering

Ke Yang, Bolin Chen, Yuejie Li, Yueying Hua, Jianhao Nie, Yueping He, Bowen Li, Chengjun Mao

PDF

Open Access

TL;DR

This paper introduces an end-to-end attention-guided evidence grounding framework for spoken question answering that improves accuracy and efficiency by explicitly locating relevant evidence in the model's latent space, reducing hallucinations and latency.

Contribution

The paper proposes a novel end-to-end framework using cross-modal attention and supervised fine-tuning to improve evidence grounding in spoken QA systems, outperforming cascaded baselines.

Findings

01

Reduces hallucinations in spoken QA.

02

Achieves 62% inference latency reduction.

03

Outperforms cascaded baseline systems.

Abstract

Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model's latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Multimodal Machine Learning Applications