End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering

Jiliang Hu; Zuchao Li; Baoyuan Qi; Liu Guoming; Ping Wang

arXiv:2511.09282·cs.SD·April 14, 2026

End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering

Jiliang Hu, Zuchao Li, Baoyuan Qi, Liu Guoming, Ping Wang

PDF

1 Video

TL;DR

This paper introduces CLSR, an end-to-end contrastive language-speech retriever that improves long-form spoken question answering by effectively extracting relevant segments from lengthy audio recordings.

Contribution

The paper presents a novel contrastive retriever that bridges acoustic and textual modalities, outperforming existing speech retrieval methods in long-form spoken question answering.

Findings

01

CLSR surpasses existing speech retrievers and pipeline approaches in four datasets.

02

Experimental results show improved accuracy in extracting question-relevant segments.

03

CLSR provides a robust foundation for practical long-form SQA applications.

Abstract

Significant progress has been made in spoken question answering (SQA) in recent years. However, many existing methods, including large audio language models, struggle with processing long audio. Follow the success of retrieval augmented generation, a speech-related retriever shows promising in help preprocessing long-form speech. But the performance of existing speech-related retrievers is lacking. To address this challenge, we propose CLSR, an end-to-end contrastive language-speech retriever that efficiently extracts question-relevant segments from long audio recordings for downstream SQA task. Unlike conventional speech-text contrastive models, CLSR incorporates an intermediate step that converts acoustic features into text-like representations prior to alignment, thereby more effectively bridging the gap between modalities. Experimental results across four cross-modal retrieval…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

End-to-End Contrastive Language-Speech Pretraining Model for Long-Form Spoken Question Answering· underline