Investigation on the Robustness of Acoustic Foundation Models on Post Exercise Speech

Xiangyuan Xue; Yuyu Wang; Ruijie Yao; Xiaoyue Ni; Xiaofan Jiang; Jingping Nie

arXiv:2603.27508·cs.SD·March 31, 2026

Investigation on the Robustness of Acoustic Foundation Models on Post Exercise Speech

Xiangyuan Xue, Yuyu Wang, Ruijie Yao, Xiaoyue Ni, Xiaofan Jiang, Jingping Nie

PDF

TL;DR

This study benchmarks various acoustic foundation models for automatic speech recognition on post-exercise speech, revealing significant model-dependent robustness issues and the potential of fine-tuning to improve performance.

Contribution

It provides a comprehensive evaluation of ASR models on post-exercise speech, highlighting robustness challenges and the effectiveness of in-domain fine-tuning.

Findings

01

Most models degrade on post-exercise speech compared to resting speech.

02

FunASR achieves the strongest baseline robustness with 14.57% WER.

03

Fine-tuning improves CTC-based models, but Whisper shows unstable adaptation.

Abstract

Automatic speech recognition (ASR) has been extensively studied on neutral and stationary speech, yet its robustness under post-exercise physiological shift remains underexplored. Compared with resting speech, post-exercise speech often contains micro-breaths, non-semantic pauses, unstable phonation, and repetitions caused by reduced breath support, making transcription more difficult. In this work, we benchmark acoustic foundation models on post-exercise speech under a unified evaluation protocol. We compare sequence-to-sequence models (Whisper and FunASR/Paraformer) and self-supervised encoders with CTC decoding (Wav2Vec2, HuBERT, and WavLM), under both off-the-shelf inference and post-exercise in-domain fine-tuning. Across the Static/Post-All benchmark, most models degrade on post-exercise speech, while FunASR shows the strongest baseline robustness at 14.57% WER and 8.21% CER on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.