Investigation on the Robustness of Acoustic Foundation Models on Post Exercise Speech
Xiangyuan Xue, Yuyu Wang, Ruijie Yao, Xiaoyue Ni, Xiaofan Jiang, Jingping Nie

TL;DR
This study benchmarks various acoustic foundation models for automatic speech recognition on post-exercise speech, revealing significant model-dependent robustness issues and the potential of fine-tuning to improve performance.
Contribution
It provides a comprehensive evaluation of ASR models on post-exercise speech, highlighting robustness challenges and the effectiveness of in-domain fine-tuning.
Findings
Most models degrade on post-exercise speech compared to resting speech.
FunASR achieves the strongest baseline robustness with 14.57% WER.
Fine-tuning improves CTC-based models, but Whisper shows unstable adaptation.
Abstract
Automatic speech recognition (ASR) has been extensively studied on neutral and stationary speech, yet its robustness under post-exercise physiological shift remains underexplored. Compared with resting speech, post-exercise speech often contains micro-breaths, non-semantic pauses, unstable phonation, and repetitions caused by reduced breath support, making transcription more difficult. In this work, we benchmark acoustic foundation models on post-exercise speech under a unified evaluation protocol. We compare sequence-to-sequence models (Whisper and FunASR/Paraformer) and self-supervised encoders with CTC decoding (Wav2Vec2, HuBERT, and WavLM), under both off-the-shelf inference and post-exercise in-domain fine-tuning. Across the Static/Post-All benchmark, most models degrade on post-exercise speech, while FunASR shows the strongest baseline robustness at 14.57% WER and 8.21% CER on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
