MORE: Multi-Objective Adversarial Attacks on Speech Recognition
Xiaoxue Gao, Zexin Li, Yiming Chen, Nancy F. Chen

TL;DR
This paper introduces MORE, a novel multi-objective adversarial attack on speech recognition models that simultaneously degrades accuracy and inference efficiency, revealing vulnerabilities in ASR systems.
Contribution
The paper proposes a hierarchical multi-objective attack framework with a new REDO objective, effectively increasing transcription length and computational cost in ASR models.
Findings
MORE significantly increases transcription length and computational cost.
The attack maintains high word error rates across various scenarios.
It exposes vulnerabilities in ASR robustness beyond accuracy alone.
Abstract
The emergence of large-scale automatic speech recognition (ASR) models such as Whisper has greatly expanded their adoption across diverse real-world applications. Ensuring robustness against even minor input perturbations is therefore critical for maintaining reliable performance in real-time environments. While prior work has mainly examined accuracy degradation under adversarial attacks, robustness with respect to efficiency remains largely unexplored. This narrow focus provides only a partial understanding of ASR model vulnerabilities. To address this gap, we conduct a comprehensive study of ASR robustness under multiple attack scenarios. We introduce MORE, a multi-objective repetitive doubling encouragement attack, which jointly degrades recognition accuracy and inference efficiency through a hierarchical staged repulsion-anchoring mechanism. Specifically, we reformulate…
Peer Reviews
Decision·Submitted to ICLR 2026
The experiments consider multiple DL models and datasets, with seemingly more rigorous evaluation methodologies than many previous papers in this area.
(1) Even though the paper claims to focus on robustness with respect to efficiency against attacks, the experimental results appear to primarily emphasize accuracy or, in some cases, the trade-off instead. However, accuracy has already been extensively explored in prior works (Raina et al., 2024; Raina & Gales, 2024; Olivier & Raj, 2022b; Madry et al., 2018a; Dong et al., 2018; Wang & He, 2021; Gao et al., 2024). Therefore, the overall contribution of this paper seems quite limited. (2) Lack of
The joint focus on accuracy and efficiency robustness is original and well-motivated. Across five Whisper models, the proposed MORE method consistently yields longer, incorrect outputs (strong attack efficacy), clearly outperforming baselines. The authors discuss potential misuse and propose mitigations, which strengthen the paper’s responsibility stance. The appendix rigorously analyzes computational cost, showing depth of understanding.
The hierarchical optimization's convergence or general properties are not analyzed mathematically. More theoretical analyses should be included. If I understand correctly, using output length as a proxy for computational efficiency is reasonable but somewhat coarse. Can authors provide some more efficiency analyses? Only Whisper-based ASR models are tested. The paper should include the test of other architectures. There is no black-box or transfer evaluation against closed models or API syste
1) Leveraging a naturally occurring failure mode for the attack seems like a nice idea. 2) The paper is mostly clear and easy to read (although see Questions for some commenting on this). 3) The proposed method is benchmarked against multiple existing methods, and the results also include a good ablation study on the individual components.
1) The reporting of experimental results should be improved (see Questions for details). 2) There is no implementation/source code available (although according to the Reproducibility statement it might be released later). 3) Eg lines 107-111: I do not find the argument for the need to combine the accuracy and sequence length attacks too convincing; any working accuracy degradation attack will make a given system unusable even if it is lighting fast. To me, the interesting part here is the pos
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Speech Recognition and Synthesis
