MORE: Multi-Objective Adversarial Attacks on Speech Recognition

Xiaoxue Gao; Zexin Li; Yiming Chen; Nancy F. Chen

arXiv:2601.01852·eess.AS·January 15, 2026

MORE: Multi-Objective Adversarial Attacks on Speech Recognition

Xiaoxue Gao, Zexin Li, Yiming Chen, Nancy F. Chen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces MORE, a novel multi-objective adversarial attack on speech recognition models that simultaneously degrades accuracy and inference efficiency, revealing vulnerabilities in ASR systems.

Contribution

The paper proposes a hierarchical multi-objective attack framework with a new REDO objective, effectively increasing transcription length and computational cost in ASR models.

Findings

01

MORE significantly increases transcription length and computational cost.

02

The attack maintains high word error rates across various scenarios.

03

It exposes vulnerabilities in ASR robustness beyond accuracy alone.

Abstract

The emergence of large-scale automatic speech recognition (ASR) models such as Whisper has greatly expanded their adoption across diverse real-world applications. Ensuring robustness against even minor input perturbations is therefore critical for maintaining reliable performance in real-time environments. While prior work has mainly examined accuracy degradation under adversarial attacks, robustness with respect to efficiency remains largely unexplored. This narrow focus provides only a partial understanding of ASR model vulnerabilities. To address this gap, we conduct a comprehensive study of ASR robustness under multiple attack scenarios. We introduce MORE, a multi-objective repetitive doubling encouragement attack, which jointly degrades recognition accuracy and inference efficiency through a hierarchical staged repulsion-anchoring mechanism. Specifically, we reformulate…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The experiments consider multiple DL models and datasets, with seemingly more rigorous evaluation methodologies than many previous papers in this area.

Weaknesses

(1) Even though the paper claims to focus on robustness with respect to efficiency against attacks, the experimental results appear to primarily emphasize accuracy or, in some cases, the trade-off instead. However, accuracy has already been extensively explored in prior works (Raina et al., 2024; Raina & Gales, 2024; Olivier & Raj, 2022b; Madry et al., 2018a; Dong et al., 2018; Wang & He, 2021; Gao et al., 2024). Therefore, the overall contribution of this paper seems quite limited. (2) Lack of

Reviewer 02Rating 4Confidence 3

Strengths

The joint focus on accuracy and efficiency robustness is original and well-motivated. Across five Whisper models, the proposed MORE method consistently yields longer, incorrect outputs (strong attack efficacy), clearly outperforming baselines. The authors discuss potential misuse and propose mitigations, which strengthen the paper’s responsibility stance. The appendix rigorously analyzes computational cost, showing depth of understanding.

Weaknesses

The hierarchical optimization's convergence or general properties are not analyzed mathematically. More theoretical analyses should be included. If I understand correctly, using output length as a proxy for computational efficiency is reasonable but somewhat coarse. Can authors provide some more efficiency analyses? Only Whisper-based ASR models are tested. The paper should include the test of other architectures. There is no black-box or transfer evaluation against closed models or API syste

Reviewer 03Rating 4Confidence 3

Strengths

1) Leveraging a naturally occurring failure mode for the attack seems like a nice idea. 2) The paper is mostly clear and easy to read (although see Questions for some commenting on this). 3) The proposed method is benchmarked against multiple existing methods, and the results also include a good ablation study on the individual components.

Weaknesses

1) The reporting of experimental results should be improved (see Questions for details). 2) There is no implementation/source code available (although according to the Reproducibility statement it might be released later). 3) Eg lines 107-111: I do not find the argument for the need to combine the accuracy and sequence length attacks too convincing; any working accuracy degradation attack will make a given system unusable even if it is lighting fast. To me, the interesting part here is the pos

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Speech Recognition and Synthesis