Reasoning Robustness of LLMs to Adversarial Typographical Errors
Esther Gan, Yiran Zhao, Liying Cheng, Yancan Mao, Anirudh Goyal, Kenji, Kawaguchi, Min-Yen Kan, Michael Shieh

TL;DR
This paper investigates how large language models' reasoning abilities are affected by typographical errors, introducing an attack algorithm and a benchmark to evaluate their robustness to such adversarial perturbations.
Contribution
It proposes the ATA algorithm for generating adversarial typos and the R2ATA benchmark to evaluate LLMs' reasoning robustness against typographical errors.
Findings
LLMs are sensitive to minimal typographical changes.
Performance drops significantly with increased typos, e.g., from 43.7% to 19.2%.
The R2ATA benchmark reveals transferability and robustness issues across models.
Abstract
Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning using Chain-of-Thought (CoT) prompting. However, CoT can be biased by users' instruction. In this work, we study the reasoning robustness of LLMs to typographical errors, which can naturally occur in users' queries. We design an Adversarial Typo Attack () algorithm that iteratively samples typos for words that are important to the query and selects the edit that is most likely to succeed in attacking. It shows that LLMs are sensitive to minimal adversarial typographical changes. Notably, with 1 character edit, Mistral-7B-Instruct's accuracy drops from 43.7% to 38.6% on GSM8K, while with 8 character edits the performance further drops to 19.2%. To extend our evaluation to larger and closed-source LLMs, we develop the \texttt{R^2ATA} benchmark, which assesses models'…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
