Unpacking Robustness in Inflectional Languages: Adversarial Evaluation and Mechanistic Insights

Pawe{\l} Walkowiak; Marek Klonowski; Marcin Oleksy; Arkadiusz Janz

arXiv:2505.07856·cs.CL·May 14, 2025

Unpacking Robustness in Inflectional Languages: Adversarial Evaluation and Mechanistic Insights

Pawe{\l} Walkowiak, Marek Klonowski, Marcin Oleksy, Arkadiusz Janz

PDF

TL;DR

This paper investigates how adversarial attacks affect models in inflectional languages like Polish and English, introducing a new evaluation protocol and benchmark to understand the role of inflection in model robustness.

Contribution

It presents a novel evaluation protocol and benchmark for analyzing adversarial robustness in inflectional languages, incorporating mechanistic interpretability techniques.

Findings

01

Adversarial attacks' impact varies with inflectional morphology.

02

The new benchmark reveals inflection-related vulnerabilities in models.

03

Mechanistic insights link inflection features to robustness performance.

Abstract

Various techniques are used in the generation of adversarial examples, including methods such as TextBugger which introduce minor, hardly visible perturbations to words leading to changes in model behaviour. Another class of techniques involves substituting words with their synonyms in a way that preserves the text's meaning but alters its predicted class, with TextFooler being a prominent example of such attacks. Most adversarial example generation methods are developed and evaluated primarily on non-inflectional languages, typically English. In this work, we evaluate and explain how adversarial attacks perform in inflectional languages. To explain the impact of inflection on model behaviour and its robustness under attack, we designed a novel protocol inspired by mechanistic interpretability, based on Edge Attribution Patching (EAP) method. The proposed evaluation protocol relies on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsActivation Patching