Enhancing adversarial robustness in Natural Language Inference using explanations

Alexandros Koulakos; Maria Lymperaiou; Giorgos Filandrianos; Giorgos Stamou

arXiv:2409.07423·cs.CL·August 4, 2025

Enhancing adversarial robustness in Natural Language Inference using explanations

Alexandros Koulakos, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper demonstrates that fine-tuning NLI models on natural language explanations enhances their robustness against adversarial attacks, and explores metrics to evaluate explanation quality correlating with human judgment.

Contribution

It introduces a model-agnostic explanation-based fine-tuning method to improve adversarial robustness in NLI and investigates evaluation metrics aligned with human perception.

Findings

01

Explanation fine-tuning improves adversarial robustness.

02

Language generation metrics correlate with human judgment.

03

Resource-efficient approach without heavy computational costs.

Abstract

The surge of state-of-the-art Transformer-based models has undoubtedly pushed the limits of NLP model performance, excelling in a variety of tasks. We cast the spotlight on the underexplored task of Natural Language Inference (NLI), since models trained on popular well-suited datasets are susceptible to adversarial attacks, allowing subtle input interventions to mislead the model. In this work, we validate the usage of natural language explanation as a model-agnostic defence strategy through extensive experimentation: only by fine-tuning a classifier on the explanation rather than premise-hypothesis inputs, robustness under various adversarial attacks is achieved in comparison to explanation-free baselines. Moreover, since there is no standard strategy of testing the semantic validity of the generated explanations, we research the correlation of widely used language generation metrics…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alexkoulakos/explain-then-predict
pytorchOfficial

Videos

Enhancing adversarial robustness in Natural Language Inference using explanations· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Natural Language Processing Techniques · Topic Modeling