Pay Attention to Real World Perturbations! Natural Robustness Evaluation in Machine Reading Comprehension
Yulong Wu, Viktor Schlegel, Riza Batista-Navarro

TL;DR
This paper introduces a framework for evaluating the robustness of machine reading comprehension models against real-world textual perturbations using Wikipedia edit history, revealing significant performance drops and potential mitigation strategies.
Contribution
It presents a novel natural perturbation evaluation method for MRC models, highlighting the gap between synthetic and real-world robustness and exploring training-based improvements.
Findings
Natural perturbations cause performance degradation in state-of-the-art models.
Models like Flan-T5 and LLMs inherit errors from natural perturbations.
Training on perturbed data improves robustness but does not fully close the gap.
Abstract
As neural language models achieve human-comparable performance on Machine Reading Comprehension (MRC) and see widespread adoption, ensuring their robustness in real-world scenarios has become increasingly important. Current robustness evaluation research, though, primarily develops synthetic perturbation methods, leaving unclear how well they reflect real life scenarios. Considering this, we present a framework to automatically examine MRC models on naturally occurring textual perturbations, by replacing paragraph in MRC benchmarks with their counterparts based on available Wikipedia edit history. Such perturbation type is natural as its design does not stem from an arteficial generative process, inherently distinct from the previously investigated synthetic approaches. In a large-scale study encompassing SQUAD datasets and various model architectures we observe that natural…
Peer Reviews
Decision·Submitted to ICLR 2025
- Proposes a framework using Wikipedia edit history to generate natural perturbations in MRC benchmarks - Evaluate model performance across encoder-only, encoder-decoder, and decoder-only architectures - Shows that natural perturbations can degrade performance and these errors transfer to larger models - Demonstrate that adversarial training with both natural and synthetic perturbations can help mitigate these issues
I am generally optimistic about the paper, and I have the following minor concerns. The analysis section could be more in-depth 1. Permutations and models - No investigation of how perturbation magnitude affects performance and why certain permutations affect the model more than others; - Missing analysis of the interaction between model size and robustness is extremely important as we see that some observations might not always be predictable and transferrable on smaller model sizes. - I
1. It is interesting to use Wikipedia edit history to construct perturbation. 2. The perturbed set is verified by human to ensure that the perturbed examples are still valid. 3. Results show that natural perturbation is a powerful attack to LMs.
1. It is unclear whether stronger model, e.g. gpt-4o would still suffer from this challenge. While weaker models like BERT suffers from the natural perturbations, it is important to show that it is still a challenge for recent stronger LLMs. 2. The perturbation method relies on Wikipedia edit history, limiting its applicability to non-Wikipedia based datasets. 3. The performance drops on non-SQuAD datasets like DROP are relatively small, e.g. LLaMA-2 only exhibits less than 2 points drop, which
* Comprehensive evaluation across multiple architecture types rather than only decoder models and multiple models of each type: encoder, decoder, encoder-decoder models * Somewhat comprehensive evaluation across multiple QA datasets, with caveats: see below * Paragraphs are generally well-written and easy to understand
**TLDR** The authors pose an interesting question, but the execution of the study contains unexpected design decisions that are not well-justified. The exact improvement of their claimed methodology over existing work is also unclear. **Details:** * The authors call out the similarities between their method and Belinkov & Bisk (2018), do not make clear the differences and improvements over the latter, if any. The claimed contribution ("novel Wikipedia revision history-based framework") also do
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Neural Networks and Applications
MethodsFlan-T5
