Testing the Generalization Power of Neural Network Models Across NLI Benchmarks
Aarne Talman, Stergios Chatzikyriakidis

TL;DR
This paper investigates the limited generalization of neural network models across different natural language inference benchmarks, revealing that models often fail to transfer well between datasets despite similar inference tasks.
Contribution
The study systematically evaluates neural network models across multiple NLI datasets, highlighting their poor cross-dataset generalization and the limitations of current datasets.
Findings
Models trained on one dataset perform poorly on others.
Large pre-trained models improve transfer when datasets are similar.
Current NLI datasets lack coverage of inference nuances.
Abstract
Neural network models have been very successful in natural language inference, with the best models reaching 90% accuracy in some benchmarks. However, the success of these models turns out to be largely benchmark specific. We show that models trained on a natural language inference dataset drawn from one benchmark fail to perform well in others, even if the notion of inference assumed in these benchmarks is the same or similar. We train six high performing neural network models on different datasets and show that each one of these has problems of generalizing when we replace the original test set with a test set taken from another corpus designed for the same task. In light of these results, we argue that most of the current neural network models are not able to generalize well in the task of natural language inference. We find that using large pre-trained language models helps with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
