The ICL Consistency Test
Lucas Weber, Elia Bruni, Dieuwke Hupkes

TL;DR
The paper introduces the ICL consistency test, a benchmark to evaluate how consistently large language models perform across different setups using the same data, revealing a lack of robust generalisation.
Contribution
It presents a new benchmark and metric for assessing consistency in prompt-based models, highlighting their limitations in generalisation across varied setups.
Findings
All tested models show inconsistent predictions across setups.
The metric identifies properties that cause prediction instability.
Models lack robust generalisation according to the new consistency measure.
Abstract
Just like the previous generation of task-tuned models, large language models (LLMs) that are adapted to tasks via prompt-based methods like in-context-learning (ICL) perform well in some setups but not in others. This lack of consistency in prompt-based learning hints at a lack of robust generalisation. We here introduce the ICL consistency test -- a contribution to the GenBench collaborative benchmark task (CBT) -- which evaluates how consistent a model makes predictions across many different setups while using the same data. The test is based on different established natural language inference tasks. We provide preprocessed data constituting 96 different 'setups' and a metric that estimates model consistency across these setups. The metric is provided on a fine-grained level to understand what properties of a setup render predictions unstable and on an aggregated level to compare…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques
