Language Models Are Poor Learners of Directional Inference
Tianyi Li, Mohammad Javad Hosseini, Sabine Weber, Mark, Steedman

TL;DR
This paper critically evaluates language models' ability to understand directional inferences, revealing their limitations and introducing a new multilingual benchmark to better assess this aspect of natural language understanding.
Contribution
It highlights the inadequacy of current datasets for testing directional inference and introduces BoOQA, a robust benchmark for evaluating language models on this task.
Findings
Language models perform poorly on directional inference tasks.
Existing datasets are flawed and can be exploited by artifacts.
BoOQA provides a more reliable evaluation framework.
Abstract
We examine LMs' competence of directional predicate entailments by supervised fine-tuning with prompts. Our analysis shows that contrary to their apparent success on standard NLI, LMs show limited ability to learn such directional inference; moreover, existing datasets fail to test directionality, and/or are infested by artefacts that can be learnt as proxy for entailments, yielding over-optimistic results. In response, we present BoOQA (Boolean Open QA), a robust multi-lingual evaluation benchmark for directional predicate entailments, extrinsic to existing training sets. On BoOQA, we establish baselines and show evidence of existing LM-prompting models being incompetent directional entailment learners, in contrast to entailment graphs, however limited by sparsity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsTest
