PRobELM: Plausibility Ranking Evaluation for Language Models
Zhangdie Yuan, Eric Chamoun, Rami Aly, Chenxi Whitehouse, Andreas, Vlachos

TL;DR
PRobELM is a new benchmark for evaluating language models' ability to rank scenarios based on plausibility using world knowledge, bridging the gap between factual accuracy and plausible reasoning.
Contribution
The paper introduces PRobELM, a novel benchmark dataset that assesses models' plausibility ranking capabilities leveraging world knowledge, constructed from Wikidata and aligned with training data timelines.
Findings
Model size and architecture influence plausibility performance.
Recent training data improves plausibility assessment.
Factual accuracy does not guarantee better plausibility ranking.
Abstract
This paper introduces PRobELM (Plausibility Ranking Evaluation for Language Models), a benchmark designed to assess language models' ability to discern more plausible from less plausible scenarios through their parametric knowledge. While benchmarks such as TruthfulQA emphasise factual accuracy or truthfulness, and others such as COPA explore plausible scenarios without explicitly incorporating world knowledge, PRobELM seeks to bridge this gap by evaluating models' capabilities to prioritise plausible scenarios that leverage world knowledge over less plausible alternatives. This design allows us to assess the potential of language models for downstream use cases such as literature-based discovery where the focus is on identifying information that is likely but not yet known. Our benchmark is constructed from a dataset curated from Wikidata edit histories, tailored to align the temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsFocus · ALIGN
