PRobELM: Plausibility Ranking Evaluation for Language Models

Zhangdie Yuan; Eric Chamoun; Rami Aly; Chenxi Whitehouse; Andreas; Vlachos

arXiv:2404.03818·cs.CL·April 28, 2025·1 cites

PRobELM: Plausibility Ranking Evaluation for Language Models

Zhangdie Yuan, Eric Chamoun, Rami Aly, Chenxi Whitehouse, Andreas, Vlachos

PDF

Open Access 1 Datasets

TL;DR

PRobELM is a new benchmark for evaluating language models' ability to rank scenarios based on plausibility using world knowledge, bridging the gap between factual accuracy and plausible reasoning.

Contribution

The paper introduces PRobELM, a novel benchmark dataset that assesses models' plausibility ranking capabilities leveraging world knowledge, constructed from Wikidata and aligned with training data timelines.

Findings

01

Model size and architecture influence plausibility performance.

02

Recent training data improves plausibility assessment.

03

Factual accuracy does not guarantee better plausibility ranking.

Abstract

This paper introduces PRobELM (Plausibility Ranking Evaluation for Language Models), a benchmark designed to assess language models' ability to discern more plausible from less plausible scenarios through their parametric knowledge. While benchmarks such as TruthfulQA emphasise factual accuracy or truthfulness, and others such as COPA explore plausible scenarios without explicitly incorporating world knowledge, PRobELM seeks to bridge this gap by evaluating models' capabilities to prioritise plausible scenarios that leverage world knowledge over less plausible alternatives. This design allows us to assess the potential of language models for downstream use cases such as literature-based discovery where the focus is on identifying information that is likely but not yet known. Our benchmark is constructed from a dataset curated from Wikidata edit histories, tailored to align the temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

MoyYuan/PRobELM
dataset· 10 dl
10 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsFocus · ALIGN