WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large   Language Models

Youssef Benchekroun; Megi Dervishi; Mark Ibrahim; Jean-Baptiste Gaya,; Xavier Martinet; Gr\'egoire Mialon; Thomas Scialom; Emmanuel Dupoux; Dieuwke; Hupkes; Pascal Vincent

arXiv:2311.15930·cs.CL·November 28, 2023·1 cites

WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models

Youssef Benchekroun, Megi Dervishi, Mark Ibrahim, Jean-Baptiste Gaya,, Xavier Martinet, Gr\'egoire Mialon, Thomas Scialom, Emmanuel Dupoux, Dieuwke, Hupkes, Pascal Vincent

PDF

Open Access 1 Repo

TL;DR

WorldSense is a synthetic benchmark designed to evaluate large language models' ability to maintain consistent world models and draw inferences, revealing persistent errors and biases even in advanced models like GPT-4.

Contribution

The paper introduces WorldSense, a bias-resistant synthetic benchmark for assessing grounded reasoning in large language models, and provides analysis of model performance and generalization.

Findings

01

State-of-the-art models make errors with minimal objects

02

Models exhibit response biases regardless of prompting techniques

03

Fine-tuning improves performance but limits generalization

Abstract

We propose WorldSense, a benchmark designed to assess the extent to which LLMs are consistently able to sustain tacit world models, by testing how they draw simple inferences from descriptions of simple arrangements of entities. Worldsense is a synthetic benchmark with three problem types, each with their own trivial control, which explicitly avoids bias by decorrelating the abstract structure of problems from the vocabulary and expressions, and by decorrelating all problem subparts with the correct response. We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat) and show that these models make errors even with as few as three objects. Furthermore, they have quite heavy response biases, preferring certain responses irrespective of the question. Errors persist even with chain-of-thought prompting and in-context learning. Lastly, we show that while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/worldsense
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management