WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models
Youssef Benchekroun, Megi Dervishi, Mark Ibrahim, Jean-Baptiste Gaya,, Xavier Martinet, Gr\'egoire Mialon, Thomas Scialom, Emmanuel Dupoux, Dieuwke, Hupkes, Pascal Vincent

TL;DR
WorldSense is a synthetic benchmark designed to evaluate large language models' ability to maintain consistent world models and draw inferences, revealing persistent errors and biases even in advanced models like GPT-4.
Contribution
The paper introduces WorldSense, a bias-resistant synthetic benchmark for assessing grounded reasoning in large language models, and provides analysis of model performance and generalization.
Findings
State-of-the-art models make errors with minimal objects
Models exhibit response biases regardless of prompting techniques
Fine-tuning improves performance but limits generalization
Abstract
We propose WorldSense, a benchmark designed to assess the extent to which LLMs are consistently able to sustain tacit world models, by testing how they draw simple inferences from descriptions of simple arrangements of entities. Worldsense is a synthetic benchmark with three problem types, each with their own trivial control, which explicitly avoids bias by decorrelating the abstract structure of problems from the vocabulary and expressions, and by decorrelating all problem subparts with the correct response. We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat) and show that these models make errors even with as few as three objects. Furthermore, they have quite heavy response biases, preferring certain responses irrespective of the question. Errors persist even with chain-of-thought prompting and in-context learning. Lastly, we show that while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management
