Probing LLM World Models: Enhancing Guesstimation with Wisdom of Crowds Decoding
Yun-Shiuan Chuang, Sameer Narendran, Nikunj Harlalka, Alexander Cheung, Sizhe Gao, Siddharth Suresh, Junjie Hu, Timothy T. Rogers

TL;DR
This paper introduces new guesstimation datasets for LLMs, demonstrating that median aggregation of multiple responses, inspired by Wisdom of Crowds, improves LLM accuracy in approximate reasoning tasks, revealing their encoded world models.
Contribution
It presents three novel guesstimation datasets and proposes WOC decoding, showing its effectiveness in enhancing LLM performance on real-world estimation tasks.
Findings
Median aggregation improves LLM guesstimation accuracy.
LLMs exhibit Wisdom of Crowds effects similar to humans.
WOC decoding enhances LLM reasoning on diverse estimation tasks.
Abstract
Guesstimation -- the task of making approximate quantitative estimates about objects or events -- is a common real-world skill, yet remains underexplored in large language model (LLM) research. We introduce three guesstimation datasets: MARBLES, FUTURE, and ELECPRED, spanning physical estimation (e.g., how many marbles fit in a cup) to abstract predictions (e.g., the 2024 U.S. presidential election). Inspired by the social science concept of Wisdom of Crowds (WOC)- where the median of multiple estimates improves accuracy-we propose WOC decoding for LLMs. We replicate WOC effects in human participants and find that LLMs exhibit similar benefits: median aggregation across sampled responses consistently improves accuracy over greedy decoding, self-consistency decoding, and mean decoding. This suggests that LLMs encode a world model that supports approximate reasoning. Our results position…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Data Stream Mining Techniques
