TL;DR
KnowledgeBerg is a comprehensive benchmark testing large language models' ability to systematically cover knowledge domains and perform compositional reasoning, revealing significant limitations across multiple models and languages.
Contribution
The paper introduces KnowledgeBerg, a new benchmark with 4,800 questions across diverse domains and languages, to evaluate LLMs' knowledge coverage and reasoning capabilities.
Findings
Open-source LLMs perform poorly on universe enumeration and reasoning tasks.
Test-time augmentation improves model performance by up to 4.35 points.
Failures are due to missing knowledge, lack of awareness, and incorrect reasoning execution.
Abstract
Many real-world questions appear deceptively simple yet implicitly demand two capabilities: (i) systematic coverage of a bounded knowledge universe and (ii) compositional set-based reasoning over that universe, a phenomenon we term "the tip of the iceberg." We formalize this challenge through two orthogonal dimensions: knowledge width, the cardinality of the required universe, and reasoning depth, the number of compositional set operations. We introduce KnowledgeBerg, a benchmark of 4,800 multiple-choice questions derived from 1,183 enumeration seeds spanning 10 domains and 17 languages, with universes grounded in authoritative sources to ensure reproducibility. Representative open-source LLMs demonstrate severe limitations, achieving only 5.26-36.88 F1 on universe enumeration and 16.00-44.19 accuracy on knowledge-grounded reasoning. Diagnostic analyses reveal three stages of failure:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
