BSBench: will your LLM find the largest prime number?
K. O. T. Erziev

TL;DR
This paper introduces BSBench, a benchmark for testing large language models on questions with no reasonable answer, revealing their limitations in handling impossible queries.
Contribution
It presents a novel benchmark and a dataset modification method to evaluate LLMs on impossible questions, highlighting their current performance gaps.
Findings
Models perform far from perfect on impossible questions
Existing datasets can be adapted for challenging benchmarks
Code and data are publicly available
Abstract
We propose that benchmarking LLMs on questions which have no reasonable answer actually isn't as silly as it sounds. We also present a benchmark that allows such testing and a method to modify the existing datasets, and discover that existing models demonstrate a performance far from the perfect on such questions. Our code and data artifacts are available at https://github.com/L3G5/impossible-bench
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Natural Language Processing Techniques
