BSBench: will your LLM find the largest prime number?

K. O. T. Erziev

arXiv:2506.04535·cs.CL·June 6, 2025

BSBench: will your LLM find the largest prime number?

K. O. T. Erziev

PDF

Open Access 1 Repo

TL;DR

This paper introduces BSBench, a benchmark for testing large language models on questions with no reasonable answer, revealing their limitations in handling impossible queries.

Contribution

It presents a novel benchmark and a dataset modification method to evaluate LLMs on impossible questions, highlighting their current performance gaps.

Findings

01

Models perform far from perfect on impossible questions

02

Existing datasets can be adapted for challenging benchmarks

03

Code and data are publicly available

Abstract

We propose that benchmarking LLMs on questions which have no reasonable answer actually isn't as silly as it sounds. We also present a benchmark that allows such testing and a method to modify the existing datasets, and discover that existing models demonstrate a performance far from the perfect on such questions. Our code and data artifacts are available at https://github.com/L3G5/impossible-bench

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

l3g5/impossible-bench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Text Analysis Techniques · Natural Language Processing Techniques