Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models
Daman Arora, Himanshu Gaurav Singh, Mausam

TL;DR
This paper introduces JEEBench, a highly challenging problem-solving benchmark for large language models, highlighting current limitations and proposing a confidence-based method to improve response accuracy.
Contribution
The paper presents JEEBench, a new difficult benchmark with 515 complex problems, and develops a confidence-thresholding technique to enhance model response selection.
Findings
Even the best models perform below 40% accuracy.
Common failure modes include algebra errors and concept grounding issues.
Prompting alone cannot effectively assess answer risk.
Abstract
The performance of large language models (LLMs) on existing reasoning benchmarks has significantly improved over the past years. In response, we present JEEBench, a considerably more challenging benchmark dataset for evaluating the problem solving abilities of LLMs. We curate 515 challenging pre-engineering mathematics, physics and chemistry problems from the highly competitive IIT JEE-Advanced exam. Long-horizon reasoning on top of deep in-domain knowledge is essential for solving problems in this benchmark. Our evaluation on various open-source and proprietary models reveals that the highest performance, even after using techniques like self-consistency, self-refinement and chain-of-thought prompting, is less than 40%. The typical failure modes of GPT-4, the best model, are errors in algebraic manipulation, difficulty in grounding abstract concepts into mathematical equations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Label Smoothing · Dense Connections · Weight Decay · Residual Connection · Linear Warmup With Cosine Annealing · Position-Wise Feed-Forward Layer · Absolute Position Encodings
