Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For   Large Language Models

Daman Arora; Himanshu Gaurav Singh; Mausam

arXiv:2305.15074·cs.CL·October 24, 2023·6 cites

Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models

Daman Arora, Himanshu Gaurav Singh, Mausam

PDF

Open Access 1 Repo

TL;DR

This paper introduces JEEBench, a highly challenging problem-solving benchmark for large language models, highlighting current limitations and proposing a confidence-based method to improve response accuracy.

Contribution

The paper presents JEEBench, a new difficult benchmark with 515 complex problems, and develops a confidence-thresholding technique to enhance model response selection.

Findings

01

Even the best models perform below 40% accuracy.

02

Common failure modes include algebra errors and concept grounding issues.

03

Prompting alone cannot effectively assess answer risk.

Abstract

The performance of large language models (LLMs) on existing reasoning benchmarks has significantly improved over the past years. In response, we present JEEBench, a considerably more challenging benchmark dataset for evaluating the problem solving abilities of LLMs. We curate 515 challenging pre-engineering mathematics, physics and chemistry problems from the highly competitive IIT JEE-Advanced exam. Long-horizon reasoning on top of deep in-domain knowledge is essential for solving problems in this benchmark. Our evaluation on various open-source and proprietary models reveals that the highest performance, even after using techniques like self-consistency, self-refinement and chain-of-thought prompting, is less than 40%. The typical failure modes of GPT-4, the best model, are errors in algebraic manipulation, difficulty in grounding abstract concepts into mathematical equations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hgaurav2k/jeebench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research

MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Label Smoothing · Dense Connections · Weight Decay · Residual Connection · Linear Warmup With Cosine Annealing · Position-Wise Feed-Forward Layer · Absolute Position Encodings