CLR-Bench: Evaluating Large Language Models in College-level Reasoning

Junnan Dong; Zijin Hong; Yuanchen Bei; Feiran Huang; Xinrun Wang; Xiao; Huang

arXiv:2410.17558·cs.AI·October 28, 2024

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

Junnan Dong, Zijin Hong, Yuanchen Bei, Feiran Huang, Xinrun Wang, Xiao, Huang

PDF

Open Access

TL;DR

CLR-Bench introduces a comprehensive evaluation framework for large language models on complex college-level reasoning tasks, highlighting their limited reasoning capabilities despite high accuracy on final answers.

Contribution

The paper presents a new benchmark with detailed explanations and two novel metrics to better assess LLMs' reasoning and explanation abilities in college-level disciplines.

Findings

01

LLMs perform poorly on reasoning tasks compared to answer accuracy.

02

GPT-4 turbo shows a significant drop from 63.31% to 39.00% in reasoning ability.

03

LLMs tend to guess answers rather than demonstrate true understanding.

Abstract

Large language models (LLMs) have demonstrated their remarkable performance across various language understanding tasks. While emerging benchmarks have been proposed to evaluate LLMs in various domains such as mathematics and computer science, they merely measure the accuracy in terms of the final prediction on multi-choice questions. However, it remains insufficient to verify the essential understanding of LLMs given a chosen choice. To fill this gap, we present CLR-Bench to comprehensively evaluate the LLMs in complex college-level reasoning. Specifically, (i) we prioritize 16 challenging college disciplines in computer science and artificial intelligence. The dataset contains 5 types of questions, while each question is associated with detailed explanations from experts. (ii) To quantify a fair evaluation of LLMs' reasoning ability, we formalize the criteria with two novel metrics.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsLinear Layer · Dense Connections · Multi-Head Attention · Adam · Softmax · Dropout · Absolute Position Encodings · Label Smoothing · Byte Pair Encoding · Layer Normalization