CLR-Bench: Evaluating Large Language Models in College-level Reasoning
Junnan Dong, Zijin Hong, Yuanchen Bei, Feiran Huang, Xinrun Wang, Xiao, Huang

TL;DR
CLR-Bench introduces a comprehensive evaluation framework for large language models on complex college-level reasoning tasks, highlighting their limited reasoning capabilities despite high accuracy on final answers.
Contribution
The paper presents a new benchmark with detailed explanations and two novel metrics to better assess LLMs' reasoning and explanation abilities in college-level disciplines.
Findings
LLMs perform poorly on reasoning tasks compared to answer accuracy.
GPT-4 turbo shows a significant drop from 63.31% to 39.00% in reasoning ability.
LLMs tend to guess answers rather than demonstrate true understanding.
Abstract
Large language models (LLMs) have demonstrated their remarkable performance across various language understanding tasks. While emerging benchmarks have been proposed to evaluate LLMs in various domains such as mathematics and computer science, they merely measure the accuracy in terms of the final prediction on multi-choice questions. However, it remains insufficient to verify the essential understanding of LLMs given a chosen choice. To fill this gap, we present CLR-Bench to comprehensively evaluate the LLMs in complex college-level reasoning. Specifically, (i) we prioritize 16 challenging college disciplines in computer science and artificial intelligence. The dataset contains 5 types of questions, while each question is associated with detailed explanations from experts. (ii) To quantify a fair evaluation of LLMs' reasoning ability, we formalize the criteria with two novel metrics.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsLinear Layer · Dense Connections · Multi-Head Attention · Adam · Softmax · Dropout · Absolute Position Encodings · Label Smoothing · Byte Pair Encoding · Layer Normalization
