SciEval: A Multi-Level Large Language Model Evaluation Benchmark for   Scientific Research

Liangtai Sun; Yang Han; Zihan Zhao; Da Ma; Zhennan Shen; Baocai Chen,; Lu Chen; Kai Yu

arXiv:2308.13149·cs.CL·November 8, 2024·5 cites

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen,, Lu Chen, Kai Yu

PDF

Open Access 2 Repos 1 Datasets

TL;DR

SciEval is a new multi-disciplinary benchmark designed to evaluate large language models' scientific research abilities, addressing data leakage and subjective question evaluation, revealing room for improvement especially on dynamic questions.

Contribution

It introduces SciEval, a comprehensive benchmark with dynamic and subjective questions based on Bloom's taxonomy, enhancing evaluation of LLMs' scientific research capabilities.

Findings

01

GPT-4 achieves state-of-the-art performance

02

Significant performance gap remains on dynamic questions

03

Benchmark addresses data leakage and subjective evaluation issues

Abstract

Recently, there has been growing interest in using Large Language Models (LLMs) for scientific research. Numerous benchmarks have been proposed to evaluate the ability of LLMs for scientific research. However, current benchmarks are mostly based on pre-collected objective questions. This design suffers from data leakage problem and lacks the evaluation of subjective Q/A ability. In this paper, we propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues. Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability. In particular, we design a "dynamic" subset based on scientific principles to prevent evaluation from potential data leakage. Both objective and subjective questions are included in SciEval. These characteristics make SciEval a more effective benchmark for scientific research…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

OpenDFM/SciEval
dataset· 464 dl
464 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Layer Normalization · Dropout · Byte Pair Encoding · Adam · Position-Wise Feed-Forward Layer