SciEx: Benchmarking Large Language Models on Scientific Exams with Human   Expert Grading and Automatic Grading

Tu Anh Dinh; Carlos Mullov; Leonard B\"armann; Zhaolin Li; Danni Liu,; Simon Rei{\ss}; Jueun Lee; Nathan Lerzer; Fabian Ternava; Jianfeng Gao,; Tobias R\"oddiger; Alexander Waibel; Tamim Asfour; Michael Beigl; Rainer; Stiefelhagen; Carsten Dachsbacher; Klemens B\"ohm; Jan Niehues

arXiv:2406.10421·cs.CL·October 3, 2024·1 cites

SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

Tu Anh Dinh, Carlos Mullov, Leonard B\"armann, Zhaolin Li, Danni Liu,, Simon Rei{\ss}, Jueun Lee, Nathan Lerzer, Fabian Ternava, Jianfeng Gao,, Tobias R\"oddiger, Alexander Waibel, Tamim Asfour, Michael Beigl, Rainer, Stiefelhagen, Carsten Dachsbacher, Klemens B\"ohm, Jan Niehues

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

SciEx is a comprehensive benchmark of university-level scientific exam questions in multiple languages and formats, evaluated with human and automatic grading to assess LLMs' problem-solving abilities and grading accuracy.

Contribution

This paper introduces SciEx, a novel multilingual, multi-modal scientific exam benchmark with expert and automatic grading, enabling evaluation of LLMs' solving and grading capabilities.

Findings

01

Current LLMs achieve around 59.4% on SciEx exams.

02

LLMs are effective as automatic graders, with 0.948 correlation to expert grading.

03

SciEx remains challenging for state-of-the-art LLMs.

Abstract

With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx - a benchmark consisting of university computer science exam questions, to evaluate LLMs ability on solving scientific tasks. SciEx is (1) multilingual, containing both English and German exams, and (2) multi-modal, containing questions that involve images, and (3) contains various types of freeform questions with different difficulty levels, due to the nature of university exams. We evaluate the performance of various state-of-the-art LLMs on our new benchmark. Since SciEx questions are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

TuAnh23/SciEx
noneOfficial

Datasets

tuanh23/SciEx
dataset· 17 dl
17 dl

Videos

SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading· underline

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education