SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading
Tu Anh Dinh, Carlos Mullov, Leonard B\"armann, Zhaolin Li, Danni Liu,, Simon Rei{\ss}, Jueun Lee, Nathan Lerzer, Fabian Ternava, Jianfeng Gao,, Tobias R\"oddiger, Alexander Waibel, Tamim Asfour, Michael Beigl, Rainer, Stiefelhagen, Carsten Dachsbacher, Klemens B\"ohm, Jan Niehues

TL;DR
SciEx is a comprehensive benchmark of university-level scientific exam questions in multiple languages and formats, evaluated with human and automatic grading to assess LLMs' problem-solving abilities and grading accuracy.
Contribution
This paper introduces SciEx, a novel multilingual, multi-modal scientific exam benchmark with expert and automatic grading, enabling evaluation of LLMs' solving and grading capabilities.
Findings
Current LLMs achieve around 59.4% on SciEx exams.
LLMs are effective as automatic graders, with 0.948 correlation to expert grading.
SciEx remains challenging for state-of-the-art LLMs.
Abstract
With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx - a benchmark consisting of university computer science exam questions, to evaluate LLMs ability on solving scientific tasks. SciEx is (1) multilingual, containing both English and German exams, and (2) multi-modal, containing questions that involve images, and (3) contains various types of freeform questions with different difficulty levels, due to the nature of university exams. We evaluate the performance of various state-of-the-art LLMs on our new benchmark. Since SciEx questions are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education
