EnviroExam: Benchmarking Environmental Science Knowledge of Large   Language Models

Yu Huang; Liang Guo; Wanqian Guo; Zhe Tao; Yang Lv; Zhihao Sun,; Dongfang Zhao

arXiv:2405.11265·cs.CL·May 21, 2024·2 cites

EnviroExam: Benchmarking Environmental Science Knowledge of Large Language Models

Yu Huang, Liang Guo, Wanqian Guo, Zhe Tao, Yang Lv, Zhihao Sun,, Dongfang Zhao

PDF

Open Access 1 Datasets

TL;DR

EnviroExam is a comprehensive benchmarking tool that assesses large language models' environmental science knowledge using university curricula-based questions, revealing performance gaps and aiding model selection and fine-tuning.

Contribution

This paper introduces EnviroExam, a novel evaluation framework based on academic curricula, to systematically assess large language models in environmental science.

Findings

01

61.3% of models passed 5-shot tests

02

48.39% of models passed 0-shot tests

03

Performance varies significantly among models

Abstract

In the field of environmental science, it is crucial to have robust evaluation metrics for large language models to ensure their efficacy and accuracy. We propose EnviroExam, a comprehensive evaluation method designed to assess the knowledge of large language models in the field of environmental science. EnviroExam is based on the curricula of top international universities, covering undergraduate, master's, and doctoral courses, and includes 936 questions across 42 core courses. By conducting 0-shot and 5-shot tests on 31 open-source large language models, EnviroExam reveals the performance differences among these models in the domain of environmental science and provides detailed evaluation standards. The results show that 61.3% of the models passed the 5-shot tests, while 48.39% passed the 0-shot tests. By introducing the coefficient of variation as an indicator, we evaluate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

enviroscientist/EnviroExam
dataset· 137 dl
137 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Topic Modeling · Research Data Management Practices