SciTrust 2.0: A Comprehensive Framework for Evaluating Trustworthiness of Large Language Models in Scientific Applications
Emily Herron, Junqi Yin, Feiyi Wang

TL;DR
SciTrust 2.0 introduces a comprehensive framework with novel benchmarks for evaluating the trustworthiness of large language models in scientific research, focusing on truthfulness, robustness, safety, and ethics.
Contribution
The paper presents new open-ended truthfulness and ethics benchmarks, and evaluates seven LLMs, revealing gaps in science-specialized models' trustworthiness.
Findings
General-purpose models outperform science-specific models in trustworthiness.
GPT-o4-mini excels in truthfulness and robustness.
Science models show weaknesses in ethical reasoning and safety.
Abstract
Large language models (LLMs) have demonstrated transformative potential in scientific research, yet their deployment in high-stakes contexts raises significant trustworthiness concerns. Here, we introduce SciTrust 2.0, a comprehensive framework for evaluating LLM trustworthiness in scientific applications across four dimensions: truthfulness, adversarial robustness, scientific safety, and scientific ethics. Our framework incorporates novel, open-ended truthfulness benchmarks developed through a verified reflection-tuning pipeline and expert validation, alongside a novel ethics benchmark for scientific research contexts covering eight subcategories including dual-use research and bias. We evaluated seven prominent LLMs, including four science-specialized models and three general-purpose industry models, using multiple evaluation metrics including accuracy, semantic similarity measures,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
