Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators
Prasoon Bajpai, Niladri Chatterjee, Subhabrata Dutta, Tanmoy, Chakraborty

TL;DR
This study evaluates the reliability of large language models as science communicators using a novel dataset, revealing strengths of certain open-access models and exposing significant limitations in current models' factual accuracy and trustworthiness.
Contribution
Introduces SCiPS-QA, a new dataset for scientific question-answering, and benchmarks multiple LLMs, highlighting their strengths and weaknesses in science communication tasks.
Findings
Llama-3-70B often outperforms GPT-4 Turbo in evaluation.
Most open-access models underperform compared to GPT-4 Turbo.
GPT models frequently fail to reliably verify responses and can deceive human evaluators.
Abstract
Large Language Models (LLMs) and AI assistants driven by these models are experiencing exponential growth in usage among both expert and amateur users. In this work, we focus on evaluating the reliability of current LLMs as science communicators. Unlike existing benchmarks, our approach emphasizes assessing these models on scientific questionanswering tasks that require a nuanced understanding and awareness of answerability. We introduce a novel dataset, SCiPS-QA, comprising 742 Yes/No queries embedded in complex scientific concepts, along with a benchmarking suite that evaluates LLMs for correctness and consistency across various criteria. We benchmark three proprietary LLMs from the OpenAI GPT family and 13 open-access LLMs from the Meta Llama-2, Llama-3, and Mistral families. While most open-access models significantly underperform compared to GPT-4 Turbo, our experiments identify…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsBiomedical Text Mining and Ontologies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Weight Decay · Position-Wise Feed-Forward Layer · Label Smoothing · Linear Warmup With Cosine Annealing · Byte Pair Encoding · Absolute Position Encodings
