Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs   as Science Communicators

Prasoon Bajpai; Niladri Chatterjee; Subhabrata Dutta; Tanmoy; Chakraborty

arXiv:2409.14037·cs.CL·September 24, 2024

Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators

Prasoon Bajpai, Niladri Chatterjee, Subhabrata Dutta, Tanmoy, Chakraborty

PDF

Open Access 1 Repo 1 Video

TL;DR

This study evaluates the reliability of large language models as science communicators using a novel dataset, revealing strengths of certain open-access models and exposing significant limitations in current models' factual accuracy and trustworthiness.

Contribution

Introduces SCiPS-QA, a new dataset for scientific question-answering, and benchmarks multiple LLMs, highlighting their strengths and weaknesses in science communication tasks.

Findings

01

Llama-3-70B often outperforms GPT-4 Turbo in evaluation.

02

Most open-access models underperform compared to GPT-4 Turbo.

03

GPT models frequently fail to reliably verify responses and can deceive human evaluators.

Abstract

Large Language Models (LLMs) and AI assistants driven by these models are experiencing exponential growth in usage among both expert and amateur users. In this work, we focus on evaluating the reliability of current LLMs as science communicators. Unlike existing benchmarks, our approach emphasizes assessing these models on scientific questionanswering tasks that require a nuanced understanding and awareness of answerability. We introduce a novel dataset, SCiPS-QA, comprising 742 Yes/No queries embedded in complex scientific concepts, along with a benchmarking suite that evaluates LLMs for correctness and consistency across various criteria. We benchmark three proprietary LLMs from the OpenAI GPT family and 13 open-access LLMs from the Meta Llama-2, Llama-3, and Mistral families. While most open-access models significantly underperform compared to GPT-4 Turbo, our experiments identify…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Prasoon1207/llm-science-miscommunication
pytorchOfficial

Videos

Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators· underline

Taxonomy

TopicsBiomedical Text Mining and Ontologies

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Weight Decay · Position-Wise Feed-Forward Layer · Label Smoothing · Linear Warmup With Cosine Annealing · Byte Pair Encoding · Absolute Position Encodings