There are significant differences among artificial intelligence large language models when answering scientific questions

Francisco Javier Álvarez-Martínez; Luis Esteban; Lucas Frungillo; Estefanía Butassi; Alessandro Zambon; María Herranz-López; Mario Aranda; Federica Pollastro; Anne Sylvie Tixier; Jose V. Garcia-Perez; David Arráez-Román; Andrew Ross; Pedro Mena; Ru Angelie Edrada-Ebel; James Lyng; Vicente Micol; Fernando Borrás-Rocher; Enrique Barrajón-Catalán

PMC · DOI:10.3389/frai.2025.1664303·October 9, 2025

There are significant differences among artificial intelligence large language models when answering scientific questions

Francisco Javier Álvarez-Martínez, Luis Esteban, Lucas Frungillo, Estefanía Butassi, Alessandro Zambon, María Herranz-López, Mario Aranda, Federica Pollastro, Anne Sylvie Tixier, Jose V. Garcia-Perez, David Arráez-Román, Andrew Ross, Pedro Mena, Ru Angelie Edrada-Ebel

PDF

Open Access

TL;DR

This study compares how well different AI language models answer scientific questions and finds that some perform better than others.

Contribution

The study provides a comparative evaluation of five LLMs for scientific accuracy and highlights the need for ethical frameworks in AI use.

Findings

01

Claude 3.5 Sonnet scored highest in depth, accuracy, and clarity among the evaluated models.

02

RAG techniques and refined prompts improved LLM performance, but some models still require development.

03

Reviewers' trust in AI increased after evaluation, though ethical concerns about transparency remained.

Abstract

This study investigates the efficacy of large language models (LLMs) for generating accurate scientific responses through a comparative evaluation of five prominent free models: Claude 3.5 Sonnet, Gemini, ChatGPT 4o, Mistral Large 2, and Llama 3.1 70B. Sixteen expert scientific reviewers assessed these models in terms of depth, accuracy, relevance, and clarity. Claude 3.5 Sonnet emerged as the highest scoring model, followed by Gemini, with notable variability among the other models. Additionally, retrieval-augmented generation (RAG) techniques were applied to improve LLM performance, and prompts were refined to improve answers. The results indicate that although LLMs such as Claude 3.5 Sonnet have potential for scientific tasks, other models may require more development or additional prompt engineering to reach comparable accuracy. Reviewers’ perceptions of artificial intelligence…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Chemicals1

Sonnet

Figures10

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education