# There are significant differences among artificial intelligence large language models when answering scientific questions

**Authors:** Francisco Javier Álvarez-Martínez, Luis Esteban, Lucas Frungillo, Estefanía Butassi, Alessandro Zambon, María Herranz-López, Mario Aranda, Federica Pollastro, Anne Sylvie Tixier, Jose V. Garcia-Perez, David Arráez-Román, Andrew Ross, Pedro Mena, Ru Angelie Edrada-Ebel, James Lyng, Vicente Micol, Fernando Borrás-Rocher, Enrique Barrajón-Catalán

PMC · DOI: 10.3389/frai.2025.1664303 · 2025-10-09

## TL;DR

This study compares how well different AI language models answer scientific questions and finds that some perform better than others.

## Contribution

The study provides a comparative evaluation of five LLMs for scientific accuracy and highlights the need for ethical frameworks in AI use.

## Key findings

- Claude 3.5 Sonnet scored highest in depth, accuracy, and clarity among the evaluated models.
- RAG techniques and refined prompts improved LLM performance, but some models still require development.
- Reviewers' trust in AI increased after evaluation, though ethical concerns about transparency remained.

## Abstract

This study investigates the efficacy of large language models (LLMs) for generating accurate scientific responses through a comparative evaluation of five prominent free models: Claude 3.5 Sonnet, Gemini, ChatGPT 4o, Mistral Large 2, and Llama 3.1 70B.

Sixteen expert scientific reviewers assessed these models in terms of depth, accuracy, relevance, and clarity.

Claude 3.5 Sonnet emerged as the highest scoring model, followed by Gemini, with notable variability among the other models. Additionally, retrieval-augmented generation (RAG) techniques were applied to improve LLM performance, and prompts were refined to improve answers. The results indicate that although LLMs such as Claude 3.5 Sonnet have potential for scientific tasks, other models may require more development or additional prompt engineering to reach comparable accuracy. Reviewers’ perceptions of artificial intelligence (AI) utility and trustworthiness showed a positive shift after evaluation. However, ethical concerns, particularly with respect to transparency and disclosure, remained consistent.

The study highlights the need for structured frameworks for evaluating LLMs and ethical considerations essential for responsible AI integration in scientific research. These findings should be interpreted with caution, as the limited sample size and domain-specific focus of the exam questions restrict the generalizability of the results.

A scatter plot categorizes partners based on effectiveness and relevance. Areas outlined in pink highlight groups: "Unforeseen/Eventual", "Ineffective and Less Strategic", "Occasional", "Strong and Effective", "Relevant", and "Preferred/Diamond" partners. Data points are colored and labeled from one to six.

## Full-text entities

- **Chemicals:** Sonnet (-)

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12547693/full.md

---
Source: https://tomesphere.com/paper/PMC12547693