Understanding LLM Scientific Reasoning through Promptings and Model's Explanation on the Answers

Alice Rueda; Mohammed S. Hassan; Argyrios Perivolaris; Bazen G. Teferra; Reza Samavi; Sirisha Rambhatla; Yuqi Wu; Yanbo Zhang; Bo Cao; Divya Sharma; Sridhar Krishnan; Venkat Bhat

arXiv:2505.01482·cs.AI·July 28, 2025

Understanding LLM Scientific Reasoning through Promptings and Model's Explanation on the Answers

Alice Rueda, Mohammed S. Hassan, Argyrios Perivolaris, Bazen G. Teferra, Reza Samavi, Sirisha Rambhatla, Yuqi Wu, Yanbo Zhang, Bo Cao, Divya Sharma, Sridhar Krishnan, Venkat Bhat

PDF

Open Access

TL;DR

This paper evaluates the scientific reasoning abilities of GPT-4o using various prompt engineering techniques on the GPQA dataset, revealing strengths in pattern recognition and highlighting areas for improvement in logical inference.

Contribution

It systematically compares multiple prompt engineering methods to assess LLM reasoning, proposing future research directions for enhancing logical inference in AI.

Findings

01

Self-consistency achieved 52.99% accuracy

02

Simple prompt techniques perform best in reasoning tasks

03

LLMs rely more on pattern recognition than true logic

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and problem-solving across various domains. However, their ability to perform complex, multi-step reasoning task-essential for applications in science, medicine, and law-remains an area of active investigation. This paper examines the reasoning capabilities of contemporary LLMs, analyzing their strengths, limitations, and potential for improvement. The study uses prompt engineering techniques on the Graduate-Level GoogleProof Q&A (GPQA) dataset to assess the scientific reasoning of GPT-4o. Five popular prompt engineering techniques and two tailored promptings were tested: baseline direct answer (zero-shot), chain-of-thought (CoT), zero-shot CoT, self-ask, self-consistency, decomposition, and multipath promptings. Our findings indicate that while LLMs exhibit emergent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies