SciFaultyQA: Benchmarking LLMs on Faulty Science Question Detection with a GAN-Inspired Approach to Synthetic Dataset Generation
Debarshi Kundu

TL;DR
This paper introduces SciFaultyQA, a benchmark dataset of intentionally faulty science questions, and proposes a GAN-inspired synthetic data generation method to evaluate and improve large language models' ability to detect flawed questions.
Contribution
It presents a new dataset and a GAN-inspired synthetic data generation approach for benchmarking LLMs on faulty science question detection, along with methods to reduce detection errors.
Findings
LLMs often fail to recognize faulty questions, providing nonsensical answers.
The proposed synthetic dataset improves evaluation of LLMs' fault detection capabilities.
Novel approaches can reduce errors in identifying flawed questions.
Abstract
Consider the problem: ``If one man and one woman can produce one child in one year, how many children will be produced by one woman and three men in 0.5 years?" Current large language models (LLMs) such as GPT-4o, GPT-o1-preview, and Gemini Flash frequently answer "0.5," which does not make sense. While these models sometimes acknowledge the unrealistic nature of the question, in many cases (8 out of 10 trials), they provide the nonsensical answer of "0.5 child." Additionally, temporal variation has been observed: if an LLM answers correctly once (by recognizing the faulty nature of the question), subsequent responses are more likely to also reflect this understanding. However, this is inconsistent. These types of questions have motivated us to develop a dataset of science questions, SciFaultyQA, where the questions themselves are intentionally faulty. We observed that LLMs often…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling
