SciFaultyQA: Benchmarking LLMs on Faulty Science Question Detection with   a GAN-Inspired Approach to Synthetic Dataset Generation

Debarshi Kundu

arXiv:2412.11988·cs.CL·December 17, 2024

SciFaultyQA: Benchmarking LLMs on Faulty Science Question Detection with a GAN-Inspired Approach to Synthetic Dataset Generation

Debarshi Kundu

PDF

Open Access 1 Repo

TL;DR

This paper introduces SciFaultyQA, a benchmark dataset of intentionally faulty science questions, and proposes a GAN-inspired synthetic data generation method to evaluate and improve large language models' ability to detect flawed questions.

Contribution

It presents a new dataset and a GAN-inspired synthetic data generation approach for benchmarking LLMs on faulty science question detection, along with methods to reduce detection errors.

Findings

01

LLMs often fail to recognize faulty questions, providing nonsensical answers.

02

The proposed synthetic dataset improves evaluation of LLMs' fault detection capabilities.

03

Novel approaches can reduce errors in identifying flawed questions.

Abstract

Consider the problem: ``If one man and one woman can produce one child in one year, how many children will be produced by one woman and three men in 0.5 years?" Current large language models (LLMs) such as GPT-4o, GPT-o1-preview, and Gemini Flash frequently answer "0.5," which does not make sense. While these models sometimes acknowledge the unrealistic nature of the question, in many cases (8 out of 10 trials), they provide the nonsensical answer of "0.5 child." Additionally, temporal variation has been observed: if an LLM answers correctly once (by recognizing the faulty nature of the question), subsequent responses are more likely to also reflect this understanding. However, this is inconsistent. These types of questions have motivated us to develop a dataset of science questions, SciFaultyQA, where the questions themselves are intentionally faulty. We observed that LLMs often…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

debarshikundupsu/scifaultyqa
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling