Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing

Humam Khan; Md Tabrez Nafis; Shahab Saquib Sohail; Aqeel Khalique; Rehan Hasan Khan

arXiv:2605.04171·cs.CL·May 7, 2026

Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing

Humam Khan, Md Tabrez Nafis, Shahab Saquib Sohail, Aqeel Khalique, Rehan Hasan Khan

PDF

TL;DR

This study evaluates hallucination tendencies of four large language models in academic writing tasks, introducing a new metric and analyzing how task type and prompts influence factual accuracy.

Contribution

It systematically compares LLMs on academic tasks, introduces the Hallucination Index, and highlights factors affecting hallucination beyond model architecture.

Findings

01

Grok and Copilot excel in reference generation but struggle with abstract prompts.

02

Gemini and ChatGPT maintain better tone control but have higher hallucination risks.

03

Hallucination behavior varies with task type and prompting conditions, not just model architecture.

Abstract

Large Language models (LLMs) show extraordinary abilities, but they are still prone to hallucinations, especially when we use them for generating Academic content. We have investigated four popular LLMs, ChatGPT, Grok, Gemini, and Copilot for hallucinations specifically for academic writing. We have designed 80 prompts across four categories, namely, reference generation, factual explanation, abstract generation, and writing improvement. We evaluated the model using a 0-5 rubric score, which checks factual accuracy, reference validity, coherence, style consistency, and academic tone. A novel weighted metric, Hallucination Index (HI), was introduced to measure hallucination in the responses generated by the models. Some of the most widely used evaluation metrics often fail to check errors which alter sentiment in machine-translated text. We found that Grok and Copilot perform better on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.