ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature
Aarush Sinha, Viraj Virk, Dipshikha Chakraborty, P.S. Sreeja

TL;DR
This paper introduces ArxEval, a pipeline for assessing hallucination rates in language models when generating scientific literature, using ArXiv data to compare model reliability.
Contribution
It presents a novel evaluation pipeline with two specific tasks to measure hallucination in language models handling scientific texts.
Findings
Fifteen language models were evaluated for hallucination frequency.
The pipeline provides comparative insights into model reliability.
Results highlight varying levels of factual accuracy among models.
Abstract
Language Models [LMs] are now playing an increasingly large role in information generation and synthesis; the representation of scientific knowledge in these systems needs to be highly accurate. A prime challenge is hallucination; that is, generating apparently plausible but actually false information, including invented citations and nonexistent research papers. This kind of inaccuracy is dangerous in all the domains that require high levels of factual correctness, such as academia and education. This work presents a pipeline for evaluating the frequency with which language models hallucinate in generating responses in the scientific literature. We propose ArxEval, an evaluation pipeline with two tasks using ArXiv as a repository: Jumbled Titles and Mixed Titles. Our evaluation includes fifteen widely used language models and provides comparative insights into their reliability in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies
