ArxEval: Evaluating Retrieval and Generation in Language Models for   Scientific Literature

Aarush Sinha; Viraj Virk; Dipshikha Chakraborty; P.S. Sreeja

arXiv:2501.10483·cs.CL·January 23, 2025

ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature

Aarush Sinha, Viraj Virk, Dipshikha Chakraborty, P.S. Sreeja

PDF

Open Access

TL;DR

This paper introduces ArxEval, a pipeline for assessing hallucination rates in language models when generating scientific literature, using ArXiv data to compare model reliability.

Contribution

It presents a novel evaluation pipeline with two specific tasks to measure hallucination in language models handling scientific texts.

Findings

01

Fifteen language models were evaluated for hallucination frequency.

02

The pipeline provides comparative insights into model reliability.

03

Results highlight varying levels of factual accuracy among models.

Abstract

Language Models [LMs] are now playing an increasingly large role in information generation and synthesis; the representation of scientific knowledge in these systems needs to be highly accurate. A prime challenge is hallucination; that is, generating apparently plausible but actually false information, including invented citations and nonexistent research papers. This kind of inaccuracy is dangerous in all the domains that require high levels of factual correctness, such as academia and education. This work presents a pipeline for evaluating the frequency with which language models hallucinate in generating responses in the scientific literature. We propose ArxEval, an evaluation pipeline with two tasks using ArXiv as a repository: Jumbled Titles and Mixed Titles. Our evaluation includes fifteen widely used language models and provides comparative insights into their reliability in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Biomedical Text Mining and Ontologies