From Queries to Criteria: Understanding How Astronomers Evaluate LLMs
Alina Hyk, Kiera McCormick, Mian Zhong, Ioana Ciuc\u{a}, Sanjib Sharma, John F Wu, J. E. G. Peek, Kartheik G. Iyer, Ziang Xiao, Anjalie Field

TL;DR
This paper investigates how astronomers evaluate LLMs by analyzing user interactions with an astronomy-focused retrieval-augmented generation bot, leading to improved evaluation benchmarks for scientific research applications.
Contribution
It provides a detailed understanding of user evaluation criteria and proposes concrete benchmarks tailored for assessing LLMs in astronomy.
Findings
Humans evaluate LLMs based on question type and response quality.
User criteria include accuracy, relevance, and scientific validity.
A new benchmark for LLMs in astronomy is proposed.
Abstract
There is growing interest in leveraging LLMs to aid in astronomy and other scientific research, but benchmarks for LLM evaluation in general have not kept pace with the increasingly diverse ways that real people evaluate and use these models. In this study, we seek to improve evaluation procedures by building an understanding of how users evaluate LLMs. We focus on a particular use case: an LLM-powered retrieval-augmented generation bot for engaging with astronomical literature, which we deployed via Slack. Our inductive coding of 368 queries to the bot over four weeks and our follow-up interviews with 11 astronomers reveal how humans evaluated this system, including the types of questions asked and the criteria for judging responses. We synthesize our findings into concrete recommendations for building better benchmarks, which we then employ in constructing a sample benchmark for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
