TL;DR
This paper introduces a reproducible evaluation framework for attributed information retrieval using large language models, addressing the challenges of open-ended queries and diverse answer attribution in information-seeking scenarios.
Contribution
It proposes a flexible benchmarking framework for attributed information seeking with various LLM architectures, enabling systematic evaluation of correctness and attribution.
Findings
Different architectural scenarios significantly affect answer correctness.
The framework can be applied with any backbone LLM.
Experiments highlight the impact of scenario choices on attribution quality.
Abstract
With the growing success of Large Language models (LLMs) in information-seeking scenarios, search engines are now adopting generative approaches to provide answers along with in-line citations as attribution. While existing work focuses mainly on attributed question answering, in this paper, we target information-seeking scenarios which are often more challenging due to the open-ended nature of the queries and the size of the label space in terms of the diversity of candidate-attributed answers per query. We propose a reproducible framework to evaluate and benchmark attributed information seeking, using any backbone LLM, and different architectural designs: (1) Generate (2) Retrieve then Generate, and (3) Generate then Retrieve. Experiments using HAGRID, an attributed information-seeking dataset, show the impact of different scenarios on both the correctness and attributability of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
