Attributed Question Answering: Evaluation and Modeling for Attributed   Large Language Models

Bernd Bohnet; Vinh Q. Tran; Pat Verga; Roee Aharoni; Daniel Andor,; Livio Baldini Soares; Massimiliano Ciaramita; Jacob Eisenstein; Kuzman; Ganchev; Jonathan Herzig; Kai Hui; Tom Kwiatkowski; Ji Ma; Jianmo Ni; Lierni; Sestorain Saralegui; Tal Schuster; William W. Cohen; Michael Collins,; Dipanjan Das; Donald Metzler; Slav Petrov; Kellie Webster

arXiv:2212.08037·cs.CL·February 14, 2023·26 cites

Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models

Bernd Bohnet, Vinh Q. Tran, Pat Verga, Roee Aharoni, Daniel Andor,, Livio Baldini Soares, Massimiliano Ciaramita, Jacob Eisenstein, Kuzman, Ganchev, Jonathan Herzig, Kai Hui, Tom Kwiatkowski, Ji Ma, Jianmo Ni, Lierni, Sestorain Saralegui, Tal Schuster, William W. Cohen

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces Attributed Question Answering, proposing an evaluation framework and benchmarking architectures to assess how well large language models can attribute generated text, which is crucial for information-seeking applications.

Contribution

It formulates Attributed QA as a new task, develops a reproducible evaluation framework, and benchmarks various architectures for attribution capabilities.

Findings

01

A correlated automatic metric effectively measures attribution.

02

Current state-of-the-art methods show varying performance on attribution tasks.

03

Insights into building LLMs with better attribution capabilities.

Abstract

Large language models (LLMs) have shown impressive results while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial in this setting. We formulate and study Attributed QA as a key first step in the development of attributed LLMs. We propose a reproducible evaluation framework for the task and benchmark a broad set of architectures. We take human annotations as a gold standard and show that a correlated automatic metric is suitable for development. Our experimental work gives concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third (How to build LLMs with attribution?).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research-datasets/attributed-qa
tfOfficial

Datasets

osunlp/AttributionBench
dataset· 882 dl
882 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)