Measuring Large Language Models Capacity to Annotate Journalistic Sourcing
Subramaniam Vincent, Phoebe Wang, Zhan Shi, Sahas Koka, Yi Fang

TL;DR
This paper proposes a benchmark to evaluate large language models' ability to identify and annotate sourcing in journalistic stories, highlighting current limitations and the need for improved transparency in AI-assisted journalism.
Contribution
It introduces a novel scenario, dataset, and metrics for assessing LLMs on sourcing annotation in journalism, addressing a gap in existing benchmarks.
Findings
LLMs currently struggle to identify all sourced statements.
Matching source types remains challenging for LLMs.
Spotting source justifications is particularly difficult.
Abstract
Since the launch of ChatGPT in late 2022, the capacities of Large Language Models and their evaluation have been in constant discussion and evaluation both in academic research and in the industry. Scenarios and benchmarks have been developed in several areas such as law, medicine and math (Bommasani et al., 2023) and there is continuous evaluation of model variants. One area that has not received sufficient scenario development attention is journalism, and in particular journalistic sourcing and ethics. Journalism is a crucial truth-determination function in democracy (Vincent, 2023), and sourcing is a crucial pillar to all original journalistic output. Evaluating the capacities of LLMs to annotate stories for the different signals of sourcing and how reporters justify them is a crucial scenario that warrants a benchmark approach. It offers potential to build automated systems to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods
MethodsSoftmax · Attention Is All You Need
