Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents
Hailey Onweller, Elias Lumer, Austin Huber, Pia Ramchandani, Vamse Kumar Subbiah, Corey Feld

TL;DR
This paper introduces a scalable framework for evaluating source attribution in LLM-generated reports, assessing link accessibility, relevance, and factual accuracy to address citation reliability issues.
Contribution
It presents the first comprehensive evaluation framework using an AST parser and LLM-based judges to systematically assess citation validity in LLM outputs.
Findings
Frontier models have over 94% link validity and 80% relevance but only 39-77% factual accuracy.
Fewer than half of open-source models generate cited reports successfully in one shot.
Fact Check accuracy decreases by 42% as retrieval calls increase from 2 to 150.
Abstract
Large language models (LLMs) power deep research agents that synthesize information from hundreds of web sources into cited reports, yet these citations cannot be reliably verified. Current approaches either trust models to self-cite accurately, risking bias, or employ retrieval-augmented generation (RAG) that does not validate source accessibility, relevance, or factual consistency. We introduce the first source attribution evaluation framework that uses a reproducible AST parser to extract and evaluate inline citations from LLM-generated Markdown reports at scale. Unlike methods that verify claims in isolation, our framework closes the loop by retrieving the actual cited content, enabling human or model evaluators to judge each citation against its source. Citations are evaluated along three dimensions. (1) Link Works verifies URL accessibility, (2) Relevant Content measures topical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
