ContextCite: Attributing Model Generation to Context
Benjamin Cohen-Wang, Harshay Shah, Kristian Georgiev, Aleksander Madry

TL;DR
This paper introduces ContextCite, a scalable method for attributing language model outputs to specific parts of their input context, aiding in verification, response improvement, and attack detection.
Contribution
The paper presents a novel, scalable approach called ContextCite for identifying which parts of context influence model outputs, applicable to any language model.
Findings
Effective in verifying generated statements
Improves response quality by context pruning
Detects poisoning attacks
Abstract
How do language models use information provided as context when generating a response? Can we infer whether a particular generated statement is actually grounded in the context, a misinterpretation, or fabricated? To help answer these questions, we introduce the problem of context attribution: pinpointing the parts of the context (if any) that led a model to generate a particular statement. We then present ContextCite, a simple and scalable method for context attribution that can be applied on top of any existing language model. Finally, we showcase the utility of ContextCite through three applications: (1) helping verify generated statements (2) improving response quality by pruning the context and (3) detecting poisoning attacks. We provide code for ContextCite at https://github.com/MadryLab/context-cite.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsContext-Aware Activity Recognition Systems · Semantic Web and Ontologies
MethodsPruning
