LLMs Corrupt Your Documents When You Delegate
Philippe Laban, Tobias Schnabel, Jennifer Neville

TL;DR
This paper evaluates the reliability of large language models in delegated workflows across various domains, revealing they often corrupt documents silently and severely over time, even with advanced models.
Contribution
Introduces DELEGATE-52, a large-scale benchmark to assess LLMs' performance in long delegated workflows across 52 domains, highlighting their unreliability and error accumulation.
Findings
Frontier models corrupt 25% of document content on average.
Agentic tool use does not improve document fidelity.
Degradation severity increases with document size and interaction length.
Abstract
Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
