AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X
Haiwen Li, Michiel A. Bakker

TL;DR
This study evaluates the real-world performance of an LLM-based fact-checking system deployed on X, showing it can produce helpful, cross-partisan notes at scale, but emphasizing the importance of platform-specific evaluation methods.
Contribution
First field evaluation of an LLM-driven fact-checking pipeline on a live social media platform, comparing its effectiveness to human notes using real platform data.
Findings
LLM notes received more positive ratings than human notes across diverse raters.
LLM notes achieved higher helpfulness scores among raters evaluating all notes on a post.
The study highlights the importance of platform-aware evaluation for deploying AI fact-checkers.
Abstract
Large language models show promising capabilities for contextual fact-checking on social media: they can verify contested claims through deep research, synthesize evidence from multiple sources, and draft explanations at scale. However, prior work evaluates LLM fact-checking only in controlled settings using benchmarks or crowdworker judgments, leaving open how these systems perform in authentic platform environments. We present the first field evaluation of LLM-based fact-checking deployed on a live social media platform, testing performance directly through X Community Notes' AI writer feature over a three-month period. Our LLM writer, a multi-step pipeline that handles multimodal content (text, images, and videos), conducts web and platform-native search, and writes contextual notes, was deployed to write 1,614 notes on 1,597 tweets and compared against 1,332 human-written notes on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
