Evaluating Commercial AI Chatbots as News Intermediaries
Mirac Suzgun, Emily Shen, Federico Bianchi, Alexander Spangher, Thomas Icard, Daniel E. Ho, Dan Jurafsky, James Zou

TL;DR
This study systematically evaluates six AI chatbots' accuracy in reporting recent news facts across languages and regions, revealing high accuracy on straightforward questions but significant vulnerabilities to false premises and retrieval biases.
Contribution
It provides the first comprehensive, multi-region, multi-language assessment of commercial AI chatbots' news reporting capabilities and identifies key failure patterns and biases.
Findings
Best systems achieve over 90% accuracy on recent factual questions.
Retrieval failures, not reasoning, cause over 70% of errors.
Models are vulnerable to false premises, with accuracy dropping significantly.
Abstract
AI chatbots are rapidly shaping how people encounter the news, yet no prior study has systematically measured how accurately these systems, with their proprietary search integrations and retrieval-synthesis pipelines, handle emerging facts across languages and regions. We present a 14-day (February 9-22, 2026) evaluation of six AI chatbots (Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5 and GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services (US & Canada, Arabic, Afrique, Hindi, Russian, Turkish). The best systems achieve over 90% multiple-choice accuracy on questions about events reported hours earlier. The same systems, however, lose 11-13% under free-response evaluation, and 16-17% across the cohort. We further characterize three failure patterns. First, every model achieves its lowest accuracy on Hindi (79% vs.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
