Accuracy and Political Bias of News Source Credibility Ratings by Large Language Models
Kai-Cheng Yang, Filippo Menczer

TL;DR
This study evaluates nine large language models' ability to rate news source credibility, revealing size-related accuracy differences, moderate alignment with experts, and a consistent liberal bias in political contexts.
Contribution
It provides a comprehensive audit of LLMs' news credibility ratings, highlighting biases and differences across model sizes and configurations.
Findings
Larger models more often refuse to rate sources.
Models agree highly among themselves (average Spearman's ρ=0.79).
Ratings only moderately align with human experts (average ρ=0.50).
Abstract
Search engines increasingly leverage large language models (LLMs) to generate direct answers, and AI chatbots now access the Internet for fresh data. As information curators for billions of users, LLMs must assess the accuracy and reliability of different sources. This paper audits nine widely used LLMs from three leading providers -- OpenAI, Google, and Meta -- to evaluate their ability to discern credible and high-quality information sources from low-credibility ones. We find that while LLMs can rate most tested news outlets, larger models more frequently refuse to provide ratings due to insufficient information, whereas smaller models are more prone to making errors in their ratings. For sources where ratings are provided, LLMs exhibit a high level of agreement among themselves (average Spearman's ), but their ratings align only moderately with human expert evaluations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Topic Modeling · Explainable Artificial Intelligence (XAI)
