The SMeL Test: A simple benchmark for media literacy in language models

Gustaf Ahdritz; Anat Kleiman

arXiv:2508.02074·cs.CL·August 8, 2025

The SMeL Test: A simple benchmark for media literacy in language models

Gustaf Ahdritz, Anat Kleiman

PDF

3 Reviews

TL;DR

The paper introduces the SMeL Test, a benchmark to evaluate language models' ability to filter untrustworthy online content, revealing current models' limitations in media literacy and hallucination issues.

Contribution

It presents the SMeL Test as a minimal benchmark for media literacy in LLMs and evaluates various models, highlighting their shortcomings in filtering misinformation.

Findings

01

No model consistently filters untrustworthy content.

02

Reasoning models perform better but still hallucinate up to 70%.

03

Larger models do not always outperform smaller ones.

Abstract

The internet is rife with unattributed, deliberately misleading, or otherwise untrustworthy content. Though large language models (LLMs) are often tasked with autonomous web browsing, the extent to which they have learned the simple heuristics human researchers use to navigate this noisy environment is not currently known. In this paper, we introduce the Synthetic Media Literacy Test (SMeL Test), a minimal benchmark that tests the ability of language models to actively filter out untrustworthy information in context. We benchmark a variety of commonly used instruction-tuned LLMs, including reasoning models, and find that no model consistently succeeds; while reasoning in particular is associated with higher scores, even the best API model we test hallucinates up to 70% of the time. Remarkably, larger and more capable models do not necessarily outperform their smaller counterparts. We…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

The topic is timely Elegantly proposed three categories of the evaluation tasks - ignoring dubious sources, resolving contradictions, and active filtering are all meaningful approaches to deal with such misinformation detection tasks. Data used for the work are rich - from encyclopedia britannica (academic) to Reddit (a casual internet community forum) to the least trustworthy source (i.e., "unknown"). Well-presented/ summarized results - the performances depending on the model scales, reaso

Weaknesses

The work is too heuristic and missing technical concepts to evaluate the validity of the work. - the work really depends on the data contents and the currently used data do not seem to have standardized methods to evaluate the validity to replicate the work. - Not sure about the technical depth of the work. It sounds more like a blog post or report of the model result analysis.

Reviewer 02Rating 4Confidence 4

Strengths

**Significant topic**: The overall problem this paper studies is relevant and contemporary. Frontier models often rely on search, and the internet is becoming increasingly cluttered with untrustworthy data. Hence, understanding how well LLMs handle sources of different trustworthiness helps users understand risks and to potentially fix issues. **Useful and rigorous datasets**: The dataset generation procedures (both the synthetic benchmark and the real-world news articles) seems to be done rigo

Weaknesses

**Active filtering task is too limited**: The Section 2 description of the "active filtering" task mentions filtering between many sources. This is (in my opinion) the most interesting task, because it corresponds to "deep research", which is most affected by untrustworthy sources. However, later (Section 3), it becomes clear that this task only uses two documents. I think only using pairs of documents is highly restrictive and does not serve as a proxy for real-world performance of "deep resear

Reviewer 03Rating 8Confidence 4

Strengths

- The proposed topic is important and timely - The paper is well-written and the benchmark seems relatively well-designed - A plethora of models are evaluated and there are numerous ablations

Weaknesses

I think the paper is strong but these are the minor weakness I see. 1: Some of the tasks are relatively arbitrary (e.g. Ignoring dubious sources requires the model to ignore the sources without really asking the model to do so). In many cases, I expect the model will need to be clearly told to ignore those sources. However, other sections of the eval does have this and I think this is more compelling. 2. I think as a general problem for this area of research is that whether some sources are

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.