Dataset of News Articles with Provenance Metadata for Media Relevance Assessment

Tomas Peterka; Matyas Bohacek

arXiv:2506.09847·cs.CL·June 12, 2025

Dataset of News Articles with Provenance Metadata for Media Relevance Assessment

Tomas Peterka, Matyas Bohacek

PDF

Open Access

TL;DR

This paper introduces the News Media Provenance Dataset, a resource for assessing media relevance by analyzing image provenance in news articles, and evaluates baseline models on tasks of origin relevance detection.

Contribution

The paper presents a new dataset with provenance metadata for news images and formulates two tasks to evaluate media relevance, highlighting current model strengths and limitations.

Findings

01

Zero-shot performance on location relevance is promising.

02

Performance on date and time relevance is limited.

03

Room for specialized architectures to improve results.

Abstract

Out-of-context and misattributed imagery is the leading form of media manipulation in today's misinformation and disinformation landscape. The existing methods attempting to detect this practice often only consider whether the semantics of the imagery corresponds to the text narrative, missing manipulation so long as the depicted objects or scenes somewhat correspond to the narrative at hand. To tackle this, we introduce News Media Provenance Dataset, a dataset of news articles with provenance-tagged images. We formulate two tasks on this dataset, location of origin relevance (LOR) and date and time of origin relevance (DTOR), and present baseline results on six large language models (LLMs). We identify that, while the zero-shot performance on LOR is promising, the performance on DTOR hinders, leaving room for specialized architectures and future work.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMisinformation and Its Impacts · Digital Media Forensic Detection · Authorship Attribution and Profiling