CrossNews-UA: A Cross-lingual News Semantic Similarity Benchmark for Ukrainian, Polish, Russian, and English

Daryna Dementieva; Evgeniya Sukhodolskaya; Alexander Fraser

arXiv:2510.19628·cs.CL·October 23, 2025

CrossNews-UA: A Cross-lingual News Semantic Similarity Benchmark for Ukrainian, Polish, Russian, and English

Daryna Dementieva, Evgeniya Sukhodolskaya, Alexander Fraser

PDF

Open Access 1 Datasets

TL;DR

This paper introduces CrossNews-UA, a scalable cross-lingual news similarity benchmark for Ukrainian, Polish, Russian, and English, with detailed annotations to improve fake news detection across languages.

Contribution

It presents a novel crowdsourcing pipeline for creating a multilingual news similarity dataset and evaluates various models on this benchmark.

Findings

01

Transformer-based models show promising performance

02

Traditional models struggle with multilingual semantic similarity

03

The dataset enables better cross-lingual fake news detection

Abstract

In the era of social networks and rapid misinformation spread, news analysis remains a critical task. Detecting fake news across multiple languages, particularly beyond English, poses significant challenges. Cross-lingual news comparison offers a promising approach to verify information by leveraging external sources in different languages (Chen and Shu, 2024). However, existing datasets for cross-lingual news analysis (Chen et al., 2022a) were manually curated by journalists and experts, limiting their scalability and adaptability to new languages. In this work, we address this gap by introducing a scalable, explainable crowdsourcing pipeline for cross-lingual news similarity assessment. Using this pipeline, we collected a novel dataset CrossNews-UA of news pairs in Ukrainian as a central language with linguistically and contextually relevant languages-Polish, Russian, and English.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ukr-detect/crossnews-ua
dataset· 38 dl
38 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMisinformation and Its Impacts · Text Readability and Simplification · Topic Modeling