When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification

Hanna Shcharbakova; Tatiana Anikina; Natalia Skachkova; Josef van Genabith

arXiv:2507.20700·cs.CL·July 29, 2025

When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification

Hanna Shcharbakova, Tatiana Anikina, Natalia Skachkova, Josef van Genabith

PDF

1 Video

TL;DR

This study evaluates multilingual claim verification models across 25 languages, revealing smaller models outperform large LLMs in nuanced, fine-grained fact-checking tasks, with significant implications for deploying effective fact-checking systems.

Contribution

It provides a comprehensive comparison showing smaller models like XLM-R outperform larger LLMs in multilingual fact verification, establishing new benchmarks and highlighting challenges in evidence utilization and bias.

Findings

01

XLM-R achieves 57.7% macro-F1, outperforming large LLMs.

02

Large LLMs show limited effectiveness in fine-grained multilingual verification.

03

Identifies biases and evidence utilization issues in LLMs.

Abstract

The rapid spread of multilingual misinformation requires robust automated fact verification systems capable of handling fine-grained veracity assessments across diverse languages. While large language models have shown remarkable capabilities across many NLP tasks, their effectiveness for multilingual claim verification with nuanced classification schemes remains understudied. We conduct a comprehensive evaluation of five state-of-the-art language models on the X-Fact dataset, which spans 25 languages with seven distinct veracity categories. Our experiments compare small language models (encoder-based XLM-R and mT5) with recent decoder-only LLMs (Llama 3.1, Qwen 2.5, Mistral Nemo) using both prompting and fine-tuning approaches. Surprisingly, we find that XLM-R (270M parameters) substantially outperforms all tested LLMs (7-12B parameters), achieving 57.7% macro-F1 compared to the best…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification· underline