Truth Knows No Language: Evaluating Truthfulness Beyond English

Blanca Calvo Figueras; Eneko Sagarzazu; Julen Etxaniz; Jeremy Barnes; Pablo Gamallo; Iria de-Dios-Flores; Rodrigo Agerri

arXiv:2502.09387·cs.CL·January 15, 2026

Truth Knows No Language: Evaluating Truthfulness Beyond English

Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria de-Dios-Flores, Rodrigo Agerri

PDF

Open Access 1 Repo 5 Models 4 Datasets 1 Video

TL;DR

This paper extends the TruthfulQA benchmark to multiple languages, evaluates LLM truthfulness across them, and finds translation and resource levels influence performance, with LLM-as-a-Judge aligning better with human judgments.

Contribution

It introduces a multilingual truthfulness benchmark, compares LLMs across languages, and demonstrates the effectiveness of machine translation for extending evaluations.

Findings

01

LLMs perform best in English, worst in Basque

02

LLM-as-a-Judge correlates more with human judgments

03

Machine translation is a viable method for multilingual benchmarking

Abstract

We introduce a professionally translated extension of the TruthfulQA benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and Spanish. Truthfulness evaluations of large language models (LLMs) have primarily been conducted in English. However, the ability of LLMs to maintain truthfulness across languages remains under-explored. Our study evaluates 12 state-of-the-art open LLMs, comparing base and instruction-tuned models using human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Our findings reveal that, while LLMs perform best in English and worst in Basque (the lowest-resourced language), overall truthfulness discrepancies across languages are smaller than anticipated. Furthermore, we show that LLM-as-a-Judge correlates more closely with human judgments than multiple-choice metrics, and that informativeness plays a critical role in truthfulness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hitz-zentroa/truthfulqa-multi
noneOfficial

Models

Datasets

Videos

Truth Knows No Language: Evaluating Truthfulness Beyond English· underline

Taxonomy

TopicsDeception detection and forensic psychology · Interpreting and Communication in Healthcare · Epistemology, Ethics, and Metaphysics

MethodsBalanced Selection