Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation

Xinwei Wu; Heng Liu; Jiang Zhou; Xiaohu Zhao; Linlong Xu; Longyue Wang; Weihua Luo; Kaifu Zhang

arXiv:2510.24073·cs.CL·October 29, 2025

Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation

Xinwei Wu, Heng Liu, Jiang Zhou, Xiaohu Zhao, Linlong Xu, Longyue Wang, Weihua Luo, Kaifu Zhang

PDF

1 Datasets

TL;DR

This paper introduces a new diagnostic framework and benchmark, HalloMTBench, to systematically identify and analyze hallucination failures in multilingual large language models during translation tasks.

Contribution

It proposes a taxonomy to distinguish types of hallucinations, creates a comprehensive multilingual benchmark, and evaluates multiple LLMs to uncover failure patterns.

Findings

01

Identified distinct hallucination triggers related to model scale and linguistic biases

02

Developed a high-quality dataset with 5,435 instances for evaluation

03

Revealed that RL can amplify language mixing issues in LLMs

Abstract

Large Language Models (LLMs) have advanced machine translation but remain vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not capable of exposing failures in multilingual LLMs. To disclose hallucination in multilingual LLMs, we introduce a diagnostic framework with a taxonomy that separates Instruction Detachment from Source Detachment. Guided by this taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark across 11 English-to-X directions. We employed 4 frontier LLMs to generate candidates and scrutinize these candidates with an ensemble of LLM judges, and expert validation. In this way, we curate 5,435 high-quality instances. We have evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination triggers'' -- unique failure patterns reflecting model scale, source length sensitivity, linguistic biases, and Reinforcement-Learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

AIDC-AI/HalloMTBench
dataset· 24 dl
24 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.