Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation
Xinwei Wu, Heng Liu, Jiang Zhou, Xiaohu Zhao, Linlong Xu, Longyue Wang, Weihua Luo, Kaifu Zhang

TL;DR
This paper introduces a new diagnostic framework and benchmark, HalloMTBench, to systematically identify and analyze hallucination failures in multilingual large language models during translation tasks.
Contribution
It proposes a taxonomy to distinguish types of hallucinations, creates a comprehensive multilingual benchmark, and evaluates multiple LLMs to uncover failure patterns.
Findings
Identified distinct hallucination triggers related to model scale and linguistic biases
Developed a high-quality dataset with 5,435 instances for evaluation
Revealed that RL can amplify language mixing issues in LLMs
Abstract
Large Language Models (LLMs) have advanced machine translation but remain vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not capable of exposing failures in multilingual LLMs. To disclose hallucination in multilingual LLMs, we introduce a diagnostic framework with a taxonomy that separates Instruction Detachment from Source Detachment. Guided by this taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark across 11 English-to-X directions. We employed 4 frontier LLMs to generate candidates and scrutinize these candidates with an ensemble of LLM judges, and expert validation. In this way, we curate 5,435 high-quality instances. We have evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination triggers'' -- unique failure patterns reflecting model scale, source length sensitivity, linguistic biases, and Reinforcement-Learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
