Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

Hanna Yukhymenko; Anton Alexandrov; Martin Vechev

arXiv:2602.22207·cs.CL·February 26, 2026

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

Hanna Yukhymenko, Anton Alexandrov, Martin Vechev

PDF

Open Access 1 Video

TL;DR

This paper introduces an automated, scalable translation pipeline for benchmarks that maintains semantic integrity, improving multilingual evaluation accuracy for large language models across eight European languages.

Contribution

The work presents a novel translation framework utilizing test-time compute scaling and multi-round ranking to produce high-quality multilingual benchmarks, preserving task structure and linguistic nuances.

Findings

01

Translations outperform existing resources in quality metrics.

02

Framework enables accurate multilingual model evaluation.

03

Releases include tools and benchmarks for community use.

Abstract

The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. Our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Explainable Artificial Intelligence (XAI)