Pearmut: Human Evaluation of Translation Made Trivial

Vil\'em Zouhar; Tom Kocmi

arXiv:2601.02933·cs.CL·April 21, 2026

Pearmut: Human Evaluation of Translation Made Trivial

Vil\'em Zouhar, Tom Kocmi

PDF

1 Repo 1 Datasets 1 Video

TL;DR

Pearmut is a user-friendly platform that simplifies human evaluation for multilingual NLP, especially machine translation, making it as accessible as automatic metrics.

Contribution

It introduces a lightweight, extensible tool supporting standard and new evaluation protocols with features like context, attention checks, and flexible strategies.

Findings

01

Supports multiple evaluation protocols including DA, ESA, and MQM.

02

Enables reliable human evaluation to be a routine part of model development.

03

Features document-level context, attention checks, and extensible design.

Abstract

Human evaluation is the gold standard for multilingual NLP, but is often skipped in practice and substituted with automatic metrics because it is notoriously complex and slow to set up with existing tools with substantial engineering and operational overhead. We introduce Pearmut, a lightweight yet feature-rich platform that makes end-to-end human evaluation as easy to run as automatic evaluation. Pearmut removes common entry barriers and provides support for evaluating multilingual tasks, with a particular focus on machine translation. The platform implements standard evaluation protocols, including DA, ESA, and MQM, and is extensible to support new protocols. It features document-level context, absolute and contrastive evaluation, attention checks, ESAAI pre-annotations and both static and dynamic assignment strategies. Pearmut enables reliable human evaluation to become a practical,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zouharvi/pearmut
github

Datasets

zouharvi/hearing2translate-humeval
dataset· 92 dl
92 dl

Videos

Pearmut: Human Evaluation of Translation Made Trivial· underline