PETra: A Multilingual Corpus of Pragmatic Explicitation in Translation

Doreen Osmelak; Koel Dutta Chowdhury; Uliana Sentsova; Cristina Espa\~na-Bonet; Josef van Genabith

arXiv:2511.02721·cs.CL·April 2, 2026

PETra: A Multilingual Corpus of Pragmatic Explicitation in Translation

Doreen Osmelak, Koel Dutta Chowdhury, Uliana Sentsova, Cristina Espa\~na-Bonet, Josef van Genabith

PDF

1 Datasets

TL;DR

This paper introduces PragExTra, a multilingual corpus and detection framework for pragmatic explicitation in translation, enabling computational analysis of cultural and contextual enrichments across eight language pairs.

Contribution

It presents the first multilingual corpus and detection method for pragmatic explicitation, demonstrating improved classifier accuracy with active learning across languages.

Findings

01

Entity and system-level explicitation are most frequent.

02

Active learning improves classifier accuracy by 7-8 percentage points.

03

Achieves up to 0.88 accuracy and 0.82 F1 across languages.

Abstract

Translators often enrich texts with background details that make implicit cultural meanings explicit for new audiences. This phenomenon, known as pragmatic explicitation, has been widely discussed in translation theory but rarely modeled computationally. We introduce PragExTra, the first multilingual corpus and detection framework for pragmatic explicitation. The corpus covers eight language pairs from TED-Multi and Europarl and includes additions such as entity descriptions, measurement conversions, and translator remarks. We identify candidate explicitation cases through null alignments and refined using active learning with human annotation. Our results show that entity and system-level explicitations are most frequent, and that active learning improves classifier accuracy by 7-8 percentage points, achieving up to 0.88 accuracy and 0.82 F1 across languages. PragExTra establishes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Doosme/PETra
dataset· 29 dl
29 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.