DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

Xiying Zhao; Zhoufutu Wen; Zhixuan Chen; Jingzhe Ding; Jianpeng Jiao; Shuai Li; Xi Li; Danni Liang; Shengda Long; Qianqian Liu; Xianbo Wu; Hongwan Gao; Xiang Gao; Liang Hu; Jiashuo Liu; Mengyun Liu; Weiran Shi; Chenghao Yang; Qianyu Yang; Xuanliang Zhang; Ge Zhang; Wenhao Huang; Yuwen Tang

arXiv:2511.10984·cs.CL·December 18, 2025

DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

Xiying Zhao, Zhoufutu Wen, Zhixuan Chen, Jingzhe Ding, Jianpeng Jiao, Shuai Li, Xi Li, Danni Liang, Shengda Long, Qianqian Liu, Xianbo Wu, Hongwan Gao, Xiang Gao, Liang Hu, Jiashuo Liu, Mengyun Liu, Weiran Shi, Chenghao Yang, Qianyu Yang, Xuanliang Zhang, Ge Zhang, Wenhao Huang

PDF

Open Access

TL;DR

DiscoX is a new benchmark for discourse-level Chinese-English translation in expert domains, highlighting the challenges faced by current models and providing a system for automatic, fine-grained evaluation aligned with human judgment.

Contribution

It introduces DiscoX, a comprehensive discourse-level translation benchmark with an associated evaluation system, filling a gap in expert domain translation assessment.

Findings

01

Current LLMs lag behind human experts in discourse-level translation.

02

Metric-S correlates well with human judgments and outperforms existing metrics.

03

DiscoX reveals significant performance gaps, emphasizing the need for further research.

Abstract

The evaluation of discourse-level translation in expert domains remains inadequate, despite its centrality to knowledge dissemination and cross-lingual scholarly communication. While these translations demand discourse-level coherence and strict terminological precision, current evaluation methods predominantly focus on segment-level accuracy and fluency. To address this limitation, we introduce DiscoX, a new benchmark for discourse-level and expert-level Chinese-English translation. It comprises 200 professionally-curated texts from 7 domains, with an average length exceeding 1700 tokens. To evaluate performance on DiscoX, we also develop Metric-S, a reference-free system that provides fine-grained automatic assessments across accuracy, fluency, and appropriateness. Metric-S demonstrates strong consistency with human judgments, significantly outperforming existing metrics. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification