A Bilingual Parallel Corpus with Discourse Annotations
Yuchen Eleanor Jiang, Tianyu Liu, Shuming Ma, Dongdong Zhang, Mrinmaya, Sachan, Ryan Cotterell

TL;DR
This paper introduces BWB, a large bilingual corpus with discourse annotations from Chinese novels translated into English, aiming to facilitate research in document-level machine translation by providing a valuable resource.
Contribution
It presents the BWB corpus and an annotated test set, addressing the lack of parallel document corpora for advancing document-level MT systems.
Findings
The BWB corpus is publicly available for research.
The annotated test set probes discourse phenomena in MT.
Supports development of more context-aware MT models.
Abstract
Machine translation (MT) has almost achieved human parity at sentence-level translation. In response, the MT community has, in part, shifted its focus to document-level translation. However, the development of document-level MT systems is hampered by the lack of parallel document corpora. This paper describes BWB, a large parallel corpus first introduced in Jiang et al. (2022), along with an annotated test set. The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena. Our resource is freely available, and we hope it will serve as a guide and inspiration for more work in document-level machine translation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsTest
