PETCI: A Parallel English Translation Dataset of Chinese Idioms

Kenan Tang (The University of Chicago)

arXiv:2202.09509·cs.CL·February 22, 2022·5 cites

PETCI: A Parallel English Translation Dataset of Chinese Idioms

Kenan Tang (The University of Chicago)

PDF

Open Access 1 Repo 1 Datasets

TL;DR

PETCI is a new parallel dataset of Chinese idioms and their English translations designed to enhance idiom translation in machine translation systems and aid language learners.

Contribution

The paper introduces PETCI, a large, scalable Chinese-English idiom translation dataset combining human and machine efforts, and evaluates baseline models for translation quality.

Findings

01

Baseline models perform poorly on idiom translation.

02

Structure-aware classification models effectively distinguish good translations.

03

PETCI can be expanded easily without specialized expertise.

Abstract

Idioms are an important language phenomenon in Chinese, but idiom translation is notoriously hard. Current machine translation models perform poorly on idiom translation, while idioms are sparse in many translation datasets. We present PETCI, a parallel English translation dataset of Chinese idioms, aiming to improve idiom translation by both human and machine. The dataset is built by leveraging human and machine effort. Baseline generation models show unsatisfactory abilities to improve translation, but structure-aware classification models show good performance on distinguishing good translations. Furthermore, the size of PETCI can be easily increased without expertise. Overall, PETCI can be helpful to language learners and machine translation systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kt2k01/petci
pytorchOfficial

Datasets

kenantang/IdiomTranslate30
dataset· 30 dl
30 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Machine Learning in Bioinformatics