Neural Machine Translation Model with a Large Vocabulary Selected by Branching Entropy
Zi Long, Ryuichiro Kimura, Takehito Utsuro, Tomoharu Mitsuhashi, Mikio, Yamamoto

TL;DR
This paper introduces a method combining neural machine translation with phrase selection via branching entropy to handle large vocabularies and technical terms, significantly improving translation accuracy for patent documents.
Contribution
It proposes a vocabulary selection technique using branching entropy to improve NMT translation of technical patent documents without language-specific knowledge.
Findings
Significant improvement in translation quality over baseline NMT.
Reduction of under-translation errors by about 50%.
Effective handling of technical terms in patent translation.
Abstract
Neural machine translation (NMT), a new approach to machine translation, has achieved promising results comparable to those of traditional approaches such as statistical machine translation (SMT). Despite its recent success, NMT cannot handle a larger vocabulary because the training complexity and decoding complexity proportionally increase with the number of target words. This problem becomes even more serious when translating patent documents, which contain many technical terms that are observed infrequently. In this paper, we propose to select phrases that contain out-of-vocabulary words using the statistical approach of branching entropy. This allows the proposed NMT system to be applied to a translation task of any language pair without any language-specific knowledge about technical term identification. The selected phrases are then replaced with tokens during training and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Handwritten Text Recognition Techniques
