JaParaPat: A Large-Scale Japanese-English Parallel Patent Application Corpus

Masaaki Nagata; Katsuki Chousa; Norihito Yasuda

arXiv:2508.16303·cs.CL·August 25, 2025

JaParaPat: A Large-Scale Japanese-English Parallel Patent Application Corpus

Masaaki Nagata, Katsuki Chousa, Norihito Yasuda

PDF

TL;DR

This paper introduces JaParaPat, a large-scale Japanese-English patent corpus with over 300 million sentence pairs, enhancing translation quality and resource availability for patent-related NLP tasks.

Contribution

The creation of a comprehensive bilingual patent corpus with improved translation accuracy and a novel alignment method for patent documents.

Findings

01

Achieved a 20 BLEU point improvement in translation accuracy.

02

Constructed a corpus of over 300 million sentence pairs.

03

Enhanced patent translation quality significantly.

Abstract

We constructed JaParaPat (Japanese-English Parallel Patent Application Corpus), a bilingual corpus of more than 300 million Japanese-English sentence pairs from patent applications published in Japan and the United States from 2000 to 2021. We obtained the publication of unexamined patent applications from the Japan Patent Office (JPO) and the United States Patent and Trademark Office (USPTO). We also obtained patent family information from the DOCDB, that is a bibliographic database maintained by the European Patent Office (EPO). We extracted approximately 1.4M Japanese-English document pairs, which are translations of each other based on the patent families, and extracted about 350M sentence pairs from the document pairs using a translation-based sentence alignment method whose initial translation model is bootstrapped from a dictionary-based sentence alignment method. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.