CinPatent: Datasets for Patent Classification
Minh-Tien Nguyen, Nhung Bui, Manh Tran-Tien, Linh Le, and Huy-The Vu

TL;DR
This paper introduces two new large-scale patent classification datasets in English and Japanese, compares multiple classification methods, and provides insights into their performance and the contribution of different patent sections.
Contribution
The paper provides the first systematic performance comparison of baseline methods on new multilingual patent datasets and releases these datasets for future research.
Findings
AttentionXML outperforms other baselines consistently.
Different patent sections contribute variably to classification performance.
Performance varies with training data segmentation.
Abstract
Patent classification is the task that assigns each input patent into several codes (classes). Due to its high demand, several datasets and methods have been introduced. However, the lack of both systematic performance comparison of baselines and access to some datasets creates a gap for the task. To fill the gap, we introduce two new datasets in English and Japanese collected by using CPC codes. The English dataset includes 45,131 patent documents with 425 labels and the Japanese dataset contains 54,657 documents with 523 labels. To facilitate the next studies, we compare the performance of strong multi-label text classification methods on the two datasets. Experimental results show that AttentionXML is consistently better than other strong baselines. The ablation study is also conducted in two aspects: the contribution of different parts (title, abstract, description, and claims) of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Advanced Text Analysis Techniques · Topic Modeling
