PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation

Ruixuan Luo; Jingjing Xu; Yi Zhang; Zhiyuan Zhang; Xuancheng Ren; Xu; Sun

arXiv:1906.11455·cs.CL·May 31, 2022·108 cites

PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation

Ruixuan Luo, Jingjing Xu, Yi Zhang, Zhiyuan Zhang, Xuancheng Ren, Xu, Sun

PDF

Open Access 4 Repos

TL;DR

PKUSEG is a multi-domain Chinese word segmentation toolkit that provides domain-specific models and employs a novel domain adaptation method using synthetic data to improve performance across various domains.

Contribution

The paper introduces PKUSEG, a toolkit with domain-specific models and a domain adaptation paradigm using synthetic data for Chinese word segmentation.

Findings

01

High performance across multiple domains

02

Effective domain adaptation with synthetic data

03

Supports POS tagging and model training

Abstract

Chinese word segmentation (CWS) is a fundamental step of Chinese natural language processing. In this paper, we build a new toolkit, named PKUSEG, for multi-domain word segmentation. Unlike existing single-model toolkits, PKUSEG targets multi-domain word segmentation and provides separate models for different domains, such as web, medicine, and tourism. Besides, due to the lack of labeled data in many domains, we propose a domain adaptation paradigm to introduce cross-domain semantic knowledge via a translation system. Through this method, we generate synthetic data using a large amount of unlabeled data in the target domain and then obtain a word segmentation model for the target domain. We also further refine the performance of the default model with the help of synthetic data. Experiments show that PKUSEG achieves high performance on multiple domains. The new toolkit also supports…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications