Searching for Optimal Subword Tokenization in Cross-domain NER
Ruotian Ma, Yiding Tan, Xin Zhou, Xuanting Chen, Di Liang, Sirui Wang,, Wei Wu, Tao Gui, Qi Zhang

TL;DR
This paper introduces X-Piece, a subword-level approach for cross-domain NER that re-tokenizes input words to better align subword distributions between domains, improving performance especially when combined with existing domain-invariant methods.
Contribution
The paper proposes a novel subword-level re-tokenization method for cross-domain NER using optimal transport, addressing input distribution shift directly at the word level.
Findings
X-Piece improves NER performance across four datasets.
Combining X-Piece with DIRL methods yields further gains.
The approach effectively aligns subword distributions between domains.
Abstract
Input distribution shift is one of the vital problems in unsupervised domain adaptation (UDA). The most popular UDA approaches focus on domain-invariant representation learning, trying to align the features from different domains into similar feature distributions. However, these approaches ignore the direct alignment of input word distributions between domains, which is a vital factor in word-level classification tasks such as cross-domain NER. In this work, we shed new light on cross-domain NER by introducing a subword-level solution, X-Piece, for input word-level distribution shift in NER. Specifically, we re-tokenize the input words of the source domain to approach the target subword distribution, which is formulated and solved as an optimal transport problem. As this approach focuses on the input level, it can also be combined with previous DIRL methods for further improvement.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Domain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis
MethodsALIGN
