Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models
Zihong Zhang, Liqi He, Zuchao Li, Lefei Zhang, Hai Zhao, Bo Du

TL;DR
This paper investigates the limits of unsupervised word segmentation using Large Language Models (LLMs), demonstrating their ability to follow prompts for segmentation, and introduces a novel hybrid method called LLACA that combines LLM insights with Aho-Corasick automata for improved performance.
Contribution
The paper proposes a new framework to evaluate LLMs' semantic understanding through word segmentation and introduces LLACA, a hybrid unsupervised segmentation method combining LLMs with Aho-Corasick automata.
Findings
LLMs can follow prompts to segment raw text into words.
Model size correlates with segmentation performance across languages.
LLACA improves segmentation accuracy over traditional methods.
Abstract
Word segmentation stands as a cornerstone of Natural Language Processing (NLP). Based on the concept of "comprehend first, segment later", we propose a new framework to explore the limit of unsupervised word segmentation with Large Language Models (LLMs) and evaluate the semantic understanding capabilities of LLMs based on word segmentation. We employ current mainstream LLMs to perform word segmentation across multiple languages to assess LLMs' "comprehension". Our findings reveal that LLMs are capable of following simple prompts to segment raw text into words. There is a trend suggesting that models with more parameters tend to perform better on multiple languages. Additionally, we introduce a novel unsupervised method, termed LLACA (arge anguage Model-Inspired ho-orasick utomaton). Leveraging the advanced pattern recognition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
