Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models

Zihong Zhang; Liqi He; Zuchao Li; Lefei Zhang; Hai Zhao; Bo Du

arXiv:2505.19631·cs.CL·May 27, 2025

Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models

Zihong Zhang, Liqi He, Zuchao Li, Lefei Zhang, Hai Zhao, Bo Du

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the limits of unsupervised word segmentation using Large Language Models (LLMs), demonstrating their ability to follow prompts for segmentation, and introduces a novel hybrid method called LLACA that combines LLM insights with Aho-Corasick automata for improved performance.

Contribution

The paper proposes a new framework to evaluate LLMs' semantic understanding through word segmentation and introduces LLACA, a hybrid unsupervised segmentation method combining LLMs with Aho-Corasick automata.

Findings

01

LLMs can follow prompts to segment raw text into words.

02

Model size correlates with segmentation performance across languages.

03

LLACA improves segmentation accuracy over traditional methods.

Abstract

Word segmentation stands as a cornerstone of Natural Language Processing (NLP). Based on the concept of "comprehend first, segment later", we propose a new framework to explore the limit of unsupervised word segmentation with Large Language Models (LLMs) and evaluate the semantic understanding capabilities of LLMs based on word segmentation. We employ current mainstream LLMs to perform word segmentation across multiple languages to assess LLMs' "comprehension". Our findings reveal that LLMs are capable of following simple prompts to segment raw text into words. There is a trend suggesting that models with more parameters tend to perform better on multiple languages. Additionally, we introduce a novel unsupervised method, termed LLACA ( $L$ arge $L$ anguage Model-Inspired $A$ ho- $C$ orasick $A$ utomaton). Leveraging the advanced pattern recognition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hkr04/llaca
noneOfficial

Videos

Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling