Unsupervised Text Segmentation via Kernel Change-Point Detection on Sentence Embeddings
Mumin Jia, Jairo Diaz-Rodriguez

TL;DR
This paper introduces Embed-KCPD, a novel unsupervised text segmentation method using kernel change-point detection on sentence embeddings, supported by new theory and validated on benchmarks and real-world data.
Contribution
It presents the first dependence-aware theory for KCPD in language, along with a training-free segmentation algorithm and a simulation framework for validation.
Findings
Embed-KCPD outperforms strong baselines on standard benchmarks.
Theoretical guarantees ensure accurate change point recovery within small windows.
Simulation results validate the predicted scaling behavior of the method.
Abstract
Unsupervised text segmentation is crucial because boundary labels are expensive, subjective, and often fail to transfer across domains and granularity choices. We propose Embed-KCPD, a training-free method that represents sentences as embedding vectors and estimates boundaries by minimizing a penalized KCPD objective. Beyond the algorithmic instantiation, we develop, to our knowledge, the first dependence-aware theory for KCPD under -dependent sequences, a finite-memory abstraction of short-range dependence common in language. We prove an oracle inequality for the population penalized risk and a localization guarantee showing that each true change point is recovered within a window that is small relative to segment length. To connect theory to practice, we introduce an LLM-based simulation framework that generates synthetic documents with controlled finite-memory dependence and known…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Authorship Attribution and Profiling · Text and Document Classification Technologies
