TreeSeg: Hierarchical Topic Segmentation of Large Transcripts
Dimitrios C. Gklezakos, Timothy Misiak, Diamond Bishop

TL;DR
TreeSeg is a robust hierarchical segmentation method for large transcripts that combines embedding models with divisive clustering, effectively handling noisy ASR data and outperforming baselines on standard datasets.
Contribution
The paper introduces TreeSeg, a novel hierarchical segmentation approach that is noise-robust and scalable, with the creation of a new annotated corpus, TinyRec.
Findings
Outperforms all baseline methods on ICSI and AMI datasets
Robust to noisy ASR transcripts
Efficiently handles large transcripts
Abstract
From organizing recorded videos and meetings into chapters, to breaking down large inputs in order to fit them into the context window of commoditized Large Language Models (LLMs), topic segmentation of large transcripts emerges as a task of increasing significance. Still, accurate segmentation presents many challenges, including (a) the noisy nature of the Automatic Speech Recognition (ASR) software typically used to obtain the transcripts, (b) the lack of diverse labeled data and (c) the difficulty in pin-pointing the ground-truth number of segments. In this work we present TreeSeg, an approach that combines off-the-shelf embedding models with divisive clustering, to generate hierarchical, structured segmentations of transcripts in the form of binary trees. Our approach is robust to noise and can handle large transcripts efficiently. We evaluate TreeSeg on the ICSI and AMI corpora,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification · Machine Learning in Bioinformatics · Genomics and Chromatin Dynamics
