Topic Segmentation in the Wild: Towards Segmentation of Semi-structured & Unstructured Chats
Reshmi Ghosh, Harjeet Singh Kajal, Sharanya Kamath, Dhuri Shrivastava,, Samyadeep Basu, Soundararajan Srinivasan

TL;DR
This paper investigates the challenges of applying existing topic segmentation models to unstructured and semi-structured chats, highlighting the limited transferability of models trained on structured texts and proposing domain-specific training for better results.
Contribution
It provides a comprehensive analysis of how current topic segmentation models perform on unstructured texts and demonstrates the importance of domain-specific training over large structured corpora.
Findings
Pre-training on structured texts like Wiki-727K does not transfer well to unstructured texts.
Training from scratch on small unstructured datasets significantly improves segmentation performance.
Current models struggle with generalization to semi-structured and unstructured chat data.
Abstract
Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured texts. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Wikis in Education and Collaboration
