HashSet -- A Dataset For Hashtag Segmentation
Prashant Kodali, Akshala Bhatnagar, Naman Ahuja, Manish Shrivastava,, Ponnurangam Kumaraguru

TL;DR
HashSet is a new, large dataset for hashtag segmentation that includes manually annotated and loosely supervised data, offering diverse hashtag examples to better evaluate and improve model performance across different writing styles and domains.
Contribution
The paper introduces HashSet, a comprehensive dataset for hashtag segmentation with diverse, real-world hashtags, addressing limitations of existing datasets and enabling more robust model evaluation.
Findings
State-of-the-art models perform worse on HashSet, indicating dataset diversity impacts performance.
HashSet includes 1.9k manual annotations and 3.3M loosely supervised hashtags.
Dataset diversity reveals the need for improved models in hashtag segmentation.
Abstract
Hashtag segmentation is the task of breaking a hashtag into its constituent tokens. Hashtags often encode the essence of user-generated posts, along with information like topic and sentiment, which are useful in downstream tasks. Hashtags prioritize brevity and are written in unique ways -- transliterating and mixing languages, spelling variations, creative named entities. Benchmark datasets used for the hashtag segmentation task -- STAN, BOUN -- are small in size and extracted from a single set of tweets. However, datasets should reflect the variations in writing styles of hashtags and also account for domain and language specificity, failing which the results will misrepresent model performance. We argue that model performance should be assessed on a wider variety of hashtags, and datasets should be carefully curated. To this end, we propose HashSet, a dataset comprising of: a) 1.9k…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Natural Language Processing Techniques
