Generative Deduplication For Socia Media Data Selection

Xianming Li; Jing Li

arXiv:2401.05883·cs.CL·October 4, 2024·1 cites

Generative Deduplication For Socia Media Data Selection

Xianming Li, Jing Li

PDF

Open Access 1 Video

TL;DR

This paper introduces a generative deduplication framework that effectively removes semantically duplicate social media data, reducing training data size and enhancing NLP model performance.

Contribution

It presents a novel self-supervised generative model with noise augmentation for universal social media data deduplication, improving efficiency and effectiveness.

Findings

01

Reduces training samples more effectively than baselines.

02

Improves NLP performance on social media data.

03

Enhances social media language understanding.

Abstract

Social media data exhibits severe redundancy caused by its noisy nature. It leads to increased training time and model bias in its processing. To address this issue, we propose a novel Generative Deduplication framework for social media data selection by removing semantically duplicate data. While related work involves data selection in task-specific training, our model acts as an efficient pre-processing method to universally enhance social media NLP pipelines. Specifically, we train a generative model via self-supervised learning to predict a keyword to capture the semantics of noisy social media text for deduplication. Meanwhile, time-dimensional Gaussian noise is added to improve training complexity and avoid learning trivial features. Extensive experiments suggest that our model can better reduce training samples while improving performance than baselines. The results show our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Generative Deduplication For Socia Media Data Selection· underline

Taxonomy

TopicsData Quality and Management · Digital and Cyber Forensics · Topic Modeling