Exploiting Twitter as Source of Large Corpora of Weakly Similar Pairs for Semantic Sentence Embeddings
Marco Di Giovanni, Marco Brambilla

TL;DR
This paper presents a language-independent method to create large datasets of weakly similar sentence pairs from Twitter data, enabling training of semantic embeddings without manual labeling, and demonstrates its effectiveness across multiple NLP tasks.
Contribution
The authors introduce a novel approach to automatically generate large-scale weakly similar sentence pairs from Twitter, facilitating unsupervised training of semantic embeddings for multiple languages.
Findings
Models trained on Twitter data outperform previous unsupervised methods.
Increasing corpus size improves embedding quality, even up to 2 million samples.
The approach generalizes well to various semantic similarity tasks.
Abstract
Semantic sentence embeddings are usually supervisedly built minimizing distances between pairs of embeddings of sentences labelled as semantically similar by annotators. Since big labelled datasets are rare, in particular for non-English languages, and expensive, recent studies focus on unsupervised approaches that require not-paired input sentences. We instead propose a language-independent approach to build large datasets of pairs of informal texts weakly similar, without manual human effort, exploiting Twitter's intrinsic powerful signals of relatedness: replies and quotes of tweets. We use the collected pairs to train a Transformer model with triplet-like structures, and we test the generated embeddings on Twitter NLP similarity tasks (PIT and TURL) and STSb. We also introduce four new sentence ranking evaluation benchmarks of informal texts, carefully extracted from the initial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Authorship Attribution and Profiling
MethodsAttention Is All You Need · Test · Linear Layer · Position-Wise Feed-Forward Layer · Adam · Residual Connection · Byte Pair Encoding · Dropout · Dense Connections · Label Smoothing
