Exploiting Twitter as Source of Large Corpora of Weakly Similar Pairs   for Semantic Sentence Embeddings

Marco Di Giovanni; Marco Brambilla

arXiv:2110.02030·cs.CL·October 6, 2021

Exploiting Twitter as Source of Large Corpora of Weakly Similar Pairs for Semantic Sentence Embeddings

Marco Di Giovanni, Marco Brambilla

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper presents a language-independent method to create large datasets of weakly similar sentence pairs from Twitter data, enabling training of semantic embeddings without manual labeling, and demonstrates its effectiveness across multiple NLP tasks.

Contribution

The authors introduce a novel approach to automatically generate large-scale weakly similar sentence pairs from Twitter, facilitating unsupervised training of semantic embeddings for multiple languages.

Findings

01

Models trained on Twitter data outperform previous unsupervised methods.

02

Increasing corpus size improves embedding quality, even up to 2 million samples.

03

The approach generalizes well to various semantic similarity tasks.

Abstract

Semantic sentence embeddings are usually supervisedly built minimizing distances between pairs of embeddings of sentences labelled as semantically similar by annotators. Since big labelled datasets are rare, in particular for non-English languages, and expensive, recent studies focus on unsupervised approaches that require not-paired input sentences. We instead propose a language-independent approach to build large datasets of pairs of informal texts weakly similar, without manual human effort, exploiting Twitter's intrinsic powerful signals of relatedness: replies and quotes of tweets. We use the collected pairs to train a Transformer model with triplet-like structures, and we test the generated embeddings on Twitter NLP similarity tasks (PIT and TURL) and STSb. We also introduce four new sentence ranking evaluation benchmarks of informal texts, carefully extracted from the initial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

marco-digio/twitter4sse
noneOfficial

Models

🤗
digio/Twitter4SSE
model· 41 dl· ♡ 7
41 dl♡ 7

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Authorship Attribution and Profiling

MethodsAttention Is All You Need · Test · Linear Layer · Position-Wise Feed-Forward Layer · Adam · Residual Connection · Byte Pair Encoding · Dropout · Dense Connections · Label Smoothing