Learning Semantic Similarity for Very Short Texts

Cedric De Boom; Steven Van Canneyt; Steven Bohez; Thomas Demeester,; Bart Dhoedt

arXiv:1512.00765·cs.IR·December 3, 2015

Learning Semantic Similarity for Very Short Texts

Cedric De Boom, Steven Van Canneyt, Steven Bohez, Thomas Demeester,, Bart Dhoedt

PDF

TL;DR

This paper explores combining word embeddings with tf-idf to improve semantic similarity detection in very short texts, outperforming traditional methods in initial experiments.

Contribution

It introduces a hybrid approach that merges dense word embeddings with tf-idf, advancing short text semantic matching beyond existing naive combination techniques.

Findings

01

Hybrid method outperforms traditional similarity measures

02

Combining embeddings with tf-idf improves semantic matching

03

Effective for very short text fragments

Abstract

Levering data on social media, such as Twitter and Facebook, requires information retrieval algorithms to become able to relate very short text fragments to each other. Traditional text similarity methods such as tf-idf cosine-similarity, based on word overlap, mostly fail to produce good results in this case, since word overlap is little or non-existent. Recently, distributed word representations, or word embeddings, have been shown to successfully allow words to match on the semantic level. In order to pair short text fragments - as a concatenation of separate words - an adequate distributed sentence representation is needed, in existing literature often obtained by naively combining the individual word representations. We therefore investigated several text representations as a combination of word embeddings in the context of semantic pair matching. This paper investigates the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.