A Distribution-Based Threshold for Determining Sentence Similarity
Gioele Cadamuro, Marco Gruppo

TL;DR
This paper introduces a neural network-based method to determine a distribution-based threshold for sentence similarity, especially for sentences with highly specific information, and demonstrates its transferability across domains.
Contribution
The authors propose a novel thresholding approach using distribution analysis of sentence pair distances with a siamese neural network, improving similarity detection accuracy.
Findings
Effective threshold derived from distance distributions
Method generalizes well to different datasets
Improves accuracy in identifying similar sentences with specific info
Abstract
We hereby present a solution to a semantic textual similarity (STS) problem in which it is necessary to match two sentences containing, as the only distinguishing factor, highly specific information (such as names, addresses, identification codes), and from which we need to derive a definition for when they are similar and when they are not. The solution revolves around the use of a neural network, based on the siamese architecture, to create the distributions of the distances between similar and dissimilar pairs of sentences. The goal of these distributions is to find a discriminating factor, that we call "threshold", which represents a well-defined quantity that can be used to distinguish vector distances of similar pairs from vector distances of dissimilar pairs in new predictions and later analyses. In addition, we developed a way to score the predictions by combining attributes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
