Optimal Transport-based Alignment of Learned Character Representations for String Similarity
Derek Tam, Nicholas Monath, Ari Kobren, Aaron Traylor, Rajarshi Das,, Andrew McCallum

TL;DR
This paper introduces STANCE, a novel learned string similarity model that uses optimal transport and neural networks to improve alias detection and downstream coreference tasks.
Contribution
The paper presents STANCE, a new method combining optimal transport and neural encoding for string similarity, outperforming existing models on multiple datasets.
Findings
STANCE outperforms state-of-the-art models on four datasets
Constructed five new alias detection datasets for evaluation
Improves cross-document coreference by 2.8 B^3 F1 points
Abstract
String similarity models are vital for record linkage, entity resolution, and search. In this work, we present STANCE --a learned model for computing the similarity of two strings. Our approach encodes the characters of each string, aligns the encodings using Sinkhorn Iteration (alignment is posed as an instance of optimal transport) and scores the alignment with a convolutional neural network. We evaluate STANCE's ability to detect whether two strings can refer to the same entity--a task we term alias detection. We construct five new alias detection datasets (and make them publicly available). We show that STANCE or one of its variants outperforms both state-of-the-art and classic, parameter-free similarity models on four of the five datasets. We also demonstrate STANCE's ability to improve downstream tasks by applying it to an instance of cross-document coreference and show that it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Data Quality and Management · Natural Language Processing Techniques
