Contextualized Semantic Distance between Highly Overlapped Texts
Letian Peng, Zuchao Li, Hai Zhao

TL;DR
This paper introduces NDD, a novel semantic distance metric using masked language modeling to better evaluate highly overlapped texts, improving sensitivity and domain adaptation in NLP tasks.
Contribution
It proposes a mask-and-predict strategy with NDD, addressing limitations of traditional metrics in overlapped texts and enabling unsupervised text compression and domain adaptation.
Findings
NDD outperforms traditional metrics in semantic similarity tasks.
The method improves text compression without training.
NDD surpasses supervised methods in domain adaptation.
Abstract
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation. Better evaluation of the semantic distance between the overlapped sentences benefits the language system's understanding and guides the generation. Since conventional semantic metrics are based on word representations, they are vulnerable to the disturbance of overlapped components with similar representations. This paper aims to address the issue with a mask-and-predict strategy. We take the words in the longest common sequence (LCS) as neighboring words and use masked language modeling (MLM) from pre-trained language models (PLMs) to predict the distributions on their positions. Our metric, Neighboring Distribution Divergence (NDD), represent the semantic distance by calculating the divergence between distributions in the overlapped parts.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
