Comparative analysis of word embeddings in assessing semantic similarity of complex sentences
Dhivya Chandrasekaran, Vijay Mago

TL;DR
This paper investigates how the complexity of sentences affects the performance of different word embeddings and language models in assessing semantic similarity, revealing a notable decline in accuracy with increased sentence complexity.
Contribution
It introduces a new complex sentence dataset and analyzes the sensitivity of various embeddings to sentence complexity, highlighting limitations of current models.
Findings
Performance drops 10-20% with increased sentence complexity
Existing benchmarks may overestimate model capabilities on complex sentences
Complexity impacts the reliability of semantic similarity assessments
Abstract
Semantic textual similarity is one of the open research challenges in the field of Natural Language Processing. Extensive research has been carried out in this field and near-perfect results are achieved by recent transformer-based models in existing benchmark datasets like the STS dataset and the SICK dataset. In this paper, we study the sentences in these datasets and analyze the sensitivity of various word embeddings with respect to the complexity of the sentences. We build a complex sentences dataset comprising of 50 sentence pairs with associated semantic similarity values provided by 15 human annotators. Readability analysis is performed to highlight the increase in complexity of the sentences in the existing benchmark datasets and those in the proposed dataset. Further, we perform a comparative analysis of the performance of various word embeddings and language models on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
