Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs
Doohee You, S Fraiberger

TL;DR
This paper evaluates various deduplication methods for economic research paper titles using NLP and LLMs, highlighting the effectiveness of semantic similarity measures and the low prevalence of duplicates.
Contribution
It compares traditional and semantic similarity-based deduplication techniques, including LLMs, for economic research titles, providing insights into their relative effectiveness.
Findings
Low prevalence of duplicates based on semantic similarity
Semantic measures align with NLP and LLM-based distance metrics
Further validation with human annotations is ongoing
Abstract
This study investigates efficient deduplication techniques for a large NLP dataset of economic research paper titles. We explore various pairing methods alongside established distance measures (Levenshtein distance, cosine similarity) and a sBERT model for semantic evaluation. Our findings suggest a potentially low prevalence of duplicates based on the observed semantic similarity across different methods. Further exploration with a human-annotated ground truth set is completed for a more conclusive assessment. The result supports findings from the NLP, LLM based distance metrics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Semantic Web and Ontologies · Topic Modeling
MethodsSparse Evolutionary Training · Sentence-BERT
