Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs

Doohee You; S Fraiberger

arXiv:2410.01141·cs.CL·July 2, 2025

Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs

Doohee You, S Fraiberger

PDF

Open Access

TL;DR

This paper evaluates various deduplication methods for economic research paper titles using NLP and LLMs, highlighting the effectiveness of semantic similarity measures and the low prevalence of duplicates.

Contribution

It compares traditional and semantic similarity-based deduplication techniques, including LLMs, for economic research titles, providing insights into their relative effectiveness.

Findings

01

Low prevalence of duplicates based on semantic similarity

02

Semantic measures align with NLP and LLM-based distance metrics

03

Further validation with human annotations is ongoing

Abstract

This study investigates efficient deduplication techniques for a large NLP dataset of economic research paper titles. We explore various pairing methods alongside established distance measures (Levenshtein distance, cosine similarity) and a sBERT model for semantic evaluation. Our findings suggest a potentially low prevalence of duplicates based on the observed semantic similarity across different methods. Further exploration with a human-annotated ground truth set is completed for a more conclusive assessment. The result supports findings from the NLP, LLM based distance metrics.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Semantic Web and Ontologies · Topic Modeling

MethodsSparse Evolutionary Training · Sentence-BERT