Combining Embeddings and Domain Knowledge for Job Posting Duplicate Detection
Matthias Engelbach, Dennis Klau, Maximilien Kintz, Alexander Ulrich

TL;DR
This paper presents a method for detecting duplicate job descriptions by combining string similarity, text embeddings, and keyword matching, significantly improving performance and being validated in real-world production use.
Contribution
It introduces a novel combination of overlap-based similarity, embeddings, and curated skill lists for effective duplicate detection in job postings.
Findings
Combining multiple methods improves detection accuracy.
The approach outperforms individual techniques.
The tool is successfully deployed in production.
Abstract
Job descriptions are posted on many online channels, including company websites, job boards or social media platforms. These descriptions are usually published with varying text for the same job, due to the requirements of each platform or to target different audiences. However, for the purpose of automated recruitment and assistance of people working with these texts, it is helpful to aggregate job postings across platforms and thus detect duplicate descriptions that refer to the same job. In this work, we propose an approach for detecting duplicates in job descriptions. We show that combining overlap-based character similarity with text embedding and keyword matching methods lead to convincing results. In particular, we show that although no approach individually achieves satisfying performance, a combination of string comparison, deep textual embeddings, and the use of curated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Authorship Attribution and Profiling · Topic Modeling
