Combining Embeddings and Domain Knowledge for Job Posting Duplicate   Detection

Matthias Engelbach; Dennis Klau; Maximilien Kintz; Alexander Ulrich

arXiv:2406.06257·cs.CL·June 11, 2024

Combining Embeddings and Domain Knowledge for Job Posting Duplicate Detection

Matthias Engelbach, Dennis Klau, Maximilien Kintz, Alexander Ulrich

PDF

Open Access

TL;DR

This paper presents a method for detecting duplicate job descriptions by combining string similarity, text embeddings, and keyword matching, significantly improving performance and being validated in real-world production use.

Contribution

It introduces a novel combination of overlap-based similarity, embeddings, and curated skill lists for effective duplicate detection in job postings.

Findings

01

Combining multiple methods improves detection accuracy.

02

The approach outperforms individual techniques.

03

The tool is successfully deployed in production.

Abstract

Job descriptions are posted on many online channels, including company websites, job boards or social media platforms. These descriptions are usually published with varying text for the same job, due to the requirements of each platform or to target different audiences. However, for the purpose of automated recruitment and assistance of people working with these texts, it is helpful to aggregate job postings across platforms and thus detect duplicate descriptions that refer to the same job. In this work, we propose an approach for detecting duplicates in job descriptions. We show that combining overlap-based character similarity with text embedding and keyword matching methods lead to convincing results. In particular, we show that although no approach individually achieves satisfying performance, a combination of string comparison, deep textual embeddings, and the use of curated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpam and Phishing Detection · Authorship Attribution and Profiling · Topic Modeling