Contrastive Bi-Encoder Models for Multi-Label Skill Extraction: Enhancing ESCO Ontology Matching with BERT and Attention Mechanisms
Yongming Sun

TL;DR
This paper introduces a zero-shot, contrastive bi-encoder framework using LLM-generated synthetic data and hierarchical skill generation to improve multi-label skill extraction from job ads, especially in non-English contexts.
Contribution
It presents a novel zero-shot approach combining LLM-synthesized training data with a hierarchical, contrastive bi-encoder model for skill extraction, reducing reliance on manual annotations.
Findings
Hierarchical skill generation improves fluency and discriminability.
The model achieves strong zero-shot performance on Chinese job ads.
Outperforms TF-IDF and standard BERT baselines in retrieval tasks.
Abstract
Fine-grained labor market analysis increasingly relies on mapping unstructured job advertisements to standardized skill taxonomies such as ESCO. This mapping is naturally formulated as an Extreme Multi-Label Classification (XMLC) problem, but supervised solutions are constrained by the scarcity and cost of large-scale, taxonomy-aligned annotations--especially in non-English settings where job-ad language diverges substantially from formal skill definitions. We propose a zero-shot skill extraction framework that eliminates the need for manually labeled job-ad training data. The framework uses a Large Language Model (LLM) to synthesize training instances from ESCO definitions, and introduces hierarchically constrained multi-skill generation based on ESCO Level-2 categories to improve semantic coherence in multi-label contexts. On top of the synthetic corpus, we train a contrastive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text and Document Classification Technologies · Sentiment Analysis and Opinion Mining
