Breaking the HISCO Barrier: Automatic Occupational Standardization with OccCANINE
Christian M{\o}ller Dahl, Torben Johansen, Christian Vedel

TL;DR
This paper presents OccCANINE, an open-source tool that automates occupational coding with high accuracy, significantly reducing manual effort and enabling broader research applications across multiple coding systems.
Contribution
The paper introduces a fine-tuned CANINE model for automatic occupational standardization, achieving 96% accuracy and generalizing across multiple coding systems.
Findings
Achieves 96% accuracy, precision, and recall.
Reduces coding time from weeks to minutes.
Generalizes to multiple occupational coding systems.
Abstract
This paper introduces OccCANINE, an open-source tool that maps occupational descriptions to HISCO codes. Manual coding is slow and error-prone; OccCANINE replaces weeks of work with results in minutes. We fine-tune CANINE on 15.8 million description-code pairs from 29 sources in 13 languages. The model achieves 96 percent accuracy, precision, and recall. We also show that the approach generalizes to three systems - OCC1950, OCCICEM, and ISCO-68 - and release them open source. By breaking the "HISCO barrier," OccCANINE democratizes access to high-quality occupational coding, enabling broader research in economics, economic history, and related disciplines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Usability and User Interface Design · Safety Systems Engineering in Autonomy
