Resume Information Extraction via Post-OCR Text Processing
Selahattin Serdar Helli, Senem Tanberk, Sena Nur Cavsak

TL;DR
This paper presents a method for extracting resume information by classifying post-OCR text groups using NLP models, with a focus on the effectiveness of DistilBERT in this process.
Contribution
It introduces a novel approach combining OCR, object recognition, and NLP classification for resume data extraction, highlighting the performance of DistilBERT.
Findings
DistilBERT achieved the highest F1 scores among models tested.
YOLOv8 effectively identified text regions for classification.
The approach improves resume information extraction accuracy.
Abstract
Information extraction (IE), one of the main tasks of natural language processing (NLP), has recently increased importance in the use of resumes. In studies on the text to extract information from the CV, sentence classification was generally made using NLP models. In this study, it is aimed to extract information by classifying all of the text groups after pre-processing such as Optical Character Recognition (OCT) and object recognition with the YOLOv8 model of the resumes. The text dataset consists of 286 resumes collected for 5 different (education, experience, talent, personal and language) job descriptions in the IT industry. The dataset created for object recognition consists of 1198 resumes, which were collected from the open-source internet and labeled as sets of text. BERT, BERT-t, DistilBERT, RoBERTa and XLNet were used as models. F1 score variances were used to compare the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Topic Modeling
MethodsYou Only Look Once · Multi-Head Attention · Attention Is All You Need · Attention Dropout · WordPiece · Adam · Byte Pair Encoding · Residual Connection · Weight Decay · Softmax
