CER-HV: A Human-in-the-Loop Framework for Cleaning Datasets Applied to Arabic-Script HTR
Sana Al-azzawi, Elisa Barney, Marcus Liwicki

TL;DR
This paper introduces CER-HV, a human-in-the-loop framework utilizing CER-based ranking to detect and clean label errors in Arabic-script handwritten text datasets, significantly improving data quality and model performance.
Contribution
The paper presents CER-HV, a novel framework combining CER-based noise detection with human verification, revealing dataset issues and enhancing HTR accuracy across multiple Arabic-script languages.
Findings
Datasets contain significant transcription, segmentation, and orientation errors.
CER-HV improves evaluation CER by 0.3-1.8% depending on dataset noisiness.
State-of-the-art CRNN performance achieved on multiple Arabic-script HTR datasets.
Abstract
Handwritten text recognition (HTR) for Arabic-script languages still lags behind Latin-script HTR, despite recent advances in model architectures, datasets, and benchmarks. We show that data quality is a significant limiting factor in many published datasets and propose CER-HV (CER-based Ranking with Human Verification) as a framework to detect and clean label errors. CER-HV combines a CER-based noise detector, built on a carefully configured Convolutional Recurrent Neural Network (CRNN) with early stopping to avoid overfitting noisy samples, and a human-in-the-loop (HITL) step that verifies high-ranking samples. The framework reveals that several existing datasets contain previously underreported problems, including transcription, segmentation, orientation, and non-text content errors. These have been identified with up to 90 percent precision in the Muharaf and 80-86 percent in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Topic Modeling · Advanced Neural Network Applications
