Automated Detection of Dosing Errors in Clinical Trial Narratives: A Multi-Modal Feature Engineering Approach with LightGBM
Mohammad AL-Smadi

TL;DR
This paper introduces an automated, multi-modal feature engineering approach using LightGBM to detect dosing errors in clinical trial narratives, achieving high accuracy despite class imbalance.
Contribution
It combines diverse feature types including NLP, semantic embeddings, and transformer scores, demonstrating the effectiveness of feature selection and sparse lexical features in clinical text classification.
Findings
Achieved 0.8725 ROC-AUC on the CT-DEB benchmark dataset.
Removing sentence embeddings significantly reduces performance.
Selecting top 500-1000 features yields optimal results, outperforming using all features.
Abstract
Clinical trials require strict adherence to medication protocols, yet dosing errors remain a persistent challenge affecting patient safety and trial integrity. We present an automated system for detecting dosing errors in unstructured clinical trial narratives using gradient boosting with comprehensive multi-modal feature engineering. Our approach combines 3,451 features spanning traditional NLP (TF-IDF, character n-grams), dense semantic embeddings (all-MiniLM-L6v2), domain-specific medical patterns, and transformer-based scores (BiomedBERT, DeBERTa-v3), used to train a LightGBM model. Features are extracted from nine complementary text fields (median 5,400 characters per sample) ensuring complete coverage across all 42,112 clinical trial narratives. On the CT-DEB benchmark dataset with severe class imbalance (4.9% positive rate), we achieve 0.8725 test ROC-AUC through 5-fold ensemble…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
