Detection of flawed multiple-choice questions in preclinical medical education using item difficulty and discrimination indices: a six-year analysis
Varanya Srisomsak, Chantacha Sitticharoon, Issarawan Keadkraichaiwat, Sunan Meethes, Inpreeya Inpaen

TL;DR
This study shows that using statistical thresholds alone misses some flawed multiple-choice questions in medical exams, highlighting the need for expert review alongside quantitative analysis.
Contribution
The study provides empirical evidence that static psychometric thresholds miss a significant portion of flawed exam items.
Findings
14.3% of flawed items were missed when relying solely on p-value and rpb-value thresholds.
Flawed items tended to be more difficult and less discriminative than uncorrected items.
Expert review is necessary alongside quantitative metrics to ensure exam quality.
Abstract
MCQ exams may include flawed items affecting validity. Psychometric indicators such as item difficulty (p-value) and point-biserial coefficient (rpb-value) are widely used to identify problematic questions. Evidence on using p-value (< 0.25) and/or rpb-value thresholds (< 0) to detect flawed items remains limited. This study aimed to provide a proof-of-concept using a large, real-world dataset, evaluating how often flawed items were missed when relying solely on static thresholds. Exam analyses from 32 preclinical courses (academic years 2017–2022) were reviewed. Items meeting predefined thresholds were flagged, while all items were manually reviewed when the most frequently chosen answer was not the keyed correct answer or when multiple options had similar p-values. Flagged items were sent to course directors for verification, and only confirmed items were recorded as corrections.…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPsychometric Methodologies and Testing · Medical Education and Admissions · Reliability and Agreement in Measurement
