Towards Unbiased Evaluation of Detecting Unanswerable Questions in EHRSQL
Yongjin Yang, Sihyeon Kim, SangMook Kim, Gyubok Lee, Se-Young Yun,, Edward Choi

TL;DR
This paper highlights the data bias in unanswerable questions within the EHRSQL dataset for EHR QA systems and proposes a data split strategy to mitigate this bias, improving evaluation reliability.
Contribution
It identifies a bias caused by N-gram patterns in unanswerable questions and introduces a simple data split adjustment to reduce this bias in EHR QA evaluation.
Findings
Bias due to N-gram patterns in unanswerable questions
Data split adjustment reduces bias impact
Improved evaluation reliability on MIMIC-III
Abstract
Incorporating unanswerable questions into EHR QA systems is crucial for testing the trustworthiness of a system, as providing non-existent responses can mislead doctors in their diagnoses. The EHRSQL dataset stands out as a promising benchmark because it is the only dataset that incorporates unanswerable questions in the EHR QA system alongside practical questions. However, in this work, we identify a data bias in these unanswerable questions; they can often be discerned simply by filtering with specific N-gram patterns. Such biases jeopardize the authenticity and reliability of QA system evaluations. To tackle this problem, we propose a simple debiasing method of adjusting the split between the validation and test sets to neutralize the undue influence of N-gram filtering. By experimenting on the MIMIC-III dataset, we demonstrate both the existing data bias in EHRSQL and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
