Moving beyond the empty cell: The threat of decontextualized healthcare data
Aya El Mir, Eric Bezerra de Sousa, Ignacio Mesina-Estarrón, Leo Anthony Celi, Moad Hani, Mohammed Benjelloun, Neha Nageswaran, Saïd Mahmoudi, Shaheen Siddiqui, Sreeram Sadasivam, William Greig Mitchell

TL;DR
This paper argues that missing or inaccurate healthcare data is not just a technical issue but a result of social and systemic factors that AI can amplify if ignored.
Contribution
The paper introduces a new perspective on decontextualized data, emphasizing the need for context-aware AI in healthcare.
Findings
Traditional statistical methods fail to address the root causes of decontextualized data.
AI trained on decontextualized data can reinforce health inequities.
Context-aware approaches using metadata and social determinants can lead to more ethical AI systems.
Abstract
Missing, inaccurate, or poorly documented data in healthcare is often treated as a technical problem to be statistically resolved via imputation, deletion, or modeling assumptions about randomness. However, such inaccuracies relate to far more complex socioeconomic and geopolitical issues, rather than “errors of data entry” to be ameliorated with statistical modeling techniques. We outline that what is really missing or inaccurate is the context in which the data is collected—and that only by understanding this context can we begin to prevent artificial intelligence’s (AIs) amplification of misleading, decontextualized data. We critically examine how traditional modeling methods fail to account for the factors that influence what data gets recorded, and for whom. We show how AI systems trained on decontextualized data reinforce health inequities at scale. And, we review recent…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Data-Driven Disease Surveillance · Ethics and Social Impacts of AI
1. Introduction
Statistical techniques to manage missing, inaccurate, or poorly documented healthcare data like imputation and deletion have been extensively researched and applied for decades, on the premise that data is missing either completely at random (MCAR), at random (MAR), or not at random (MNAR) [1,2]. These techniques, however, ignore an incontrovertible truth about such healthcare data: that it is essentially never missing, inaccurate, or flawed at random. What is really missing is an understanding of the context of the data. Rather than representing a data cell to be imputed, deleted, or filled for better statistical model performance, such data is the result of long-standing, intertwining geopolitical and socio-economic complexities, which demand far more nuanced and insightful approaches than those currently used if we are to meaningfully interpret them.
When approaching missing, inaccurate or poorly documented healthcare data, statistical methods rarely appreciate how the data came about—how data was collected (or not collected) and by whom, the socio-economic policies or habits facilitating data collection (or not), the devices used to capture certain data signals (or mis-capture/not capture signals), and how final datasets are curated and made available for analysis [3].
Data may be missing because of socioeconomic barriers to documentation. For instance, in intensive care units (ICU), overnight blood sugar level (BSL) data may not be documented due to language discordance between staff and patient, necessitating an overnight interpreter and extra work—it’s easier to let the non-English speaking patient sleep without checking their BSL, rather than wake them and concurrently find a translator [4]. Similarly, overnight bed-ulcer status might not be documented because a patient is too obese to roll with skeleton night staff, so it’s decided to wait till the morning. These data aren’t just “missing;” they’re the result of pernicious social barriers prohibiting data collection.
Data may not make it to the point of collection in the first place—a form of “invisible” missing data. Black patients suffering out-of-hospital cardiac arrest (OOHCA) are less likely to be resuscitated and subsequently transported to a medical facility [5], making their data less likely to be captured. Aboriginal & Torres Strait Islander Australians (hereafter “Indigenous Australians”) are more likely to self-discharge from ICU against medical advice, similarly making their critical illness data impossible to capture in the first place [6]. And female patients are known to be more likely to have acute coronary syndrome (ACS) misdiagnosed by male physicians [7], making subsequent cardiovascular data less likely to be recorded.
Even when documented, data may be so inaccurate that the real underlying information is cloaked by erroneous readings. For instance, pulse oximetry data is much more inaccurate for patients with heavily pigmented skin [8], with falsely-high readings leading to systemic under-treatment and worse outcomes [9]. Again, missing data context and subsequent inaccuracy here is not random, but indicative of inequalities in who equipment calibration and data collection has been optimized for (non-pigmented patients) versus who data is being collected from, in a clinical setting.
When AI models are built with decontextualized data, they absorb these blind spots to generate flawed predictions at scale: the well-described “garbage in, garbage out” AI phenomenon [10]. While statistical methods that more accurately address contextual complexities embedded in data may exist, they cannot be developed by those who do not understand the data context at all points in the pipeline, from collection, documentation, curation, storage, and distribution [3]. Current statistical modeling methods to handle missing, inaccurate, and poorly recorded data (hereafter collectively termed “decontextualized data”) dismiss the context dictating why it is flawed and seek only to improve model accuracy parameters. Machine Learning communities working in isolation lack an understanding of the geopolitical and socio-economic context of such missing data and are unable to develop such statistical tools alone. Without this understanding, gaping contextual cavities in data will continue to exist, posing most risk to those who’s data is most contextually-flawed—historically those already disadvantaged by current healthcare structures [11].
This paper is a call to collaborative action between those who understand data context and those developing statistical techniques to resolve decontextualized data. We propose solutions and methodological improvements for tackling decontextualized data so that we might stand a chance at developing insightful methods to minimize its impact on already marginalized populations.
2. Rethinking missing data: The importance of context and critique of traditional approaches
Understanding mechanisms that perpetuate decontextualized data requires examining the structural factors underlying data collection. The data lifecycle can be divided into different pipeline stages, to the point of acquisition and use by statisticians, to help highlight geopolitical and socio-economical contextual issues.
a. Data availability: Historically disadvantaged populations are less likely to have their data collected—“Invisible” missingness contextualization
Healthcare data may be unavailable to document in the first place, a form of ‘invisible’ missing data. This data is traditionally representative of patients from lower socioeconomic backgrounds, racial or ethnic minority groups, or communities with fragmented healthcare services across multiple regions. These are the populations whose healthcare stands most to gain by the considered use of AI, but who will conversely become relatively more disadvantaged without adequate contextual understanding of their healthcare data.
The health-related disadvantage of Indigenous Australians is extensively described, with substantially higher rates of chronic disease and difficulty accessing primary or tertiary healthcare [12–16]. Self-discharge rates against medical advice for Indigenous Australians are the highest in the world [17]; indeed, they’re almost four-times more likely to self-discharge during an ICU admission than non-Indigenous Australians [6]. Whilst the reasons for self-discharge are complex, they likely reflect an unfavorable perception of the unfamiliar hospital environment, perceived differences in care, institutionalized racism, and/or communication barriers [6,18,19]. Those who self-discharge against medical advice have almost 50% higher 8-year mortality compared to those who do not self-discharge against medical advice (adjusted hazard ratio 1.46; 95% confidence interval 1.01–2.1) [6]. Despite being a particularly high-risk group, healthcare data for patients who self-discharge against medical advice becomes impossible to document, and its omission from morbidity/mortality prediction models means actual predictions underestimate how unwell these patients are.
“Invisible” missing data similarly applies to Black patients suffering OOHCA in the United States (US). Black patients experience significantly higher rates of OOHCA than White patients [20] and in rural US settings are up to five times less likely to survive OOHCA compared to patients living in more affluent areas of the country [21]. Survival discrepancies following OOHCA are not due to biological variation between Black vs White patients—but contextual factors: lower likelihood of OOHCA being witnessed by bystanders, lower likelihood of bystander cardiopulmonary resuscitation if witnessed, lower availability of automatic electronic defibrillators, longer Emergency Response Service times and lower likelihood of early advanced airway management upon their arrival [5]; leading to lower rates of hospitalization after OOHCA. Further, those that do make it to hospital are less likely to receive urgent interventions correlated with survival, like targeted temperature management, coronary angiogram, or cardiac catheterization due to variability in hospital capabilities (i.e., 24/7 cardiac catheter labs) and standardized post-arrest pathways [5,21]—further contextualizing why survival rates are poorer. Decontextualized data used to train models to predict survival after OOHCA for Black patients omit the data for those who did not make it in the first place, and for those who do, may erroneously predict substantially higher risk of mortality attributed toward their race, rather than the context of where data was collected, i.e., in lower capability healthcare settings. Without an understanding of this context, lifesaving interventions may be withdrawn earlier on the assumption that their outcome will be poor due to their race.
When certain populations are systematically underrepresented in datasets, such as Indigenous Australians or Black patients in the US, predictive models can also underestimate future healthcare needs. By failing to understand the data context, models misinterpret that relatively lower historical healthcare expenditure (which is actually due to relatively under-reported healthcare data) equates to lower future healthcare needs than are actually required—as has been demonstrated by Obermeyer et al [22].
b. Data collection and documentation: Incomplete, inaccurate, or misleading documentation due to clinician discretion or workforce variability—Clinical environment contextualization
Even when recorded, healthcare data is often incomplete, inconsistent or misleading due to contextual conditions under which measurements are recorded. For example, the frequency of BSL measurements in ICU varies between patients depending on physician judgement, anxiety, and perceived risk of hypoglycemia. When patients are perceived by clinicians as more at risk of hypoglycemia, possibly due to clinician anxiety or perceived (rather than actual) risk, they are more likely to have more frequent BSL measurements. A recent review of the MIMIC-IV dataset analyzing almost 25,000 patients suggested as much, demonstrating that Black and Hispanic patients underwent BSL measurements 6% and 11% more frequently than White patients respectively, even when fully adjusted for illness severity and comorbidities (95% incidence rate ratio confidence intervals 1.01–1.12, and 1.01–1.21, respectively) [23]. When context is ignored, spurious conclusions may be drawn, i.e., that frequent BSL monitoring is linked to higher illness severity and poorer outcomes, when in reality the reverse causal direction is true, and may actually be more reflective of perceived risk of illness severity and BSL fluctuation, or clinician anxiety [23].
Staffing and shift models in ICU also affect documentation. Workflow interruptions (i.e., staff handover and shift changes) [23] and language barriers (i.e., language discordance between nursing staff and patients) make BSL recording less likely during those periods [4]. The same publication using MIMIC IV data demonstrated that English-speaking patients are 8% more likely to have BSL measurements taken in ICU’s in the US vs non-English speaking patients (95% incidence rate ratio confidence intervals 1.01–1.15), possibly because they’re more likely to “speak up” without language discordance barriers [23]. When such decontextualized data is analyzed, conclusions that certain patients had more frequent readings due to higher illness severity (rather than language concordance) are likely to be drawn. Furthermore, higher ICU mortality for non-English speaking patients without interpreters present is well-documented, independent of BSL measurement frequency [24,25]. Considering less frequent BSL monitoring (which disproportionately affects non-English speaking patients) is also associated with higher BSL fluctuations, higher hospital morbidity/mortality, and longer length of admission [26–29], this is additionally concerning.
c. Data saving and storage: Understanding the context behind the capacity for consistent, comprehensive data storage
Understanding the context of variation in data storage capacities between healthcare facilities is equally critical. The availability of extensive datasets and significant computational power is an essential precursor to building robust, equitable models [11]. Regional and institutional disparity means those with advanced technological infrastructure and large-scale data storage capabilities, such as affluent regions of the US (home to over 40% of the world’s databases alone), are overrepresented in the data used to train predictive models [11]. Consequently, while models may perform well in environments where data originated, like the US, they lack generalizability and could perpetuate healthcare inequities when applied to populations from data-poor regions, like the Global South [30]. Even within affluent, data-rich areas with large-scale data storage capacity like the US, socially, ethnically, or racially marginalized groups are more likely to have fragmented care across multiple institutions, and lower internet access/literacy preventing the use of online portals and patient-reported outcomes [31]. This worsens scattered, incomplete data documentation for already-marginalized groups within affluent regions, similarly making model findings less generalizable to them. Ignoring the context of data provenance, and limitations imposed by heterogeneous data storage capacities poses further risk to model bias.
d. How current statistical approaches fail to recognize healthcare data context
Most machine learning pipelines continue to rely on traditional statistical approaches focused solely on data completion. Techniques such as mean, median, and mode imputation; k-nearest neighbors (KNN); regression-based methods; forward and backward filling; interpolation; multiple imputation (including multiple imputation by chained equations (MICE)); and model-based techniques like Kalman filters and mixed-effects models remain widespread [32–34]. These methods, while mathematically sound, fall short in addressing contextual truths embedded in data. They typically assume that missingness can be explained within the data itself, often under the MAR or MCAR paradigms, assumptions that are not only overly optimistic but frequently invalid, ignoring the interplay of socioeconomic and geopolitical factors influencing missing data as described above.
For example, KNN and regression imputation assume similarity or linearity, which fails in marginalized or underrepresented populations where data is systematically lacking. Techniques like multiple imputation may improve model robustness, but their computational cost can increase with data scale, the number of imputations, and imputation model complexity. Crucially, they still rely on the flawed assumption that missingness is ignorable when properly modeled. These approaches prioritize statistical completeness over epistemological clarity, and risk creating an illusion of data integrity while masking the very structural barriers that created the missingness in the first place. Without this awareness of the importance of data context, algorithmic outputs risk reproducing the very inequities they seek to mitigate.
e. Epistemology: Acknowledging our own limitations
Recognizing missing context requires epistemic humility: acknowledgment that our understanding of data is shaped and limited by our own individual experiences and exposures. Data analysis often proceeds as though those interpreting the data possess full knowledge of its origins. Before any exploratory analysis begins, those responsible for data analysis and modeling, i.e., the machine learning community must ask “how did this data come about?”; acknowledging that the social and structural conditions surrounding data generation are as important as the data itself, to avoid amplifying its embedded biases and inequalities.
3. From technical fixes to systemic transformation: Reconceptualizing the AI lifecycle
Logistical decisions and statistical methods applied at early stages of the AI pipeline (i.e., imputation and other downstream adjustments) do not ameliorate risks posed by decontextualized healthcare data [22,32–34]. Because these approaches intervene only after data have been produced, they cannot confront the contextual and structural inequalities embedded in healthcare data at its inception [3,35]. Addressing the challenges posed by decontextualized healthcare data requires reconceptualizing the entire AI lifecycle, from decisions about when and what data are collected, to model development, deployment, and external validation. They require stakeholder acknowledgement that data is fundamentally shaped by upstream social, political, and institutional forces rather than something to be manipulated with downstream technical considerations [3,35–38].
Meaningfully addressing decontextualized healthcare data requires genuine collaboration with communities and domain experts who understand the lived contexts of data generation. Indigenous Data Sovereignty principles demonstrate that when communities exercise authority over what data is collected, how it is interpreted, and for what purposes it is used, data completeness and fidelity improve because structural barriers to documentation are directly addressed [39–43]. Likewise, clinicians, nurses, social workers, allied healthcare workers and patient communities hold essential contextual expertise that enables assessment of whether data accurately capture clinical phenomena or instead reflects broader social inequities or contextual circumstance [3,11,35,44]. Their involvement must be foundational, not peripheral; to properly contextualize the data we use in healthcare.
Evaluation of model performance must also extend beyond statistical performance. The central question should not be “does this model perform well?” but “should this model exist at all, and under what conditions?” [37,38,45–47]. This necessitates participatory forms of validation that assess whether systems reinforce existing inequities or support community-defined goals [37–41,47–49].
Technical innovations in fairness, privacy, and distributed computational approaches that limit raw data extraction and secondary data use without local oversight should function as guardrails, not primary solutions, within a reimagined lifecycle where context precedes computation [22,45,46,50–55]. This paradigm shift requires institutional incentives that prioritize transparency, community involvement, community benefit, and contextual depth over technical novelty and model performance [37,38,47]. Only through such systemic transformation can AI systems avoid amplifying health inequities deeply rooted in the use of decontextualized healthcare data.
4. Conclusion
No amount of algorithmic refinement or sophisticated modeling techniques can compensate for datasets that fundamentally fail to capture the context of their own creation. Throughout this paper, we have examined the structural, epistemic, and algorithmic dimensions of decontextualized healthcare data, emphasizing that as AI becomes more deeply embedded in clinical decision-making, the risks of training models on such data grow increasingly urgent.
Rather than aiming for mathematical accuracy, we must foster a paradigm shift that moves toward deep, interdisciplinary collaboration. We must recognize missing, incomplete, and inaccurate data not as a statistical nuisance, but an ethical and systemic signal. This requires a concerted effort between the machine learning community and domain experts; clinicians, healthcare workers, and community representatives, who understand data context. It is only through this partnership that we can begin to build models that account for the complex realities and what data is trying to communicate.
Ultimately, we must reimagine AI systems not simply as tools of optimization, but as opportunities for repair, systems that do not obscure what is decontextualized, but illuminate it, giving voice to those historically left out, and shaping a more equitable and accountable future.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Pham TM, Pandis N, White IR. Missing data: issues, concepts, methods. Semin Orthod. 2024;30(1):37–44. doi: 10.1053/j.sodo.2024.01.007 · doi ↗
- 2Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, Tabona O. A survey on missing data in machine learning. J Big Data. 2021;8(1):140. doi: 10.1186/s 40537-021-00516-9 34722113 PMC 8549433 · doi ↗ · pubmed ↗
- 3Mitchell WG, Wawira JG, Celi LA. Rebooting artificial intelligence for health. PLOS Glob Public Health. 2025;5(1):e 0004171. doi: 10.1371/journal.pgph.0004171 39823406 PMC 11741560 · doi ↗ · pubmed ↗
- 4Twersky SE, Jefferson R, Garcia-Ortiz L, Williams E, Pina C. The impact of limited english proficiency on healthcare access and outcomes in the U.S.: a scoping review. Healthcare (Basel). 2024;12(3):364. doi: 10.3390/healthcare 12030364 38338249 PMC 10855368 · doi ↗ · pubmed ↗
- 5Mehta NK, Allam S, Mazimba S, Karim S. Racial, ethnic, and socioeconomic disparities in out-of-hospital cardiac arrest within the United States: now is the time for change. Heart Rhythm O 2. 2022;3(6Part B):857–63. doi: 10.1016/j.hroo.2022.07.009 36588995 PMC 9795269 · doi ↗ · pubmed ↗
- 6Mitchell WG, Deane A, Brown A, Bihari S, Wong H, Ramadoss R, et al. Long term outcomes for Aboriginal and Torres Strait Islander Australians after hospital intensive care. Med J Aust. 2020;213(1):16–21. doi: 10.5694/mja 2.50649 32484925 · doi ↗ · pubmed ↗
- 7Greenwood BN, Carnahan S, Huang L. Patient-physician gender concordance and increased mortality among female heart attack patients. Proc Natl Acad Sci U S A. 2018;115(34):8569–74. doi: 10.1073/pnas.1800097115 30082406 PMC 6112736 · doi ↗ · pubmed ↗
- 8Kyriacou PA, Charlton PH, Al-Halawani R, Shelley KH. Inaccuracy of pulse oximetry with dark skin pigmentation: clinical implications and need for improvement. Br J Anaesth. 2023;130(1):e 33–6. doi: 10.1016/j.bja.2022.03.011 35430087 · doi ↗ · pubmed ↗
