Multimodal pulse oximeters to support the integrated management of childhood illnesses: A usability and diagnostic accuracy assessment from a multi-country hybrid type 2 study
Helen L. Storey, Tessa L. Fielding, Julia Mwesigwa, Rebecca K. Green, Megan E. Parker, Anmol Jacob, Samwel Lwambura, Mumbe Kitonga, Leila Maina, Mansi Tyagi, Alice Mwikamba, Caroline Ngunu, Anuj Kumar Pandey, Ndèye Marème Sougou, Jean Tine, Angharad Steele, Tedila Habte

TL;DR
This study evaluates new pulse oximeters that measure multiple health metrics to help diagnose childhood illnesses more effectively in low-resource settings.
Contribution
The study introduces and validates multimodal pulse oximeters with enhanced features for diagnosing childhood illnesses in low- and middle-income countries.
Findings
Multimodal pulse oximeters showed good usability and high satisfaction among healthcare providers.
Device performance for hypoxemia, tachycardia, and fever exceeded 80% agreement across age categories.
Respiratory rate measurements showed greater variability between devices.
Abstract
Nearly 5 million children die each year of preventable causes, with pneumonia being a key contributor. The Integrated Management of Childhood Illnesses guidelines improve health care workers’ diagnostic and management capabilities by relying mostly on clinical signs. Though there have been successes, challenges in the consistent application of IMCI and the accurate diagnosis of conditions like hypoxemia remain. Next generation pulse oximeters add functionality to stand alone pulse oximeters, like measurement of respiratory rate, temperature, and hemoglobin. While the TIMCI project sought to address gaps in the introduction of pulse oximetry in India, Kenya, Senegal, and Tanzania, research was also conducted to strengthen the market for multimodal pulse oximetry (PO) devices by filling evidence gaps around ideal product attributes and the validation of available and near to market…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Fig 1
Fig 2
Fig 3
Fig 4
Fig 5
Fig 6
Fig 7
Fig 8- —Unitaid
- —http://dx.doi.org/10.13039/100000865Bill and Melinda Gates Foundation
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNon-Invasive Vital Sign Monitoring · Thermal Regulation in Medicine · Optical Imaging and Spectroscopy Techniques
Introduction
The Integrated Management of Childhood Illnesses (IMCI) strategy, introduced in the mid-1990s, aims to reduce child mortality and morbidity by providing integrated guidelines to help health care workers (HCWs) effectively diagnose and manage the major causes of childhood illness, through a structured approach to case management [1–3]. In 2023, an estimated 4.8 million children under five died of preventable causes, mostly due to pneumonia (20%), malaria (16%), and diarrhea (15%) [4]. Among these deaths, fever is a common symptom often overlapping with coughing and fast breathing. Despite progress in malaria detection and treatment, diagnosing and managing children with other diseases remains challenging. Identifying risk factors like hypoxemia and other conditions of severe illness is essential for guiding treatment decisions to avert death.
Challenges exist with the use of IMCI such as inconsistent adherence by health workers [4–8], and sole reliance on clinical signs to support identification and treatment of severely sick children [9–13], which is inadequate for detecting hypoxemia [14,15]. Hypoxemia, or low blood oxygen saturation, can be easily measured with a pulse oximeter (PO). Use of PO in hospitals and primary care settings saves lives and resources [13,16,17], and its global use and awareness has grown since the Covid-19 pandemic [18]. Respiratory rate is essential for diagnosing pneumonia in IMCI, but it is often under-measured and subject to human error, particularly in young children with fast breathing [8,19–21]. Integrated pulse oximeters that measure blood oxygen saturation and respiratory rate are now commercially available, and though challenges with manually and automatically measuring respiratory rate remain [21], early studies have demonstrated positive usability and acceptance of this multimodal PO technology among community health workers [21,22], as well as good performance in hospital settings [23,24]. Nevertheless, more evidence is needed to build confidence in the capabilities of these devices in primary health care (PHC) settings and demonstrate the operational feasibility of their integration into existing IMCI care practices, especially with the added functionality of temperature measurement.
Furthermore, medical device manufacturers are leveraging photoplethysmography (PPG), the underlying technology in pulse oximeters and a measurement of the circulatory system, to detect other health indicators using machine-learning/artificial intelligence (ML/AI) models. Based on light absorption or reflection, changes in blood flow are detected as changes in light intensity through the blood and tissues [24], allowing additional measurements such as heart rate, respiration, blood pressure, and anemia. Consumer products such as smartwatches and smartphone applications also detect light reflective PPG measurement using integrated sensors such as the camera and flashlight. As smartwatch and smartphone-based sensors become more powerful [25], it is important to consider the potential role of PPG derived smartphone-based clinical screening tools to digitize health data, facilitate risk-based stratification of patients, and support faster decision-making on individual and population-based care. While consumer products are not replacements for medical devices, the current methods in IMCI for assessing fast breathing by counting and anemia by consideration of pallor [10–13], are subjective and result in considerable variation across providers. Even still, these assessments are critical for frontline health workers to inform decision-making [26], and highlight the need for clinical screening tools to expand the reach of medical devices, if they can be reliable and user friendly. Demonstrating that new tools perform as intended in the target context of use builds the evidence needed for introduction and eventual scale-up.
A key barrier that manufacturers face in advancing next generation multimodal POs is validating new devices or algorithms, particularly among hard to reach populations like children, and not having clear guidance on necessary product attributes. Developing new medical devices requires significant investment, and when validation is expensive, demand is uncertain, and the market is price sensitive, companies may view the risk as too high, despite the potential to save lives. Product development partners derisk research and development for new products where the potential for health impact is greater than the perceived return on investment for a company. Sustaining a low margin, or no margin public health product may be more feasible for a company if the upfront investment is minimized.
The Tools for the Integrated Management of Childhood Illness (TIMCI) project was a 5-year Unitaid-funded project focused on the interventions of pulse oximetry and clinical decision support algorithms (CDSAs) in primary care. Results from research on the introduction and evaluation of PO and CDSAs in primary care in Kenya, Senegal, Tanzania, and Uttar Pradesh, India are reported elsewhere [27,28]. To strengthen the market for new multimodal PO devices, research was also conducted to fill evidence gaps around ideal product attributes, and the validation and operational fit of available and near to market PPG-derived clinical measurement tools (medical device and smartphone-based screening technologies) through a hybrid type 2 study design. The mixed methods type 2 effectiveness-implementation study measured the performance and feasibility of identified multimodal pulse oximeter devices by primary care providers in Kenya, Senegal, Tanzania, and Uttar Pradesh, India. Research was also conducted to define the requirements of next generation multimodal PO devices through a target product profile (TPP) that communicates optimal and minimum product attributes, and an open-source data repository of reference measurements from the diagnostic accuracy study that aims to catalyze PPG-derived, multimodal PO medical device and smartphone-based clinical screening technology development. Here we report the performance results of the hybrid type 2 study with a manufacturer-agnostic perspective and share resources from the TPP development and the open-source data repository.
Methods
Ethics statement
Before evaluation by ethical review committees, the research described in this manuscript underwent scientific merit review by two separate reviewers. The study was then reviewed independently by ethical review boards in each country. For Tanzania, this included the Ifakara Health Institute IRB and the National Health Research Ethics Committee. For India, this included a) the National Health Ministry, Uttar Pradesh, b) the Institutional Ethics Committee of King George Medical University and c) the Health Ministry’s Screening Committee of the Indian Council of Medical Research. For Senegal, this was the Comité National d’Ethique por la Recherche en Santé. And for Kenya this was the Amref Health Africa Ethics and Scientific Review Committee and the National Council of Science and Technology. Written informed consent was obtained from all providers and caregivers of the children involved in this study.
Inclusivity in research: Additional information regarding the ethical, cultural, and scientific considerations specific to inclusivity in global research is included in supplemental information (S2 Checklist).
TIMCI and parent study
The TIMCI project aimed to support healthcare providers to identify and manage severe illness, by introducing pulse oximetry and CDSAs to primary care facilities in India, Kenya, Senegal and Tanzania. To address evidence gaps, an evaluation was designed to include pragmatic cluster randomized controlled trials in India and Tanzania (NCT04910750), and quasi-experimental pre-post studies in Kenya and Senegal (NCT05065320), complemented by embedded mixed-method studies in all countries [29]. TIMCI also aimed to accelerate the development and market entry of non-invasive devices that augment the measurement features of standard pulse oximeters. These devices, called multimodal pulse oximeters, measure additional vital signs such as respiratory rate, temperature, and/or hemoglobin, that could improve healthcare providers’ ability to accurately diagnose, treat, and/or refer their patients. Barriers to market entry of multimodal pulse oximeters for use in LMICs have included lack of information to inform appropriate product design, as well as uncertain demand. To address gaps, a TPP was developed, and a field validation was conducted to evaluate the accuracy and operational feasibility of select multimodal pulse oximeters either in or near to market.
TPP development
The TPP was developed through a multi-phase process, starting with a literature review and expert interviews to develop a first draft. Workshops were then conducted in Kenya, Tanzania, and Senegal to obtain feedback on the draft TPP from a national stakeholder perspective. The workshops took place either in person (Senegal, Tanzania) or virtually (Kenya) and included interactive, human-centered design activities led in small groups or individually. Participants were selected for their expertise in child health research, primary care practice, and national IMCI guideline development and implementation. Following the workshops, an online survey was conducted to assess agreement on the minimum and optimal requirements for 23 product attributes. For each requirement, respondents were asked to indicate whether they “agree”, “mostly agree”, “neither agree or disagree”, “mostly disagree”, “fully disagree”, or “other (do not have expertise to comment)”. A predefined agreement threshold of at least 60% selecting “agree”, “mostly agree”, or “neither agree or disagree” was set in advance. Finally, select manufacturers (n = 7) were engaged in a discussion where the draft TPP was shared, and challenges and opportunities with key product attributes were explored. Further details on development are available in supplemental information (S1 Text) and the final TPP is accessible at https://www.path.org/our-impact/resources/multimodal-pulse-oximeter-tpp.
Type 2 hybrid study
Study design.
The objective of this study was to evaluate identified multimodal pulse oximeter devices used by primary care providers in Kenya, Senegal, Tanzania, and Uttar Pradesh, India. To assess the performance and feasibility of multimodal PO devices a hybrid design was used to conduct a prospective mixed methods diagnostic accuracy and implementation study in two primary care facilities in each country [27,28]. (Fig 1) Recruitment occurred from 15 May 2023–18 December 2023 in Tanzania, 28 August 2023–27 December 2023 in India, 24 May 2023–09 January 2024 in Senegal, and 16 March 2023–15 December 2023 in Kenya. In this paper we discuss the results from the usability and diagnostic accuracy study.
Type 2 hybrid study flow chart by key phases, activities, and timeframe.
Usability assessment.
To measure the usability and acceptability of the devices by healthcare providers and caregivers at the primary care level, a usability assessment was conducted. The usability assessment observed user-product interactions over 2 rounds of use to measure error modes and rates and time to result. A post-use survey measured user satisfaction among primary care providers. These assessments took place in January and February of 2023. All multimodal PO devices were assessed by 6–8 healthcare providers per country, which identifies around 80% of common user errors [30,31]. Each provider had familiarity with general POs in day-to-day practice and assessed all devices. Device order was randomized for each provider. Detailed observations were recorded using structured data collection forms and a systems usability score was used to assess usability across devices [32,33] (Fig 2).
Workflow diagram for usability assessment and example results for each device.
Diagnostic accuracy study: The primary objective of the diagnostic accuracy study was to determine the performance of available multimodal PO devices compared to an accepted reference standard, by primary care providers assessing children 0–59 months seeking care at the primary care level. A cross-sectional diagnostic accuracy study compared the index devices to an accepted reference standard to measure accuracy. The research activity occurred only after the patient received routine care according to current practices in the facility. A convenience approach was used for recruitment and for each participant, one index device and all reference devices were used simultaneously in 3 identical measurements recorded in sequence. Video recordings were captured as a single file, which were redacted of identifying attributes and edited to smaller clips for annotation purposes. Calmness and perfusion index were noted, as well as any device errors or issues preventing complete data collection for each participant. Measurements obtained at the same time from index and reference devices were compared for agreement. (Fig 3) The following indicators of performance were assessed: Bland Altman plots, intraclass correlation coefficients, confusion matrices for percent agreement, and mean absolute error. Following device assessment, semi-structured interviews were performed with primary care providers and caregivers to assess acceptability of the index devices. For each device assessed, 5–10 caregivers were surveyed for their perceptions on acceptability of each device. Device specific results are presented without attribution to product name. All products have strengths and weaknesses and continue to be improved through ongoing manufacturer updates. Product evaluations are specific to the product version at the time of the study.
Workflow diagram used by research assistants during diagnostic accuracy study data collection, and illustration of positioning of participants and devices.
Study population.
The study took place in four countries: Kenya, Senegal, Tanzania and India (Uttar Pradesh). Two facilities per country were selected in partnership with research partners and country stakeholders. Adjusting for study timeline constraints, the usability assessment was conducted in Kenya, Tanzania, and Senegal, while the diagnostic accuracy study was conducted in Kenya, Tanzania, and India. Index and reference measurements were collected on children 0–59 months seeking care at the primary care facility. Due to varying device fit and performance based on size, three age categories were evaluated and defined as: under 2 months (0–1), 2 months to under 12 months (2–11), and over 12 months to under 60 months (12–59). For devices with prior evidence of use in children, all three age categories were assessed, while devices without prior use data in children were only assessed in the 12–59 months age category. Evaluating older children first provided a chance to demonstrate feasibility with those who were less likely to encounter issues related to smaller fingers or increased movement. All facilities had experience with pulse oximetry use. The study enrolled children 0–59 months presenting with an illness, and for whom caregivers provided written consent. Children were excluded from the study if they were in the immediate post-natal period or first day of life, presenting for care due to trauma, admitted for inpatient care, were critically ill requiring emergency or immediate referral and care, or caregivers did not provide written consent.
Device measurements.
Index devices: A landscaping of PO and multimodal PO device manufacturers was conducted to identify devices for evaluation. A combination of desk research, and outreach to relevant partners, stakeholders and networks identified 16 potential technologies. Following discussions with manufacturers to align products to the TPP, and in-house verification benchmarking in the PATH engineering lab in Seattle, 5 technologies were selected for inclusion in the hybrid study. Device 1 measures blood oxygen saturation, respiratory rate, pulse rate, and temperature, is a handheld instrument with a finger clip attached by cord and displays measurements on the device. Device 2 measures blood oxygen saturation, respiratory rate, and pulse rate, is a handheld instrument with a finger clip attached by cord and displays measurements on the device. Device 3 measures blood oxygen saturation, respiratory rate, and pulse rate, is a fingertip clip instrument, and displays measurements on the device. Device 4 measures blood oxygen saturation, respiratory rate, pulse rate, temperature, and additional clinical parameters that were not assessed in this study, is an instrument that the patient holds in the palm of their hand and connects to a tablet by Bluetooth to display measurements. And device 5 measures blood oxygen saturation, respiratory rate, pulse rate, and temperature, is a forehead band instrument, and connects to a tablet by Bluetooth to display measurements. From a regulatory perspective, device 1 is FDA approved and CE marked, device 5 is CE marked, and devices 2, 3, 4, and 6 are in development. (See S2 Text for detailed specifications of multimodal PO devices by product attribute)
Through our landscaping work on next generation PO devices and noninvasive anemia measurement [34], PPG-derived clinical measurement capabilities were also identified on smartphone platforms. Using existing sensors in the phone, some clinical measurements under development include pulse rate, blood oxygen saturation, respiratory rate, and anemia [21,24,35,36]. To assess the usability of this type of smartphone device, a consumer application available for download was used as a comparable form factor and user interface for measuring heart rate and respiratory rate, and results are presented as device 6. Additionally, an android device (Samsung Galaxy A13) was included in the performance study to noninvasively capture clinical data using the existing sensors, which included video data from placing a finger over the camera to derive PPG waveform, photos of eye conjunctiva, and photos of nailbeds. Children ages 2–11 months and 12–59 months were assessed, excluding the youngest age category due to the limited research on young babies and an abundance of caution to protect this vulnerable population. Because research to develop an algorithm that classifies the data based on a threshold is in development, this smartphone-derived clinical data is included in our data repository to support future research by others.
Reference devices: A reference standard for each clinical measurement was used at the same time as the index device to determine agreement of the devices. Abnormal clinical measurements are defined as follows for each clinical parameter based on reference measurement: blood oxygen saturation (<90%), pulse rate (>179 bpm for less than 12 months, > 139 bpm for 12–60 months), temperature (≥37.5C), and respiratory rate (>49 bpm for less than 12 months, > 39 bpm for 12–35 months, and >29 bpm for 36–60 months). (Table 1)
Table 1: Summary of primary and secondary reference measurements and devices used.
Respiratory rate annotation: Respiratory rate technologies utilize different methodologies such as accelerometers to measure chest movement, capnography to measure respiration, acoustics to measure breath sounds, and PPG to measure blood circulation in tissue [21]. In a controlled setting, capnography is the gold standard for respiratory rate measurement, however, a human counter is the current practice in primary care settings, and the preferred comparison for new technologies. To minimize interrater and intrarater variability in respiratory rate measurement by health care providers [9,40,19,39,37], videos were used to standardize the human counter, which has been used in previous studies [20,38]. Recording of the videos was exactly aligned to the time period of the index test measurements.
Prior to annotation, video recordings were edited into 60-second clips, with a maximum of three measurement clips per participant, according to the developed standard operating procedure. Each 60-second clip was reviewed and categorized to identify the highest quality measurement clip to use in annotation for each participant, resulting in a maximum of one measurement per participant.
To conduct the video annotation a panel of six reviewers was formed. Three clinical nurses and three health officers were recruited in Hawassa, Ethiopia. All reviewers had undergone previous training in line with the Integrated Management of Neonatal and Child Illnesses (IMNCI) strategy [41], and had practical experience managing sick children under five years of age. To ensure annotation in accordance with guidelines, and to standardize annotation between reviewers, the panel underwent a three-day training. The panel training was conducted in line with the standard operating procedure developed for the annotation, and included modules on respiratory rate counting, data collection methods and procedures, as well as instructions on how to use the annotation software tool (Philips Foundation, Belgium). This adapted software was used during training and data collection, using videos collected from a previous study [37]. The training was concluded with individual blinded competency tests where each reviewer was assigned five videos with known respiratory rates and reviewer accuracy was determined by a respiratory rate within ±2 bpm of the video’s known value. All reviewers met or exceeded the pass mark of 80% before commencing data collection [42].
For annotation, each reviewer was randomly assigned to one of three annotation groups each day (two reviewers per group). Videos were also assigned randomly and in proportion to the quality category designated to ensure videos of varying quality were distributed across the annotation groups. Breaths were marked using the annotation tool at the point of full chest expansion, with a full breath cycle being defined as the observation of one full inhalation and one full exhalation. Each video was annotated by both group members, and agreement was defined as a discrepancy in respiratory rate of ±2.5 breaths or less. If there was agreement, the annotation of the video was considered complete. If there was no agreement, the video was randomly assigned to a third panel member who was blinded to the results of the first two annotations. In addition to annotation and respiratory rate data collected through the annotation tool, supplementary data including annotator feedback was recorded. All data were recorded in SurveyCTO (Dobility, Inc, USA) before being exported for analysis.
Data management and analysis.
Research data was collected through structured data collection forms, surveys, case report forms, and video recording using REDCap (made available through grant UL1 TR002319) and reviewed daily for completeness and accuracy. No patient identifying information was collected as the focus of the evaluation was the provider and the device.
Usability was assessed by likert scale questions on ease of use and user confidence, error rates were classified by minor and critical errors, hands-on time to result refers to how long the device takes to provide measurements for all the clinical parameters it can evaluate and was measured by provider, and system usability scale (SUS) composite scores were calculated as described elsewhere [32]. To assess acceptability of the index devices by providers, a survey was administered containing open ended and Likert scale questions, as well as a semi-quantitative survey on product attributes. The diagnostic accuracy study was analyzed by clinical measurement (blood oxygen saturation, pulse rate, respiratory rate, or temperature) for each device and age category using Bland Altman plots, mean absolute error, and percent agreement, and following STARD guidelines [43,44]. The completed STARD checklist is available in supplemental information (S1 Checklist).
Baseline characteristics of participants, including providers and child/caregiver pairs, and usability assessment data were summarized using descriptive statistics. Bland Altman plots were generated for each clinical measurement, device and age category, to evaluate bias in the comparison of the index device to the reference device. Mean absolute error and standard deviations were calculated to give bias and upper/lower limits of agreement, describing the relative magnitude of measurement difference and giving insight into accuracy and precision with a single value. Directional drift indicated whether the index device measured high or low with respect to the reference value. Within each clinical measurement, units of measurement are the same so values are comparable. Bias and error were also evaluated after removing outlier values that were more than 3 standard deviations from the mean, indicating a likely erroneous value. For all the reference devices except annotated respiratory rate, 3 measurements were attempted for each child consultation, all conducted by a single provider. Measurement attempts were averaged together to give an overall percent agreement per participant and an overall percent agreement per age category. Four age categories were used for fast breathing in alignment with clinical cutoffs.
Using the method by Lu et. al. (2016) [45], the calculated sample size for the diagnostic accuracy study was 50 observations per targeted age category, rounding up to account for attrition or failed measurements. As an accuracy of at least ±2 breaths per minute translates to a maximum allowable difference of 2 plus the confidence interval of the limit of agreement, it was calculated that a mean difference of ≤1.3 with a standard deviation of 1.0 would have reasonable precision to detect a maximum allowable difference of 4.0 with a sample size of 47. Based on prior evidence of use in children, 3 devices were evaluated in all 3 age categories, 1 device was evaluated in the 2 older age categories, and 2 devices were evaluated only in the oldest age category. In total, enrollment of 650 children was the target of the diagnostic accuracy study per country.
All analyses were conducted by clinical measurement, device and age category. Quantitative analyses were performed in R using the BlandAltmanLeh and irr packages and conducted collaboratively using GitHub [46,47]. Nutrition measures were calculated using the zscorer package [48].
Data repository
With approval from all reviewing ethics committees, reference measurements from the diagnostic accuracy study are available in a data repository to facilitate secondary research on noninvasive and smartphone-based technologies for children. The repository contains deidentified reference data collected from the patient monitor, thermometer, Hemocue device, chromameter, custom PPG device, anthropometric assessments, redacted evaluation videos, and associated data from the smartphone application. The repository is structured to meet ethical, legal, and scientific oversight requirements to allow secondary research for potential commercial use, similar to biorepositories developed previously [49]. The data repository governance plan was reviewed by all research partners and the PATH Office of Research Affairs. Database storage is maintained on an AWS platform and requests for access to the data can be made online [50].
Results
To summarize form factors, Device 1 is a handheld unit with a corded finger clip that measures blood oxygen saturation, respiratory rate, pulse, and temperature, displaying results directly. Devices 2 and 3 also measure blood oxygen saturation, respiratory rate, and pulse; Device 2 is handheld with a corded clip, while Device 3 is a fingertip clip, both showing data on the device. Device 4, held in the palm, measures blood oxygen saturation, respiratory rate, pulse, and temperature, sending readings to a tablet. Device 5 is a forehead band measuring blood oxygen saturation, respiratory rate, pulse, and temperature, and also sending readings to a tablet. From a regulatory standpoint, device 1 has both FDA approval and CE marking; device 5 holds a CE mark; devices 2, 3, 4, and 6 are currently under development.
Usability assessment
The usability assessment was completed in 3 countries with a total of 18 providers. In India, device training had to proceed prior to completion of the usability assessment due to scheduling, and in Kenya and Tanzania, one device was not available in time to assess prior to training, resulting in only 6 assessments conducted for this device in Senegal.
Device 1 was assessed by 18 users, of which 94% were “very satisfied” with the device, 83% noted it was “very easy” to use and 94% noted it would be “very useful” at their facility. Complete device use involved device setup, positioning sensor 1 on the patient, obtaining sensor 1 measurements, positioning sensor 2, and obtaining sensor 2 measurement. In round 1 of device use, 5 users experienced a critical error and 11 experienced a minor error or required a prompt to keep proceeding in the task, while in round 2, 2 users experienced critical errors, and 3 users experienced minor errors.
Device 2 was assessed by 17 users, of which 94% were “very satisfied” with the device, 94% noted it was “very easy” to use and 94% noted it would be “very useful” at their facility. Complete device use involved device setup, positioning the sensor, and obtaining measurements. In round 1, no users experienced a critical error, and 8 users experienced a minor error or required a prompt to keep proceeding in the task, while in round 2, one user experienced a critical error, and 4 users experienced a minor error.
Device 3 was assessed by 18 users, of which 89% were “very satisfied” with the device, 94% noted it was “very easy” to use, and 89% noted it would be “very useful” at their facility. Complete device use involved device setup, positioning the sensor, and obtaining measurements. In round 1, 1 user experienced a critical error, and 10 experienced a minor error or required a prompt to keep proceeding in the task, while in round 2, no users experienced critical errors, and 2 users experienced minor errors.
Device 4 was assessed by 18 users, of which 33% were “very satisfied” with the device, 28% noted it was “very easy” to use, and 67% noted it would be “very useful” at their facility. Complete device use involved device setup, positioning sensor 1 on the patient, obtaining sensor 1 measurement, positioning sensor 2, obtaining sensor 2 measurements, positioning sensor 3, and obtaining sensor 3 measurement. In round 1, 8 users experienced a critical error and 10 experienced a minor error or required a prompt to keep proceeding in the task, while in round 2, 1 user experienced critical errors, and 10 users experienced minor errors.
Device 5 was assessed by 5 users, of which 20% were “very satisfied” with the device, 0% noted it was “very easy” to use, and 60% noted it would be “very useful” at their facility. Complete device use involved device setup, positioning the sensor, and obtaining measurements. In round 1, 3 of 6 users experienced a critical error and 2 experienced a minor error or required a prompt to keep proceeding in the task, while in round 2, 1 user experienced critical errors, no users experienced minor errors, and 2 users were unable to complete the second round due to device malfunctions.
Finally, device 6 was assessed by 16 users, of which 25% were “very satisfied” with the device, 44% noted it was “very easy” to use, and 56% noted it would be “very useful” at their facility. Complete device use involved device setup, positioning sensor 1 on the patient, obtaining sensor 1 measurements, positioning sensor 2, and obtaining sensor 2 measurement. In round 1, 2 users experienced a critical error and 14 experienced a minor error or required a prompt to keep proceeding in the task, while in round 2, no users experienced critical errors, and 8 users experienced minor errors.
Fig 4 includes the average time to result and SUS score by device. Among the 6 devices, Device 3 had the shortest average time to result in round 1 at 2.5 minutes, decreasing by over 1 minute in round 2 to 1.2 minutes. Device 4 had the longest average time to result in round 1 at 11.9 minutes, decreasing by 8 minutes in round 2 to 4.2 minutes). Finally, the average SUS scores for all devices were 70 or over indicating acceptable usability. Device 1 had the highest score at 93 while Device 4 had the lowest score at 70. Full results are available in supplemental information (S1 Fig).
Average time to result and SUS score by device.
Diagnostic accuracy study
Devices 1–5 were evaluated for performance. A total of 1980 participants were enrolled in the diagnostic accuracy study across the 3 countries, of which 47% were female. Most children were accompanied by their mother (79%) and common symptoms included cough (87%), fever (60%), and difficulty breathing (17%). Provider diagnoses were most often respiratory illness (83%) and fever (34%). Of the 77% (1520/1980) of participants who received treatment, 58% of the treatments were antibiotics. Rates of stunting, wasting, and anemia varied by country, as did abnormal measurements for blood oxygen, pulse, temperature, and respiratory rate. For detailed participant characteristics, see Table 2.
Table 2: Participant characteristics in the diagnostic accuracy study.
Bias: Bland Altman plots were constructed for each clinical measurement, device, and age category showing mean differences or bias between the index and reference measurement. Fig 5 indicates that average blood oxygen saturation bias across five devices ranged from -0.49 to 3.04 (12–59 months), 0.22 to 4.12 (2–11 months), and -0.17 to 5.05 (0–1 months). Fig 6 shows pulse rate bias ranged from 0.57 to 2.23, 7.58 to 14.91, and 9.13 to 13.13 for the same age groups respectively. In Fig 7, respiratory rate bias varied from -16.68 to 4.23, -8.91 to 11.25, and -1.19 to 22.45. Finally, Fig 8 shows temperature bias from 0.34 to 2.03, -0.36 to 2.05, and -0.43 to 1.98 across these age categories.
Bland Altman plots of blood oxygen saturation by device and age category.Data labels displayed represent M1 (black circle), M2 (pink triangle), and M3 (grey square).
Bland Altman plots of pulse rate by device and age category.Data labels displayed represent M1 (black circle), M2 (pink triangle), and M3 (grey square).
Bland Altman plots of respiratory rate by device and age category.Data labels displayed represent M1 (black circle), M2 (pink triangle), and M3 (grey square).
Bland Altman plots of temperature by device and age category.Data labels displayed represent M1 (black circle), M2 (pink triangle), and M3 (grey square).
Error: Mean absolute errors, which are the same as mean differences from the Bland Altman plots, are provided along with standard deviation and limits of agreement in Table 4 below. For blood oxygen saturation, pulse rate, temperature, and respiratory rate measurement errors varied by device, but varied similarly by age group across devices. Across devices, removing outliers minimally changed the magnitude of error, particularly for respiratory rate that included data cleaning for annotating videos. For detailed data, see Table 3.
Table 3: Mean absolute error, standard deviation, and 95% limit of agreement with and without outliers, by clinical parameter, age category, and device.
Percent agreement: Overall percent agreement, positive and negative percent agreement, as well as positive and negative predictive values are determined by clinical measurement, device, and age category. Percent agreement and predictive values for hypoxemia, tachycardia, fever, and fast breathing varied by device and age group, with generally high agreement for hypoxemia and tachycardia, moderate for fever, and variable for fast breathing. For full details and breakdowns, see Table 4. Because each measurement was captured 3 times in a row, except annotated respiratory rate, agreements were also evaluated by measurement attempt, which are shared in supplemental material along with fast breathing results using acoustic respiratory rate (S3 Text).
Table 4: Percent agreement and predictive values by clinical measurement, device, and age category (n = abnormal measurement by reference device).
Discussion
This multi-dimensional research generated findings on product validation, diagnostic accuracy and insights on market entry. Through the usability assessment, all study devices showed good usability with few observed user errors, high perceived satisfaction and usefulness, and high system usability scores though devices with more procedures had more user challenges, particularly involving connected capabilities. In the diagnostic accuracy study involving 1980 participants, percent agreements for hypoxemia, tachycardia, and fever among all devices demonstrated greater than 80% agreement for all age categories; respiratory rate measurement demonstrated more variability in overall percent agreement across devices (39–91%). In general, mean absolute error across the 5 devices was less among the older age categories compared to younger age category for all clinical measurements. Diverse inputs from stakeholders and manufacturers defined key attributes of next generation multimodal devices including high interest in respiratory rate, temperature and hemoglobin, and features like providing a quick result, long battery life, and applicability for all ages.
Multiple indicators of accuracy were included in this analysis, offering a more nuanced assessment of device performance and addressing the varying priorities of stakeholder groups. Starting with the more straightforward, overall agreement for all devices was over 85%, except in one subcategory, when measuring hypoxemia, tachycardia, and fever. And while positive percent agreement and positive predictive values were low, largely due to the fewer number of cases (n = 46 for hypoxemia, n = 332 for tachycardia, n = 127 for fever, and n = 341 for fast breathing), negative percent agreement and negative predictive value were higher and may be useful for all clinical measurements including detection of fast breathing. Additionally, overall agreement appeared to improve with repeated measures (table in supplemental information). Given that fast breathing is measured only 10–20% of the time that it should be measured, the negative predictive value of a multimodal device may assist providers in determining that a patient does not have fast breathing, even if a positive detection requires further screening. Additionally, these noninvasive devices provide results in 1–4 minutes, according to the usability results, and may allow repeated measurements of positive results to be a practical approach for improving accuracy in operational workflows. More research is needed to assess the performance of these devices if used in a two or three test strategy, or with consideration for pre-test probability, as it also adds complexity [51].
Giving a more nuanced interpretation, the Bland Altman plots show varying patterns of data spread across clinical measurements. For blood oxygen saturation, the funnel shape suggests measurement error is greater at lower values, and the bias is overestimating the true value in lower values. For pulse rate, the bias is more consistent across the range of values, but where error occurs it is overestimating the true value. For respiratory rate, the pattern for device 1 suggests consistent bias across the range of values but more overestimation in the younger age categories, while device 5 has a pattern suggesting consistent bias with more underestimation in the older age categories, and devices 3 and 4 show more overestimation at higher values. Finally, for temperature, the device 1 pattern suggests consistent underestimation, device 5 suggests overestimation at lower values and underestimation and higher values, and device 4 has a distinct pattern likely due to the contact thermometer taking more time to develop a stable reading so the 3 measurement attempts varied in a consistent manner. From a clinical care perspective, overestimation of blood oxygen saturation, and underestimation of pulse rate, respiratory rate, and temperature are more dangerous biases, particularly around the thresholds for abnormal, as false negatives would result in children being missed who may be more sick and could benefit from further care sooner. This level of detail provides information useful for device development and ongoing improvement, especially as these devices are at different stages of product development, including those already on the market and approved by regulatory authorities, those design-locked but not yet available commercially, and those seeking to expand product claims to younger age categories. The form factors were also highly diverse with traditional handheld designs, finger clip designs, and connected designs that required a tablet to operate. Overall, across devices and clinical measurements, performance was often best in the oldest age category, similar to other recent studies [38,52,53]. Larger digits and minimal or more controlled movements make it easier for medical devices to capture more accurate data.
From a user and market perspective, the usabilty assessment highlighted that fewer device procedures improved usability while connected capabilities increased user challenge. Despite these challenges, repeated use led to fewer errors, demonstrating that even complex devices could be learned over time.. Throughout this research, stakeholders and providers expressed high interest in devices with expanded feature sets, particularly the integration of temperature measurement, which sometimes was valued over anemia assessment despite thermometers being affordable and accessible, reflecting evolving needs since the Covid-19 pandemic increased attention toward non-contact thermometers [54]. Incorporating such features into multimodal diagnostic tools may enhance the user experience and clinical utility, especially if development involves diverse stakeholder input to ensure relevance and adoption [55]. However, as devices become more advanced and capable of capturing additional clinical parameters such as hemoglobin and blood pressure through PPG-derived measurements [24,56–60], the mental workload on providers also increases. This underscores the importance of supportive innovations, such as AI-driven decision support tools and automated documentation systems, to help providers efficiently interpret and act on the growing volume of health data. Balancing feature expansion with usability and workflow integration is therefore essential to ensure these technologies meet both user needs and market demand, ultimately optimizing the provider experience [61].
Study limitations included complicated device setup, possibly excessive performance measures, and an assessment of perceived demand for next-generation multimodal devices that may differ from actual demand. In the diagnostic accuracy study, managing the complexity of device setup and conducting simultaneous measurements across multiple instruments posed significant challenges, particularly in maintaining a calm and cooperative environment for the children involved. Though this could lead to spurious readings, extensive practice with the technology and daily data review ensured high quality data collection. Additionally, a sensitivity analysis was performed to understand if outlier measurements had impacted the results, which showed little change. While we included a broad range of diagnostic performance measures to provide a comprehensive assessment, this may have introduced complexity that could obscure findings or reduce interpretability for some audiences. However, the different metrics are of varying utility for different audiences, such as manufacturers, health care providers, or policymakers, and therefore sharing the nuance was prioritized over potential information overload. Finally, TPPs play a vital role in articulating requirements and aligning the global health community and industry stakeholders with product class needs. Nevertheless, our study reflects only perceived demand; generating evidence of demonstrated demand is essential to triangulate and ensure new products are appropriately positioned for market entry.
In conclusion, this study generated key evidence around the performance of current multimodal pulse oximeter devices across a range of form factors and with varying product attributes. The results are presented in a manufacturer-neutral manner to account for continuing product improvements throughout different stages of product development. While all devices were usable with training and consistent practice, less frequent use may challenge provider recall and ease of use, resulting in diminishing adherence over time in their already brief consultation visits. Moreover, rising stakeholder interest in feature sets should be considered alongside increasing requirements for providers to implement further measurements or assessments, since this could overburden health professionals if essential features such as integration and connectivity are not prioritized as well. The manufacturer and research community have a critical opportunity to leverage emerging technologies to improve the experience of providers, and as a result, the experience of patients.
Supporting information
S1 ChecklistCompleted STARD checklist.(DOCX)
S2 ChecklistInclusivity in global research questionnaire.(DOCX)
S1 TextDevelopment of a target product profile for multimodal PO devices.(DOCX)
S2 TextSpecifications of multimodal PO devices by product attribute, highlighting good (green) to poor (red) alignment to minimum requirements of the target product profile.(DOCX)
S3 TextOverall percent agreement by measurement attempt (M1-3) and age category.(DOCX)
S1 FigDetailed usability results by device.(DOCX)
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1WHO. Towards a grand convergence for child survival and health. 2016. https://www.who.int/publications/i/item/WHO-MCA-16.04
- 2WHO. Integrated management of childhood illness. Child Health and Development. https://www.who.int/teams/maternal-newborn-child-adolescent-health-and-ageing/child-health/integrated-management-of-childhood-illness
- 3WHO. Integrated management of childhood illness - Chart booklet. 2014. https://www.who.int/publications/m/item/integrated-management-of-childhood-illness---chart-booklet-(march-2014)
- 4UNICEF. Levels and trends in child mortality, report 2024. New York, NY, USA. 2025. https://data.unicef.org/resources/levels-and-trends-in-child-mortality-2024/
- 5Bjornstad E, Preidis GA, Lufesi N, Olson D, Kamthunzi P, Hosseinipour MC, et al. Determining the quality of IMCI pneumonia care in Malawian children. Paediatr Int Child Health. 2014;34(1):29–36. doi: 10.1179/2046905513 Y.0000000070 24091151 PMC 4424282 · doi ↗ · pubmed ↗
- 6Horwood C, Vermaak K, Rollins N, Haskins L, Nkosi P, Qazi S. An evaluation of the quality of IMCI assessments among IMCI trained health workers in South Africa. P Lo S One. 2009;4(6):e 5937. doi: 10.1371/journal.pone.0005937 19536288 PMC 2693922 · doi ↗ · pubmed ↗
- 7Krüger C, Heinzel-Gutenbrunner M, Ali M. Adherence to the integrated management of childhood illness guidelines in Namibia, Kenya, Tanzania and Uganda: evidence from the national service provision assessment surveys. BMC Health Serv Res. 2017;17(1):822. doi: 10.1186/s 12913-017-2781-3 29237494 PMC 5729502 · doi ↗ · pubmed ↗
- 8Nguyen DTK, Leung KK, Mc Intyre L, Ghali WA, Sauve R. Does integrated management of childhood illness (IMCI) training improve the skills of health workers? A systematic review and meta-analysis. P Lo S One. 2013;8(6):e 66030. doi: 10.1371/journal.pone.0066030 23776599 PMC 3680429 · doi ↗ · pubmed ↗
