Establishment and Validation of Serum Ferritin Reference Intervals Based on Real-World Big Data and Multi-Strategy Partitioning Algorithms
Yixin Xu, Xiaojuan Wu, Junlong Zhang, Qian Niu, Bei Cai, Qiang Miao

TL;DR
This study establishes accurate reference intervals for serum ferritin levels using big data and advanced statistical methods, improving diagnostic accuracy for a local population.
Contribution
A novel multi-strategy partitioning framework for deriving population-specific serum ferritin reference intervals using real-world data and decision tree analysis.
Findings
Males had significantly higher serum ferritin concentrations than females.
Age was significantly associated with serum ferritin in females but not in males.
Study-derived reference intervals outperformed manufacturer-provided intervals in validation.
Abstract
Background/Objectives: We aimed to establish and validate population-based reference intervals (RIs) for serum ferritin (SF) using an indirect, date-driven approach based on real-world laboratory data and to optimize partitioning strategies. Methods: SF results from 29,723 apparently healthy individuals who underwent health examinations at West China Hospital between 2020 and 2024 were retrospectively analyzed. SF was measured on a Roche Cobas e801 electrochemiluminescence immunoassay platform. After Box–Cox transformation, outliers were removed using an iterative Tukey method. Potential partitioning factors were evaluated, and data-driven age cut-points were explored using decision tree regression and verified with the Harris–Boyd criteria. RIs were estimated using nonparametric percentile methods and validated in an independent cohort of 2494 individuals. Results: SF concentrations…
Click any figure to enlarge with its caption.
Figure 1
Figure 2- —National Key Technologies R&D Program of China
- —Project of Sichuan Provincial Department of Science and Technology
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIron Metabolism and Disorders · Hemoglobinopathies and Related Disorders · Clinical Laboratory Practices and Quality Control
1. Introduction
Serum ferritin (SF) is the primary intracellular iron storage protein and is widely distributed in the liver, spleen, and bone marrow. In clinical practice, SF is a sensitive indicator of body iron stores: low concentrations suggest iron deficiency or iron deficiency anemia, whereas elevated concentrations may reflect iron overload, inflammation, or malignancy. Consequently, SF is frequently used for the diagnosis and monitoring of iron-metabolism disorders [1,2]. However, reference intervals (RIs) for SF—defined here as population-based intervals derived from a reference cohort, which differs from individualized or custom RIs based on an individual’s longitudinal results—vary markedly across populations, regions, and assay platforms [3]. In China, many laboratories still rely on manufacturer-provided RIs derived from foreign populations, which may not account for differences in genetics, diet, and environmental factors [4,5]. Using inadequately established or nontransferable RIs can lead to both false-positive and false-negative interpretations. For SF, an overly narrow upper limit (UL) may label physiological elevations (e.g., in postmenopausal women) as “abnormal”, triggering unnecessary repeat testing, imaging, or referrals, whereas overly broad limits may delay recognition of iron deficiency or overload. In addition, ferritin is an acute-phase reactant, and inappropriate RIs may amplify the confounding effects of subclinical inflammation, metabolic conditions, or liver disease on result interpretation, ultimately undermining clinical decision-making and resource utilization.
Sex and age are well-documented determinants of SF concentrations. Several studies report that females have lower SF concentrations than males, with a pronounced increase after menopause, suggesting that age plays a crucial role in female SF dynamics [6,7]. The Clinical and Laboratory Standards Institute (CLSI) EP28-A3c guideline emphasizes considering partitioning factors, such as sex and age, when establishing Ris [8,9]. However, the RIs currently used in our laboratory (provided by Roche Diagnostics) were derived from a limited cohort of 224 German adults (120 males aged 20–60 years and 104 premenopausal females aged 17–60 years) and lack comprehensive age stratification (males: 30–400 µg/L; females: 13–150 µg/L). Such RIs may underestimate physiological variability, particularly in peri- and postmenopausal women [10,11].
In recent years, indirect methods that utilize large-scale laboratory databases have gained wide acceptance for establishing Ris [12,13]. These approaches reduce subjective bias inherent to direct sampling and have been successfully applied to various clinical biomarkers [14,15,16,17]. In this study, we employed an indirect, date-driven approach using a large real-world laboratory dataset. Decision tree analysis and the Harris–Boyd method were combined to evaluate the necessity of sex- and age-specific partitioning. Our objective was to establish robust, locally applicable RIs for SF to improve diagnostic accuracy and to provide a methodological framework for other biomarkers.
2. Materials and Methods
2.1. Study Population
We retrospectively collected SF results from 46,963 individuals who underwent routine health examinations at the Health Management Center of West China Hospital, Sichuan University, between January 2020 and December 2024. When an individual had multiple eligible SF measurements during the study period, only the first measurement was retained to ensure independence of observations and to avoid over-representation of frequently tested individuals in the indirect RI estimation. After excluding 702 individuals with incomplete data, 46,261 individuals remained as the initial study population. To ensure biological validity and representativeness of the reference population, we further excluded individuals with abnormal liver function tests (alanine aminotransferase, aspartate aminotransferase), abnormal renal function (serum creatinine), or abnormal routine hematology parameters (hemoglobin, red blood cell count, white blood cell count, platelet count). Subjects with a history of malignancy, recent surgery or hospitalization, active infections, or samples affected by hemolysis, icterus, or lipemia were also excluded. Ultimately, 29,723 apparently healthy individuals (17,846 males and 11,877 females) were included in the RI establishment cohort. An independent validation cohort of 2494 individuals (1490 males, 1004 females) examined between January and May 2025 was used to validate the newly established RIs. This study was approved by the Ethics Committee of West China Hospital, Sichuan University (Approval No. 2022-1682) and was conducted in accordance with the Declaration of Helsinki and relevant institutional guidelines. The requirement for informed consent was waived because only de-identified retrospective data were used.
2.2. Instruments and Reagents
SF was measured on a Roche Cobas e801 electrochemiluminescence immunoassay analyzer (Roche Diagnostics GmbH, Mannheim, Germany), using Roche Elecsys^®^ Ferritin reagent kits (Roche Diagnostics GmbH, Mannheim, Germany). Calibration was performed with Roche Elecsys^®^ Ferritin CalSet calibrators, and internal quality control was performed daily using two levels of Roche Elecsys^®^ PreciControl Tumor Marker controls (low and high). Westgard multi-rules (1_3S_, 2_2S_, R_4S_) were applied. The laboratory’s analytical performance specification for SF during the study period was CV ≤ 5% (per the immunoassay quality management plan), and the cumulative CVs for both QC levels remained within this target. The laboratory participated continuously in external quality assessment/proficiency testing organized by the National Center for Clinical Laboratories (NCCL, Beijing, China) and the College of American Pathologists (CAP), consistently achieving satisfactory performance throughout the study period.
2.3. Establishment of Reference Intervals
Initially, data normality was assessed visually using histograms. Data exhibiting evident skewness underwent a Box–Cox transformation in R software (version 3.6.3) to approximate a normal distribution. The optimal transformation parameter (λ) was determined by maximum likelihood estimation (Formula (1)). Histograms were then used to re-evaluate the distribution of the transformed data.
In Formula (1), X represents the original data and λ is the transformation parameter chosen to best approximation a normal distribution. For positive X values, the Box–Cox transformation takes various forms: λ = 2 corresponds to a square transformation, λ = 0.5 to a square root transformation, and λ = 0 to a natural logarithmic transformation. In practice, statistical software such as R (version 3.6.3) or Python (version 3.13.1) is used to determine the optimal value of λ that maximizes the likelihood of normality. In this study, λ was set to 0.284 and the Box–Cox transformation successfully converted the distribution of SF data from non-normal to approximately normal, with details provided in Table 1 and Figure 1.
The Tukey method was applied iteratively to identify and remove outliers, defined as values below P_25_ − 1.5 × IQR or above P_75_ + 1.5 × IQR until no further outliers remained [18]. After data normalization and cleaning, RI partitioning was evaluated in two stages. First, sex-specific differences were assessed using the Harris–Boyd standard normal deviate method (Formula (2)). Second, for age (continuous variable), scatterplots were inspected and decision tree regression was implemented to explore data-driven age cut-points for RI partitioning. Age was modeled as a continuous predictor and SF (Box–Cox transformed) as the response. A recursive partitioning algorithm was fitted to minimize within-node variance (equivalently, maximize between-node separation) and to propose split points that improved model fit (R^2^). R^2^ is the measure of fitting degree for all subclasses after every division step in each stage, ranging from 0 (no fit) to 1 (exact fit). The age partition point corresponding to the highest R^2^ was selected as the optimal threshold [19,20]. Candidate age partitions were subsequently confirmed by the Harris–Boyd criteria (Z and Z*); partitioning was considered statistically justified When Z > Z* and the standard deviation ratio (SD ratio) exceeded 1 [9].
In Formula (2), and are the means, and are the standard deviations, and and are the sample sizes of the two groups being compared.
Finally, RIs were estimated using nonparametric methods (2.5th–97.5th percentiles) and 90% confidence interval (CI) were computed by bootstrap resampling. To assess clinical applicability, we evaluated the proportion of individuals in the validation cohort whose SF values fell outside these intervals in accordance with CLSI EP28-A3c guidelines. An RI was considered valid if fewer than 10% of individuals had values outside the interval.
2.4. Statistical Analysis
Statistical analyses were performed using SPSS v23.0 (IBM Corp., Armonk, NY, USA) and R (version 3.6.3, 2020). R was used for Box–Cox transformation, iterative Tukey outlier exclusion, decision tree regression, Harris–Boyd calculations (Z, Z* and SD ratio), non-parametric percentile RI estimation, and graphical visualization. SPSS was used for descriptive statistics and hypothesis testing. Data normality was evaluated using histograms and skewness-kurtosis tests. Normally distributed variables were expressed as mean ± standard deviation and analyzed by independent samples t-tests. Categorical data were expressed as counts (percentages) and analyzed using chi-squared tests. Pearson correlation and simple linear regression were used to assess associations be-tween SF and age. Statistical significance was set at p < 0.05.
3. Results
3.1. SF Data Characteristics
A total of 29,723 apparently healthy individuals (17,846 males and 11,877 females) were included in the final analysis. Initial assessment revealed a markedly positive skewed in the SF distribution for both sexes (Figure 1A,B). Following Box–Cox transformation, the distributions approximated normality, with marked improvements in skewness and kurtosis (Figure 1C,D; Table 1). Specifically, in males, skewness decreased from 1.681 to 0.190 and kurtosis from 4.512 to 0.350. In females, skewness decreased from 1.952 to 0.163 and kurtosis decreased from 6.478 to −0.254. After iterative outlier removal using the Tukey method, the final dataset for RI modeling comprised 17,568 males (aged 16–91 years, mean SF 403.72 ± 230.03 µg/L) and 11,831 females (aged 14–91 years, mean SF 133.93 ± 108.89 µg/L) (Table 1).
3.2. RI Partitioning Analysis
To determine whether sex- and age-specific partitioning of SF RIs was necessary, we compared SF concentrations between males and females after outlier exclusion. Between-sex differences were substantial (p < 0.001; Figure 2A) and met the Harris–Boyd criteria for partitioning (Z > Z* and SD ratio = 2.11; Table 2). We next evaluated age as a partitioning factor within each sex. In males, SF showed only a weak inverse correlation with age (r = −0.049, p < 0.001; Figure 2B) and decision tree regression suggested a split at 61 years with poor model fit (R^2^ = 0.0102; Table 3); this split was not supported by the Harris–Boyd test (Z < Z*; Table 2). In females, SF increased markedly with age (r = 0.466, p < 0.001; Figure 2C) and the decision tree regression identified 50 years as the optimal cut-off with a substantially better model fit (R^2^ = 0.2467, Table 3), which was subsequently confirmed by the Harris–Boyd method (Z > Z* and SD ratio = 1.55; Table 2). Therefore, the final partitioning scheme consisted of a single group for males and two age-based subgroups for females (≤50 years and >50 years).
3.3. Establishment of SF Reference Intervals
Based on the final partitioning scheme, the 95% distribution RIs for SF in this regional healthy population were established using the nonparametric percentile method (P2.5–P97.5). The resulting RIs were 98.02–997.78 µg/L for males aged 16–91 years; 10.30–299.55 µg/L for females aged 14–50 years; and 36.61–507.00 µg/L for females aged > 50 years (Table 4). For comparison, the manufacturer-provided RIs are 30.00–400.00 µg/L for males and 13.00–150.00 µg/L for females without age stratification. Notably, the ULs of the study-derived RIs exceeded the manufacturer’s ULs, particularly in females > 50 years (Table 5).
3.4. Validation and Comparison of RIs
To evaluate the clinical applicability of the study-derived RIs, validation was conducted using an independent cohort of 2494 individuals. The pass rates of the study-derived RIs were significantly higher than those of the manufacturer-provided RIs across all subgroups: 93.83% vs. 56.71% in males (p < 0.001); 94.72% vs. 73.97% in females aged ≤ 50 years (p < 0.001); and 94.52% vs. 37.12% in females aged > 50 years (p < 0.001).
4. Discussion
Population-based RIs are fundamental for interpreting laboratory results and for minimizing misclassification in clinical decision making. SF, as the primary form of iron storage in the body, plays an important clinical role in assessing iron metabolism, aiding in the diagnosis of iron-deficiency anemia, iron overload, and certain malignancies [2,21,22]. However, SF concentrations are influenced by multiple factors including sex, age, ethnicity, geographic region, and analytical platform, resulting in considerable variability in RIs across populations [3,5,7,16]. Therefore, establishing locally appropriate RIs based on large regional datasets is essential to ensure accurate interpretation of laboratory results [13].
In this study, by using a large real-world health examination dataset and an indirect RI approach, we established and validated sex- and age-specific SF RIs for apparently healthy individuals in southwestern China. We observed that SF concentrations were significantly higher in males than in females, consistent with previous reports [23,24], reflecting differences in iron storage between sexes that may be attributable to physiological factors, greater muscle mass, and differences in estrogen regulation of iron metabolism [25]. After applying a multi-strategy partitioning algorithm, no further age partitioning was required for males, and the overall male RI was 98.02–997.78 µg/L. For females, our analysis clearly identified 50 years as an important threshold, with SF levels significantly higher in females aged > 50 years compared to those aged ≤ 50 years. The RIs established were 10.30–299.55 µg/L for females aged ≤ 50 years and 36.61–507.00 µg/L for females aged > 50 years. This finding is consistent with results from other studies using different analytical platforms in China [5], and supports the physiological increase in iron storage following menopause due to cessation of menstrual blood loss [7,10]. The observed sex and age patterns are biologically plausible. Compared with women of reproductive age, men typically have higher iron stores because they lack menstrual iron loss and may have greater dietary iron intake and body iron reserves [26]. In women, the rise in SF after midlife is consistent with the reduction and eventual cessation of menstrual blood loss after menopause, together with hormonal regulation of iron metabolism and hepcidin signaling [27,28]. These physiological changes support the need for age-specific interpretation in females and help explain why a single female RI without age stratification may lead to frequent false-positive results in older women.
Notably, the ULs of the study-derived RIs—particularly in females > 50 years—exceeded those provided by the manufacturer. This may reflect both true physiological variability in the local cohort and the right-tailed distribution of ferritin, for which the 97.5th percentile may remain sensitive to residual heterogeneity even after rigorous exclusion criteria and iterative outlier removal. Because SF is also an acute-phase reactant, unrecognized low-grade inflammation and metabolic or hepatic factors (e.g., fatty liver), alcohol intake, or unrecorded supplement use may contribute to higher SF values in an “apparently healthy” dataset. Therefore, SF values near the UL should be interpreted in clinical context rather than in isolation. For clinical implementation, these RIs are intended for adults comparable to the health examination population and measured using the same analytical platform. Importantly, an RI is not a diagnostic decision limit: some patients with disease may still have SF values within the RI, and values outside the RI do not by themselves establish a diagnosis. When SF is above the UL, repeat testing and evaluation with complementary indices (e.g., hemoglobin and red cell indices, transferrin saturation, C-reactive protein, and liver function tests) can help distinguish iron overload from inflammation-related hyperferritinemia; when SF is low or borderline, assessment of iron deficiency should similarly integrate clinical context and additional iron studies [1,2,4]. Where available, interpretation should also consider prior results, longitudinal trends and clinical suspicion.
Methodologically, this study followed CLSI EP28-A3c recommendations, used a transparent transformation and outlier strategy, and combined decision tree analysis with Harris–Boyd testing to provide an objective, reproducible partitioning workflow. Such approaches have been widely used in recent pediatric RI studies [29,30,31], and are becoming an important trend in indirect RI modeling. The RIs established in this study underwent external verification in an independent cohort and yielded pass rates > 93.8% in all subgroups, substantially higher than the manufacturer-provided RIs, supporting improved local applicability. Similar findings have been reported in other studies both in China and internationally [32,33,34], suggesting that manufacturer-provided RIs, typically derived from selected populations in other countries, may not adequately reflect the actual status of local populations, potentially leading to overdiagnosis, unnecessary anxiety, and inefficient use of healthcare resources [35].
Nonetheless, this study has certain limitations including the single-center retrospective design, the lack of individual information on diet, supplementation, alcohol intake and comorbidities, and platform specificity. In addition, while finer age stratification (e.g., by decades) may reveal gradual trends, excessive partitioning can compromise statistical robustness and was not justified by our objective partitioning criteria in this dataset. Future studies incorporating multicenter datasets and different analytical systems are needed to improve the representativeness and broader applicability of the RIs.
5. Conclusions
In summary, using a large real-world health examination dataset and an indirect modeling strategy, we established population-based serum ferritin RIs for individuals on the Roche Cobas e801 platform. These RIs offer improved regional representativeness and clinical utility, providing robust support for the accurate diagnosis of iron metabolism disorders. The proposed framework is reproducible and may be extended to other biomarkers using real-world laboratory databases.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Camaschella C. Iron deficiency Blood 2019133303910.1182/blood-2018-05-81594430401704 · doi ↗ · pubmed ↗
- 2Pasricha S.R. Tye-Din J. Muckenthaler M.U. Swinkels D.W. Iron deficiency Lancet 202139723324810.1016/S 0140-6736(20)32594-033285139 · doi ↗ · pubmed ↗
- 3Truong J. Naveed K. Beriault D. Lightfoot D. Fralick M. Sholzberg M. The origin of ferritin reference intervals: A systematic review Lancet Haematol.202411 e 530e 53910.1016/S 2352-3026(24)00103-038937026 · doi ↗ · pubmed ↗
- 4Sezgin G. Monagle P. Loh T.P. Ignjatovic V. Hoq M. Pearce C. Mc Leod A. Westbrook J. Li L. Georgiou A. Clinical thresholds for diagnosing iron deficiency: Comparison of functional assessment of serum ferritin to population based centiles Sci. Rep.2020101823310.1038/s 41598-020-75435-533106588 PMC 7589482 · doi ↗ · pubmed ↗
- 5Wang Q.P. Guo L.Y. Lu Z.Y. Gu J.W. Reference intervals established using indirect method for serum ferritin assayed on Abbott Architect i 2000(SR) analyzer in Chinese adults J. Clin. Lab. Anal.202034 e 2308310.1002/jcla.2308331674712 PMC 7083431 · doi ↗ · pubmed ↗
- 6Floegel A. Intemann T. Siani A. Moreno L.A. Molnar D. Veidebaum T. Hadjigeorgiou C. De Henauw S. Hunsberger M. Eiben G. Cohort-Based Reference Values for Serum Ferritin and Transferrin and Longitudinal Determinants of Iron Status in European Children Aged 3–15 Years J. Nutr.202415465866910.1016/j.tjnut.2023.12.00138048991 PMC 10900138 · doi ↗ · pubmed ↗
- 7Addo O.Y. Mei Z. Jefferds M.E.D. Jenkins M. Flores-Ayala R. Williams A.M. Young M.F. Luo H. Ko Y.-A. Papassotiriou I. Physiologically based serum ferritin thresholds for iron deficiency among women and children from Africa, Asia, Europe, and central America: A multinational comparative study Lancet Glob. Health 202513 e 831e 84210.1016/S 2214-109X(25)00009-940288394 · doi ↗ · pubmed ↗
- 8Harris E.K. Boyd J.C. On dividing reference data into subgroups to produce separate reference ranges Clin. Chem.19903626527010.1093/clinchem/36.2.2652302771 · doi ↗ · pubmed ↗
