Modelling Cumulative Effects of Air Pollution on Respiratory Illnesses by Performing Spline Estimation of Constrained, Additive Single-Index Model
Xingfa Zhang, Siyu Wang, Quanxi Shao, Sijia Wang, Yuezi Wei

TL;DR
This paper introduces a new model to study how air pollution and weather affect respiratory illnesses, using data from Hong Kong including the SARS epidemic.
Contribution
A novel semiparametric index model is proposed to capture cumulative and nonlinear effects of air pollution and weather on respiratory illnesses.
Findings
SO2, NO2, and PM10 effects decay quickly, while O3, NOx, RH, and temperature have stable accumulation periods.
The proposed model outperforms previous models in fitting performance for health monitoring.
Public health measures during the SARS epidemic were accounted for using a growth curve model.
Abstract
It is widely recognised that air pollutants including sulphur dioxide (SO2), respirable suspended particulates (PM10), nitrogen oxides (NOx), nitrogen dioxide (NO2), and ozone (O3), as well as weather conditions such as temperature (Temp) and relative humidity (RH), are major causes of respiratory illnesses. To quantify the unknown and highly nonlinear relationships between these factors and respiratory illness, and the cumulative effect from exposure to symptoms, in this paper, we propose a semiparametric index model with constraints to capture the cumulative effect additively and the nonlinearity nonparametrically. As a case study, the model is applied to a dataset from the Hong Kong SAR. As the data period includes the SARS (severe acute respiratory syndrome) epidemic in 2003, we further construct a growth curve model to account for the extra impact of public health measures. The…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5- —Guangdong Basic and Applied Basic Research Foundation
- —Funding for Science and Technology Projects in Guangzhou
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAir Quality and Health Impacts · Air Quality Monitoring and Forecasting · Energy and Environment Impacts
1. Introduction
Air pollution is among the major environmental problems in both developed and developing countries (WHO 2003) [1]. Many studies have addressed the impacts of air pollution on respiratory illnesses around the world (USEPA 1996) [2]. This research is motivated by the challenge of modelling the effect of ambient air pollution on respiratory illnesses. It is widely agreed that exposure to ambient air pollution may cause serious respiratory illnesses and that weather conditions may also contribute to their seriousness [3,4,5,6,7,8,9,10,11,12,13,14]. However, quantifying the effects of various pollutants as well as weather conditions is a difficult task due to the high nonlinearities in the impact of these environmental and weather factors on the onset of the respiratory illnesses and possible interactions amongst these factors. Unfortunately, the mechanism underlying such nonlinearities and interactions is unknown. To account for a possible incubative period and/or cumulative effects, we may consider the lags of potential factors that result in a large number of covariates in the regression. To address this issue, semiparametric models are frequently used, such as the cumulative effect model, index model, generalised additive model [15,16,17,18,19], and fuzzy process model [20,21]. The cumulative effect model has value in addressing the questions: “Is there a threshold below which no effects of the pollutants on health are expected to occur in all people?” and “What averaging period (time pattern) is the most relevant from the point of view of health?”, while the significant length of cumulative period has not been investigated.
The cumulative effects of pollutants and weather factors on respiratory illnesses have been recognised and investigated by researchers [22,23,24]. In data analyses, the ‘weekend-effect’ (change in hygiene habits) and environmental effects should be modelled simultaneously.
With a normal lifestyle, the effect of environmental and climatic factors on respiratory illness should be relatively stable, because the pattern of individual exposure to the environment and community is relatively stable. However, sudden events may change this established pattern and affect the number of cases of respiratory illnesses. This is true for the data we collected from Hong Kong, as the SARS (severe acute respiratory syndrome) epidemics in Hong Kong in 2003 changed people’s lifestyles during time [25]. There are at least two options for such a dataset. We can simply use the data within a period in which there was no significant event. Alternatively, we can introduce an extra component to model the effect of this significant event. In addition to allowing researchers a better understanding and use of the information, the latter approach can also provide insights into the impact of the significant event.
In this paper, we modify Xia and Tong’s (2006) cumulative model by including a component for the effect of SARS and determining the length of the cumulative period, to gain a guideline on the incubative period [9]. For computational efficiency, we also propose the use of the spline method for the effect function, which is more efficient than the local polynomial method. The paper is organised as follows. The data used in this study are described in the next section, followed by the modelling and estimation procedure. The results are then provided before a discussion and conclusions.
2. Dataset and Methods
2.1. Dataset
Hong Kong became a Special Administrative Region (HKSAR) of the People’s Republic of China on 1 July 1997, after a century and a half of British administration. It is located south of Mainland China with a population of 7 million and an area of 1103 km^2^ covering the Hong Kong Island, Kowloon, and the more rural New Territories. Its climate is sub-tropical, with temperatures between 10 degrees Celsius in winter and 33 degrees Celsius in summer.
Air pollution data were obtained from the Environmental Protection Department of HKSAR (www.epd.gov.hk) (accessed on 18 February 2025). These include the daily average levels of sulphur dioxide (SO_2_, μgm^−3^), respirable suspended particulates (PM10 μgm^−3^), nitrogen oxides (NOx, μgm^−3^), nitrogen dioxide (NO_2_, μgm^−3^), and ozone (O_3_, μgm^−3^). Weather data were obtained from the Hong Kong Observatory, including temperature (Temp, degrees Celsius) and relative humidity (RH). Instead of detailed concentrations of individual pollutants, the Air Pollution Index (API) is usually reported to the public. However, we did not use the API in this study as it constitutes secondary data derived from the highest indices of several key pollutants by comparing the measured concentrations with their respective health-related Air Quality Objectives (AQOs) established under the Air Pollution Control Ordinance.
Daily hospital admissions for respiratory diseases were obtained from the Hospital Authority of the HKSAR (the same data source used in Shao et al., 2010 and Wong et al., 2013 [10,25]). The period of study was from 1 January 2000 to 31 December 2005, totalling 2192 days. The number of respiratory illnesses is plotted in Figure 1. As we can see, there is a sudden drop following the outbreak of SARS. We divide the data into two parts. The pre-SARS part represents the data before 8 March 2003 and post-SARS after 23 June 2003 (inclusive). We do not use the data during the SARS epidemic, as this was a transition period of change. Visual comparisons in the form of box plots are given in Figure 2 for all variables. We can see that the temperature (with mean from 22.96 to 23.54, STD from 5.03 to 5.23) and RH (with mean from 78.01 to 76.92, STD from 9.72 to 11.21) are similar for the before- and after-SARS periods. However, SO_2_ (with mean from 16.95 to 22.60, STD from 11.14 to 16.43), PM10 (with mean from 51.04 to 58.07, STD from 24.89 to 31.32), and O_3_ (with mean from 31.80 to 35.45, STD from 15.70 to 19.89) increase, while NOx (with mean from 115.80 to 107.80, STD from 53.49 to 47.63) and NO_2_ (with mean from 57.63 to 58.23, STD from 19.32 to 22.11) are relatively stable. In contrast, the number of illnesses decreases (with mean from 220 to 207, STD from 38.91 to 64.11). For detailed summary statistics, refer to Table 1 in [25].
2.2. Methods
Let be our response variable, which is the number of daily admissions due to respiratory problems to a regional hospital, and be the covariates or variables affecting the response variable. The potential covariates in our study include sulphur dioxide (SO_2_), nitrogen dioxide (NO_2_), nitrogen oxides (NO_x_), respirable suspended particulates (PM10), ozone (O_3_), temperature (Temp), and relative humidity (RH). Let (=8 March 2003) and (23 June 2003) be the start and end of the SARS epidemic, respectively.
A general cumulative model is defined as
where is a constant that appears in the model for model identification, representing the expectation of on the default day (Friday in our case) before SARS, with zero cumulative effects and the average humidity and temperature. Let
model the impact of public health measures due to the SARS epidemic. Furthermore,
models the effect of the day of the week in the hospital admission system, with
Moreover,
models the cumulative effect of environmental and climatic factors, where is the observation of the jth variable at time i, { } are unknown functions, and is the length of the cumulative period of the jth variable ( ). To ensure identifiability, we assume that ( ). For meaningful interpretation, it is also assumed that the effect functions for SO_2_, PM10, NO_x_, NO_2_, and O_3_ are monotonically non-decreasing and . The framework of (1) is similar to model (1) in Wong et al. (2013)’s paper [25], though the setting for unknown function is different. In [25], was set as a multi-index form with the purpose of dimension reduction, and it was difficult to directly study the cumulative effect of certain variables individually. In this paper, was set as an additive single-index form with constraints. Such a setting enables us to describe the cumulative effect functions for each considered variable.
Xia and Tong [9] recommended the backfitting algorithm together with the minimum average variance estimation (MAVE) and local polynomial method for the unknown functions { }. However, the local polynomial approach is not computationally efficient for an index model with high dimensions because of the large matrix involved. Spline methods have a great computational advantage in approximating the unknown nonparametric effect function [26,27,28]. The number of knots and their locations need to be pre-determined in spline methods. Wang and Yang [28] suggested a set number of knots to be used based on the sample size and equally spaced knots for a uniform distribution. As the predictors in our model setting were not uniformly distributed, we determined the knot locations through the probability space of the empirical cumulative distribution, that is, the knots were equally spaced in the quantiles of the empirical distribution, giving roughly the same sample number in each segment.
For our additive single-index model, instead of updating all the cumulative effect functions simultaneously, we updated the functions separately by iteration. To do this, we needed an algorithm similar to the single-index model but with constraints. Following [28], the single-index model in our setting is estimated as below.
Let be the residuals after removing all other intermediate effects during the iteration. To update the cumulative effect and the corresponding weights of a particular covariate variable, say the jth variable, we need to consider the estimation of a single-index model
where (bear in mind that is a column vector) and . Without constraints, the spline function can be formed as below [28].
For a fixed θ_j_, let
where is the rescaled centred cumulative distribution with
In the implementation, a is chosen to be the 95th percentile of , where
Under suitable assumptions, the regression function can be written in terms of as
To form the spline, we pre-select an integer . Following [28], we use We divide [0, 1] into (N + 1) subintervals for and , where is a sequence of equally spaced points, called interior knots. We augment these interior points so that . The jth B-spline of order k for this knot sequence, denoted by , is recursively defined by [29].
For a fixed θ_j_, the cubic spline estimator of and the corresponding estimator of are
where is the space of all functions with the 2nd-order partial derivatives continuous on [0, 1] and which are polynomials of degree 3 on each interval. The estimator of the coefficient θ_j_ is used to minimise the above objective function in the coefficient space. More detail can be found by referring to Section 3 of [28].
The monotonicity of θ_j_ can be obtained by using a restricted optimisation method when minimising the objective function in Equation (10), and in our study, the Gradient Projection Method was adopted. The monotonicity of a certain effect function is guaranteed by monotonous B-spline coefficients that can be estimated simply by restricted linear regression. Bear in mind that, in our study, effect functions for SO_2_, PM10, NOx, NO_2_, and O_3_ were assumed to be non-decreasing. As the lag of the cumulative effect (i.e., the dimension of the single index) was unknown, a criterion was needed to determine the optimal lag. Information criteria are frequently used. Akaike’s information criterion [30] was adopted in this study due to its simplicity and powerfulness.
2.2.1. Computing Algorithm with Given Lags
With given lags for the cumulative effect, model (1) can be estimated by the following steps. Before the start of the iteration, the initial estimates of all parameters and functions are set by the following procedure:
Step 1. Initialise the estimates of constant and the parameter in the weekly effect and SARS effect by least-squares fitting of on and with a constant term. Denote the estimates as , and .
Step 2. Let , where and are the fitted values using and , respectively.
Step 3. The cumulative effect functions and the corresponding weights are initialised as below.
Step 3.1. Estimate the first cumulative effect function and the corresponding weights using the single-index model (6) with and . Denote the estimates of the cumulative effect function and the corresponding weights by and , respectively.
Step 3.2. For , estimate the jth cumulative effect function and the corresponding weights using the single-index model (6) with and . Denote the estimates of the cumulative effect function and the corresponding weights by and , respectively.
{End}
Once the initialisation is performed, the iteration can be implemented as below.
Step 4. In the mth iteration, (m = 1, 2, …), compute and update the constant and the coefficients of the weekly effect and SARS effect denoted by , and using least-squares fitting of on and with a constant term.
Step 5. Let , where and are the fitted values using and , respectively.
Step 6. The cumulative effect functions and the corresponding weights are estimated as below.
Step 6.1. Estimate the first cumulative effect function and the corresponding weights using the single-index model (6) with and . Denote the estimates of the cumulative effect function and the corresponding weights by and , respectively.
Step 6.2. For , estimate the jth cumulative effect function and the corresponding weights using the single-index model (6) with and . Denote the estimates of the cumulative effect function and the corresponding weights by and respectively.
Step 6.3. Update the residual by and check the convergence. If the convergence criterion is not met, go to Step 4.
{End}
2.2.2. Search Approach for (Locally) Optimal Lags
In Steps 1–6 in the above algorithm, for given lags , one can estimate model (1) and calculate the related AIC (Akaike information criterion) value, denoted as AIC ( ). The AIC is a measure of the goodness of fit of a statistical model. Assuming that the model errors are independently normally distributed, k is the number of parameters in the fitted model, n is the number of observations, and SSR is the sum of squared residuals; then, AIC becomes
Due to the fixed sample size, AIC ( ) can be calculated by
where MSE is the mean squared error, and is number of parameters, which includes both spline parameters and model parameters (however, the spline parameters are not shown in the arguments of the AIC definition for ease of notation and can be ignored in the AIC calculation because their numbers are fixed and therefore do not affect the order of AIC values). Next, we give a method to estimate the (locally) optimal lags for model (1) based on AIC values.
Step 1. Set , with taking values from the integer set: . Here, is a gap between integers and is a given upper bound for lag estimation. Denote the minimum point of the AIC ( ) with respect to to be and then the initial estimates for lags are set as .
Step 2. To renew the values of lag let , , with taking integer values from the interval . Here, is a relatively small integer. Find the minimum point of
with respect to and denote it as . Then, the updated value for is .
Step 3. Check the convergence by comparing the distance between ( ) and ( ). If the convergence criterion is not met, set and go to Step 2.
In our study, the above Step 1 was executed with and in Step 2 was set to be 5.
{End}
Finally, we used the bootstrap method to calculate confidence bounds for the estimates.
3. Results and Discussion
To obtain an approximate normal distribution, we applied the logarithmic transform to the number of daily respiratory illnesses as usual. Unlike the previous studies, which used fixed lags of the cumulative effect, this study estimated the relatively optimal lags for each predictor. The weekly effect due to hospital admission and the SARS effect due to habit change were considered as before. The maximum lag was pre-determined as 200 days. Friday was used as the default day in the weekly effect component (Equation (3)).
The optimal lags affecting respiratory illnesses are given in Table 1, and the associated weights are depicted in Figure 3. It can be seen that O_3_ has the longest cumulative effect on the respiratory illnesses and NO_2_ has the shortest cumulative effect, while the other pollutants have similar durations. The effects by SO_2_, NO_2_, and PM10 decay quickly, while the other pollutants have a period of accumulation (18–38 days for O_3_, 2–30 days for NOx, 1–13 days for RH, and 4–12 days for temperature). The cumulative effects of individual predictors are plotted in Figure 4 against the weighted averages. It can be seen that the effects of pollutants tend to level up after certain cumulative averages. Such results are consistent with those of [9], but the threshold point and the effect function range for each pollutant are different. This could be due to the fact that the model covariates and constraints are different. While the cumulative effects of the pollutants are comparable to those of [9], the cumulative effects of weather factors (RH and temperature) are different from the previous findings and are of interest. It can be seen that the RH has the least effect on respiratory illnesses at around 60%, which is the most comfortable level for human beings, and reaches the highest effect at around 80%. This finding is consistent with [31]. For temperature, both the cold and hot weather tend to increase the incidence of respiratory diseases, and the most suitable temperature is around 28 degrees Celsius. However, there is a local peak at around 26 degrees Celsius, which may relate to the later spring and early autumn, during which the temperatures are not settled and, as a result, may cause slightly more incidents.
As by-products, the parameter estimates for the constant term (Equation (1)), weekly effect (Equation (3)), and health measures due to the SARS epidemic (Equation (5)) are given in Table 2. The results are comparable with the results in [25]. The number of reported cases on the weekend is less than that on weekdays. The health measures imposed during the SARS epidemic had a significant effect on the reported cases. The model fitting with the observations and the residuals is given in Figure 5, from which it can be seen that the fitted values capture the observation trend well, with the residuals approximating a white noise process. The root mean squared error between observed and fitted values is 0.1515 (versus 0.1802 for the model in [25]), which translates to RMSE = 32.50 for the original number of cases (versus 37.25 for the model in [25]). This implies that our model’s fitting performance is slightly better than those of previous models.
4. Summary and Conclusions
This paper proposes an estimation method based on splines for a constrained, additive single-index model, motivated by modelling the effect of air pollution on respiratory illnesses. The additivity is based on the consideration of multiple covariates, and the single index is for the cumulative effect. The monotonic constraint is used to model the decreasing effect over time. The spline method is used in the non-parametric effect functions due to its computing efficiency. The bootstrap is used to evaluate the uncertainty bounds in the model parameters. An algorithm was designed and implemented in MATLAB 2023.
When modelling the effect of air pollution on respiratory illnesses in Hong Kong during 2000 to 2005, the weekly effect and impact of health measures during the SARS epidemic were modelled by day-of-the-week variables and the growth curve function, respectively. The results showed that the pollution variables had similar cumulative periods of around one month, except NO_2_ with the shortest period (5 days) and O_3_ with the longest period (52 days). These results may aid further theoretical and practical studies of air pollution. The effects of weather factors are consistent with the existing findings and can be explained by common knowledge in public health. The weekly effect and impact of the SARS epidemic are consistent with previous studies, and the fitting of the model is better. Consequently, the model framework and algorithm can be used for other applications where a similar model structure is applicable.
Related issues also need to be further studied, including the asymptotic properties of model estimators and variable selection. Then, confidence intervals based on the asymptotic variance may be obtained, which could be more accurate than those obtained by the bootstrap method. It also makes sense to try different variables or consider the interactive effect between variables in explaining respiratory illnesses. For example, jointly considering NO with NO2 could make more sense than the case of NOx with NO2, because NOx is the sum of NO2 and NO. We leave these considerations for future research.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1World Health Organization Reports on a WHO/HEI Working Group, Bonn, Germany 2003 Available online: https://apps.who.int/iris/bitstream/handle/10665/42789/9241562439.pdf?sequence=1(accessed on 28 September 2024)
- 2U.S. Environmental Protection Agency Report No. 40 CFR Part 501996 Available online: https://www.ecfr.gov/current/title-40/chapter-I/subchapter-C/part-50?toc=1(accessed on 28 September 2024)
- 3Delfino R.J. Becklake M.R. Hanley J.A. The relationship of urgent hospital admissions for respiratory illnesses to photochemical air pollution levels in Montreal Environ. Res.19946711910.1006/enrs.1994.10617925191 · doi ↗ · pubmed ↗
- 4Künzli N. Kaiser R. Medina S. Studnicka M. Chanel O. Filliger P. Herry M. Horak F.Jr. Puybonnieux-Texier V. Quénel P. Public-health impact of outdoor and traffic-related air pollution: A European assessment Lancet 200035679580110.1016/S 0140-6736(00)02653-211022926 · doi ↗ · pubmed ↗
- 5Smoyer K.E. Kalkstein L.S. Greene J.S. Ye H. The impact of weather and pollution on human mortality in Birmingham, Alabama and Philadelphis Int. J. Climatol.20002088189710.1002/1097-0088(20000630)20:8<881::AID-JOC 507>3.0.CO;2-V · doi ↗
- 6Aunan K. Pan X.C. Exposure-response functions for health effects of ambient air pollution applicable for China—A meta-analysis Sci. Total Environ.200432931610.1016/j.scitotenv.2004.03.00815262154 · doi ↗ · pubmed ↗
- 7Jalaludin B.B. O’Toole B.I. Leeder S.R. Acute effects of urban ambient air pollution on respiratory symptoms, asthma medication use, and doctor visits for asthma in a cohort of Australian children Environ. Res.200495324210.1016/S 0013-9351(03)00038-015068928 · doi ↗ · pubmed ↗
- 8Maynard R. Key airborne pollutants-the impact on health Sci. Total Environ.2004334–33591310.1016/j.scitotenv.2004.04.02515504488 · doi ↗ · pubmed ↗
