Enhancing Multi-model Inference with Natural Selection

Ching-Wei Cheng; Guang Cheng

arXiv:1906.02389·math.ST·June 7, 2019

Enhancing Multi-model Inference with Natural Selection

Ching-Wei Cheng, Guang Cheng

PDF

Open Access

TL;DR

This paper introduces a genetic algorithm approach to improve the quality of candidate models in multi-model inference, demonstrating theoretical convergence and empirical effectiveness in statistical applications.

Contribution

The paper develops a genetic algorithm framework with adaptive termination and schema theory for better candidate model selection in multi-model inference.

Findings

01

Genetic algorithm effectively improves candidate model quality.

02

The adaptive termination criterion reduces computational cost.

03

Empirical results show superior performance in real data examples.

Abstract

Multi-model inference covers a wide range of modern statistical applications such as variable selection, model confidence set, model averaging and variable importance. The performance of multi-model inference depends on the availability of candidate models, whose quality has been rarely studied in literature. In this paper, we study genetic algorithm (GA) in order to obtain high-quality candidate models. Inspired by the process of natural selection, GA performs genetic operations such as selection, crossover and mutation iteratively to update a collection of potential solutions (models) until convergence. The convergence properties are studied based on the Markov chain theory and used to design an adaptive termination criterion that vastly reduces the computational cost. In addition, a new schema theory is established to characterize how the current model set is improved through…

Tables6

Table 1. Table 1: Variable selection results and GIC values of the selected models for the riboflavin dataset.

Method	Active Covariates	GIC
Proposed GA	XHLA-at	$- 20.520$
Multisplit procedure (Meinshausen et al., 2009)^†	YXLD-at	$- 14.357$
Stability selection (Meinshausen and Bühlmann, 2010)^†	YXLD-at, YOAB-at, LYSC-at	$- 1.431$
Debiased Lasso (Javanmard and Montanari, 2014a)	YXLD-at, YXLE-at	$15.643$
B-TREX (Lederer and Muller, 2015)	YXLD-at, YOAB-at, YXLE-at	$10.624$
${AV}_{\infty}$ (Chichignoud et al., 2016)	YXLD-at, YOAB-at, YEBC-at,	$- 5.681$
	ARGF-at, XHLB-at
RP, SA, and Ridge-type projection (Bühlmann, 2013)^†	None	$- 11.775$
^†Obtained by Bühlmann et al. (2014) using the R package hdi.

Table 2. Table 2: Results of the relative size of 95 % percent 95 95{\%} SMSs and model averaging for the riboflavin dataset.

			RMSE of Model Averaging
Method	$# (Candidate Models)$	$# (Models in 95 % SMS)$	GIC-based	AL
GA	$67$	$56$	$0.6941$	$0.6162$
RP	$54$	$1$	$0.9139$	N/A
SA	$16$	$11$	$0.9139$	N/A

Table 3. Table 3: Summary of the best models for the residential building dataset.

Method	Active Variables	GIC
GA	PF-5, PF-7, EVI-05-Lag1, EVI-07-Lag4, EVI-12-Lag5, EVI-13-Lag4	$2571.49$
RP	(None)	$3788.05$
SA	PF-2, PF-3, PF-4, PF-5, PF-6, PF-7	$2699.63$

Table 4. Table 4: SOIL values of the important variables for the residential building dataset. SOIL values less than 0.05 0.05 0.05 are not listed.

	SOIL
Variable Code	GA	RP	SA
PF-2			$1.000$
PF-3			$1.000$
PF-4			$1.000$
PF-5	$1.000$		$1.000$
PF-6			$1.000$
PF-7	$1.000$		$1.000$
EVI-05-Lag1	$1.000$
EVI-07-Lag4	$1.000$
EVI-12-Lag5	$1.000$
EVI-13-Lag4	$1.000$
EVI-19-Lag1		$1.000$

Table 5. Table 5: Results of relative size of 95 % percent 95 95{\%} SMSs and model averaging for the residential building dataset.

			RMSE of Model Averaging
Method	$# (Candidate Models)$	$# (Models in 95 % SMS)$	GIC-based	AL
GA	$48$	$41$	$27.5553$	$28.4411$
RP	$11$	$3$	$104.9914$	N/A
SA	$84$	$13$	$32.7367$	$32.2841$

Table 6. Table S.1: Variable coding for the residential building dataset.

Variable ID	Descriptions	Unit
Project Physical and Financial (PF) Variables
PF-1	Project locality defined in terms of zip codes	N/A
PF-2	Total floor area of the building	$m^{2}$
PF-3	Lot area	$m^{2}$
PF-4	Total preliminary estimated construction cost based on the prices at the beginning of the project	$10^{7} {IRR}^{m}$
PF-5	Preliminary estimated construction cost based on the prices at the beginning of the project	$10^{5} {IRR}^{m}$
PF-6	Equivalent preliminary estimated construction cost based on the prices at the beginning of the project in a selected base year	$10^{5} {IRR}^{m}$
PF-7	Duration of construction	Time resolution
PF-8	Price of the unit at the beginning of the project per $m^{2}$	$10^{5} {IRR}^{m}$
Economic Variables and Indexes (EVI)
EVI-01	The number of building permits issued	N/A
EVI-02	Building services index (BSI) for a preselected base year	N/A
EVI-03	Wholesale price index (WPI) of building materials for the base year	N/A
EVI-04	Total floor areas of building permits issued by the city/municipality	$m^{2}$
EVI-05	Cumulative liquidity	$10^{7} {IRR}^{m}$
EVI-06	Private sector investment in new buildings	$10^{7} {IRR}^{m}$
EVI-07	Land price index for the base year	$10^{7} {IRR}^{m}$
EVI-08	The number of loans extended by banks in a time resolution	N/A
EVI-09	The amount of loans extended by banks in a time resolution	$10^{7} {IRR}^{m}$
EVI-10	The interest rate for loan in a time resolution	$%$
EVI-11	The average construction cost of buildings by private sector at the time of completion of construction	$10^{5} {IRR}^{m} / m^{2}$
EVI-12	The average of construction cost of buildings by private sector at the beginning of the construction	$10^{5} {IRR}^{m} / m^{2}$
EVI-13	Official exchange rate with respect to dollars	${IRR}^{m}$
EVI-14	Nonofficial (street market) exchange rate with respect to dollars	${IRR}^{m}$
EVI-15	Consumer price index (CPI) in the base year	N/A
EVI-16	CPI of housing, water, fuel and power in the base year	N/A
EVI-17	Stock market index	N/A
EVI-18	Population of the city	N/A
EVI-19	Gold price per ounce	${IRR}^{m}$

Equations204

Y = X β^{0} + ε,

Y = X β^{0} + ε,

f (u) = ⎩ ⎨ ⎧ - GIC (u) v \in \overline{Ψ} (t), ∣ v ∣ < n min - GIC (v) if ∣ u ∣ < n if ∣ u ∣ \geq n,

f (u) = ⎩ ⎨ ⎧ - GIC (u) v \in \overline{Ψ} (t), ∣ v ∣ < n min - GIC (v) if ∣ u ∣ < n if ∣ u ∣ \geq n,

GIC (u) = n lo g σ_{u}^{2} + κ_{n} ∣ u ∣,

GIC (u) = n lo g σ_{u}^{2} + κ_{n} ∣ u ∣,

w_{k} = \frac{exp ( f _{k} /2 )}{\sum _{l = 1}^{K} exp ( f _{l} /2 )}, k = 1, \dots, K .

w_{k} = \frac{exp ( f _{k} /2 )}{\sum _{l = 1}^{K} exp ( f _{l} /2 )}, k = 1, \dots, K .

u_{j}^{c, k} = {u_{j}^{p 1, k} u_{j}^{p 2, k} with probability 1/2 otherwise, j = 1, \dots, d .

u_{j}^{c, k} = {u_{j}^{p 1, k} u_{j}^{p 2, k} with probability 1/2 otherwise, j = 1, \dots, d .

u_{j}^{k} (t) = {1 - u_{j}^{c, k} u_{j}^{c, k} with probability π_{m} otherwise, j = 1, \dots, d .

u_{j}^{k} (t) = {1 - u_{j}^{c, k} u_{j}^{c, k} with probability π_{m} otherwise, j = 1, \dots, d .

\overset{π}{ˉ}_{m, j}^{k} = ⎩ ⎨ ⎧ \frac{γ _{j}^{- 1}}{\sum _{l \in V_{+}^{k}} γ _{l}^{- 1}} ∣ V_{+}^{k} ∣ π_{m} \frac{γ _{j}}{\sum _{l \in V_{-}^{k}} γ _{l}} ∣ V_{-}^{k} ∣ π_{m} if j \in V_{+}^{k} if j \in V_{-}^{k} .

\overset{π}{ˉ}_{m, j}^{k} = ⎩ ⎨ ⎧ \frac{γ _{j}^{- 1}}{\sum _{l \in V_{+}^{k}} γ _{l}^{- 1}} ∣ V_{+}^{k} ∣ π_{m} \frac{γ _{j}}{\sum _{l \in V_{-}^{k}} γ _{l}} ∣ V_{-}^{k} ∣ π_{m} if j \in V_{+}^{k} if j \in V_{-}^{k} .

u_{j}^{k} (t) = {1 - u_{j}^{c, k} u_{j}^{c, k} with probability \overset{π}{ˉ}_{m, j}^{k} otherwise, j = 1, \dots, d .

u_{j}^{k} (t) = {1 - u_{j}^{c, k} u_{j}^{c, k} with probability \overset{π}{ˉ}_{m, j}^{k} otherwise, j = 1, \dots, d .

P^{*}

P^{*}

\displaystyle K^{*}\approx\big{\lceil}1+\log(-d/\log P^{*})/\log 2\big{\rceil},

\displaystyle K^{*}\approx\big{\lceil}1+\log(-d/\log P^{*})/\log 2\big{\rceil},

\displaystyle K=4\big{\lceil}1+\log(-d/\log P^{*})/\log 2\big{\rceil}

\displaystyle K=4\big{\lceil}1+\log(-d/\log P^{*})/\log 2\big{\rceil}

\displaystyle H^{t}_{0}:~{}\bar{f}\big{(}\Psi(t)\big{)}=\bar{f}\big{(}\Psi(t-10)\big{)}\quad\text{v.s.}\quad H^{t}_{1}:~{}\bar{f}\big{(}\Psi(t)\big{)}\not=\bar{f}\big{(}\Psi(t-10)\big{)},

\displaystyle H^{t}_{0}:~{}\bar{f}\big{(}\Psi(t)\big{)}=\bar{f}\big{(}\Psi(t-10)\big{)}\quad\text{v.s.}\quad H^{t}_{1}:~{}\bar{f}\big{(}\Psi(t)\big{)}\not=\bar{f}\big{(}\Psi(t-10)\big{)},

A = [A_{11} A_{21} O A_{22}],

A = [A_{11} A_{21} O A_{22}],

u^{*} := u \in M arg max f (u)

u^{*} := u \in M arg max f (u)

\displaystyle\mathcal{M}_{\max}=\Big{\{}\bm{u}=\big{\{}u^{1},\ldots,u^{K}\big{\}}\in\mathcal{M}^{K}:u^{*}\in\bm{u}\Big{\}}.

\displaystyle\mathcal{M}_{\max}=\Big{\{}\bm{u}=\big{\{}u^{1},\ldots,u^{K}\big{\}}\in\mathcal{M}^{K}:u^{*}\in\bm{u}\Big{\}}.

\displaystyle\lim_{t\to\infty}P\big{(}u^{*}\in\Psi(t)\big{)}=1.

\displaystyle\lim_{t\to\infty}P\big{(}u^{*}\in\Psi(t)\big{)}=1.

\displaystyle\lim_{t\to\infty}\mathsf{E}\big{[}\bar{f}\big{(}\Psi(t)\big{)}\big{]}=\sum_{\bm{u}\in\mathcal{M}^{K}}\pi(\bm{u})\bar{f}(\bm{u}),

\displaystyle\lim_{t\to\infty}\mathsf{E}\big{[}\bar{f}\big{(}\Psi(t)\big{)}\big{]}=\sum_{\bm{u}\in\mathcal{M}^{K}}\pi(\bm{u})\bar{f}(\bm{u}),

\displaystyle P\big{(}u^{*}\in\Psi(t)\big{)}\geq 1-\alpha

\displaystyle P\big{(}u^{*}\in\Psi(t)\big{)}\geq 1-\alpha

\displaystyle{\uparrow}(H)=\big{\{}u\in\mathcal{M}:u_{j}=H_{j}~{}\text{or}~{}H_{j}=\ast~{}\text{for each}~{}j=1,\ldots,p\big{\}}.

\displaystyle{\uparrow}(H)=\big{\{}u\in\mathcal{M}:u_{j}=H_{j}~{}\text{or}~{}H_{j}=\ast~{}\text{for each}~{}j=1,\ldots,p\big{\}}.

\displaystyle m(H,t+1)\sim\textit{Binomial}\big{(}K,\alpha(H,t)\big{)}.

\displaystyle m(H,t+1)\sim\textit{Binomial}\big{(}K,\alpha(H,t)\big{)}.

α_{se l} (H, t) = k : u^{k} \in ↑ (H) \sum w_{k} .

α_{se l} (H, t) = k : u^{k} \in ↑ (H) \sum w_{k} .

α (H, t)

α (H, t)

\displaystyle\qquad+\sum_{k,l:u^{k},u^{l}\not\in{\uparrow}(H)}w_{k}w_{l}\frac{(2\pi_{m})^{h_{kl}}(1-\pi_{m})^{\operatorname{ord}(H)}}{\big{[}2(1-\pi_{m})\big{]}^{\delta_{H}(u^{k},u^{l})}},

\displaystyle\alpha(H,t)\geq(1-\pi_{m})^{\operatorname{ord}(H)}\alpha_{sel}(H,t)^{2}+2^{-\operatorname{ord}(H)}\alpha_{sel}(H,t)\big{[}1-\alpha_{sel}(H,t)\big{]}+\big{[}1-\alpha_{sel}(H,t)\big{]}^{2}\pi_{m}^{\operatorname{ord}(H)}.

\displaystyle\alpha(H,t)\geq(1-\pi_{m})^{\operatorname{ord}(H)}\alpha_{sel}(H,t)^{2}+2^{-\operatorname{ord}(H)}\alpha_{sel}(H,t)\big{[}1-\alpha_{sel}(H,t)\big{]}+\big{[}1-\alpha_{sel}(H,t)\big{]}^{2}\pi_{m}^{\operatorname{ord}(H)}.

u \neq = u^{0}, ∣ u ∣ < \tilde{s} in f μ^{⊤} (I - H_{u}) μ \geq C_{2} n,

u \neq = u^{0}, ∣ u ∣ < \tilde{s} in f μ^{⊤} (I - H_{u}) μ \geq C_{2} n,

u \in M_{\tilde{s}} - {u^{0}} min GIC (u) - GIC (u^{0}) > 0,

u \in M_{\tilde{s}} - {u^{0}} min GIC (u) - GIC (u^{0}) > 0,

u (t) = u \in Ψ_{\tilde{s}} (t) arg min GIC (u) .

u (t) = u \in Ψ_{\tilde{s}} (t) arg min GIC (u) .

t \to \infty lim n \to \infty lim P (u (t) = u^{0}) = 1.

t \to \infty lim n \to \infty lim P (u (t) = u^{0}) = 1.

\displaystyle\mathcal{A}_{\alpha}=\big{\{}u\in\Psi:H_{0,u}~{}\text{is not rejected at a significance level}~{}\alpha\big{\}}

\displaystyle\mathcal{A}_{\alpha}=\big{\{}u\in\Psi:H_{0,u}~{}\text{is not rejected at a significance level}~{}\alpha\big{\}}

H_{0, u} : Model u is not worse than u^{#} vs. H_{1, u} : Model u is worse than u^{#} .

H_{0, u} : Model u is not worse than u^{#} vs. H_{1, u} : Model u is worse than u^{#} .

H_{0, u}^{d i s} : u and u^{#} are indistinguishable vs. H_{1, u}^{d i s} : u and u^{#} are distinguishable

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaussian Processes and Bayesian Inference

Full text

Enhancing Multi-model Inference with Natural Selection

Ching-Wei Cheng PhD student. Department of Statistics, Purdue University, West Lafayette, IN 47906; e-mail: [email protected].

Guang Cheng Corresponding Author. Professor, Department of Statistics, Purdue University, IN 47906; e-mail: [email protected]. Guang Cheng gratefully acknowledges NSF DMS-1712907, DMS-1811812, DMS-1821183, and Office of Naval Research, (ONR N00014-18-2759).

Abstract

Multi-model inference covers a wide range of modern statistical applications such as variable selection, model confidence set, model averaging and variable importance. The performance of multi-model inference depends on the availability of candidate models, whose quality has been rarely studied in literature. In this paper, we study genetic algorithm (GA) in order to obtain high-quality candidate models. Inspired by the process of natural selection, GA performs genetic operations such as selection, crossover and mutation iteratively to update a collection of potential solutions (models) until convergence. The convergence properties are studied based on the Markov chain theory and used to design an adaptive termination criterion that vastly reduces the computational cost. In addition, a new schema theory is established to characterize how the current model set is improved through evolutionary process. Extensive numerical experiments are carried out to verify our theory and demonstrate the empirical power of GA, and new findings are obtained for two real data examples.

Keywords: Convergence analysis; evolvability; genetic algorithm; Markov chain; multi-model inference; schema theory.

1 Introduction

A collection of candidate models serves as a first and important step of multi-model inference, whose spectrum covers variable selection, model confidence set, model averaging and variable importance (Burnham and Anderson, 2004; Anderson, 2008). The importance of a candidate model set is highlighted in Lavou and Droz (2009): “all results of the multi-model analyses are conditional on the (candidate) model set.” However, in literature, candidate models are either given (e.g., Hansen et al. (2011); Hansen (2014)) or generated without any justifications (e.g., Ando and Li (2014); Ye et al. (2018)). As far as we know, there is no statistical guarantee on the quality of such candidate models, no matter the parameter dimension is fixed or diverges.

In this paper, we study genetic algorithm (GA, Holland (1975); Mitchell (1996); Whitley (1994)) in order to search for high-quality candidate models over the whole model space. GA is a class of iterative algorithms inspired by the process of natural selection, and often used for global optimization or search problems; see Figure 1. There are two key elements of GA: a genetic representation of the solution domain, i.e., a binary sequence, and a fitness function to evaluate the candidate solutions such as all kinds of information criteria. A GA begins with an initial population of a given size that is improved through iterative application of genetic operations, such as selection, crossover and mutation, until convergence; see Figure 2.

Specifically, we employ three basic genetic operations, i.e., selection, crossover and mutation, for the GA. In each generation (the population in each iteration), we adopt elitism and proportional selection so that the fittest model is kept into the next generation, and that fitter models are more likely to be chosen as the “parent” models to breed the next generation, respectively. Uniform crossover is then performed to generate one “child” model by recombining the genes from each pair of parent models. Finally, a mutation operator is applied to randomly alter chosen child genes. Besides the uniform mutation, we propose a new adaptive mutation strategy using the variable association strength to enhance the variable selection performance. The genetic operations are iteratively performed until the size of the new generation reaches that of the previous one; see Figure 4. It is worth noting that the crossover operator generates new models similar to their parents (i.e., local search), while the mutation operator increases the population diversity to prevent GAs from being trapped in local optimum (thus resulting in global search). See Section 2 for more details.

In theory, we investigate the convergence properties of the GA in Theorem 3.1 based on the Markov chain theory. A practical consequence is to design an adaptive termination strategy that significantly reduces the computational cost. Furthermore, we prove that a fitter schema (a collection of solutions with specific structures; see Definition 3.2) is more likely to survive and be expanded in the next generation, using the schema theory (Theorem 3.3 and Corollary 3.1). This implies that the average fitness of the subsequent population gets improved, which entitled the “survival of the fittest” phenomenon of the natural selection.

Our results are applied to variable selection and model confidence set (MCS). In the former, the GA generates a manageable number of models (that is much smaller than all models up to some pre-determined size), over which the true model is found; see Proposition 4.1. As for the latter, the collected models in the model confidence sets constructed by the GA are shown to be not statistically worse than the true model with a certain level of confidence; see Proposition 4.2.

As far as we are aware, two other methods can also be used to prepare candidate models: (i) collecting distinct models on regularization paths of penalized estimation methods (e.g., Lasso (Tibshirani, 1996), SCAD (Fan and Li, 2001) and MCP (Zhang, 2010)), called as “regularization paths (RP)” method; (ii) a simulated annealing (SA) algorithm recently proposed by Nevo and Ritov (2017). The former has no rigorous control on the quality of candidate models since model evaluation is not taken into account, and the latter needs a pre-determined model size and an efficiency threshold to filter out bad models. In comparison, the GA uses information-criterion based fitness function to search for good models, and produces models of various sizes. As a result, the candidate models produced by the GA lead to much improved multi-model inference results, as demonstrated in Sections 5 and 6. Ando and Li (2014) and Lan et al. (2018) proposed approaches to prepare candidate models that do not work for general multi-model inference applications. Best subset selection and forward stepwise regression can generate solution paths similar to the Lasso (Tibshirani, 2015; Hastie et al., 2017). However, the former imposes intractable computational burden and the latter lacks of comprehensive theoretical investigation.

Extensive simulation studies are carried out in Section 5 to demonstrate the power of the GA in comparison with the RP and the SA in terms of computation time, quality of the candidate model set, and performance of multi-model inference applications. In particular, the GA-best model exhibits the best variable selection performance in terms of the high positive selection rate and low false positive rate. For model averaging and variable importance, the GA results in at least comparable performance to the RP and the SA, but exhibits greater robustness than the SA. Additionally, the GA is also shown to possess better applicability than the RP in optimal high-dimensional model averaging.

Two real data examples are next carried out to illustrate the practical utility of the GA. For the riboflavin dataset (Bühlmann et al., 2014), the GA-best model finds an informative gene which has not stood out in the literature (e.g., Bühlmann et al., 2014; Javanmard and Montanari, 2014a; Lederer and Muller, 2015; Chichignoud et al., 2016; Hilafu and Yin, 2017). For the residential building dataset (Rafiei and Adeli, 2016, 2018), we identify factors, such as preliminary estimated construction cost, duration of construction, and $1$ -year delayed land price index and exchange rate, relevant to construction costs. These findings are further confirmed by the variable importance results using the SOIL (Ye et al., 2018). Moreover, compared with the aforementioned competing methods, we again find that the GA generates the best candidate model set and results in the best model averaging performance on both datasets.

The rest of this paper is organized as follows. In Section 2 we present the GA for global model search, and list several possible ways for improving the implementation. In Section 3 the GA is analyzed using the Markov chain and schema theories. In Section 4 we illustrate how the GA assists multi-model inference tools such as variable selection and model confidence set. Sections 5 and 6 present extensive simulation studies and two real data analysis. In Section 7, we discuss future works. All proofs are presented in the supplementary materials.

2 Methodology

Consider a linear regression model

[TABLE]

where $\bm{Y}=(Y_{1},\ldots,Y_{n})^{\top}$ is the response vector, $\bm{X}=[\bm{X}_{1},\ldots,\bm{X}_{d}]$ is the design matrix with $\bm{X}_{j}$ representing the $j$ -th column for $j=1,\ldots,d$ , and $\bm{\varepsilon}=(\varepsilon_{1},\ldots,\varepsilon_{n})^{\top}$ is the noise vector with $\mathsf{E}[\varepsilon_{i}]=0$ and $\mathsf{Var}(\varepsilon_{i})=\sigma^{2}$ . Suppose $\bm{\beta}^{0}=(\beta^{0}_{1},\ldots,\beta^{0}_{d})^{\top}$ is $s$ -sparse (i.e., $\|\bm{\beta}^{0}\|_{0}=s$ ) with $s\ll\min(n,d)$ . Throughout this paper, $s$ and $d$ are allowed to grow with $n$ .

Genetic representation for variable selection.

The genetic representation of a model is defined as a binary sequence of length $d$ , say $u=(u_{1},\ldots,u_{d})$ , and variable $j$ is said to be active (inactive) if $u_{j}=1$ ( $u_{j}=0$ ). For example, $u=(1,1,1,0,0,0,0,0,0,0)$ denotes the model with $d=10$ variables but only the first three variables being active. Note that $|u|=\sum_{j=1}^{d}u_{j}$ denote the model size. Denote $\bm{X}_{u}$ as the submatrix of $\bm{X}$ subject to $u$ , and $\mathcal{M}=\{0,1\}^{d}$ as the model space.

Fitness function.

Let $\Psi(t)$ denote the $t$ -th generation of population, and $\overline{\Psi}(t)=\cup_{t^{\prime}=0}^{t}\Psi(t^{\prime})$ the collection of all models that have appeared up to the $t$ -th generation. For any model $u\in\Psi(t)$ , the fitness function is then defined as

[TABLE]

where

[TABLE]

is the generalized information criterion (GIC, Nishii (1984); Shao (1997)) and $\widehat{\sigma}_{u}^{2}=\bm{Y}^{\top}\big{[}\bm{I}_{n}-\bm{X}_{u}(\bm{X}_{u}^{\top}\bm{X}_{u})^{-1}\bm{X}_{u}^{\top}\big{]}\bm{Y}/n$ is the mean squared error evaluated by the model $u$ . GIC covers many types of information criteria (e.g., AIC (Akaike, 1973) with $\kappa_{n}=2$ , BIC (Schwarz, 1978) with $\kappa_{n}=\log n$ , modified BIC (Wang et al., 2009) with $\kappa_{n}=\log\log|u|\log n$ with $d<n$ and extended BIC (Chen and Chen, 2008) with $\kappa_{n}\asymp\log n+2\log d$ with $p\geq n$ ). Since GIC cannot be computed for $|u|\geq n$ , we define it as the worst fitness value up to the current generation. The rational is that any model with size larger than $n$ should be unfavorable to all models with size smaller than $n$ given the assumption that $s\ll\min(n,d)$ . This definition warrants an unconstrained optimization, which is convenient for subsequent theoretical analysis. This is different from other ways to deal with the “infeasible solutions” in the GA literature, e.g., the “death penalty” in Chehouri et al. (2016) and Zhang et al. (2014), which lead to constrained optimization.

2.1 A Genetic Algorithm for Candidate Model Search

We propose a genetic algorithm to search for good candidate models in Algorithm 1. Specifically, we use the RP method to generate an initial population, and then adopt proportional selection, uniform crossover and mutation operators to constitute the evolutionary process. Besides uniform mutation, we propose another mutation strategy based on the strength of variable association for improving empirical performances. An adaptive termination strategy is also proposed to enhance the computational efficiency. See Algorithm 1 for the overview of the GA.

Initialization:

The initial population $\Psi(0)=\big{\{}u^{1}(0),\ldots,u^{K}(0)\big{\}}$ only has very minimal requirement as follows: (i) $K\geq 2$ and (ii) $|u^{k}(0)|<n$ for some $k=1,\ldots,K$ (i.e., at least one model with commutable GIC). The condition (i) allows the GA to explore through the model space $\mathcal{M}$ ; see Section 3.1, and (ii) ensures $f\big{(}u^{k}(0)\big{)},k=1,\ldots,K$ , are all available. The choice of $K$ will be discussed in Section 2.2.1. For fast convergence of the GA, we recommend the RP method to generate initial population. Please see Figure 3 for how models produced by RP are improved by the GA in terms of GIC.

Given the $(t-1)$ -th generation $\Psi(t-1)=\big{\{}u^{1}(t-1),\ldots,u^{K}(t-1)\big{\}}$ , the GA produces the next generation $\Psi(t)=\big{\{}u^{1}(t),\ldots,u^{K}(t)\big{\}}$ through proportional selection, uniform crossover and mutation operations. See Figure 4 to visualize the evolution process. In what follows, we give details for each step in our main algorithm.

In the elitism selection step, we choose $u^{*}(t-1):=\operatorname*{arg\,max}_{u\in\Psi(t-1)}f(u)$ , i.e., the best model in $\Psi(t-1)$ is kept into $\Psi(t)$ , and define it as $u^{1}(t)$ for simplicity. The proportional selection step chooses parent models from $\Psi(t-1)$ (including $u^{*}(t-1)$ ) based on the exponentially scaled fitness as follows. Define the fitness $f_{k}=f\big{(}u^{k}(t-1)\big{)}$ according to (2.2). For $k=1,\ldots,K$ , first compute the weight $w_{k}$ for $u^{k}(t-1)$ as

[TABLE]

Then $(K-1)$ pairs of models are randomly selected with replacement from $\Psi(t-1)$ , where the probability of selecting $u^{k}(t-1)$ is $w_{k}$ . Note that the exponentially scaled information criteria are often used for model weighting in multi-model inference (e.g., Burnham and Anderson, 2004; Wagenmakers and Farrell, 2004; Hoeting et al., 1999).

Each pair of parent models produces a child model by performing uniform crossover with equal mixing rate (i.e., each child position has equal chance to be passed from the two parents). That is, let $u^{p1,k}=(u^{p1,k}_{1},\ldots,u^{p1,k}_{d})$ and $u^{p2,k}=(u^{p2,k}_{1},\ldots,u^{p2,k}_{d})$ be the chosen parent models, and then the genes in the child model $u^{c,k}=(u^{c,k}_{1},\ldots,u^{c,k}_{d})$ is determined by

[TABLE]

In the last step, we apply mutation to the child model $u^{c,k}$ . Given a mutation probability $\pi_{m}$ (usually low, such as $\pi_{m}=0.01$ or $1/d$ ), we consider the following two mutation schemes. Denote by $u^{k}(t)=\big{(}u^{1}_{d}(t),\ldots,u^{k}_{d}(t)\big{)}$ the resulting model after mutation being applied to $u^{c,k}$ .

•

Uniform mutation: Genes in $u^{c,k}$ are randomly flipped with probability $\pi_{m}$ , i.e.,

[TABLE]

•

Adaptive mutation: We propose a data-dependent mutation operator based on the variable association measures $\gamma_{j}$ . For example, $\gamma_{j}$ can be either the marginal correlation learning $\big{|}\widehat{\mathsf{Cor}}(\bm{X}_{j},\bm{Y})\big{|}$ (Fan and Lv, 2008) or the high-dimensional ordinary least-squares projection $\big{|}\bm{X}_{j}(\bm{X}\bm{X}^{\top})^{-1}\bm{Y}\big{|}$ (Wang and Leng, 2016, available only for $d\geq n$ ). Let $V^{k}_{+}=\{j:u^{c,k}_{j}=1\}$ and $V^{k}_{-}=\{j:u^{c,k}_{j}=0\}$ . Define the mutation probability for the $u^{c,k}_{j}$ as

[TABLE]

Then the proposed mutation operation is performed by

[TABLE]

By defining $\bar{\pi}^{k}_{m,j}$ this way, unimportant active variables are more likely to be deactivated, and important inactive variables are more likely to be activated. Also, it can be easily seen that this mutation operation results in the same expected number of deactivated and activated genes as those of uniform mutation operation. As far as we are aware, this is the first data dependent mutation method in the GA literature.

In numerical experiments, we note that the adaptive mutation performs slightly better than the uniform mutation. For space constraint, we just focus on the adaptive mutation with $\pi_{m}=1/d$ .

As for termination, we propose an adaptive criterion by testing whether the average fitness becomes stabilized; see Section 2.2.2 for more details. This is very different from the user specified criteria used in GA literature such as the largest number of generations (e.g., Murrugarra et al., 2016) or the minimal change of the best solution (e.g., Aue et al., 2014).

Remark 2.1.

We note that the models collected by the GA are in nature sparse since their sizes are around the true model size $s$ ; see Figure 5 for example. This empirically appealing feature allows us to construct GA-based sparse model confidence sets in the later sections.

2.2 Computational Considerations

Computational concern has been the major critiques that prevent GAs from being popular over other optimization methods such as gradient descent in machine learning and statistical communities. In our experience, the most computational cost is taken by the calculation of the fitness evaluation, which could be alleviated by reducing the population size (Section 2.2.1) and the number of generations (Section 2.2.2).

2.2.1 Population Sizing

The population size $K$ plays an important role in GAs. It is obvious that larger population makes GAs computationally more expensive. On the other hand, empirical results indicate that small population would jeopardize performance (e.g., Koumousis and Katsaras, 2006; Piszcz and Soule, 2006; Lobo and Lima, 2005). We found that the minimum population size suggested in Reeves (1993) makes a good balance. The idea is to have a population such that every possible solution in the search space should be reachable from an randomly generated initial population by crossover only. In binary gene coding cases, it means that the solutions in the initial population cannot be all [math] or $1$ for any position. For any $K$ , the probability of such an event can be found by

[TABLE]

Accordingly, for every given $P^{*}$ , we can calculate the minimum population size

[TABLE]

where $\lceil a\rceil$ is the smallest integer larger than $a\in\mathbb{R}$ . For example, a population of size $K=25$ is enough to ensure that the required probability exceeds $99.99{\%}$ when $d=1{,}000$ .

In our implementation, we conservatively use

[TABLE]

with $P^{*}=0.9999$ , to specify the population size according to model dimension.

2.2.2 Adaptive Termination

To adaptively terminate, we perform an independent two-sample $t$ -test on whether the average fitness of $\Psi(t)$ and $\Psi(t-10)$ are the same at a significance level $0.05$ :

[TABLE]

where $\bar{f}\big{(}\Psi(t)\big{)}$ is the average fitness of the $t$ -th generation. The $T$ -th generation is set to be the final generation if $T$ is the smallest $t\geq 10$ such that $H^{t}_{0}$ is rejected. The generation gap $10$ is meant to weaken the correlation between the two generations being tested. Note that the GA can be regarded as a Markov chain (see Section 3.1 for details) and therefore there exists dependence among generations. Hence, it is not appropriate to perform two-sample $t$ -test of the average fitness from two consecutive generations.

Remark 2.2.

This termination criterion is constructed based on the limiting distribution derived for the associated Markov chain (see the discussion below Theorem 3.1 for more details) and results in huge computational efficiency. In the literature, the GA iteration is often terminated at a fixed, predetermined number of generations, say $T_{\max}$ , which is usually large such as $50,100$ or even larger (e.g. Höglund, 2017; Jafar-Zanjani et al., 2018). Our termination criterion, on the other hand, entitles a scientific check for the convergence. With the RP used for generating the initial population, the GA enters the stationary distribution (as the average fitness is tested to be stabilized) in just a few generations (say, around $20$ generations). In addition, we note that the computational cost incurred by the independent two-sample $t$ -test (2.8) is negligible, as the fitness values are computed as the models are generated.

3 Theoretical Properties

In this section, we study the theoretical properties of the GA, which belongs to the so-called canonical GA (CGA) family111CGAs are also called as simple or standard GAs in the literature. CGA uses binary sequence for solution representation, and updates a fixed-sized population via selection, crossover and mutation operators. (Holland, 1975). The proposed GA is a CGA that specifically employs elitism and proportional selection, uniform crossover and uniform or adaptive mutation as described in Section 2. We first investigate the convergence properties for a general CGA family based on Markov chain theory, i.e., Theorem 3.1. Furthermore, Theorem 3.2 presents a brand new theoretical framework to construct MCSs for the globally best model. We next establish a new schema theory (Theorem 3.3 and Corollary 3.1) to elicit the evolutionary mechanism for the GA. It is worthy noting that the theoretical results established in this section apply to the general CGA framework and hence not restricted to the specific variable selection problem.

3.1 Convergence Analysis

In this section, we show that the Markov chain associated with a CGA class has a unique stationary distribution from which the asymptotic inclusion of the globally best model, i.e., global convergence, can be deduced. Such a result justifies the adaptive termination rule in Section 2.2.2, and can be used to reduce the search space for variable selection problems; see Proposition 4.1. Note that the theoretical results obtained in this section hold for any finite sample size.

Recall that $\Psi(t)=\big{\{}u^{1}(t),\ldots,u^{K}(t)\big{\}}$ represents the $t$ -th generation of the population, and denote by $\big{\{}\Psi(t)\big{\}}_{t\geq 0}$ the associated Markov chain with values on the finite state (population) space $\mathcal{M}^{K}$ . The corresponding transition matrix is defined as $\bm{P}=\big{[}P_{\bm{u}\bm{v}}\big{]}_{\bm{u},\bm{v}\in\mathcal{M}^{K}}$ , where $P_{\bm{u}\bm{v}}=P\big{(}\Psi(t+1)=\bm{v}\big{|}\Psi(t)=\bm{u}\big{)}$ . We need the following definitions for our subsequent analysis.

Definition 3.1.

A square matrix $\bm{A}=[a_{ij}]\in\mathbb{R}^{K\times K}$ is said to be non-negative (positive) if $a_{ij}\geq 0$ ( $a_{ij}>0$ ) for all $i,j\in\{1,\ldots,d\}$ . A non-negative square matrix $\bm{A}$ is said to be

(a)

primitive if there exists a positive integer $k$ such that $\bm{A}^{k}$ is positive; 2. (b)

reducible if there exists two square matrices $\bm{A}_{11}$ and $\bm{A}_{22}$ and a matrix $\bm{A}_{21}$ with suitable dimensions such that $\bm{A}$ can be expressed as the form

[TABLE]

where $\bm{O}$ denotes a zero matrix with suitable dimensions, by applying the same permutations to rows and columns; 3. (c)

irreducible if it is not reducible; 4. (d)

stochastic if $\sum_{j=1}^{K}a_{ij}=1$ for all $i=1,\ldots,K$ .

Let

[TABLE]

denote the best model in $\mathcal{M}$ and suppose it is unique, i.e., $f(u^{*})>f(u)$ for all $u\in\mathcal{M}-\{u^{*}\}$ . Moreover, denote the collection of states that contains $u^{*}$ by

[TABLE]

The following theorem describes two important convergence properties.

Theorem 3.1.

Let $\bm{P}$ denote the transition probability matrix of the Markov chain associated with a CGA with elitism selection, population size $K\geq 2$ and mutation rate $\pi_{m}\in(0,1)$ .

(a)

There exists a unique stationary distribution $\bm{\pi}=\big{(}\pi(\bm{u}):\bm{u}\in\mathcal{M}^{K}\big{)}^{\top}$ that satisfies $\bm{\pi}^{\top}=\bm{\pi}^{\top}\bm{P}$ and $\pi(\bm{u})=\lim_{t\to\infty}P\big{(}\Psi(t)=\bm{u}\big{)}$ with $\pi(\bm{u})>0$ for $\bm{u}\in\mathcal{M}_{\max}$ and $\pi(\bm{u})=0$ for $\bm{u}\not\in\mathcal{M}_{\max}$ . 2. (b)

(Theorem 6 of Rudolph (1994)**) We have

[TABLE]

As far as we are aware, the existence of the stationary distribution stated in Theorem 3.1 (a) for CGAs with elitism selection is new, even though similar results for non-elitist CGAs has been presented for over decades (e.g., Rudolph, 1994; Dorea et al., 2010). This is in contrast with the GA literature that typically concerns global convergence (e.g., Rudolph, 1994; Agapie, 1998; Dorea et al., 2010) rather than the stationary distribution. As for Theorem 3.1 (b), the elitism selection is a necessary condition (Rudolph, 1994; Agapie, 1998) and it is different from the path-consistency property of non-convex penalization approaches (e.g., Kim and Kwon (2012); Wang et al. (2013)) in that the former captures the best model for any sample size $n$ as $t\to\infty$ and the latter targets at the true model as $n\to\infty$ . Later, Theorem 3.1 (b) is extended to a selection consistency result as $n\to\infty$ ; see Proposition 4.1.

Part (a) of Theorem 3.1 has the following implication. Recall that $\bar{f}(\bm{u})$ is the average fitness of any population $\bm{u}$ , and thus we have

[TABLE]

which is a constant given data $(\bm{X},\bm{Y})$ . This indicates that the average fitness oscillates around a constant in the long run, as $\Psi(t)$ becomes stabilized (i.e., the associated Markov chain converges). This justifies the termination check in (2.8).

Remark 3.1.

It is worth noting that Theorem 3.1 does not only apply to the GA but also any CGA with elitism selection. The key reason is that the child solutions generated through selection, crossover and mutation operators always remain in the search space for unconstrained optimization or search problems. Accordingly, instead of the ones mentioned in Section 2, Theorem 3.1 still holds for any other selection, crossover and mutation operations (e.g., rank-based or tournament selection (Shukla et al., 2015) and the newly proposed crossover and mutation operations developed in Hassanat and Alkafaween (2018) and Hassanat et al. (2018), respectively).

In contrast to the asymptotic result in Theorem 3.1 (b) as $t\to\infty$ , it is also of practical relevance to construct a $100(1-\alpha){\%}$ MCS that covers the best model $u^{*}$ after a finite number of generations. A particularly appealing feature is that every model in this set is sparse. This is conceptually different from the MCS constructed based on the debiased principle (van de Geer et al., 2014; Zhang and Zhang, 2014; Javanmard and Montanari, 2014b), which mostly contains dense models.

Theorem 3.2.

Let $\Psi(t)$ denote the $t$ -th population of a CGA with elitism selection, $K\geq 2$ and $\pi_{m}\in(0,1)$ . Then for any $\alpha\in(0,1)$ there exists a positive integer $T_{\alpha}$ such that

[TABLE]

for any $t\geq T_{\alpha}$ .

The proof of Theorem 3.2 implies the global convergence property described in Theorem 3.1 (b) by letting $\alpha=0$ and thus $T_{0}=\infty$ . From the proof of Theorem 3.2, we note that obtaining the value of $T_{\alpha}$ requires the knowledge of the constant $\xi$ as defined in (S.1.4), which is often unknown. By definition, $\xi$ can be obtained by estimating the submatrix $\bm{R}$ in the transition matrix $\bm{P}$ . That is, $\xi$ is the smallest of the row sums of $\bm{R}$ . Since $\bm{R}$ has $\big{|}\mathcal{M}^{K}\big{|}-\big{|}\mathcal{M}_{\max}\big{|}=\binom{K+2^{d}-1}{K}-\binom{K+2^{d}-2}{K-1}$ rows and $\big{|}\mathcal{M}_{\max}\big{|}=\binom{K+2^{d}-2}{K}$ columns, the size of $\bm{R}$ is massive. For instance, when $(K,d)=(10,5)$ , there are about $3\times 10^{17}$ elements in $\bm{R}$ . Albeit Vose (1993) provides a useful formula to compute the elements in the $\bm{P}$ , the computational cost is too large to be carried out in practice. Hence, we leave an accurate estimation or approximation of $\xi$ to future study.

3.2 Evolvability Analysis

In this section, we establish a schema theorem to study the evolution process of the GA. Specifically, it is proven that the average fitness gets improved over generations. To the best of our knowledge, we are the first to develop a schema theorem for GAs with proportional selection, uniform crossover and uniform mutation at the same time in the GA literature. The most closely related schema theorems are provided by Poli (2001b, a) for GAs with proportional selection and one-point crossover, and by Ming et al. (2004) for GAs with uniform crossover alone.

In the following we give the definition of a schema with general GA terminology (i.e., using “solutions” instead of “models”), followed by an example as illustration.

Definition 3.2.

A schema $H=(H_{1},\ldots,H_{d})\in\{0,1,\ast\}^{d}$ is a ternary sequence of length $d$ , where the “ $\ast$ ” is a wildcard symbol, meaning that we do not care whether it is [math] or $1$ . The indices where the schema has a [math] or $1$ are called the fixed positions. We say a solution $u=(u_{1},\ldots,u_{d})$ matches $H$ if all fixed positions of $H$ are the same as the corresponding positions in $u$ . The order of a schema $H$ , denoted by $\operatorname{ord}(H)$ , is defined by the number of fixed positions in $H$ . Moreover, by adopting the notations used in the order theory (e.g., Fletcher and Wennekers (2017)), for any schema $H$ we define the expansion operator ${\uparrow}(H)$ to map $H$ to the set of all possible solutions that match $H$ , i.e.,

[TABLE]

Example 3.1.

Suppose a schema $H=(1,0,\ast,0,\ast)$ . In this case, $\operatorname{ord}(H)=3$ , and ${\uparrow}(H)=\big{\{}(1,0,0,0,0),(1,0,0,0,1),(1,0,1,0,0),(1,0,1,0,1)\big{\}}$ .

Let $m(H,t)$ denote the number of solutions that match a schema $H$ in the $t$ -th generation, and $\alpha(H,t)$ the probability that the schema $H$ survives or is created after the $t$ -th generation. Poli et al. (1998) noted that $m(H,t+1)$ follows a binomial distribution with the number of trials $K$ and success probability $\alpha(H,t)$ , i.e., ( $K$ is the population size)

[TABLE]

Hence, we have $\mathsf{E}\big{[}m(H,t+1)\big{]}=K\alpha(H,t)$ . Accordingly, higher $\alpha(H,t)$ leads to higher $\mathsf{E}\big{[}m(H,t+1)\big{]}$ and thus tends to result in more solutions of $H$ in the next generation. Since the population size is fixed, more solutions of a fitter schema imply higher average fitness in the subsequent generation. Hence, we will show that $\alpha(H^{1},t)$ is larger than $\alpha(H^{2},t)$ if the average fitness of $H^{1}$ is larger than that of $H^{2}$ .

To prove the above result, we need to define the following different notions of Hamming distance. The first concerns two models $u=(u_{1},\ldots,u_{p})$ and $v=(v_{1},\ldots,v_{p})$ , i.e., $\delta(u,v)=\sum_{j=1}^{p}\mathds{1}(u_{j}\not=v_{j})$ , while the second type of Hamming distance is between a model and a schema $H$ on the fixed positions: $\delta(u,H)=\sum_{j:H_{j}\not=\ast}\mathds{1}(u_{j}\not=H_{j})$ . The last one is Hamming distance between models $u$ and $v$ with respect to the fixed positions of any schema $H$ : $\delta_{H}(u,v)=\sum_{j:H_{j}\not=\ast}\mathds{1}(u_{j}\not=v_{j})$ .

We are now ready to characterize $\alpha(H,t)$ explicitly for the GA with uniform mutation. Recall from (2.4) that $w_{k}$ denotes the probability that model $u^{k}$ is selected as a parent model.

Theorem 3.3.

Given the $t$ -th generation $\Psi(t)=\{u^{1},\ldots,u^{K}\}$ and a schema $H$ , define the probability that a solution matching $H$ is selected by the proportional selection operator as

[TABLE]

For the GA with uniform mutation, we have

[TABLE]

where $h_{kl}=\big{|}\{j:H_{j}\not=\ast,u^{k}_{j}=u^{l}_{j}\not=H_{j}\}\big{|}$ .

The general result in Theorem 3.3 provides an exact form of $\alpha(H,t)$ , which is quite difficult to interpret and analyze. Accordingly, we derive a simple-to-analyze lower bound for $\alpha(H,t)$ .

Corollary 3.1.

Suppose conditions in Theorem 3.3 hold. For $\pi_{m}\leq 0.5$ , we have

[TABLE]

It can be seen from (3.6) that the lower bound of $\alpha(H,t)$ gets larger when the schema selection probability $\alpha_{sel}(H,t)$ increases or the schema $H$ has lower order (i.e., $\operatorname{ord}(H)$ is small). By definition, fitter schema $H$ leads to larger $\alpha_{sel}(H,t)$ and therefore higher $\alpha(H,t)$ and $\mathsf{E}\big{[}m(H,t+1)\big{]}$ . Since an expansion of the fitter schema $H$ is expected in a fixed-size population, fitter models matching $H$ are more likely to be generated in place of weaker models; see Section 5.2 for a numerical verification. Accordingly, the subsequent generation is anticipated to have higher average fitness. This entitles the “survival of the fittest” phenomenon of the natural selection and acknowledges the evolvability of the GA.

4 GA-assisted Multi-model Inference

In this section, we describe how the GA helps multi-model inferences. Note that existing information-criteria based variable selection (e.g., Chen and Chen, 2008; Kim et al., 2012; Wang and Zhu, 2011) and MCS procedures (e.g., Hansen et al., 2011; Ferrari and Yang, 2015; Zheng et al., 2018+) typically concern the true model rather than the globally best model, which is the target of the GA. To bridge this gap, we first present a lemma suggesting that the true model indeed possess the lowest GIC value and therefore become the globally best model in large samples.

The following regularity condition is needed.

Assumption 4.1.

(A1)

There exists a positive constant $C_{1}$ such that $\lambda_{\min}\big{(}\bm{X}_{u^{0}}^{\top}\bm{X}_{u^{0}}/n\big{)}>C_{1}$ for all $n$ , where $u^{0}$ denotes the true model; 2. (A2)

There is a positive constant $C_{2}$ such that

[TABLE]

where $\bm{\mu}=\bm{X}\bm{\beta}^{0}$ and $\bm{H}_{u}=\bm{X}_{u}(\bm{X}_{u}^{\top}\bm{X}_{u})^{-1}\bm{X}_{u}^{\top}$ denotes the hat matrix of the model $u$ , for some positive integer $\tilde{s}$ with $s\leq\tilde{s}<n$ .

Condition (A1) ensures the design matrix of the true model is well-posed and Condition (A2) is the asymptotic identifiability condition used in Chen and Chen (2008), indicating that the model is identifiable if no model with comparable size can predict as well as the true model.

Recall that $\kappa_{n}$ is defined in the GIC formulation (2.3).

Lemma 4.1.

Suppose Assumption 4.1 holds, $\bm{\varepsilon}\sim\mathcal{N}(\bm{0},\sigma^{2}\bm{I})$ , $\log d=O(n^{\tau})$ , $\kappa_{n}=O(n^{\tau})$ and $\kappa_{n}\to\infty$ for some positive constant $\tau<1$ . Then for any positive integer $\tilde{s}$ satisfying $\tilde{s}\geq s$ and $\tilde{s}\log d=o(n)$ , we have, as $n\to\infty$ ,

[TABLE]

where $\mathcal{M}_{\tilde{s}}=\big{\{}u\in\mathcal{M}:|u|\leq\tilde{s}\big{\}}$ .

4.1 Variable Selection

The GA offers a practical way to perform variable selection by only searching the models generated by the GA instead of the whole model space. The existing information-criterion based selection methods search a constrained model space $\mathcal{M}_{\tilde{s}}$ for some $s\leq\tilde{s}\ll n$ . However, by using the GA, we only need to evaluate at most $K\times T$ models (recall that $K$ and $T$ are the population size and the number of generations to convergence, respectively). For example, under the simulation Case 1 with $(n,d,s,\rho)=(200,400,6,0.5)$ (see Section 5.1), it is nearly impossible to go through $\binom{400}{6}\approx 5.5\times 10^{12}$ models with the true size $6$ , not to mention to compare all the models with sizes at most $\tilde{s}$ for some $6\leq\tilde{s}\ll n$ . On the other hand, the GA searches for the true model in all $500$ simulation runs, each with less than $1{,}750$ models evaluated ( $K=92$ and $T\leq 19$ generations to convergence).

By combining Theorem 3.1 (b) and Lemma 4.1, Proposition 4.1 shows that the true model becomes the best model in large samples and is eventually captured by the GA. Let $\Psi_{\tilde{s}}(t)$ denote the $t$ -th generation of a GA population on the constrained model space $\mathcal{M}_{\tilde{s}}$ . The fitness function (2.2) makes models of sizes at least $n$ nearly impossible to be generated. Accordingly, it is equivalent to setting $\tilde{s}=n-1$ .

Proposition 4.1.

Suppose conditions in Lemma 4.1 hold and $\Psi_{\tilde{s}}(t)$ satisfies the conditions in Theorem 3.1. Define

[TABLE]

Then we have

[TABLE]

4.2 Model Confidence Set

In this section, we construct practically feasible model confidence sets with the aid of the GA, in comparison with the one based on Theorem 3.2. The main idea is to employ the two-step testing procedure of Vuong (1989), given that the candidate models are produced by GA.

Given a candidate model set $\Psi=\big{\{}u^{1},\ldots,u^{K}\big{\}}$ , let $u^{\#}=\operatorname*{arg\,min}_{u\in\Psi}\operatorname{GIC}(u)$ denote the best candidate model in $\Psi$ . Collect

[TABLE]

by performing the hypothesis testing

[TABLE]

for every $u\in\Psi-\{u^{\#}\}$ at significance level $\alpha$ . We name the model confidence set $\mathcal{A}_{\alpha}$ as survival model set (SMS) since the models therein survive the elimination testing (4.4). Recall from Section 3.2 that the GA models, even after the globally best model is found, keep being improved until convergence. Accordingly, a manageable number of good (and sparse) models are included in the SMS when the GA is used to provide candidate models. Later, we use the relative size $|\mathcal{A}_{\alpha}|/|\Psi|$ to measure the quality of the candidate model set in Section 5.3.3.

To perform the hypothesis testing (4.4) where $u$ and $u^{\#}$ may not be nested, we employ the two-step procedure of Vuong (1989) by decomposing (4.4) as first model distinguishability test

[TABLE]

and if $H^{dis}_{0,u}$ is rejected, then a superiority test

[TABLE]

The rejection of $H_{0,u}$ at significance level $\alpha$ is equivalent to that $H^{dis}_{0,u}$ and $H^{sup}_{0,u}$ are both rejected at significance level $\alpha$ . We note that the original superiority test in Vuong (1989) is based on likelihood ratio, and therefore certain adjustment is needed for our case; see Section S.3.1 for detailed description. The R package nonnest2 (Merkle and You, 2018) is used to test (4.5) and extract necessary quantities for the GIC-based superiority test (4.6).

The following proposition justifies the asymptotic validity of the constructed SMS.

Proposition 4.2.

Suppose conditions in Proposition 4.1 hold and $\Psi_{\tilde{s}}(t)$ satisfies the conditions in Theorem 3.1. Let $\mathcal{A}_{\alpha}(t)$ denote a $100(1-\alpha){\%}$ SMS with $\Psi_{\tilde{s}}(t)$ serving as the candidate model set. Then we have

[TABLE]

for all $u\in\Psi_{\tilde{s}}(t)-\{u^{0}\}$ such that $H_{0,u}$ is not rejected at significance level $\alpha$ .

5 Simulation Studies

In this section, we conduct extensive simulation studies to provide numerical support for the new schema theory supplied in Section 3.2 and show that the GA outperforms the RP method and the SA algorithm of Nevo and Ritov (2017). The simulated data were generated based on the linear model (2.1) with $\bm{\varepsilon}\sim\mathcal{N}_{n}(\bm{0},\bm{I})$ . Each row of the design matrix $\bm{X}$ was generated independently from $\mathcal{N}_{d}(\bm{0},\bm{\Sigma})$ , where $\bm{\Sigma}$ is a Toeplitz matrix with the $(k,l)$ -th entry $\Sigma_{kl}=\rho^{|k-l|}$ for $\rho=0,0.5$ and $0.9$ . Results were obtained based on $500$ simulation replicates.

5.1 Simulation Settings

We consider six simulation cases below with both high-dimensional (i.e., $d\geq n$ ; Cases 1–4) and low-dimensional (Cases 5 and 6) settings. Cases 1 and 2 with $\rho=0$ refers to the first two simulation cases used in Ye et al. (2018). Case 3 is inspired from the simulations settings used in Nevo and Ritov (2017), but our settings ensure $X_{s+1}$ and $X_{s+2}$ are always marginally distributed as $\mathcal{N}(0,1)$ for any $\rho\in[0,1)$ . Case 4 is similar to Case 3 but with weak signals. Cases 5 and 6 refers to the simulation example 2 of Wang et al. (2009), with weak signals in Case 6.

Let $\bm{1}_{p}$ and $\bm{0}_{p}$ denote the $p$ -dimensional vectors of $1$ ’s and [math]’s, respectively.

Case 1:

$\bm{\beta}^{0}=(4\bm{1}_{s-2}^{\top},-6\sqrt{2},\frac{4}{3},\bm{0}_{d-s}^{\top})^{\top}$ .

Case 2:

$\bm{\beta}^{0}$ as in Case 1. Re-define $X_{s+1}=0.5X_{1}+2X_{s-2}+\eta_{1}$ , where $\eta_{1}\sim\mathcal{N}(0,0.01)$ .

Case 3:

$\bm{\beta}^{0}=(3\bm{1}_{s}^{\top},\bm{0}_{d-s}^{\top})^{\top}$ . Re-define $X_{s+1}=\frac{2}{3\sqrt{(1+\rho)}}(X_{1}+X_{2})+\eta_{2}$ and $X_{s+2}=\frac{2}{3\sqrt{(1+\rho)}}(X_{3}+X_{4})+\eta_{3}$ , where $\eta_{1},\eta_{2}\overset{iid}{\sim}\mathcal{N}(0,1/9)$ .

Case 4:

$\bm{\beta}^{0}=(3\log(n)/\sqrt{n}\bm{1}_{s}^{\top},\bm{0}_{d-s}^{\top})^{\top}$ . Re-define $X_{s+1}$ and $X_{s+2}$ as in Case 3.

Two cases are set up for moderate dimensional (i.e., $d<n$ ) scenarios:

Case 5:

$\beta^{0}_{1}\geq\cdots\geq\beta^{0}_{s}$ are iid $\textit{Uniform}(0.5,1.5)$ , sorted decreasingly, and $\beta_{j}=0$ for $j>s$ .

Case 6:

$\bm{\beta}^{0}=(3\log(n)/\sqrt{n}\bm{1}_{s}^{\top},\bm{0}_{d-s}^{\top})^{\top}$ .

For the GA implementation, we use

[TABLE]

to evaluate models. This choice of $\kappa_{n}=3.5\log d$ makes the GIC coincide with the pseudo-likelihood information criterion (Gao and Carroll, 2017) and the high-dimensional BIC (Wang and Zhu, 2011). The penalization constant $3.5$ is specifically used due to the superior performance shown in the simulation studies in Wang and Zhu (2011). It should be mentioned that (5.1) works well regardless the relationship between $n$ and $d$ (e.g., Gao and Carroll, 2017; Wang and Zhu, 2011). Our Python implementation of the GA, the RP and the SA is publicly available in the Github repository https://github.com/aks43725/cand.

5.2 Schema Evolution

The discussion followed by Corollary 3.1 concludes that fitter schema $H$ leads to larger $\alpha_{sel}(H,t)$ (the probability of selecting a model matching $H$ in the $t$ -th generation) and hence larger $\mathsf{E}\big{[}m(H,t+1)\big{]}$ (the expected number of models matching $H$ in the $(t+1)$ -th generation). In the following we provide empirical evidence by observing that $m(H,t+1)$ aligns with $\alpha_{sel}(H,t)$ for three schemata:

[TABLE]

which represent good, fair and bad performing schemata, respectively. In particular, $H^{1}$ is expected to perform the best by covering good models such as the true model. The $2s$ [math]’s are placed to deteriorate its overall performance through ruling out some models that are too good to observe the evolution of $m(\cdot,t)$ and $\alpha_{sel}(\cdot,t)$ . $H^{2}$ is expected to be slightly worse than $H^{1}$ because models matching it are all overfitting by having at least two false discoveries. We anticipate $H^{3}$ to have the worst performance due to missing one true signal. Note that ${\uparrow}(H^{1})\cap{\uparrow}(H^{2})\cap{\uparrow}(H^{3})$ does not cover the whole model space $\mathcal{M}$ . For implementation, we used uniform mutation as needed in the theoretical conditions. Moreover, since the GA with initial population provided by the RP is too good to observe the evolution process, we used an approach proposed in Section S.3.3 to randomly generate an initial population.

Figure 6 is obtained under Case 3 with $(n,d,s,\rho)=(200,400,6,0.5)$ and serves as a representative example since other cases (included in supplementary, Section S.4.2) exhibits similar patterns. The upper panel confirms our performance assertion on the overall schema performance, i.e., $H^{1}$ is slightly better than $H^{2}$ and $H^{3}$ is the worst. From the lower panel, it is evident that the pattern of $m(H^{1},t+1)$ aligns with that of $\alpha_{sel}(H^{1},t)$ in all cases. In addition, the strong schema $H^{1}$ evolves to take over the whole population eventually even it is a minority at the beginning, and vice versa for the weaker schema $H^{2}$ . On the other hand, the evolution process of $H^{3}$ illustrates a typical example that a particularly weak schema extincts soon and never rises again. In summary, a good schema expands and a weak one diminishes over generations, resulting in an improved average fitness until convergence.

5.3 Comparison with Existing Methods

In this section, we compare the GA with the RP and the SA in terms of computation time, quality of candidate model sets, and performance of multi-model inference applications such as variable importance, model confidence set and model averaging. For the RP, we collect the unique models on the regularization paths of Lasso, SCAD and MCP using the Python package pycasso. Recall that the GA takes the RP for the initial population. The SA is implemented to search for models of sizes appeared in the last GA generation, and the best $K$ models are kept as the final candidate model set. Other tuning parameters are settled according to the simulation settings in Nevo and Ritov (2017).

In the following, we show that the GA evidently improve the models generated by the RP in reasonable computation time, and that the SA takes a long time to implement but produces at most comparable results to those of the GA. In particular, the GA exhibits the best performance in all cases in terms of variable selection and quality of candidate model set. In terms of model averaging and variable importance, the GA performs at least comparably to the RP and the SA in high-dimensional cases, while just comparably under low-dimensional settings.

5.3.1 Computation Time

The averaged computation time for the three methods are depicted in Figure S.1. It is obvious that the GA is a bit slower than the RP but way much (like more than $10$ times) faster than the SA.

5.3.2 Variable Selection

To evaluate the performance of variable selection, the boxplots of the positive selection rate (PSR, the proportion of true signals that are active in the best model) and the false discovery rate (FDR, the proportion of false signals that are active in the best model) are drawn in Figure 7 and Figure 8, respectively. We see that the GA-best model gives fairly high PSR and low FDR in all cases, demonstrating excellent variable selection performance. Under high-dimensional settings (Cases 1–4), the RP produces high PSR but also high FDR, while the SA results in the opposite (PSR and FDR are both low). For moderate dimensional cases (Cases 5 and 6), both of the RP and the SA give low PSR and FDR. In summary, the GA-best model possesses much better variable selection performance than those from the RP and the SA.

5.3.3 Quality of Candidate Models

We evaluate the quality of candidate model sets using two criteria: (i) the average fitness and (ii) the relative size of $95{\%}$ SMSs (see Section 4.2 for the SMS construction) to the original candidate model set. Figure 9 exhibits the boxplots of average fitness and suggests that the GAs produce the fittest models in all cases. The SA takes the second place in high-dimensional cases (Cases 1–4), yet is outperformed by the RP in moderate dimensional cases (Cases 5 and 6) with $\rho=0$ and $0.5$ , where the covariates are not strongly correlated. To conclude, the candidate model set produced by the GA possesses the best quality among the three approaches.

Figure 10 displays the boxplots of the relative size of $95{\%}$ SMSs $\mathcal{A}_{0.05}$ against the original candidate model set $\Psi$ , i.e., $|\mathcal{A}_{0.05}|/|\Psi|$ , where larger values indicate better quality of $\Psi$ . We see that the relative sizes for the GA are typically higher than those from the RP and SA in all cases, and are close to $1$ in high-dimensional settings (e.g., Cases 1–4). This supports the conclusion we made about the quality of candidate models in the previous paragraph.

5.3.4 Model Averaging

Model averaging, especially in high-dimensional predictive analysis, is a prominent application of multi-model inference. The GA does not perform significantly better than the RP and the SA in model averaging, but exhibits better applicability than the RP, and greater robustness than the SA.

Given a candidate model set $\Psi=\big{\{}u^{1},\ldots,u^{K}\big{\}}$ , the model averaging predictor is defined by

[TABLE]

where $\widehat{\bm{Y}}_{u^{k}}=\bm{X}_{u}(\bm{X}_{u}^{\top}\bm{X}_{u})^{-1}\bm{X}_{u}^{\top}\bm{Y}$ are the least-squares predictors and $w_{k}$ with $0\leq w_{k}\leq 1$ denote the model weights of $u^{k}$ for $k=1,\ldots,K$ . We use the root mean squared error (RMSE) defined by

[TABLE]

to assess the performance of model averaging.

Two model weighting schemes are considered to obtain the model weights $w_{k}$ : (i) GIC-based weights as in (2.4) with $f_{k}$ replaced by $-\operatorname{GIC}(u^{k})$ , and (ii) the weighting approach proposed by Ando and Li (2014), which we called it the “AL weighting” hereafter (see Section S.3.2 for detailed construction). We note that (i) is the the most commonly used model weighting scheme in multi-model inference (e.g., Akaike weights (Akaike, 1978; Bozdogan, 1987; Burnham and Anderson, 2004; Wagenmakers and Farrell, 2004) and Bayesian model averaging (Hoeting et al., 1999)), and (ii) is developed for optimal predictive performance in high-dimensional model averaging.

Figure 11 displays the boxplots of the RMSE using the GIC-based model weighting, showing that the GA exhibits good and robust (in contrast to the wildly high RMSE by SA in Case 4 with $(n,d,s,\rho)=(400,1000,50,0.9)$ ; see Remark 5.1 for more details) results over all cases. On the other hand, the RP is obviously worse than the GA in Case 2, and the SA’s performance is just comparable to that of the GA. The three methods perform similarly in the rest cases (i.e., Cases 1, 3, 5 and 6).

The RMSEs obtained by the AL weighting are shown in Figure 12. Different from the results using the GIC-based model weights, the GA behaves slightly better than SA in some cases (e.g., Case 1 with $\rho=0.0$ and $0.5$ and Case 3 with $\rho=0.0$ ) and comparably in the rest. Yet similarly, the GA performs robustly and the SA has wildly high RMSE in Case 4 with $(n,d,s,\rho)=(400,1000,50,0.9)$ . On the other hand, the results for the RP are omitted due to the computational infeasibility (inverting a singular matrix) in generating the AL weights. Accordingly, the GA is shown to possess better applicability in optimal high-dimensional model averaging.

5.3.5 Variable Importance

To evaluate the performance of high-dimensional variable importance, we employ the sparsity oriented importance learning (SOIL; Ye et al., 2018) defined by

[TABLE]

with the GIC-based model weights $w_{k}$ given in (2.4). It can well separate the variables in the true model from the rest in the sense that $\mathrm{SOIL}_{j}$ rarely gives [math] ( $1$ ) if the variable $j$ is (not) in the true model. Moreover, it rarely gives variables not in the true model significantly higher values than those in the true model even if the signal is weak. In the original work (Ye et al., 2018), the candidate models were generated using the RP method. Our results indicate that the GA performs at least comparably to the SA and the RP in separating the true signals from the rest.

Figure 13 and Figure 14 depict the averaged SOIL values for the first $2s$ variables for Cases 2 and 4, respectively, where the active ones are before the vertical gray line and the rest are not shown due to $\mathrm{SOIL}_{j}\approx 0$ for $j>2s$ no matter which method was used for candidate model preparation. Results for Cases 1, 3, 5 and 6 are presented in supplementary (Section S.4.3) due to high similarity among the three methods. the GA exhibits the best performance that separate the true signals from the rest. Specifically, the resulting SOIL values are by no means close to [math] and $1$ for truly active and inactive variables, respectively. On the other hand, in Case 2 the RP results in $\mathrm{SOIL}_{s-2}\equiv 1$ and $\mathrm{SOIL}_{s+1}=0$ , where $X_{s-2}$ is a true signal and $X_{s+1}$ is not. Moreover, in Case 4 with $(n,d,s,\rho)=(200,1000,50,0.9)$ , since the SA results in $\mathrm{SOIL}_{j}\leq 0.03$ for $j=38,\ldots,50$ , these $13$ true signals may easily be regarded as not important.

Remark 5.1.

From Case 4 with $(n,d,s,\rho)=(200,1000,50,0.9)$ , we note that the SA’s performance in model averaging and variable importance critically depends on the model size specification. Recall that the SA only searches for the models with sizes resulting from the GA candidate models. For this simulation case, the GA model sizes are around a half of the number of strong signals, i.e., $s/2$ . Such model size misspecification causes the SA to perform poorly in model averaging and variable importance. On the other hand, the GA still behaves well even when all of its resulting candidate models miss certain number of true signals.

6 Real Data Example

In this section, we present two real data examples to exhibit the usefulness of the proposed GA. Additionally, hypothesis testing (4.4) was conducted to compare models in terms of the GIC.

6.1 The Riboflavin Dataset

We first introduce the riboflavin (vitamin B) production dataset that was widely studied in high-dimensional variable selection literature (e.g., Bühlmann et al., 2014; Javanmard and Montanari, 2014a; Lederer and Muller, 2015; Chichignoud et al., 2016; Hilafu and Yin, 2017). The response variable is the logarithm of the riboflavin production rate in Bacillus subtilis for $n=71$ samples and the covariates are the logarithm of the expression level of $d=4{,}088$ genes. Please see more details in Supplementary Section A.1 of Bühlmann et al. (2014).

The proposed GA delivers new insights by yielding better variable selection results than the existing works. From Table 1, the GA-best model contains only one active gene XHLA-at which was not identified by previous approaches. However, it turns out to be the fittest model (with all $p$ -values $<0.0001$ ) among those listed. Moreover, the importance of the gene XHLA-at is confirmed by having $\mathrm{SOIL}_{\textsf{XHLA-at}}=1$ and all other SOIL values less than $0.01$ . Accordingly, we suggest a further investigation on the gene XHLA-at is needed from scientists.

Table 2 summarizes the results of $95{\%}$ SMSs (see Section 4.2), and shows the GA outperforms the RP and the SA in terms of the quality of candidate model set and model averaging. For the former, besides the much fittest (i.e., lowest GIC) model, the GA also gives the highest relative size of $95{\%}$ SMSs of $56/67=83.58{\%}$ (compared to $1/54=1.85{\%}$ for the RP and $11/16=68.75{\%}$ for the SA). For model averaging, the GA results in the smallest RMSE using the GIC-based weighting. Moreover, as the only method leading to successful AL weighting (see Section 5.3.4) computation, the GA is shown to possess better applicability in optimal high-dimensional model averaging.

6.2 Residential Building Dataset

The second dataset was used to study $n=372$ residential condominiums from as many 3- to 9-story buildings constructed between 1993 and 2008 in Tehran, Iran (Rafiei and Adeli, 2016, 2018). Construction cost, sale price, $8$ project physical and financial (PF) variables and $19$ economic variables and indices (EVI) with up to $5$ time lags before the construction were collected on the quarterly basis. Similar to the analysis in Rafiei and Adeli (2018), we study how construction cost is influenced by the PF and delayed EVI factors, but exclude the only categorical PF variable, project locality. Accordingly, we have $d=7+19\times 5=102$ covariates. We define the variable coding in Table S.1 for the ease of presentation.

Table 3 and Table 4 respectively summarize the variable selection and variable importance results of the GA, the RP and the SA. From the former, we see that the GA-best model gives the best performance (i.e., lowest GIC), and its variable structure agrees with the findings by Rafiei and Adeli (2018), which suggest that PF and EVI factors (especially 4-quarter delayed ones) be informative. Moreover, the second column in Table 4 confirms the relevance of PF-5, PF-7, $1$ -quarter delayed EVI-05, $4$ -quarter delayed EVI-07 and EVI-13, and $5$ -quarter delayed EVI-12. We also note that the RP- and SA-best models do not consist of sensible variable structures and are significantly worse than the GA-best model ( $p$ -values $<0.0001$ ).

Figure 15 and Table 5 respectively display the boxplots of the fitness values of the candidate models and the multi-model analysis results to evaluate the quality of candidate model sets and model averaging. The former (Figure 15) suggests that the GA models generally possess higher fitness (i.e., lower GIC) values. Again, the GA is shown to produce the best candidate model set by having the fittest best model (all $p$ -values $<0.0001$ ) and the highest relative size of $95{\%}$ SMS of $41/48=85.41{\%}$ (compared to approximately $14{\%}$ for the RP and the SA). In addition to generating the best candidate model set, the GA also results in the lowest RMSE of model averaging using both the GIC-based and AL weighting methods. These results suggest that good candidate models be helpful in enhancing the performance of multi-model inference.

To further investigate the predictive performance via model averaging with the AL weighting, we randomly split the dataset using five ratios of validation to training (RVTs) of $10{\%},20{\%},30{\%},40{\%}$ and $50{\%}$ . For each RVT, $100$ randomly selected validation and training datasets were generated by splitting the original dataset, and the boxplots of RMSE are drawn in Figure 16. In summary, the GA generally results in lower RMSE, suggesting its superior predictive performance.

7 Discussion

In the end, we propose three future directions. Firstly, we are interested in developing more implementable algorithms for Theorem 3.2 to construct the proposed MCS procedure. Secondly, we believe that incorporating GAs into modern computational tools such as neural networks may produce more powerful statistical inference procedures. For instance, the deep neuroevolution developed by the Uber AI Labs uses GAs to train deep reinforcement learning (DRL) models and demonstrates amazing performance on hard DRL benchmarks such as Atari and Humanoid Locomotion (e.g., Petroski Such et al., 2018; Zhang et al., 2017; Conti et al., 2018); see https://eng.uber.com/deep-neuroevolution/ for a comprehensive introduction. Lastly, we want to investigate more advanced GA variants (e.g., adaptive GAs (e.g., Tang, 2012; Song and Xiao, 2013; Rajakumar and George, 2013; LaPorte et al., 2015), the immune GAs (e.g., Jiao and Wang, 2000; Yu and Zhou, 2008; Zhang et al., 2014) or the hybrid GAs (e.g., Chan et al., 2005; Kao and Zahara, 2008; Chen and Shahandashti, 2009; Zhu et al., 2011)) from statistical and machine learning perspectives.

S.1 Proofs for Section 3

S.1.1 Proof of Theorem 3.1

To prove (a), first note that given $u^{*}\in\Psi(t)$ the subsequent generations cannot travel to any state that does not contain $u^{*}$ due to elitism selection. This means $\bm{P}$ is reducible, and $\mathcal{M}_{\max}$ is closed (in the sense that $P\big{(}u^{*}\not\in\Psi(t^{\prime})\big{|}u^{*}\in\Psi(t)\big{)}=0$ for all $t^{\prime}>t$ ).

Without loss of generality, there exists square matrices $\bm{A}$ and $\bm{T}$ , and a matrix $\bm{R}$ with suitable dimensions such that

[TABLE]

where $\bm{A}$ is a $|\mathcal{M}_{\max}|\times|\mathcal{M}_{\max}|$ transition probability submatrix corresponding to the states in $\mathcal{M}_{\max}$ . According to Lemma S.6.1 (Theorem 2 of Rudolph (1994)), it suffices to show that $\bm{A}=[a_{\bm{u}\bm{v}}]_{\bm{u},\bm{v}\in\mathcal{M}_{\max}}$ is stochastic and primitive, and $\bm{R}$ and $\bm{T}$ are not zero matrices.

To show $\bm{A}$ is stochastic and primitive, first note that $\bm{A}$ corresponds to the transition probability matrix for the states $\bm{u}\in\mathcal{M}_{\max}$ . Since any $P\big{(}\Psi(t+1)\not\in\mathcal{M}_{\max}\big{|}\Psi(t)\in\mathcal{M}_{\max}\big{)}=0$ for any $t\geq 0$ , we must have $\sum_{\bm{v}\in\mathcal{M}_{\max}}a_{\bm{u}\bm{v}}=1$ . This indicates that $\bm{A}$ is stochastic.

For any fixed-size population $\bm{u}$ , the child models generated by selection and crossover operations still belong to $\mathcal{M}$ , and they can be transformed to any other models through the mutation operator with $\pi_{m}\in(0,1)$ . In other words, any model $u\in\bm{u}$ with $u\not=u^{*}$ can be mapped to any $v\in\mathcal{M}$ . This implies any state in $\mathcal{M}_{\max}$ can travel to any other state in $\mathcal{M}_{\max}$ with positive probability. Accordingly, $\bm{A}$ is positive and thus primitive.

Similar argument yields that $P_{\bm{u}\bm{u}}=P\big{(}\Psi(t+1)=\bm{u}\big{|}\Psi(t)=\bm{u}\big{)}>0$ for all $\bm{u}\in\mathcal{M}^{K}$ , and therefore $\bm{T}$ , the transition probability matrix corresponding to the states not in $\mathcal{M}_{\max}$ , is not zero. Moreover, since the generational best model can only be improved, any model $u$ can be transformed to $u^{*}$ with positive probability due to the mutation operator with $p_{m}\in(0,1)$ . Hence for any $t\geq 0$ we have

[TABLE]

Note that the entries of $\bm{R}$ collects all such transition probabilities. Consequently, it is a positive, and thus nonzero matrix.

The result of (b) is a straightforward consequence of (a). That is, since $\bm{\pi}$ is a distribution over $\mathcal{M}^{K}$ and $\pi(\bm{u})=0$ for all $\bm{u}\not\in\mathcal{M}_{\max}$ , we have $\sum_{\bm{u}\in\mathcal{M}_{\max}}\pi(\bm{u})=1$ . By the definition of $\mathcal{M}_{\max}$ , it further implies the asymptotic inclusion of the best model as $t\to\infty$ .

S.1.2 Proof of Theorem 3.2

It suffices to show that

[TABLE]

Since the GA with elitism selection satisfies

[TABLE]

it suffices to show that there exists a positive integer $T_{\alpha}$ such that

[TABLE]

Let

[TABLE]

denotes the total probability that a population $\bm{u}$ is transmitted into any population with the best solution in one iteration. According to (S.1.1), define

[TABLE]

Note that, for all $\bm{u}\in\mathcal{M}^{K}$ and positive integer $t$ ,

[TABLE]

It then holds, for any $\bm{u}\in\mathcal{M}^{K}$ and positive integer $T$ ,

[TABLE]

where the third equality is due to the Markov property. By keeping doing this we obtain

[TABLE]

Since $\xi\in(0,1)$ , there exists a positive integer $T_{\alpha}$ such that $(1-\xi)^{T_{\alpha}}\leq\alpha<(1-\xi)^{T_{\alpha}-1}$ . Accordingly, the desired confidence statement (S.1.3) follows.

S.1.3 Proof of Theorem 3.3

We first characterize the individual probabilities caused by the selection, crossover and mutation operations. Firstly, it is obvious that the probability that models $u^{k}$ and $u^{l}$ are selected is $w_{k}w_{l}$ . Secondly, the probability that the uniform mutation operation transforms a given model $v$ into a solution that matches $H$ is $\pi_{m}^{\delta(v,H)}(1-\pi_{m})^{\operatorname{ord}(H)-\delta(v,H)}$ .

Finally, we discuss the effect of the uniform crossover operation, given two parent models $u^{k}$ and $u^{l}$ are selected. Due to the mechanism of the uniform crossover, all possible child models has equal probabilities to be generated. This allows us to focus on the fixed positions of $H$ . Note that it is possible that $u^{k}$ and $u^{l}$ can never generate a child model that is a solution that matches $H$ . Therefore, we define

[TABLE]

as the minimum $\delta(v,H)$ among all the child models $v$ produced by the uniform crossover with parent models $u^{k}$ and $u^{l}$ . Now, suppose $v$ is a model generated through uniform crossover with $u^{k}$ and $u^{l}$ , we have

[TABLE]

Accordingly, a general form of $\alpha(H,t)$ can be written by

[TABLE]

Note that on the right hand side of (S.1.5), the summation can be tore apart to three cases based on whether the parents are solutions that match $H$ . That is,

[TABLE]

where Cases 1, 2 and 3 refer to the events that the final child model after crossover and mutation is a solution that matches $H$ given that

both parents match $H$ (i.e., $k,$ such that $u^{k},u^{l}\in{\uparrow}(H)$ ), 2. 2.

only one of the parents matches $H$ (i.e., $k$ such that $u^{k}\in{\uparrow}(H)$ and $l:u^{l}\not\in{\uparrow}(H)$ ), and 3. 3.

neither of the parents matches $H$ (i.e., $k,l$ such that $u^{k},u^{l}\not\in{\uparrow}(H)$ ),

respectively.

For Case 1, since both parents belong to ${\uparrow}(H)$ , it follows that $\delta_{H}(u^{k},u^{l})=0$ and $h_{kl}=0$ , and hence

[TABLE]

For Case 2, since one of the parents matches $H$ , it holds $h_{kl}=0$ and $\delta_{H}(u^{k},u^{l})=\delta(u^{l},H)$ . It then holds that

[TABLE]

For Case 3, there seems no simplification available, and therefore we have

[TABLE]

The proof is then complete by plugging (S.1.7), (S.1.8) and (S.1.9) into (S.1.6).

S.1.4 Proof of Corollary 3.1

First note that $2(1-\pi_{m})>1$ since $\pi_{m}\leq 0.5$ . Since $\delta(u^{l},H)\leq\operatorname{ord}(H)$ for all $u^{l}\not\in{\uparrow}(H)$ , it follows that

[TABLE]

Similarly, since $2\pi_{m}<1$ , $h_{kl}\leq\operatorname{ord}(H)$ and $\delta_{H}(u^{k},u^{l})\leq\operatorname{ord}(H)$ for all $u^{k},u^{l}\not\in{\uparrow}(H)$ , we have

[TABLE]

Accordingly, we have the desired result (3.6).

S.2 Proof for Section 4

S.2.1 Proof of Lemma 4.1

Without loss of generality, let $u^{0}$ denote the binary sequence with first $s$ genes active and the rest inactive and $\sigma^{2}=1$ . Recall that $\bm{X}_{u}$ denotes the submatrix of $\bm{X}$ subject to the active variable indices in $u$ . Let $\bm{H}_{u}=\bm{X}_{u}(\bm{X}_{u}^{\top}\bm{X}_{u})^{-1}\bm{X}_{u}^{\top}$ the projection matrix of the submatrix $\bm{X}_{u}$ .

We first consider the case $u\not\supseteq u^{0}$ , i.e., model $u$ misses at least one relevant variable. We can write

[TABLE]

Note that

[TABLE]

where the $Z_{i}$ are independent $\mathcal{N}(0,1)$ variables, and

[TABLE]

where $\bm{\mu}=\bm{X}_{u^{0}}\bm{\beta}^{0}_{u^{0}}$ . By Condition (A2), uniformly over $u$ with $|u|\leq\tilde{s}$ , it holds

[TABLE]

Write

[TABLE]

Note that for any model $u$ with $|u|\leq\tilde{s}$ , there exists a positive constant $C_{3}$ such that

[TABLE]

By the union bound, it follows that

[TABLE]

Let $t=\sqrt{2s\log d}$ , we arrive at

[TABLE]

as $n\to\infty$ . Accordingly,

[TABLE]

and therefore we have

[TABLE]

Now we deal with the last two terms in (S.2.2). Note that we can write

[TABLE]

where $Z_{i}$ are some independent $\mathcal{N}(0,1)$ variables. By the union bound, it then holds

[TABLE]

It is east to see that (see, for example, Yang (1999))

[TABLE]

Let $t=3s\log d$ , we arrive at

[TABLE]

as $n\to\infty$ . Consequently, we have

[TABLE]

Similarly,

[TABLE]

By (S.2.3), (S.2.4), (S.2.5) and (S.2.6), it is easy to see that $\mathrm{RSS}(u)-\mathrm{RSS}(u^{0})$ is dominated by $\bm{\mu}^{\top}(\bm{I}-\bm{H}_{u^{0}})\bm{\mu}$ . Coupled with (S.2.1), there is a positive constant $C_{4}$ such that

[TABLE]

in probability. Since $\kappa_{n}=o(n)$ , we conclude that

[TABLE]

as $n\to\infty$ .

Now we consider the case $u\supseteq u^{0}$ but $u\not=u^{0}$ . Since $(\bm{I}-\bm{H}_{u})\bm{X}_{u^{0}}=\bm{O}$ , we have $\bm{Y}^{\top}(\bm{I}-\bm{H}_{u})\bm{Y}=\bm{\varepsilon}^{\top}(\bm{I}-\bm{H}_{u})\bm{\varepsilon}$ and

[TABLE]

where $Z_{u,i}$ are some independent $\mathcal{N}(0,1)$ variables depending on $u$ . By the union bound we have

[TABLE]

where the second inequality follows from the sharp deviation bound on the $\chi^{2}$ distribution (see Lemma 3 of Fan and Lv (2008)). Hence we have

[TABLE]

with probability tending to $1$ . Accordingly, the desired result (4.1) follows from (S.2.7) and (S.2.8).

S.2.2 Proof of Proposition 4.1

From Lemma 4.1 we know that the true model $u^{0}$ is the best model in the model space $\mathcal{M}_{\tilde{s}}$ with probability tending to $1$ . Along with Theorem 3.1 (b) we have

[TABLE]

By the definition of $\widehat{u}(t)$ , we arrive at

[TABLE]

This completes the proof.

S.2.3 Proof of Proposition 4.2

By the construction of the $\mathcal{A}_{\alpha}(t)$ , we have

[TABLE]

for all $u\in\Psi_{\tilde{s}}(t)-\{\widehat{u}(t)\}$ with $H_{0,u}$ not rejected and any $t\geq 0$ . Along with Proposition 4.1, which ensures that $\lim_{t\to\infty}\lim_{n\to\infty}P\left(\widehat{u}(t)=u^{0}\right)=1$ , the desired result then holds.

S.3 Details of the Auxiliary Methods

S.3.1 GIC-Based Superiority Test

A natural test statistic for the GIC-based superiority test (4.6) can be derived based on the difference of the GIC values of models $u$ and $u^{\#}$ . Note that the first term in the GIC (2.3) comes from simplifying the log likelihood with Gaussian noise. That is, the general form for GIC can be written as

[TABLE]

where $L(\bm{\beta}_{u};\bm{X},\bm{Y})$ is the likelihood function of model $u$ evaluated at $\bm{\beta}_{u}$ given data $(\bm{X},\bm{Y})$ , and $\widehat{\bm{\beta}}_{u}=\big{(}\bm{X}_{u}^{\top}\bm{X}_{u}\big{)}^{-1}\bm{X}_{u}\bm{Y}$ for any model $u\in\mathcal{M}$ with $|u|<n$ . As a result, we write

[TABLE]

Note that the first term on the R.H.S. is merely a constant and the sampling variation comes only from the second term. When $u$ and $u^{\#}$ are distinguishable (i.e., $H^{dis}_{0,u}$ in (4.5) is rejected), Vuong (1989) showed that the normalized log likelihood ratio

[TABLE]

where

[TABLE]

denotes the population variance of the log likelihood ratio of $u$ and $u^{\#}$ , $\bm{\beta}^{0}_{u}$ is the true regression coefficient under model $u$ , and $\bm{\mathsf{X}}$ and $\mathsf{Y}$ are the population counterparts of the design vector and the response scalar, respectively. Accordingly, under $H^{sup}_{0,u}$ , the result (S.3.1) can be used to show that

[TABLE]

In practice, we plug-in a consistent estimate of $\omega_{u}^{2}$ , denoted by $\widehat{\omega}_{u}^{2}$ (see Vuong (1989) for the formula), into (S.3.2) to perform the test. Accordingly, we reject $H^{\sup}_{0,u}$ if

[TABLE]

where $z_{1-\alpha}$ is the $(1-\alpha)$ -quantile of standard normal distribution, and the value of $\widehat{\omega}_{u}^{2}$ can be extracted from the R package nonnest2 (Merkle and You, 2018) when implementing the distinguishability test (4.5).

S.3.2 Model Averaging Approach of Ando and Li (2014)

Given a candidate model set $\Psi=\{u^{1},\ldots,u^{K}\}$ , let $\bm{D}_{k}$ be a $n\times n$ diagonal matrix with the $l$ -th element being $(1-h_{kl})^{-1}$ , where $h_{kl}$ is the $l$ -th diagonal element of the hat matrix $\bm{H}_{u^{k}}=\bm{X}_{u^{k}}\big{(}\bm{X}_{u^{k}}^{\top}\bm{X}_{u^{k}}\big{)}^{-1}\bm{X}_{u^{k}}^{\top}$ , and $\widetilde{\bm{H}}_{k}=\bm{D}_{k}(\bm{H}_{u^{k}}-\bm{I})+\bm{I}$ . Following Ando and Li (2014), the $K$ -dimensional weight vector $\bm{w}=(w_{1},\ldots,w_{K})^{\top}$ can be computed by

[TABLE]

where $\bm{a}=(a_{1},\ldots,a_{K})^{\top}$ with $a_{k}=\bm{Y}^{\top}\widetilde{\bm{H}}_{k}\bm{Y}$ , and $\bm{B}$ is a $K\times K$ matrix with the $(k,j)$ -th element $B_{kl}=\bm{Y}^{\top}\widetilde{\bm{H}}_{k}^{\top}\widetilde{\bm{H}}_{l}\bm{Y}$ . Note that the common constraint $\sum_{k=1}^{K}w_{k}=1$ for model weights does not necessarily to be imposed. In fact, Ando and Li (2014) show that their weighting approach leads to the smallest possible estimation error of the model averaging predictor (5.2) without the constraint.

S.3.3 A Variable Association Measure Assisted Approach for Generating the Initial Population

Given variable association measures $\gamma_{j},j=1,\ldots,d$ (e.g., the marginal correlation learning $\big{|}\widehat{\mathsf{Cor}}(\bm{X}_{j},\bm{Y})\big{|}$ (Fan and Lv, 2008) or the HOLP $\big{|}\bm{X}_{j}(\bm{X}\bm{X}^{\top})^{-1}\bm{Y}\big{|}$ (Wang and Leng, 2016, available only for $d\geq n$ )), we introduce an approach to randomly generate the initial population $\{u^{0},\ldots,u^{0}_{K}\}$ for the GA as follows.

Step 1:

Assign the model sizes $|u^{0}_{k}|,k=1,\ldots,K$ , by generating $K$ independent

$\textit{HyperGeom}\big{(}6\min(n,d),2\min(n,d),\min(n,d)\big{)}$ random variables, where $\textit{HyperGeom}(N,M,n)$ denotes the hypergeometric distribution with the probability mass function

[TABLE]

Step 2:

For $k=1,\ldots,K$ , the active positions of $u^{0}_{k}$ are determined by randomly selecting $|u^{0}_{k}|$ numbers from $[d]$ without replacement according to the probability distribution $\big{\{}\gamma_{j}/\sum_{l=1}^{d}\gamma_{l}\big{\}}_{j=1,\ldots,d}$ .

This approach ensures the model sizes are around $\min(n,d)/3$ and never exceed $\min(n,d)$ . Moreover, by making use of the variable association measures $\gamma_{j}$ , the resulting models are likely to contains the true signals so that their performance are by no means poor.

S.4 Supplementary Simulation Results

S.4.1 Computation Time

Figure S.1 displays the bar graph for the averaged computation time for implementing the three methods.

S.4.2 Schema Evolution

Figure S.2–S.7 present the additional results of schema evolution. The conclusions we draw in Section 5.2 still applies for these results, even though the patterns for high-dimensional (Cases 1–4) and low-dimensional (Cases 5 and 6) results are clearly different.

S.4.3 Variable Importance

Figure S.8–S.11 present additional simulation results of variable importance under Cases 1, 3, 5 and 6. In Cases 1 and 3 (Figure S.8 and Figure S.9), we see the results of GA and the SA are comparable, and slightly better than the RP in separating the true signals from the rest in some cases (e.g., $X_{s}$ under Case 1 with $\rho=0.9$ and $X_{s+1}$ and $X_{s+2}$ under Case 3 with $\rho=0$ ). However, under Cases 5 and 6 (Figure S.10 and Figure S.11) the three methods are just comparable.

S.5 Variable Coding for the Residential Building Dataset

The variable coding with descriptions and units for the residential building dataset used in Section 6.2 is listed in Table S.1. Detailed explanations are omitted and can be found in Table 1 of Rafiei and Adeli (2018).

S.6 Auxiliary Lemmas

In this section we provide technical lemmas, with a bit abuse of notations.

Lemma S.6.1 (Theorem 2 of Rudolph (1994)).

Let $\bm{P}$ be a $n\times n$ reducible stochastic matrix that can be decomposed into

[TABLE]

where $\bm{C}$ is an $m\times m$ primitive stochastic matrix with $m\leq n$ and $\bm{R}$ and $\bm{T}$ are two non-zero matrices with suitable dimensions. Then there exists an $(n-m)\times n$ positive matrix $\bm{R}_{\infty}$ such that

[TABLE]

is a stable stochastic matrix with $\bm{P}^{\infty}=\bm{1}\bm{\pi}^{\top}$ , where $\bm{1}=(1,\ldots,1)^{\top}$ is the vector of $1$ ’s with suitable length, $\bm{\pi}=(\pi_{1},\ldots,\pi_{n})^{\top}=\bm{\pi}_{0}^{\top}\bm{P}^{\infty}$ is unique regardless of the initial distribution $\bm{\pi}_{0}$ , and $\bm{\pi}$ satisfies $\pi_{i}>0$ for $i=1,\ldots,m$ and $\pi_{i}=0$ for $i=m+1,\ldots,n$ .

Bibliography87

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Agapie (1998) Agapie, A. (1998), “Genetic algorithms: Minimal conditions for convergence,” in Artificial Evolution: Third European Conference AE ’97 Nîmes, France, October 22–24, 1997 Selected Papers , eds. Hao, J.-K., Lutton, E., Ronald, E., Schoenauer, M., and Snyers, D., Springer Berlin Heidelberg, pp. 181–193.
2Akaike (1973) Akaike, H. (1973), “Information theory and an extension of the maximum likelihood principle,” in 2nd International Symposium on Information Theory, Tsahkadsor, Armenia, USSR, September 2–8, 1971 , ed. Nikolaevich Petrov, F. C., Akadémiai Kiadó, Budapest, pp. 267–281.
3Akaike (1978) — (1978), “On the likelihood of a time series model,” Journal of the Royal Statistical Society. Series D (The Statistician) , 27, 217–235.
4Anderson (2008) Anderson, D. R. (2008), Model Based Inference in the Life Sciences: A Primer on Evidence , Springer-Verlag New York.
5Ando and Li (2014) Ando, T. and Li, K.-C. (2014), “A model-averaging approach for high-dimensional regression,” Journal of the American Statistical Association , 109, 254–265.
6Aue et al. (2014) Aue, A., Cheung, R. C. Y., Lee, T. C. M., and Zhong, M. (2014), “Segmented model selection in quantile regression using the minimum description length principle,” Journal of the American Statistical Association , 109, 1241–1256.
7Bozdogan (1987) Bozdogan, H. (1987), “Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions,” Psychometrika , 52, 345–370.
8Bühlmann (2013) Bühlmann, P. (2013), “Statistical significance in high-dimensional linear models,” Bernoulli , 19, 1212–1242.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Enhancing Multi-model Inference with Natural Selection

Abstract

1 Introduction

2 Methodology

Genetic representation for variable selection.

Fitness function.

2.1 A Genetic Algorithm for Candidate Model Search

Initialization:

Remark 2.1**.**

2.2 Computational Considerations

2.2.1 Population Sizing

2.2.2 Adaptive Termination

Remark 2.2**.**

3 Theoretical Properties

3.1 Convergence Analysis

Definition 3.1**.**

Theorem 3.1**.**

Remark 3.1**.**

Theorem 3.2**.**

3.2 Evolvability Analysis

Definition 3.2**.**

Example 3.1**.**

Theorem 3.3**.**

Corollary 3.1**.**

4 GA-assisted Multi-model Inference

Assumption 4.1**.**

Lemma 4.1**.**

4.1 Variable Selection

Proposition 4.1**.**

4.2 Model Confidence Set

Proposition 4.2**.**

5 Simulation Studies

5.1 Simulation Settings

5.2 Schema Evolution

5.3 Comparison with Existing Methods

5.3.1 Computation Time

5.3.2 Variable Selection

5.3.3 Quality of Candidate Models

5.3.4 Model Averaging

5.3.5 Variable Importance

Remark 5.1**.**

6 Real Data Example

6.1 The Riboflavin Dataset

6.2 Residential Building Dataset

7 Discussion

S.1 Proofs for Section 3

S.1.1 Proof of Theorem 3.1

S.1.2 Proof of Theorem 3.2

S.1.3 Proof of Theorem 3.3

S.1.4 Proof of Corollary 3.1

S.2 Proof for Section 4

S.2.1 Proof of Lemma 4.1

S.2.2 Proof of Proposition 4.1

S.2.3 Proof of Proposition 4.2

S.3 Details of the Auxiliary Methods

S.3.1 GIC-Based Superiority Test

S.3.2 Model Averaging Approach of Ando and Li (2014)

S.3.3 A Variable Association Measure Assisted Approach for Generating the Initial Population

S.4 Supplementary Simulation Results

S.4.1 Computation Time

S.4.2 Schema Evolution

S.4.3 Variable Importance

S.5 Variable Coding for the Residential Building Dataset

S.6 Auxiliary Lemmas

Lemma S.6.1** (Theorem 2 of Rudolph (1994)).**

Remark 2.1.

Remark 2.2.

Definition 3.1.

Theorem 3.1.

Remark 3.1.

Theorem 3.2.

Definition 3.2.

Example 3.1.

Theorem 3.3.

Corollary 3.1.

Assumption 4.1.

Lemma 4.1.

Proposition 4.1.

Proposition 4.2.

Remark 5.1.

Lemma S.6.1 (Theorem 2 of Rudolph (1994)).