Bayesian Automatic Relevance Determination for Utility Function   Specification in Discrete Choice Models

Filipe Rodrigues; Nicola Ortelli; Michel Bierlaire; Francisco Pereira

arXiv:1906.03855·stat.ML·June 11, 2019

Bayesian Automatic Relevance Determination for Utility Function Specification in Discrete Choice Models

Filipe Rodrigues, Nicola Ortelli, Michel Bierlaire, Francisco Pereira

PDF

TL;DR

This paper introduces a Bayesian method with automatic relevance determination to efficiently identify optimal utility functions in discrete choice models, improving accuracy and scalability for large datasets.

Contribution

It develops a scalable Bayesian framework with variational inference for automatic utility function specification in discrete choice models, handling high-dimensional data.

Findings

01

Accurately recovers true utility functions in semi-artificial data

02

Discovers high-quality utility specifications in real data

03

Outperforms previous methods on multiple criteria

Abstract

Specifying utility functions is a key step towards applying the discrete choice framework for understanding the behaviour processes that govern user choices. However, identifying the utility function specifications that best model and explain the observed choices can be a very challenging and time-consuming task. This paper seeks to help modellers by leveraging the Bayesian framework and the concept of automatic relevance determination (ARD), in order to automatically determine an optimal utility function specification from an exponentially large set of possible specifications in a purely data-driven manner. Based on recent advances in approximate Bayesian inference, a doubly stochastic variational inference is developed, which allows the proposed DCM-ARD model to scale to very large and high-dimensional datasets. Using semi-artificial choice data, the proposed approach is shown to very…

Tables11

Table 1. Table 1: Manually-defined utility function specifications used to generate the semi-artificial choice data.

	Artificial specification
Spec	Variables in $V_{train}$	Variables in $V_{sm}$	Variables in $V_{car}$
S1	ASC, TT, CO	ASC, TT, CO	TT, CO
S2	ASC, TT, TT x age, CO	ASC, TT, CO,	TT, TT x age,
S2		CO x ga	CO
S3	ASC, TT, TT x age,	ASC, TT, CO,	TT, TT x age,
S3	CO, CO x ga, HE	CO x ga, log(HE)	CO
S4	ASC, ASC x ga, TT, CO	ASC, ASC x ga,	TT, CO,
S4		TT, CO	CO x purpose
S5	ASC, log(TT), HE	ASC, log(TT), HE	TT, CO
S6	ASC, log(TT),	ASC, log(TT)	TT, CO
S6	log(TT) x ga, CO
S7	ASC, box(TT),	ASC, TT	TT, CO
S7	box(TT) x ga, CO
S8	ASC, ASC x ga, TT,	ASC, ASC x ga, TT,	TT, CO,
S8	CO, CO x who	CO, CO x who	CO x luggage
S9	ASC, TT, CO	ASC, TT, TT x age,	TT, CO,
S9	CO x ga	CO, CO x ga	CO x income

Table 2. Table 2: Results of DCM-ARD for medium-sized search space (part 1).

	Train		Swiss Metro		Car
Spec	Variable	$λ$	Variable	$λ$	Variable	$λ$
S1	ASC	1.814	TT	0.513	TT	0.744
	TT	1.174	ASC	0.126	CO	0.011
	CO	0.393	CO	0.066	log(TT) x pur1	0.000
	CO x age1	0.000	log(HE) x age1	0.000	log(TT) x pur2	0.000
	…		…		…
S2	ASC	2.353	TT	0.495	TT	0.389
	TT x age1	0.524	CO x ga	0.195	CO	0.070
	TT x age2	0.524	ASC	0.120	TT x age1	0.060
	TT x age3	0.524	CO	0.030	TT x age2	0.060
	TT x age4	0.524	ASC x pur1	0.000	TT x age3	0.060
	TT	0.468	ASC x pur2	0.000	TT x age4	0.060
	CO	0.416	ASC x pur3	0.000	log(TT) x pur1	0.000
	TT x pur1	0.000	ASC x pur4	0.000	log(TT) x pur2	0.000
	…		…		…
S3	ASC	2.536	TT	0.522	TT	0.478
	CO x ga	0.633	CO x ga	0.426	CO	0.120
	TT x age1	0.510	ASC	0.133	TT x age1	0.061
	TT x age2	0.510	CO	0.023	TT x age2	0.061
	TT x age3	0.510	log(HE)	0.005	TT x age3	0.061
	TT x age4	0.510	HE x age1	0.000	TT x age4	0.061
	TT	0.300	HE x age2	0.000	log(TT) x ga	0.000
	CO	0.202	HE x age3	0.000	log(CO) x pur1	0.000
	HE	0.056	HE x age4	0.000	log(CO) x pur2	0.000
	HE x pur1	0.000	log(CO) x pur1	0.000	log(CO) x pur3	0.000
	…		…		…

Table 3. Table 3: Results of DCM-ARD for medium-sized search space (part 2).

	Train		Swiss Metro		Car
Spec	Variable	$λ$	Variable	$λ$	Variable	$λ$
S4	ASC x ga	6.836	ASC x ga	3.401	TT	0.855
	CO	2.323	CO	1.354	CO x pur1	0.100
	ASC	1.338	TT	0.462	CO x pur2	0.100
	TT	0.885	ASC	0.361	CO x pur3	0.100
	CO x ga	0.001	log(HE) x age1	0.001	CO x pur4	0.100
	ASC x pur1	0.000	log(HE) x age2	0.001	CO x pur5	0.100
	ASC x pur2	0.000	log(HE) x age3	0.001	CO x pur6	0.100
	ASC x pur3	0.000	log(HE) x age4	0.001	CO x pur7	0.100
	ASC x pur4	0.000	CO x pur1	0.000	CO x pur8	0.100
	ASC x pur5	0.000	CO x pur2	0.000	log(TT) x ga	0.000
	…		…		…
S5	ASC	1.775	log(TT)	0.557	TT	0.722
	log(TT)	1.405	ASC	0.087	CO	0.042
	HE	0.035	CO	0.002	log(TT) x pur1	0.000
	TT x age1	0.000	HE	0.001	log(TT) x pur2	0.000
	TT x age2	0.000	HE x age1	0.000	log(TT) x pur3	0.000
	…		…		…
S6	ASC	2.071	log(TT)	0.664	TT	0.809
	log(TT) x ga	1.600	ASC	0.106	CO	0.042
	log(TT)	0.611	log(TT) x age1	0.000	log(TT) x pur1	0.000
	CO	0.394	log(TT) x age2	0.000	log(TT) x pur2	0.000
	TT x age1	0.000	log(TT) x age3	0.000	log(TT) x pur3	0.000
	…		…		…

Table 4. Table 4: Results of DCM-ARD for large search space (part 1).

	Train		Swiss Metro		Car
Spec	Variable	$λ$	Variable	$λ$	Variable	$λ$
S1	ASC	1.879	TT	0.537	TT	0.677
	TT	1.196	ASC	0.137	CO	0.011
	CO	0.513	CO	0.096	CO x pur1	0.000
	log(CO) x inc1	0.001	TT x lugg1	0.000	CO x pur2	0.000
	…		…		…
S2	ASC	2.391	TT	0.568	TT	0.411
	TT	0.606	CO x ga	0.179	TT x age1	0.059
	CO	0.477	ASC	0.130	TT x age2	0.059
	TT x age1	0.352	CO	0.031	TT x age3	0.059
	TT x age2	0.352	HE x inc1	0.000	TT x age4	0.059
	TT x age3	0.352	HE x inc2	0.000	CO	0.036
	TT x age4	0.352	HE x inc3	0.000	TT x lugg1	0.000
	log(CO) x pur1	0.001	HE x inc4	0.000	TT x lugg2	0.000
	…		…		…
S3	ASC	2.599	TT	0.594	TT	0.446
	CO x ga	0.717	CO x ga	0.477	CO	0.107
	TT	0.432	ASC	0.127	TT x age1	0.074
	TT x age1	0.364	CO	0.017	TT x age2	0.074
	TT x age2	0.364	log(HE)	0.003	TT x age3	0.074
	TT x age3	0.364	HE x inc1	0.000	TT x age4	0.074
	TT x age4	0.364	HE x inc2	0.000	CO x who1	0.001
	CO	0.194	HE x inc3	0.000	CO x who2	0.001
	HE	0.057	HE x inc4	0.000	CO x who3	0.001
	ASC x who1	0.002	log(HE) x inc1	0.000	TT x pur1	0.000
	…		…		…

Table 5. Table 5: Results of DCM-ARD for large search space (part 2).

	Train		Swiss Metro		Car
Spec	Variable	$λ$	Variable	$λ$	Variable	$λ$
S7	ASC	2.246	TT	0.574	TT	0.553
	box(TT) x ga	1.787	ASC	0.120	CO	0.019
	CO	0.360	CO x pur1	0.000	TT x lugg1	0.000
	log(TT)	0.220	CO x pur2	0.000	TT x lugg2	0.000
	log(CO) x inc1	0.001	CO x pur3	0.000	seg(CO,4)	0.000
	…		…		…
S8	ASC x ga	7.448	ASC x ga	4.805	TT	0.828
	CO	2.840	CO	1.695	CO	0.018
	ASC	1.611	TT	0.559	seg(CO,4)	0.000
	TT	1.120	ASC	0.336	seg(CO,4)	0.000
	CO x who1	0.057	CO x who1	0.025	seg(CO,4)	0.000
	CO x who2	0.057	CO x who2	0.025	CO x pur1	0.000
	CO x who3	0.057	CO x who3	0.025	CO x pur2	0.000
	CO x inc1	0.001	seg(TT,8)	0.000	CO x pur3	0.000
	…		…		…
S9	ASC	2.255	TT	1.367	TT	1.118
	TT	1.197	CO x ga	0.501	CO	0.098
	CO x ga	0.805	TT x age1	0.134	CO x inc1	0.004
	CO	0.187	TT x age2	0.134	CO x inc2	0.004
	ASC x who1	0.001	TT x age3	0.134	CO x inc3	0.004
	ASC x who2	0.001	TT x age4	0.134	CO x inc4	0.004
	ASC x who3	0.001	ASC	0.110	CO x who1	0.001
	HE x age1	0.000	CO	0.015	CO x who2	0.001
	HE x age2	0.000	seg(TT,8)	0.000	CO x who3	0.001
	…		…		…

Table 6. Table 6: Prediction accuracy and log-likelihood on held-out data

		DCM		DCM-ARD		DCM-TRUE
Search Space	Spec	Acc.	LogLik	Acc.	LogLik	Acc.	LogLik
Moderate	S1	0.615	-2733.9	0.628	-2569.0	0.627	-2567.4
Moderate	S2	0.627	-2697.0	0.638	-2498.2	0.636	-2496.8
Moderate	S3	0.639	-2662.5	0.645	-2452.9	0.646	-2450.4
Moderate	S4	0.627	-2597.3	0.647	-2454.9	0.648	-2452.7
Moderate	S5	0.607	-2788.8	0.623	-2621.9	0.623	-2619.2
Moderate	S6	0.624	-2621.3	0.632	-2530.3	0.633	-2527.1
Large	S1	0.589	-2798.2	0.628	-2569.0	0.627	-2567.4
Large	S2	0.602	-2773.7	0.638	-2498.2	0.636	-2496.8
Large	S3	0.612	-2924.0	0.645	-2452.9	0.646	-2450.4
Large	S7	0.603	-2746.6	0.606	-2675.5	0.617	-2551.1
Large	S8	0.598	-2858.0	0.642	-2489.8	0.646	-2421.7
Large	S9	0.614	-2823.9	0.653	-2466.7	0.660	-2400.3

Table 7. Table 7: Results for real SM data

Train		Swiss Metro		Car
Variable	$λ$	Variable	$λ$	Variable	$λ$
log(TT) x ga	9.506	log(CO) x ga	5.570	log(CO)	4.479
ASC	4.002	log(CO) x pur1	2.251	TT x ga	1.378
log(CO)	3.262	log(CO) x pur2	2.251	log(TT) x pur1	0.477
log(CO) x pur1	2.469	log(CO)	1.184	log(TT) x pur2	0.477
log(CO) x pur2	2.469	log(TT)	0.506	log(CO) x age1	0.213
log(CO) x ga	1.235	CO	0.349	log(CO) x age2	0.213
CO	0.556	ASC x age1	0.250	log(CO) x age3	0.213
log(CO) x age1	0.269	ASC x age2	0.250	log(CO) x age4	0.213
log(CO) x age2	0.269	ASC x age3	0.250	CO x pur1	0.156
log(CO) x age3	0.269	ASC x age4	0.250	CO x pur2	0.156
log(CO) x age4	0.269	CO x pur1	0.236	TT x age1	0.107
CO x pur1	0.228	CO x pur2	0.236	TT x age2	0.107
CO x pur2	0.228	ASC x ga	0.146	TT x age3	0.107
log(TT)	0.175	CO x ga	0.099	TT x age4	0.107
log(HE)	0.075	TT x age1	0.027	CO	0.037
CO x ga	0.068	TT x age2	0.027	log(TT)	0.000
CO x age1	0.034	TT x age3	0.027	log(TT) x ga	0.000
CO x age2	0.034	TT x age4	0.027	log(CO) x ga	0.000
CO x age3	0.034	TT x pur1	0.005	TT	0.000
CO x age4	0.034	TT x pur2	0.005	TT x pur1	0.000
…		…		…

Table 8. Table 8: Utility function specifications for true SM data

	Specification
S#	Variables in $V_{train}$	Variables in $V_{sm}$	Attrib. in $V_{car}$
R1	ASC, TT, CO	ASC, TT, CO	TT, CO
R2	ASC, log(TT),	ASC, log(TT),	TT, log(CO)
R2	log(TT) x ga, log(CO)	log(CO)
R3	ASC, log(TT), log(TT) x ga,	ASC, log(TT), log(CO),	TT, log(CO)
R3	log(CO), log(CO) x pur	log(CO) x ga
R4	ASC, log(TT), log(TT) x ga,	ASC, log(TT),	TT, TT x ga,
	log(CO), log(CO) x ga,	log(CO), log(CO) x ga,	log(CO)
	log(CO) x pur	log(CO) x pur
R5	ASC, log(TT), log(TT) x ga,	ASC, ASC x age,	TT, TT x ga,
	log(CO), log(CO) x ga,	log(TT), log(CO),	TT x pur,
	log(CO) x pur,	log(CO) x ga,	log(CO)
	log(CO) x age	log(CO) x pur
R6	ASC, log(TT), log(TT) x ga,	ASC, ASC x age,	TT, TT x ga,
	log(CO), log(CO) x ga,	log(TT), log(CO),	TT x pur,
	log(CO) x pur,	log(CO) x ga,	log(CO)
	log(CO) x age, log(HE)	log(CO) x pur
R7	ASC, log(TT), log(TT) x ga,	ASC, ASC x ga,	TT, TT x ga,
	log(CO), log(CO) x ga,	ASC x age, log(TT),	TT x pur, log(CO)
	log(CO) x pur,	log(CO), log(CO) x ga,	log(CO) x age,
	log(CO) x age, log(HE)	log(CO) x pur	log(CO) x pur

Table 9. Table 9: Results for true SM data

	Specification
	R1	R2	R3	R4	R5	R6	R7
Log-like full	-8,625	-8,368	-8,064	-7,836	-7,679	-7,645	-7,617
AIC	17,267	16,755	16,152	15,704	15,410	15,345	15,301
BIC	17,326	16,821	16,239	15,820	15,599	15,542	15,549
Pseudo- $R^{2}$	0.221	0.244	0.272	0.292	0.306	0.309	0.312
Pseudo- ${\bar{R}}^{2}$	0.220	0.243	0.271	0.291	0.304	0.307	0.309
Log-lik train	-6,032	-5,822	-5,619	-5,429	-5,297	-5,271	-5,247
Log-lik test	-2,603	-2,558	-2,457	-2,437	-2,428	-2,421	2,430
Train acc.	0.616	0.636	0.661	0.676	0.689	0.690	0.692
Test acc.	0.615	0.638	0.662	0.670	0.675	0.677	0.679

Table 10. Table 10: Results for true SM data vs. baseline from state of the art

	Specification
	Bierlaire et al. [2001]	PyLogit Example	R6	R7
Log-lik full	-8,483	-8,061	-7,645	-7,617
AIC	16,984	16,150	15,345	15,301
BIC	17,050	16,252	15,542	15,549
Pseudo- $R^{2}$	0.234	0.272	0.309	0.312
Pseudo- ${\bar{R}}^{2}$	0.233	0.271	0.307	0.309
Log-lik train	-5,960	-5,633	-5,271	-5,247
Log-lik test	-2,535	-2,450	-2,421	2,430
Train acc.	0.646	0.667	0.690	0.692
Test acc.	0.644	0.650	0.677	0.679

Table 11. Table 11: Results for true SM data, spec 6

	Coef	StdErr	$z$	$p > \| z \|$	[0.025	0.975]
ASC (Train)	3.036	0.196	15.478	0.000	2.652	3.421
ASC (SM)	0.900	0.134	6.725	0.000	0.638	1.163
ASC x age1 (SM)	0.575	0.156	3.699	0.000	0.271	0.880
ASC x age2 (SM)	0.784	0.103	7.585	0.000	0.582	0.987
ASC x age3 (SM)	0.704	0.102	6.909	0.000	0.505	0.904
ASC x age4 (SM)	0.479	0.107	4.478	0.000	0.270	0.689
log(TT) (Train)	-0.964	0.261	-3.697	0.000	-1.477	-0.453
log(TT) (SM)	-2.570	0.110	-23.465	0.000	-2.785	-2.355
TT (Car)	-0.865	0.218	-3.974	0.000	-1.293	-0.439
log(TT) x ga (Train)	-2.995	0.275	-10.880	0.000	-3.535	-2.455
TT x ga (Car)	-0.176	0.210	-0.841	0.400	-0.589	0.235
TT x pur1 (Car)	0.273	0.064	4.285	0.000	0.148	0.398
TT x pur2 (Car)	0.289	0.088	3.289	0.001	0.117	0.463
log(CO) (Train)	-2.637	0.318	-8.297	0.000	-3.261	-2.015
log(CO) (SM)	-1.984	0.247	-8.023	0.000	-2.470	-1.500
log(CO) (Car)	-1.875	0.175	-10.714	0.000	-2.218	-1.532
log(CO) x ga (Train)	-1.997	0.195	-10.248	0.000	-2.379	-1.615
log(CO) x ga (SM)	-2.249	0.132	-17.024	0.000	-2.509	-1.991
CO x age1 (Train)	-0.317	0.090	-3.539	0.000	-0.493	-0.141
CO x age2 (Train)	-0.578	0.079	-7.336	0.000	-0.733	-0.424
CO x age3 (Train)	-0.647	0.080	-8.134	0.000	-0.804	-0.492
CO x age4 (Train)	-0.525	0.083	-6.301	0.000	-0.690	-0.362
log(CO) x pur1 (Train)	2.521	0.294	8.574	0.000	1.945	3.098
log(CO) x pur1 (SM)	1.963	0.231	8.510	0.000	1.511	2.415
log(CO) x pur2 (Train)	3.282	0.308	10.641	0.000	2.678	3.887
log(CO) x pur2 (SM)	2.589	0.244	10.606	0.000	2.111	3.068
HE, (Train)	-0.948	0.118	-8.059	0.000	-1.179	-0.718

Equations57

U_{in} = V_{in} + ϵ_{in},

U_{in} = V_{in} + ϵ_{in},

V_{in} = β_{i}^{\mbox T} x_{in} = d = 1 \sum D_{i} β_{d i} x_{d in},

V_{in} = β_{i}^{\mbox T} x_{in} = d = 1 \sum D_{i} β_{d i} x_{d in},

P_{n} (i) = \frac{e ^{V_{in}}}{\sum _{j \in C_{n}} e ^{V_{j n}}} .

P_{n} (i) = \frac{e ^{V_{in}}}{\sum _{j \in C_{n}} e ^{V_{j n}}} .

β^{*} = ar g β max n = 1 \sum N i \in C_{n} \sum y_{in} lo g P_{n} (i),

β^{*} = ar g β max n = 1 \sum N i \in C_{n} \sum y_{in} lo g P_{n} (i),

β_{i} \sim N (β_{i} ∣ 0, λ I),

β_{i} \sim N (β_{i} ∣ 0, λ I),

\displaystyle p(\textbf{y},\boldsymbol{\beta}|\lambda)=\Bigg{(}\prod_{i\in\mathcal{C}}\mathcal{N}(\boldsymbol{\beta}_{i}|\textbf{0},\lambda\textbf{I})\Bigg{)}\prod_{n=1}^{N}\prod_{i\in\mathcal{C}_{n}}(P_{n}(i))^{y_{in}},

\displaystyle p(\textbf{y},\boldsymbol{\beta}|\lambda)=\Bigg{(}\prod_{i\in\mathcal{C}}\mathcal{N}(\boldsymbol{\beta}_{i}|\textbf{0},\lambda\textbf{I})\Bigg{)}\prod_{n=1}^{N}\prod_{i\in\mathcal{C}_{n}}(P_{n}(i))^{y_{in}},

p (β ∣ y, λ) = \frac{p ( β ∣ λ ) \prod _{n = 1}^{N} \prod _{i \in C_{n}} ( P _{n} ( i ) ) ^{y_{in}}}{\int p ( β ∣ λ ) \prod _{n = 1}^{N} \prod _{i \in C_{n}} ( P _{n} ( i ) ) ^{y_{in}} d β} .

p (β ∣ y, λ) = \frac{p ( β ∣ λ ) \prod _{n = 1}^{N} \prod _{i \in C_{n}} ( P _{n} ( i ) ) ^{y_{in}}}{\int p ( β ∣ λ ) \prod _{n = 1}^{N} \prod _{i \in C_{n}} ( P _{n} ( i ) ) ^{y_{in}} d β} .

V_{in} = d = 1 \sum D_{i} k = 1 \sum K_{d} β_{k d i} δ_{k} (s_{n}) h (x_{d in}),

V_{in} = d = 1 \sum D_{i} k = 1 \sum K_{d} β_{k d i} δ_{k} (s_{n}) h (x_{d in}),

β_{k d i} \sim N (β_{k d i} ∣0, λ_{d i}) .

β_{k d i} \sim N (β_{k d i} ∣0, λ_{d i}) .

\displaystyle p(\boldsymbol{\beta}|\boldsymbol{\lambda})=\Bigg{(}\prod_{i\in\mathcal{C}}\prod_{d=1}^{D_{i}}\prod_{k=1}^{K_{d}}\mathcal{N}(\beta_{kdi}|0,\lambda_{di})\Bigg{)},

\displaystyle p(\boldsymbol{\beta}|\boldsymbol{\lambda})=\Bigg{(}\prod_{i\in\mathcal{C}}\prod_{d=1}^{D_{i}}\prod_{k=1}^{K_{d}}\mathcal{N}(\beta_{kdi}|0,\lambda_{di})\Bigg{)},

\displaystyle p(\textbf{y}|\boldsymbol{\lambda})=\int\Bigg{(}\prod_{i\in\mathcal{C}}\prod_{d=1}^{D_{i}}\prod_{k=1}^{K_{d}}\mathcal{N}(\beta_{kdi}|0,\lambda_{di})\Bigg{)}\prod_{n=1}^{N}\prod_{i\in\mathcal{C}_{n}}(P_{n}(i))^{y_{in}}\,d\boldsymbol{\beta}.

\displaystyle p(\textbf{y}|\boldsymbol{\lambda})=\int\Bigg{(}\prod_{i\in\mathcal{C}}\prod_{d=1}^{D_{i}}\prod_{k=1}^{K_{d}}\mathcal{N}(\beta_{kdi}|0,\lambda_{di})\Bigg{)}\prod_{n=1}^{N}\prod_{i\in\mathcal{C}_{n}}(P_{n}(i))^{y_{in}}\,d\boldsymbol{\beta}.

q (β ∣ μ, c) = i \in C \prod d = 1 \prod D_{i} k = 1 \prod K_{d} N (β_{k d i} ∣ μ_{k d i}, c_{k d i}),

q (β ∣ μ, c) = i \in C \prod d = 1 \prod D_{i} k = 1 \prod K_{d} N (β_{k d i} ∣ μ_{k d i}, c_{k d i}),

KL (q (β ∣ μ, c) ∣∣ p (β ∣ y)) = \int q (β ∣ μ, c) lo g \frac{q ( β ∣ μ , c )}{p ( β ∣ y )} d β .

KL (q (β ∣ μ, c) ∣∣ p (β ∣ y)) = \int q (β ∣ μ, c) lo g \frac{q ( β ∣ μ , c )}{p ( β ∣ y )} d β .

lo g p (y ∣ λ)

lo g p (y ∣ λ)

\geq \int q (β ∣ μ, c) lo g \frac{p ( y , β ∣ λ )}{q ( β ∣ μ , c )} d β

= E_{q (β)} [lo g p (y, β ∣ λ)] - E_{q (β)} [lo g q (β ∣ μ, c)] = L (μ, c, λ),

L (μ, c, λ)

L (μ, c, λ)

- i \in C \sum d = 1 \sum D_{i} k = 1 \sum K_{d} E_{q (β)} [lo g N (β_{k d i} ∣ μ_{k d i}, c_{k d i})]

q (β ∣ μ, c)

q (β ∣ μ, c)

L (μ, c, λ)

L (μ, c, λ)

= E_{N (z ∣ 0, I)} [lo g p (y ∣ c \circ z + μ)] + i \in C \sum d = 1 \sum D_{i} k = 1 \sum K_{d} lo g c_{k d i}

+ i \in C \sum d = 1 \sum D_{i} k = 1 \sum K_{d} E_{N (z_{k d i} ∣0, 1)} lo g N (c_{k d i} z_{k d i} + μ_{k d i} ∣0, λ_{d i}) + \mbox co n s t .,

L (μ, c, λ)

L (μ, c, λ)

- \frac{1}{2} i \in C \sum d = 1 \sum D_{i} K_{d} lo g λ_{d i} - \frac{1}{2} i \in C \sum d = 1 \sum D_{i} k = 1 \sum K_{d} \frac{c _{k d i}^{2} + μ _{k d i}^{2}}{λ _{d i}} + \mbox co n s t .

λ_{d i}^{*} = \frac{1}{K _{d}} k = 1 \sum K_{d} (c_{k d i}^{2} + μ_{k d i}^{2}) .

λ_{d i}^{*} = \frac{1}{K _{d}} k = 1 \sum K_{d} (c_{k d i}^{2} + μ_{k d i}^{2}) .

L (μ, c)

L (μ, c)

\nabla_{μ_{k d i}} L (μ, c)

\nabla_{μ_{k d i}} L (μ, c)

\nabla_{c_{k d i}} L (μ, c)

\nabla_{μ_{k d i}} L (μ, c)

\nabla_{μ_{k d i}} L (μ, c)

\nabla_{c_{k d i}} L (μ, c)

\nabla_{β_{k d i}} lo g p (y ∣ β)

\nabla_{β_{k d i}} lo g p (y ∣ β)

μ^{(t)} = μ^{(t - 1)} + ρ_{t} \nabla_{μ} L (μ, c)

μ^{(t)} = μ^{(t - 1)} + ρ_{t} \nabla_{μ} L (μ, c)

c^{(t)} = c^{(t - 1)} + ρ_{t} \nabla_{c} L (μ, c)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Bayesian Automatic Relevance Determination for Utility Function Specification in Discrete Choice Models

Filipe Rodrigues

[

Nicola Ortelli

Michel Bierlaire

Francisco C. Pereira

Technical University of Denmark (DTU), Bygning 116B, 2800 Kgs. Lyngby, Denmark

École Polytechnique Fédérale de Lausanne (EPFL),

Abstract

Specifying utility functions is a key step towards applying the discrete choice framework for understanding the behaviour processes that govern user choices. However, identifying the utility function specifications that best model and explain the observed choices can be a very challenging and time-consuming task. This paper seeks to help modellers by leveraging the Bayesian framework and the concept of automatic relevance determination (ARD), in order to automatically determine an optimal utility function specification from an exponentially large set of possible specifications in a purely data-driven manner. Based on recent advances in approximate Bayesian inference, a doubly stochastic variational inference is developed, which allows the proposed DCM-ARD model to scale to very large and high-dimensional datasets. Using semi-artificial choice data, the proposed approach is shown to very accurately recover the true utility function specifications that govern the observed choices. Moreover, when applied to real choice data, DCM-ARD is shown to be able discover high quality specifications that can outperform previous ones from the literature according to multiple criteria, thereby demonstrating its practical applicability.

keywords:

discrete choice models, automatic relevance determination, automatic utility specification, doubly stochastic variational inference

url]http://fprodrigues.com

1 Introduction

Discrete choice models (DCM) provide a powerful framework for understanding user behaviour. By modelling user choices as functions of the alternative-specific characteristics and user attributes, DCMs allow researchers to predict users’ future choices given a set of discrete alternatives and understand the behaviour process that governs their choices. Hence, it is without surprise that DCMs have become a widely adopted framework in various domains ranging from psychology to economics, thus making them one of the main work-horses for understanding user travel behaviour, consumer behaviour, and many other kinds of user choices.

In practice, a fundamental part of applying the DCM framework consists in specifying the utility function for each alternative in the choice set, which are generally assumed to be known a priori. For the sake of interpretability, these utility functions are typically assumed to be linear functions of a set of explanatory variables. Although limiting at first sight, this linear framework can be made rather powerful by exploring variable transformations (e.g. log-transformations, Box-Cox transformations), one-hot encodings, piecewise linear representations, discretizations, interactions between variables, etc. However, all these modelling choices quickly raise the number of possible utility function specifications beyond manageable values for the modeller. On the other hand, given the central role of the utility functions in DCMs, it is essential to determine good specifications, at the risk of obtaining misspecified models and biased parameter estimates [Torres et al., 2011]. As a consequence, a modeller often spends large portions of time seeking the “best” specification according to different criteria (e.g. convergence, log-likelihood, p-values), typically through a combination of trial-and-error and domain knowledge (e.g. economic theories).

In this paper, we propose leveraging the Bayesian framework in order to automatically determine an optimal utility function specification from an exponentially large set of possible specifications in a purely data-driven manner. Although the proposed approach is not meant to be a complete replacement for expert intuition and domain knowledge, it is shown to provide key insights about the data that can help the modeller determine the utility function specification that best represents the observed choice data, which can ultimately lead to new understandings about the way people make choices in certain contexts.

Based on the principle of Automatic Relevance Determination (ARD), as developed by Tipping [2001] in the context of the Relevance Vector Machine and as widely used in the Gaussian Processes literature [Rasmussen, 2003], we propose the use of a hierarchical prior on the preference parameters of each utility function in order to automatically determine their relevance for explaining the observed choice data. The key idea consists in jointly estimating the posterior distribution over the preference parameters, as well as the optimal values for the variances of the Gaussian priors over each possible explanatory variable to be included in each utility function specification. In order to ensure consistency among the selected variables, i.e. that either all or none of the dimensions corresponding to the representation of a given explanatory variable are selected, we propose tying the variance parameters of the Gaussian priors over the parameters that correspond to the same representation of a given choice attribute. Given the estimated optimal values for the variances of the Gaussian priors for a very large set of possible variable representations, a modeller can easily determine the most relevant attributes and corresponding representations for explaining a dataset of observed choices by simply selecting the variables for which the estimated prior variances are non-zero.

Since exact Bayesian inference in the proposed DCM-ARD model is intractable, we propose the use of the variational inference framework. Namely, we develop an efficient approximate inference algorithm using doubly stochastic variational inference [Titsias & Lázaro-Gredilla, 2014]. By combining the theory of variational inference with the theory of stochastic optimization, the proposed inference algorithm is able to approximate the true posterior distribution over the preference parameters with a tractable distribution and jointly estimate the optimal Gaussian prior hyper-parameters, while being able to scale to very large datasets with a very high number of dimensions. Although we focus on Multinomial Logit (MNL) models, the proposed approach can be extended to more complex models such as Mixed and Latent Class Logit models.

The validity of the proposed automatic utility function specification framework is empirically demonstrated using both semi-artificial and real choice data. We begin by empirically demonstrating the ability of the proposed approach to discover the correct utility function specifications through an extensive series of experiments on simulated choice data based on the Swissmetro dataset [Bierlaire et al., 2001]. In particular, we manually specify a series of “artificial” (but realistic) utility function specifications of increasing complexity and, based on the Swissmetro dataset, we sample new artificial choices according to the manually-specified utility functions. Our empirical results show that the proposed DCM-ARD model is able to very accurately recover the “true” specifications that were used to generate the artificial choices, even in settings where the number of variables representations and transformations considered for each utility function is in the order of the thousands. Lastly, our empirical results on the real choices from the Swissmetro dataset demonstrate the potential of the proposed framework for discovering novel utility function specifications that can potentially outperform previous ones from the state of the art in terms of explanatory power and generalization to unobserved data.

In summary, the main contributions of this paper are the following:

We adapt the theory of ARD to the domain of DCMs, making the necessary modifications that are required from a choice modelling perspective (e.g. multiple utility functions with alternative-specific attributes, variable number of dimensions and tied parameters in the hierarchical priors);

2.

We develop a new variational inference algorithm for performing fast approximate inference in the proposed DCM-ARD model based on the DSVI framework proposed by Titsias & Lázaro-Gredilla [2014];

3.

We empirically show (i) the ability of the the proposed approach to recover the true utility function specifications on semi-artificial choice data, (ii) that DCM-ARD can discover new specifications that outperform previous ones from the literature, and (iii) that the developed DSVI algorithm is able to scale to very large datasets and search spaces.

The remainder of this paper is organized as follows. In the next section, we review the relevant literature for this work. Section 3 presents the proposed DCM-ARD model and derives a scalable doubly-stochastic variational inference algorithm for performing fast approximate Bayesian inference on it. The corresponding experimental results are presented in Section 4. The paper ends with the conclusions (Section 5).

2 Literature review

The problem of automatically determining the relevant variables for inclusion in a model has been studied to a significant extent in the supervised machine learning literature under the common title of “feature selection”. When using feature selection techniques, the main premise is that the considered data contain redundant or irrelevant variables, which can therefore be removed without consequent loss of information [Dash & Liu, 1997]. The numerous existing approaches are generally classified as wrapper, filter and embedded methods according to the strategy they employ to search for subsets of variables [Guyon & Elisseeff, 2003]. Wrappers use the model of interest to score subsets according to the predictive power they allow to achieve. Despite being computationally intensive, wrappers offer a simple way of addressing the problem: a plethora of methods based on simulated annealing [Lin et al., 2008, Brusco, 2014], tabu search [Fouskakis & Draper, 2008, Pacheco et al., 2009], evolutionary algorithms [Pal et al., 1998, Vinterbo & Ohno-Machado, 1999, Soufan et al., 2015] and other combinatorial optimization algorithms have already been applied successfully, both for linear and logistic regressions. In comparison, filter methods are independent of the model under consideration; they use “proxy” measures such as correlation or mutual information [Xing et al., 2001, Peng et al., , Vergara & Estévez, ] to evaluate single features or subsets. While being less computationally intensive than wrappers, filters usually achieve worse results in terms of prediction power. Finally, embedded methods are characterized by the fact that the selection of variables and the estimation of the model are performed simultaneously, in a single process. A good example of such class of methods is the LASSO, initially proposed by Tibshirani [1996] and successfully applied both to linear [Zhang & Huang, 2008] and logisitic [Huttunen et al., , Hossain et al., 2014] regressions. Other existing embedded methods make use of mixed integer optimization [Sato et al., 2016] or decision trees [Muni et al., , Deng & Runger, 2012] to effectively incorporate feature selection as part of the training process.

In the field of discrete choice analysis, interest has recently emerged for methods that are able to “mitigate” the need for presumptive structural assumptions. Two main directions of research are explored in the existing literature: the first substitutes DCMs with machine learning classifiers that do not require any prior knowledge concerning the domain [Paredes et al., 2017, Brathwaite et al., , Lhéritier et al., , Sifringer et al., ], while the second focuses on automatizing the utility specification of DCMs by means of data-driven feature selection algorithms [Tutz et al., 2015, Paz et al., 2019, Ortelli et al., 2019].

A particularly elegant class of methods for performing automatic feature selection in the statistics and machine learning literature relies on the concept of automatic relevance determination (ARD) [Tipping, 2001, MacKay, 1996, Bishop, 2006]. The idea behind this class of approaches consists in specifying the a-priori uncertainty and infer a-posteriori uncertainty about regression coefficients explicitly and hierarchically in a Bayesian framework. However, unfortunately, Bayesian inference in such hierarchical models quickly becomes intractable, and effective and scalable methods are required in order to perform approximate inference. To that end, Bishop [2006] presents a type-II maximum likelihood based on variational inference in a linear regression context, where the hyper-parameters of the hierarchical priors are tuned by maximizing the marginal likelihood of the data. This approach was later extended by Drugowitsch [2013] to a fully Bayesian approach by further considering a normal inverse-gamma prior over the hyper-parameters of the hierarchical priors, and then performing variational inference to determine the corresponding posterior distributions. Furthermore, the author also considers ARD in a binary logistic regression context. The difficulty in the latter stems from the non-conjugacy of the sigmoid, which required the authors to consider an additional model-specific parametric lower bound on the sigmoid as proposed by Jaakkola & Jordan [2000], which can raise the computational cost and compromise accuracy. Recently, highly efficient general-purpose black-box variational inference methods have proposed in the literature [Ranganath et al., 2014, Titsias & Lázaro-Gredilla, 2014], which allow for approximating the required expectations using inexpensive Monte Carlo approximations. In particular, Titsias & Lázaro-Gredilla [2014] proposed a doubly stochastic variational inference for performing ARD in binary logistic regression. The approach proposed in this paper builds on the work of Titsias & Lázaro-Gredilla [2014] to propose an ARD framework for discrete choice models, and to develop a corresponding efficient variational inference algorithm.

3 Approach

3.1 Discrete choice models

Following the Random Utility Maximization (RUM) theory, discrete choice models are based on the assumption that each individual $n\in\{1,\dots,N\}$ is a rational decision-maker that aims at maximizing some utility with respect to the choice set $\mathcal{C}_{n}$ that is presented to her. A key step in discrete choice modeling is then to specify a function $U_{in}$ that is able to capture the utility of each alternative $i$ for each individual $n$ . The utility function is further assumed to be partitioned intro two components: a systematic (or deterministic) utility $V_{in}$ and a random component $\epsilon_{in}$ :

[TABLE]

where $\epsilon_{in}$ is an i.i.d. term that captures the uncertainty stemming from the impossibility of $V_{in}$ to fully capture the choice context. As for the systematic component $V_{in}$ , it is typically assumed to be a linear function of the observable explanatory variables $\textbf{x}_{in}=\{x_{din}\}_{d=1}^{D_{i}}$ of the utility of alternative $i$ for each individual $n$ (e.g. alternative characteristics, individual’s socio-demographic attributes, etc.):

[TABLE]

where $\boldsymbol{\beta}_{i}$ is a vector of alternative-specific preference parameters. This accounts for the more general setting where preference parameters may vary between different alternatives. Following the same reasoning, our specification further allows for a variable number of explanatory variables $D_{i}$ per alternative $i$ .

Under the standard multinomial logit assumption that $\epsilon_{in}\sim\mbox{EV}(0,1)$ , the probability of individual $n$ selecting alternative $i$ is given by

[TABLE]

Given a dataset of observed choices and corresponding explanatory variables for a population of size $N$ , the modeler’s objective is to determine the preference parameters $\boldsymbol{\beta}$ , which are typically estimated by maximizing the log-likelihood function:

[TABLE]

where $y_{in}$ is a one-hot encoding of the observed choice for the $n^{th}$ individual (i.e. $y_{in}$ takes the value 1 if the individual $n$ chose the alternative $i$ , and 0 otherwise), and y and $\boldsymbol{\beta}$ are used to denote the set of all observed choices and preference parameters, respectively.

Despite the appealing simplicity of maximum likelihood estimation methods, in this paper we shall follow a Bayesian approach. The latter not only allows us to infer full posterior distributions for the preference parameters $\boldsymbol{\beta}$ that provide for a principled way of performing hypotheses testing [Song et al., 2017] and uncertainty quantification, but also enable online learning approaches in which the posterior over the parameters is continuously updated as more data becomes available [Danaf, 2017]. Moreover, most importantly, it will support the development of the automatic utility function specification approach based on ARD proposed in Section 3.2.

We begin by introducing the standard Bayesian framework for the discrete choice model specified above, which will serve as the starting point for the proposed approach in Section 3.2. To enable the Bayesian treatment of model above, we start by placing a prior distribution over the preference parameters for each of the alternatives:

[TABLE]

where I denotes the identity matrix, thus making $\lambda\textbf{I}$ a diagonal covariance matrix parametrized by $\lambda$ .

In order to summarize the entire model, we present below its generative process - a compact description of the model’s assumptions regarding how the observed data was generated.

For each alternative $i$ in the entire choice set $\mathcal{C}$

(a)

Draw preference parameters $\boldsymbol{\beta}_{i}\sim\mathcal{N}(\boldsymbol{\beta}_{i}|\textbf{0},\lambda\textbf{I})$ 2. 2.

For each individual $n\in\{1,\dots,N\}$

(a)

Draw observed choice variable $y_{n}\sim\mbox{Categorical}(y_{n}|P_{n})$

The joint probability distribution is then given by

[TABLE]

where we purposely omitted the explicit dependency on the explanatory variables x to avoid cluttering the notation. Making use of Bayes’ theorem, the posterior distribution over the preference parameters $\boldsymbol{\beta}$ is

[TABLE]

However, the non-conjugacy between the prior (5) and the softmax likelihood in (3) deems the integral in the denominator intractable, thus making exact inference infeasible. Fortunately, over recent years, we have observed very significative improvements in the accuracy and scalability of approximate Bayesian inference methods, which we shall exploit in Section 3.3.

3.2 Automatic utility function specification

The main of focus of this paper is on leveraging the Bayesian framework and the concept automatic relevance determination (ARD) [Tipping, 2001] to lift the burden of manually searching for an optimal utility function specification for a given discrete choice problem from the modeler. Namely, we wish to automatically determine the relevant variables for the utility function of each alternative $i$ , while considering also for different non-linear transformations (e.g. log-transforms, Box-Cox transforms), different continuous variable discretizations, interactions between variables, etc. In order to allow for some of these modeling options and, in particular, variable interactions, let us begin by considering a more flexible parameterization of the utility function in (2). Letting $s_{n}$ be a categorical socio-economic variable with $K$ categories associated with individual $n$ (e.g. age, income, education or profession), we can allow for interactions with the remaining variables by introducing an unknown parameter per category $\beta_{1},\dots,\beta_{K}$ and defining the utility function for an alternative $i$ as

[TABLE]

where $\delta_{k}(s_{n})$ is an indicator function, which takes the value 1 if the $n^{th}$ individual belongs to category $k$ and 0 otherwise, and $h(\cdot)$ is an arbitrary function (e.g. logarithm for a log-transform). Kindly notice that the utility specification in (2) is a special case of (8), when $K_{d}=1$ and $h(\cdot)$ is the identity function. Similarly, this specification also contains one-hot encodings and discretizations of a variable $d$ as special cases by adapting the functions $\delta_{k}(\cdot)$ and $h(\cdot)$ accordingly.

Based on (8), the problem of automatic utility function specification can then be defined as determining the subset of input dimensions $\mathcal{S}_{i}\subseteq\{1,\dots,D_{i}\}$ that best models the observed choices according to a dataset of observed choices, where $\{1,\dots,D_{i}\}$ is a very large set of possible variable transformations and representations whose usefulness to the model we wish to test. For example, for a cost variable, a modeler may consider including in $\{1,\dots,D_{i}\}$ the variable itself, its log-transformed value, cost interacted with gender, cost interacted with age, cost interacted with both gender and age, a piecewise linear transformation, etc. The goal is then to determine which subset $\mathcal{S}_{i}$ of these should be included in the utility function specification $V_{i}$ .

The starting point for our proposed approach is the concept of automatic relevance determination (ARD), as used for instance in the statistical machine learning literature for the relevance vector machine [Tipping, 2001]. The key idea lies in realizing that preference parameters of irrelevant dimensions $d$ should be pushed towards zero. However, the standard prior specification in (5) is too restrictive to allow for some parameters to be pushed arbitrarily close to zero, while others retain their actual values. This restriction stems for the fact in (5), the parameters are assumed to have independent univariate Gaussian priors that share the same prior variance $\lambda$ . Therefore, we can make progress towards ARD in discrete choice models by constructing a flexible hierarchical prior, in which each parameter is assigned an independent Gaussian prior with its own variance, but parameters belonging to the representation of the same variable share the same variance. Mathematically, this corresponds to

[TABLE]

Please note that the constraint of sharing the same variance over the index $k$ is crucial in order to ensure that the entire group is treated as a whole, i.e. either all $k$ “sub-dimensions” of a variable $d$ are deemed relevant by the model, or none is and their corresponding parameters are all pushed towards zero. The prior over all the preference parameters is then given by

[TABLE]

where $\boldsymbol{\lambda}$ is used to denote the set of all $\lambda_{di}$ . While one could further place a Gamma prior over the precisions $\lambda_{di}^{-1}$ , we refrain from doing so because (i) it would introduce a new set of hyper-parameters to specify and (ii), as we shall see in Section 3.3, it is possible to optimize over the variance parameters $\lambda$ analytically. Hence, we shall continue by treating the latter as point parameters rather than random variables in a fully Bayesian setting. The generative process of the proposed model can then be summarized as follows:

For each alternative $i$ in the entire choice set $\mathcal{C}$

(a)

For each variable $d\in\{1,\dots,D_{i}\}$

i.

Set preference parameter variance $\lambda_{di}$ 2. ii.

For each category $k\in\{1,\dots,K_{d}\}$

A.

Draw preference parameter $\beta_{kdi}\sim\mathcal{N}(\beta_{kdi}|0,\lambda_{di})$ 2. 2.

For each individual $n\in\{1,\dots,N\}$

(a)

Draw observed choice variable $y_{n}\sim\mbox{Categorical}(y_{n}|P_{n})$

In order to place further emphasis on the hierarchical structure of the proposed model, Figure 1 shows a graphical model representation, which highlights the dependencies between the different variables.

Based on the model specification above, our goal is to be able to jointly infer the preference parameters $\boldsymbol{\beta}$ and estimate the variance parameters $\lambda_{di}$ for each explanatory variable, in order to assess which ones should be included in each utility function $V_{i}$ . As for the “standard” discrete choice model in Section 3.1, performing exact Bayesian inference in the proposed model is intractable. Therefore, we shall proceed by developing an approximate Bayesian inference algorithm using doubly stochastic variational inference [Titsias & Lázaro-Gredilla, 2014].

3.3 Doubly stochastic variational inference

The intractability of exact inference for the proposed model stems from the impossibility of obtaining an analytical expression for the marginal likelihood in the denominator of (7), which for the proposed ARD model takes the form

[TABLE]

In order to obtain an efficient and scalable approximate inference algorithm that is able to cope with large datasets and with very high dimensionalities $D_{i}$ , we propose the use of the variational inference framework [Jordan et al., 1999].

Variational inference, or variational Bayes, constructs an approximation to the true posterior distribution $p(\boldsymbol{\beta}|\textbf{y})$ by considering a family of tractable distributions $q(\boldsymbol{\beta})$ , which can be obtained by relaxing some constraints in the true distribution. In this case, we shall assume the variational distribution $q(\boldsymbol{\beta})$ to be a fully-factorized (mean-field) approximation to the true posterior:

[TABLE]

with variational parameters $\boldsymbol{\mu}$ and c. The inference problem is then to find the parameters of the variational distribution so that the approximation becomes as close as possible to the true posterior, thereby reducing inference to an optimization problem.

The closeness between the approximate posterior $q(\boldsymbol{\beta}|\boldsymbol{\mu},\textbf{c})$ and the true posterior $p(\boldsymbol{\beta}|\textbf{y})$ can be measured by the Kullback-Leibler (KL) divergence [MacKay, 2003] given by

[TABLE]

Although the KL cannot be minimized directly, following the theory on variational inference [Jordan et al., 1999, MacKay, 2003], the KL minimization can be equivalently formulated as maximizing the following lower bound on the log marginal likelihood (or log evidence) in (11):

[TABLE]

where we made use of Jensen’s inequality. We can further write the evidence lower bound, $\mathcal{L}(\boldsymbol{\mu},\textbf{c},\boldsymbol{\lambda})$ , as a function of simpler terms by exploiting the factorization of the joint and prior distributions, yielding

[TABLE]

Our goal is then to find the variational parameters $\{\boldsymbol{\mu},\textbf{c}\}$ and the hyper-parameters $\boldsymbol{\lambda}$ that maximize $\mathcal{L}(\boldsymbol{\mu},\textbf{c},\boldsymbol{\lambda})$ . However, due to the log-sum-exp term resultant from the denominator of the softmax, the expectation $\mathbb{E}_{q(\boldsymbol{\beta})}[\log P_{n}(i)]$ in (15) is still intractable. While some authors proposed the use of computationally expensive approximations to further bound this term [Blei et al., 2007, Knowles & Minka, 2011], we shall rely on a more efficient and scalable approximation based on the theory of stochastic optimization. In order to enable it, we begin by reparameterizing our approximate distribution in (12).

Consider a random variable $z\sim\mathcal{N}(z|0,1)$ . We can change the mean and variance by applying an invertible transformation $\beta=cz+\mu$ and making use of the change of variables formula for a random vector, which states that for a given function $f(x)$ , and given an invertible transformation $y=h(x)$ , we have that $f(y)=f(h(x))|J_{h^{-1}}|$ , where $|J_{h^{-1}}|$ denotes the determinant of the Jacobian matrix of the inverse transformation $h^{-1}$ . Hence, given the transformation $\beta=cz+\mu$ and its inverse $z=c^{-1}(\beta-\mu)$ , we can rewrite the approximate distribution in (12) as

[TABLE]

By plugging (16) into (14) and changing variables according to $z=c^{-1}(\beta-\mu)$ , we can rewrite $\mathcal{L}(\boldsymbol{\mu},\textbf{c},\boldsymbol{\lambda})$ as follows:

[TABLE]

where $\circ$ is used to denote the element-wise product and we used the factorization of the joint distribution $p(\textbf{y},\textbf{c}\circ\textbf{z}+\boldsymbol{\mu}|\boldsymbol{\lambda})$ in the last step. The term $-\mathbb{E}_{\mathcal{N}(\textbf{z}|\textbf{0},\textbf{I})}[\log\mathcal{N}(\textbf{z}|\textbf{0},\textbf{I})]$ was ignored because it is constant w.r.t. the variational parameters. Making use of the Gaussian pdf and linearity of expectation leads to the final evidence lower bound

[TABLE]

The key insight is that, through the change of variables, the variational parameters have been transferred inside the log likelihood, thus enabling stochastic optimization by sampling gradients from it.

Regarding the variance hyper-parameters $\boldsymbol{\lambda}$ , as it turns out, it is possible to optimize them analytically. This contrasts with other applications of ARD, where the prior variances are estimated using Expectation-Maximization (EM) - a procedure that can exhibit slow convergence due to the strong dependency between the variational parameters $\{\boldsymbol{\mu},\textbf{c}\}$ and the hyper-parameters $\boldsymbol{\lambda}$ [Titsias & Lázaro-Gredilla, 2014]. Taking derivatives of (18) w.r.t. $\lambda_{di}$ and setting them to zero yields the following optimum:

[TABLE]

Substituting back these optimal values in $\mathcal{L}(\boldsymbol{\mu},\textbf{c},\boldsymbol{\lambda})$ gives the optimized evidence lower bound

[TABLE]

In order to fit the variational distribution to the true posterior, we must optimize the lower bound in (20) w.r.t. $\boldsymbol{\mu}$ and c. Taking derivatives gives:

[TABLE]

We can further rewrite these derivatives by changing variables in the reverse direction, $\beta=cz+\mu$ , and making use of the chain rule, thus leading to the final gradients:

[TABLE]

As for the gradients of the log likelihood of the discrete choice model specified in Section 3.2, they are given by

[TABLE]

The lower bound $\mathcal{L}(\boldsymbol{\mu},\textbf{c})$ can then be optimized by first sampling a set of preference parameters $\boldsymbol{\beta}=\textbf{c}\circ\textbf{z}+\boldsymbol{\mu}$ , $\textbf{z}\sim\mathcal{N}(\textbf{0},\textbf{I})$ , and using the stochastic gradients above to update the all variational parameters $\boldsymbol{\mu}$ and c in parallel:

[TABLE]

Following the theory of stochastic optimization [Robbins & Monro, 1985], using a schedule of the learning rates $\{\rho_{t}\}$ such that $\sum\rho_{t}=\infty$ , $\sum\rho_{t}^{2}<\infty$ , the iteration in Algorithm 1 will converge to a local maxima of the bound in (20) or to the global maximum when this bound is concave. At convergence, we can assess the relevancy of each explanatory variable $d$ in the utility function for alternative $i$ by evaluating the magnitude of the estimated variance parameter $\lambda_{di}$ using (19).

Lastly, we can further scale-up the variational inference algorithm described above by introducing a second type of stochasticity as proposed by Hoffman et al. [2013]. This second type of stochastic stems from using “mini-batches” of data to compute the stochastic gradients rather then the entire dataset at once, hence resulting in a doubly stochastic variational inference algorithm. The final procedure is summarized in Algorithm 1. As we shall see in our experimental results (Section 4), the proposed inference algorithm is able to scale to very large datasets and perform automatic utility function specification considering a very high number of possible explanatory variables $D_{i}$ .

4 Experiments

In this section, an empirical evaluation of the proposed DCM-ARD for automatic utility function specification is performed based on both semi-artificial and real choice data. For both sets of experiments, the dataset used is the Swissmetro (SM) dataset described in [Bierlaire et al., 2001]. This dataset consists of survey data collected on the trains between St. Gallen and Geneva, in which the respondents provided information in order to analyze the impact of the construction of the Swissmetro. The alternatives offered to each respondent were: train, Swissmetro and car (only for car owners). After discarding respondents for which some variables were not available (e.g. age, purpose), a total of 10692 responses from 1188 individuals were used for the experiments.

The proposed DCM-ARD model and its corresponding doubly-stochastic variational inference (DSVI) algorithm were implemented in Matlab. Source code for the implementation and for reproducing all experiments in this paper is publicly available at: http://fprodrigues.com/DCM-ARD/.

4.1 Semi-artificial choice data

In order to empirically demonstrate the ability of the proposed approach to discover the correct utility function specifications, we began by conducting an extensive series of experiments on semi-artificial choice data based on the Swissmetro dataset. We manually specified a set of “artificial” (but realistic) utility function specifications of varying complexity and, based on the Swissmetro dataset, we sampled new artificial choices for the respondents according to the manually-specified utility functions. This was done by fitting a standard DCM with the manually-specified utility function to the original data using maximum-likelihood estimation and, based on the learned parameters $\boldsymbol{\beta}^{*}$ , we then sampled new choices $y_{n}\sim\mbox{Categorical}(y_{n}|P_{n})$ .

We consider two experimental settings for the application of DCM-ARD:

an experimental setting with a medium-sized utility function search space, in which the number of possible variables to be included in the utility functions is 252; these include the original variables (e.g. intercept “ASC”, travel-time “TT”, cost “CO” and headway “HE”), their log-transformations, and interactions of both the original variables and their logarithms with trip purpose (“pur”, 9 categories), respondent age (“age”, 5 groups) and annual season ticket availability (“ga”, binary). Kindly note that, although this results in 252 variables that can be included in the specification, the dimensionality of the utility function search-space includes all combinations of possible utility functions that can be generated using these variables and therefore grows exponentially with this number. For example, considering just the subset of all utility functions with only 10 variables results in $\binom{252}{10}=2.4\times 10^{17}$ possible utility functions to be considered;

2.

an experimental setting with a large utility function search space; besides the variables in the medium-sized search space, this search space also considers Box-Cox transformations, variable segmentations based on K-means clustering, and interactions of the original variables with respondent income (“inc”, 5 groups), luggage (“lug”: none, one piece or multiple pieces) and who pays for the trip (“who”: unknown, self, employer or half-half). This results in a total of 602 possible variables to be included in the utility function specifications.

Based on these two search spaces, we manually defined 9 artificial utility function specifications as shown in Table 1. Specifications S1-S6 are based on the medium-sized search space, while specifications S7-S9 are based on the large search space. However, in order to verify that DCM-ARD is able to discover the true utility function specification used to generate the choice data regardless of how large the search space considered is, we also test specifications S1-S3 with the large search space. 111We further tested other specifications, but omitted their results for conciseness (they lead to similar conclusions). However, they are available at: http://fprodrigues.com/DCM-ARD/

Given the semi-artificial choice data generated based on the manually-defined utility function specifications from Table 1, our goal is to test the ability DCM-ARD to recover the correct utility function specifications in a purely data-driven way. Tables 2 and 3 show the top-K variables selected by DCM-ARD for the medium-sized search space (i.e. specifications S1-S6) ranked according to their respective learned $\lambda$ values. In order to simplify the analysis of the results, the variables deemed relevant by DCM-ARD are highlighted in bold. Irrelevant variables are expected to have $\lambda\approx 0$ . As these results demonstrate, the proposed DCM-ARD is able to discover the true specifications almost perfectly, with all the truly “irrelevant” variables being assigned a $\lambda$ value of approximately zero. The only minor exceptions can be found in specifications S4 and S5. In the learned utility function for S4, we can observe that cost (“CO”) is assigned a $\lambda$ value of zero for the utility of car despite the fact that it was part of the true specification that was used to generate the semi-artificial data. We believe this to be a consequence of the inclusion of the interaction between “CO” and purpose (“pur”) in the true specification for car. Since there is a total of 9 different purposes and some of them have an extremely low number of observations, the effect of “CO” alone can be captured by the baseline and therefore its presence in the specification is essentially not required from a pure data perspective. As for S5, the headway variable (“HE”) in the SM utility was assigned a rather low value of $\lambda$ ( $\lambda=0.001$ ), despite the fact that it should be clearly identified by DCM-ARD as a relevant variable, since it was part of the true specification of S5.

In order to provide a deeper understanding of the proposed approach for automatic utility function specification, Figure 2a shows the convergence of the derived DSVI algorithm when applied for specification S2 and Figure 2b gives a broader perspective on the sparsity induced by the hierarchical prior that DCM-ARD uses. While Figure 2a demonstrates that the proposed DSVI algorithm is able to converge within a few thousand iterations (mini-batches), Figure 2b illustrates that the learned optimal prior variances $\lambda$ for the S2 semi-artificial choice data are zero for the majority of the input dimensions, except for the few dimensions that correspond to variables that actually belong to the true utility function specification (S2) that was used to generate the data. Furthermore, one can observe two non-zero “plateaus” (one blue and one red) that correspond to the $\lambda$ values of the interacted variables in S2, which are enforced by the DCM-ARD model to be considered jointly through the tying of the variance parameters of the Gaussian priors (see Eq. 9).

Let us now consider the large search space. Table 4 shows the top-K variables with higher $\lambda$ value according to DCM-ARD for S1, S2 and S3. As the obtained results show, DCM-ARD is still able to recover the true specifications that were used to generate the data regardless of the significantly larger search space (602 variables considered, instead of 252 for Table 2). However, since the number of variables considered is substantially larger, the execution time of the proposed DSVI algorithm naturally increased from approximately 10 minutes to close to 1 hour on a standard 2.3 GHz dual-core laptop with 16 GB of RAM.

Lastly, Table 5 shows the top-K variables deemed relevant by DCM-ARD for inclusion in the utility function specifications for S7, S8 and S9. By comparing these results with the true specifications from Table 1, one can again observe that DCM-ARD is able to discover the true specifications almost exactly. The only differences are the fact that DCM-ARD selected “log(TT)” instead of “box(TT)” in the utility function of train in S7, and the fact that it missed the interaction between “CO” and “luggage” in the utility function of car in S8. While we could not find an obvious explanation for the latter, the former can be easily explained by an analysis of the results of the Box-Cox transform, which uses a maximum likelihood approach to fit the parameters of the transformation. In the particular case of train travel time, we could immediately observe that the transformed values produced by the Box-Cox transformation are almost perfectly correlated with to the ones produced by the log-transformation (correlation coefficient of 0.998), thus leading us to conclude that both lead to equivalent utility function specifications for the train alternative.

As a further test of scalability and robustness of the proposed approach, we also considered an extremely large search space, which was obtained by expanding the large space space described above with variables that consist of Gaussian random noise, until a total of 1000 variables per alternative was reached (i.e., a total of 3000 variables). Using the semi-artificial choice data corresponding to specification S2 we were able to verify that, despite the expected increased computational run time (approximately 5 hours), the proposed DCM-ARD was still able to perfectly recover the true specification of S2.

So far we have only been considering the ability of DCM-ARD to infer the correct utility function specifications. However, one can also evaluate DCM-ARD in terms of its prediction accuracy on held-out data. Table 6 shows the prediction accuracy of DCM-ARD when trained only on 70% of the dataset and tested on 30% held-out data for the different semi-artificial specifications considered (S1-S9). By comparing these results with the accuracy of a standard DCM that considers all the variables from the search space as input (“DCM”), one can verify that thanks to the additional flexibility of the proposed hierarchical prior and the sparsity-inducing properties, DCM-ARD is able to generalize better to held-out data, thus resulting in significantly higher prediction accuracies. In fact, is most cases, DCM-ARD achieves almost as good prediction performance as a DCM estimated using the true specifications that were used to generate the semi-artificial choices (“DCM-TRUE”). On the other hand, a DCM fitted with maximum likelihood estimation with such a high number of input variables is very likely to severely overfit.

4.2 Real choice data

We will now consider the application of DCM-ARD to perform automatic utility function specification on the real choice data from the Swissmetro dataset. Table 7 shows the top-20 variables selected by DCM-ARD for inclusion in the utility functions using the moderate-sized search space. Since in this case the correct specification is unknown, we instead evaluate the quality of the DCM models that the specifications inferred by DCM-ARD produce. With this purpose, we developed a series of specifications of increasing complexity based on the results of Table 7. We begin by considering a rather simplistic specification based only on travel time and cost (R1). We then start adding variables to it according to the results of DCM-ARD in descending order of importance according to the learned values of $\lambda$ . The complete set of specifications considered is show in Table 8. Kindly note that the last specification (R7), already includes almost all the variables in the top-20 ranking shown in Table 7, and that other additional variables were assigned a $\lambda$ value of zero (or very close to zero), thus being deemed irrelevant by DCM-ARD. Also, since including both a variable and its log-transform could compromise the interpretability of the DCM models, we decided to include only the version with the higher value of $\lambda$ in the cases where DCM-ARD selected both variants222We note that, according to our empirical evidence, including both variants does tend to lead to models that fit better the data, including the held-out data.. Also, due to the fact that the purpose variable has 9 categories, with some of them having only a couple of observations, we further grouped the trip purposes into: commuting, shopping and leisure.

Based on the specifications that were generated according to the results of DCM-ARD (Table 7), we then fitted standard DCM models using the PyLogit package [Brathwaite & Walker, 2018] in Python. Table 9 shows the results obtained for the different specifications considered. As expected, one can verify that, as we increase the complexity of the specification according to the results of DCM-ARD, the fit of the DCM model improves in terms of log-likelihood. However, the quality of the DCM model also improves in terms of AIC, BIC and pseudo- $\bar{R}^{2}$ . In order to further assess the quality of the DCM-ARD specifications in terms of generalization ability to held-out data, we also performed a random 70/30% train/test split of the dataset, and computed the likelihood and accuracies in both sets. As the results in Table 9 evidence, as we move towards the full specification inferred by DCM-ARD, the accuracy and held-out data likelihood of the DCM model also improves. Interestingly, it can observed that only when we include essentially all the variables deemed relevant by DCM-ARD we start noticing some signs of overfitting in the standard DCM model: BIC and testset likelihood do not improve when going from specification R6 to R7. However, indicators such as AIC and pseudo- $\bar{R}^{2}$ still improve. Furthermore, it should be noted that the variables included from R6 to R7, already consist of variables for which DCM-ARD assigned a relatively low relevance (i.e. low value of $\lambda$ when compared to the others).

Comparing the results of specifications R6 and R7 with other proposed DCM specifications from the literature for the same dataset (Table 10), it is possible to obtain a better perspective of how good the specifications inferred by DCM-ARD are. For example, the DCM specification proposed in PyLogit for the Swissmetro dataset includes variables such as travel time, cost, headway, seat configuration, luggage and first class. However, it only achieves a loglikelihood of $-8,061$ , a BIC of $16,252$ and a pseudo- $\bar{R}^{2}$ of $0.271$ . Similarly, the original specification proposed by Bierlaire et al. [2001] achieves a loglikelihood of just $-8,483$ , a BIC of $17,050$ and a pseudo- $\bar{R}^{2}$ of $0.233$ . Moreover, if we consider generalization to held-out data, Table 10 also demonstrates that the both R6 and R7 obtain better results than both baseline approaches, thereby highlighting how DCM-ARD can be easily used to enable the automatic search of utility function specifications.

Lastly, Table 11 shows the estimated coefficients by a DCM with the specification R6 using PyLogit, and their corresponding p-values and other statistics. The full set of results for the other specifications were omitted for brevity but are available at http://fprodrigues.com/DCM-ARD/, together with the source code. As the results in Table 11 demonstrate, the specification learned by DCM-ARD leads to a stable DCM in which the coefficients for all variables except “TT x ga (Car)”, have p-values smaller than $0.001$ . It should however be noted that, in two cases, the parameter estimates are not entirely behaviourally realistic: for both Train and SM alternatives, the sum of the parameter related to “log(CO) x pur2” and the corresponding baseline (“log(CO) (Train)” and “log(CO) (SM)”) is positive, implying that all else being equal, increasing the travel cost of shopping trips improves their attractiveness. Such result is obviously wrong; it indicates that the involved parameters are erroneously capturing or omitting some effects, most probably because the travel cost of the two affected modes is interacted with “ga” and “pur”, but not with both simultaneously. However, since such interactions were not considered in the search-space, DCM-ARD is unable to identify them as relevant. Thus, this is a great example that highlights an important limitation of DCM-ARD: its results are dependent of the search-space considered, and it has no knowledge of behavioural theories. However, we reiterate that its purpose is to assist modellers on specifying utility functions according to data-driven knowledge, rather then serving as a replacement for expert modellers and domain knowledge.

5 Conclusion

This paper proposed a Bayesian framework for performing automatically utility function specification in discrete choice models based on the idea of automatic relevance determination (ARD). An efficient doubly stochastic variational inference algorithm was derived in order to perform approximate Bayesian inference in the proposed DCM-ARD model. As our empirical results using both semi-artificial and real choice data showed, the proposed approach is able to automatically discover good utility function specifications in a pure data-driven manner, even in situations when the number of possible variables considered for inclusion in the utility functions is very large. The practical advantages and overall feasibility of the proposed approach were demonstrated through an application to the popular Swissmetro dataset [Bierlaire et al., 2001], where DCM-ARD was shown to be capable of generating specifications that outperform others from the state of the art according to multiple criteria.

Despite the importance of the standard formulation of the multinomial logit in discrete choice theory, it only corresponds to a subset of the models that are used in practice, with modelling approaches like mixed logits and latent class choice models providing important ways of capturing the heterogeneity in preferences among the decision makers. Therefore, our future work focuses on extending the proposed DCM-ARD formulation for this type models, and on dealing with the challenges associated with performing approximate Bayesian inference in those settings in a scalable manner.

References

Bierlaire et al. [2001]

Bierlaire, M., Axhausen, K., & Abay, G. (2001).

The acceptance of modal innovation: The case of swissmetro.

In Proceedings of the 1st Swiss Transportation Research Conference.

Bishop [2006]

Bishop, C. M. (2006).

Pattern recognition and machine learning.

springer.

Blei et al. [2007]

Blei, D. M., Lafferty, J. D. et al. (2007).

A correlated topic model of science.

The Annals of Applied Statistics, 1, 17–35.

[4]

Brathwaite, T., Vij, A., & Walker, J. L. ().

Machine learning meets microeconomics: The case of decision trees and discrete choice.

Unpublished manuscript, .

Brathwaite & Walker [2018]

Brathwaite, T., & Walker, J. L. (2018).

Asymmetric, closed-form, finite-parameter models of multinomial choice.

Journal of choice modelling, 29, 78–112.

Brusco [2014]

Brusco, M. J. (2014).

A comparison of simulated annealing algorithms for variable selection in principal component analysis and discriminant analysis.

Computational Statistics & Data Analysis, 77, 38 – 53.

Danaf [2017]

Danaf, M. (2017).

Personalized recommendations using discrete choice models with inter-and intra-consumer heterogeneity.

In International Choice Modelling Conference 2017.

Dash & Liu [1997]

Dash, M., & Liu, H. (1997).

Feature selection for classification.

Intelligent Data Analysis, 1, 131–156.

Deng & Runger [2012]

Deng, H., & Runger, G. (2012).

Feature selection via regularized trees.

In Neural Networks (IJCNN), The 2012 International Joint Conference on (pp. 1–8).

IEEE.

Drugowitsch [2013]

Drugowitsch, J. (2013).

Variational bayesian inference for linear and logistic regression.

arXiv preprint arXiv:1310.5438, .

Fouskakis & Draper [2008]

Fouskakis, D., & Draper, D. (2008).

Comparing stochastic optimization methods for variable selection in binary outcome prediction, with application to health policy.

Journal of the American Statistical Association, 103, 1367–1381.

Guyon & Elisseeff [2003]

Guyon, I., & Elisseeff, A. (2003).

An introduction to variable and feature selection.

Journal of machine learning research, 3, 1157–1182.

Hoffman et al. [2013]

Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013).

Stochastic variational inference.

The Journal of Machine Learning Research, 14, 1303–1347.

Hossain et al. [2014]

Hossain, S., Ahmed, S. E., & Howlader, H. A. (2014).

Model selection and parameter estimation of a multinomial logistic regression model.

Journal of Statistical Computation and Simulation, 84, 1412–1426.

[15]

Huttunen, H., Manninen, T., Kauppi, J.-P., & Tohka, J. ().

Mind reading with regularized multinomial logistic regression.

Machine Vision and Applications, 24, 1311–1325.

Jaakkola & Jordan [2000]

Jaakkola, T. S., & Jordan, M. I. (2000).

Bayesian parameter estimation via variational methods.

Statistics and Computing, 10, 25–37.

Jordan et al. [1999]

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999).

An introduction to variational methods for graphical models.

Machine learning, 37, 183–233.

Knowles & Minka [2011]

Knowles, D. A., & Minka, T. (2011).

Non-conjugate variational message passing for multinomial and binary regression.

In Advances in Neural Information Processing Systems (pp. 1701–1709).

[19]

Lhéritier, A., Bocamazo, M., Delahaye, T., & Acuna-Agost, R. ().

Airline itinerary choice modeling using machine learning.

Journal of Choice Modelling, .

Lin et al. [2008]

Lin, S.-W., Lee, Z.-J., Chen, S.-C., & Tseng, T.-Y. (2008).

Parameter determination of support vector machine and feature selection using simulated annealing approach.

Applied soft computing, 8, 1505–1512.

MacKay [1996]

MacKay, D. J. (1996).

Bayesian non-linear modeling for the prediction competition.

In Maximum Entropy and Bayesian Methods (pp. 221–234).

Springer.

MacKay [2003]

MacKay, D. J. (2003).

Information theory, inference and learning algorithms.

Cambridge university press.

[23]

Muni, D., Pal, N., & Das, J. ().

Genetic programming for simultaneous feature selection and classifier design.

IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), 36, 106–117.

Ortelli et al. [2019]

Ortelli, N., Pereira, F. C., Rodrigues, F., & Bierlaire, M. (2019).

Assisted utility specification in discrete choice models.

Unpublished manuscript, .

Pacheco et al. [2009]

Pacheco, J., Casado, S., & Núñez, L. (2009).

A variable selection method based on tabu search for logistic regression models.

European Journal of Operational Research, 199, 506 – 511.

Pal et al. [1998]

Pal, N. R., Nandi, S., & Kundu, M. K. (1998).

Self-crossover-a new genetic operator and its application to feature selection.

International Journal of Systems Science, 29, 207–212.

Paredes et al. [2017]

Paredes, M., Hemberg, E., O’Reilly, U.-M., & Zegras, C. (2017).

Machine learning or discrete choice models for car ownership demand estimation and prediction?

In 2017 5th IEEE International Conference on Models and Technologies for Intelligent Transportation Systems (MT-ITS) (pp. 780–785).

Paz et al. [2019]

Paz, A., Arteaga, C., & Cobos, C. (2019).

Specification of mixed logit models assisted by an optimization framework.

Journal of Choice Modelling, 30, 50 – 60.

[29]

Peng, H., Long, F., & Ding, C. ().

Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 1226–1238.

Ranganath et al. [2014]

Ranganath, R., Gerrish, S., & Blei, D. (2014).

Black box variational inference.

In Artificial Intelligence and Statistics (pp. 814–822).

Rasmussen [2003]

Rasmussen, C. E. (2003).

Gaussian processes in machine learning.

In Summer School on Machine Learning (pp. 63–71).

Springer.

Robbins & Monro [1985]

Robbins, H., & Monro, S. (1985).

A stochastic approximation method.

In Herbert Robbins Selected Papers (pp. 102–109).

Springer.

Sato et al. [2016]

Sato, T., Takano, Y., Miyashiro, R., & Yoshise, A. (2016).

Feature subset selection for logistic regression via mixed integer optimization.

Computational Optimization and Applications, 64, 865–880.

[34]

Sifringer, B., Lurkin, V., & Alahi, A. ().

Enhancing discrete choice models with neural networks.

In Proceedings of the 18th Swiss Transportation Research Conference.

Song et al. [2017]

Song, Y., Nathoo, F. S., & Masson, M. E. (2017).

A bayesian approach to the mixed-effects analysis of accuracy data in repeated-measures designs.

Journal of Memory and Language, 96, 78–92.

Soufan et al. [2015]

Soufan, O., Kleftogiannis, D., Kalnis, P., & Bajic, V. B. (2015).

DWFS: A wrapper feature selection tool based on a parallel genetic algorithm.

PLOS ONE, 10.

Tibshirani [1996]

Tibshirani, R. (1996).

Regression shrinkage and selection via the lasso.

Journal of the Royal Statistical Society. Series B (Methodological), (pp. 267–288).

Tipping [2001]

Tipping, M. E. (2001).

Sparse bayesian learning and the relevance vector machine.

Journal of machine learning research, 1, 211–244.

Titsias & Lázaro-Gredilla [2014]

Titsias, M., & Lázaro-Gredilla, M. (2014).

Doubly stochastic variational bayes for non-conjugate inference.

In International Conference on Machine Learning (pp. 1971–1979).

Torres et al. [2011]

Torres, C., Hanley, N., & Riera, A. (2011).

How wrong can you be? implications of incorrect utility function specification for welfare measurement in choice experiments.

Journal of Environmental Economics and Management, 62, 111–121.

Tutz et al. [2015]

Tutz, G., Pößnecker, W., & Uhlmann, L. (2015).

Variable selection in general multinomial logit models.

Computational Statistics & Data Analysis, 82, 207–222.

[42]

Vergara, J. R., & Estévez, P. A. ().

A review of feature selection methods based on mutual information.

Neural Computing and Applications, 24, 175–186.

Vinterbo & Ohno-Machado [1999]

Vinterbo, S., & Ohno-Machado, L. (1999).

A genetic algorithm to select variables in logistic regression: example in the domain of myocardial infarction.

In Proceedings of the AMIA Symposium (p. 984).

American Medical Informatics Association.

Xing et al. [2001]

Xing, E. P., Jordan, M. I., & Karp, R. M. (2001).

Feature selection for high-dimensional genomic microarray data.

In ICML (pp. 601–608).

Citeseer volume 1.

Zhang & Huang [2008]

Zhang, C.-H., & Huang, J. (2008).

The sparsity and bias of the lasso selection in high-dimensional linear regression.

The Annals of Statistics, 36, 1567–1594.

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bierlaire et al. [2001] Bierlaire, M., Axhausen, K., & Abay, G. (2001). The acceptance of modal innovation: The case of swissmetro. In Proceedings of the 1st Swiss Transportation Research Conference .
2Bishop [2006] Bishop, C. M. (2006). Pattern recognition and machine learning . springer.
3Blei et al. [2007] Blei, D. M., Lafferty, J. D. et al. (2007). A correlated topic model of science. The Annals of Applied Statistics , 1 , 17–35.
4[4] Brathwaite, T., Vij, A., & Walker, J. L. (). Machine learning meets microeconomics: The case of decision trees and discrete choice. Unpublished manuscript , .
5Brathwaite & Walker [2018] Brathwaite, T., & Walker, J. L. (2018). Asymmetric, closed-form, finite-parameter models of multinomial choice. Journal of choice modelling , 29 , 78–112.
6Brusco [2014] Brusco, M. J. (2014). A comparison of simulated annealing algorithms for variable selection in principal component analysis and discriminant analysis. Computational Statistics & Data Analysis , 77 , 38 – 53.
7Danaf [2017] Danaf, M. (2017). Personalized recommendations using discrete choice models with inter-and intra-consumer heterogeneity. In International Choice Modelling Conference 2017 .
8Dash & Liu [1997] Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent Data Analysis , 1 , 131–156.