This paper introduces a penalized likelihood approach for multinomial logistic regression that automatically combines response categories, improving model interpretability and prediction accuracy.
Contribution
It develops a novel non-differentiable penalty and an ADMM algorithm for efficient estimation and category combination in multinomial logistic regression.
Findings
01
Effective category combination achieved in simulations
02
Algorithm converges reliably in practice
03
Improved prediction performance over traditional methods
Abstract
We propose a penalized likelihood method that simultaneously fits the multinomial logistic regression model and combines subsets of the response categories. The penalty is non differentiable when pairs of columns in the optimization variable are equal. This encourages pairwise equality of these columns in the estimator, which corresponds to response category combination. We use an alternating direction method of multipliers algorithm to compute the estimator and we discuss the algorithm's convergence. Prediction and model selection are also addressed.
Tables10
Table 1. Table 1: Results of simulation using setting 1 when ℒ = 𝒮 ℒ 𝒮 \operatorname{\mathcal{L}}=\operatorname{\mathcal{S}} . Standard errors are presented in parenthesis. The columns Oracle, VL, EN, and GMR report to the average KL divergence for oracle tuned GFMR, GFMR using validation likelihood, EN, and GMR methods respectively. The column VL-Oracle reports the average difference
for GFMR using validation likelihood and oracle tuned GFMR. Similar notation is used for comparisons between VL and EN, and VL and GMR. The columns labeled Oracle Groups, and VL Groups, contain the average number of unique regression coefficient vectors when using oracle tuned GFMR, and GFMR using validation likelihood.
Oracle
VL
VL-Oracle
Oracle
Groups
VL
Groups
EN
VL-EN
GMR
VL-GMR
0.10
0.011
(0.001)
0.023
(0.000)
0.011
(0.002)
1.95
2.98
0.050
(0.003)
-0.027
(0.003)
0.049
(0.003)
-0.027
(0.003)
0.25
0.057
(0.002)
0.076
(0.002)
0.019
(0.002)
2.98
3.91
0.094
(0.003)
-0.017
(0.004)
0.091
(0.003)
-0.015
(0.003)
0.50
0.137
(0.003)
0.164
(0.005)
0.027
(0.003)
3.51
4
0.191
(0.005)
-0.027
(0.004)
0.188
(0.004)
-0.024
(0.004)
1.00
0.259
(0.005)
0.292
(0.005)
0.033
(0.004)
4
4
0.399
(0.010)
-0.107
(0.009)
0.399
(0.010)
-0.106
(0.008)
0.10
0.011
(0.001)
0.019
(0.001)
0.008
(0.001)
2.27
3.36
0.033
(0.002)
-0.013
(0.001)
0.032
(0.002)
-0.012
(0.001)
0.25
0.050
(0.001)
0.061
(0.001)
0.008
(0.001)
2.27
3.36
0.033
(0.002)
-0.009
(0.001)
0.069
(0.002)
-0.012
(0.001)
0.50
0.098
(0.002)
0.114
(0.004)
0.016
(0.003)
3.86
4
0.141
(0.005)
-0.027
(0.003)
0.142
(0.004)
-0.027
(0.003)
1.00
0.150
(0.004)
0.164
(0.004)
0.015
(0.002)
4
4
0.222
(0.006)
-0.058
(0.005)
0.225
(0.006)
-0.061
(0.005)
Table 2. Table 2: Results of simulation using setting 1 when ℒ = { ( 1 , 2 ) , ( 2 , 3 ) , ( 3 , 4 ) } ℒ 1 2 2 3 3 4 \operatorname{\mathcal{L}}=\{(1,2),(2,3),(3,4)\} . The columns Oracle, and VL report to the average KL divergence for oracle tuned GFMR and GFMR using validation likelihood methods respectively. The column VL-Oracle reports the average difference
for GFMR using validation likelihood and oracle tuned GFMR. Similar notation is used for comparison VL and GMR. The columns labeled Oracle Groups, and VL Groups, contain the average number of unique regression coefficient vectors when using oracle tuned GFMR, and GFMR using validation likelihood.
Oracle
VL
VL-Oracle
Oracle Groups
VL Groups
VL-GMR
0.10
0.0121
(0.000)
.025
(0.003)
0.013
(0.002)
3.1
3.1
-0.024
(0.003)
0.25
0.065
(0.001)
0.0.086
(0.003)
0.028
(0.003)
3.4
3.25
-0.006
(0.003)
0.50
0.156
(0.004)
0.198
(0.005)
0.040
(0.005)
3.95
3.62
0.009
(0.004)
1.00
0.189
(0.003)
0.245
(0.003)
0.054
(0.001)
4
3.77
-0.017
(0.010)
0.10
0.012
(0.000)
0.020
(0.002)
0.008
(0.001)
2.91
3.15
-0.011
(0.001)
0.25
0.058
(0.001)
0.069
(0.001)
0.011
(0.002)
3.65
3.41
0.000
(0.002)
0.50
0.107
(0.003)
0.127
(0.005)
0.020
(0.004)
3.99
3.86
-0.014
(0.003)
1.00
0.156
(0.004)
0.191
(0.005)
0.035
(0.004)
4
3.99
-0.034
(0.003)
Table 3. Table 3: Number of replications out of 100 that produced the number of unique regression coefficient vectors in setting 1. Ordered penalty set references that the penalty set used is 𝒮 𝒮 \operatorname{\mathcal{S}} , while unordered penalty set references ℒ = { ( 1 , 2 ) , ( 2 , 3 ) , ( 3 , 4 ) } ℒ 1 2 2 3 3 4 \operatorname{\mathcal{L}}=\{(1,2),(2,3),(3,4)\} .
Penalty Set
1 Vector
2 Vectors
3 Vectors
4 Vectors
0.10
Ordered
0
6
78
16
Unordered
64
6
1
29
0.25
Ordered
0
5
65
30
Unordered
40
3
3
54
0.50
Ordered
0
1
36
63
Unordered
16
0
1
83
1.00
Ordered
0
0
12
88
Unordered
0
0
0
100
0.10
Ordered
0
8
68
24
Unordered
49
12
2
37
0.25
Ordered
0
1
57
42
Unordered
24
3
4
69
0.50
Ordered
0
0
14
86
Unordered
4
1
0
95
1.00
Ordered
0
0
1
99
Unordered
0
0
0
100
Table 4. Table 4: Results of simulation using setting 2 when ℒ = 𝒮 ℒ 𝒮 \operatorname{\mathcal{L}}=\operatorname{\mathcal{S}} . Standard errors are presented in parenthesis. The columns
Oracle, VL, EN, and GMR report to the average KL divergence for oracle tuned GFMR, GFMR using validation likelihood, EN, and GMR
methods respectively. The column VL-Oracle reports the average difference
for GFMR using validation likelihood and oracle tuned GFMR. Similar notation is used for comparisons between VL and EN, and VL
and GMR. The columns labeled Oracle Groups, and VL Groups, contain the average number of unique regression coefficient
vectors when using oracle tuned GFMR, and GFMR using validation likelihood.
Oracle
VL
VL-Oracle
Oracle
Groups
VL
Groups
EN
VL-EN
GMR
VL-GMR
0.10
0.008
(0.000)
0.023
(0.003)
0.014
(0.003)
1.99
2.81
0.044
(0.003)
-0.022
(0.002)
0.043
(0.003)
-0.023
(0.003)
0.25
0.043
(0.001)
0.056
(0.002)
0.013
(0.002)
2.45
3.75
0.083
(0.003)
-0.026
(0.003)
0.081
(0.003)
-0.025
(0.003)
0.50
0.118
(0.003)
0.146
(0.003)
0.026
(0.002)
3.07
4
0.169
(0.003)
-0.023
(0.004)
0.167
(0.003)
-0.021
(0.003)
1.00
0.225
(0.006)
0.243
(0.007)
0.017
(0.004)
4
4
0.364
(0.008)
-0.120
(0.008)
0.372
(0.009)
-0.120
(0.008)
0.10
0.008
(0.000)
0.014
(0.001)
0.006
(0.001)
2.04
3.39
0.027
(0.001)
-0.012
(0.001)
0.027
(0.002)
-0.012
(0.001)
0.25
0.037
(0.001)
0.048
(0.001)
0.010
(0.001)
2.90
3.91
0.060
(0.002)
-0.012
(0.002)
0.059
(0.002)
-0.012
(0.002)
0.50
0.076
(0.002)
0.089
(0.003)
0.012
(0.002)
3.85
4
0.116
(0.002)
-0.027
(0.002)
0.115
(0.003)
-0.026
(0.002)
1.00
0.124
(0.004)
0.131
(0.004)
0.006
(0.001)
4
4
0.201
(0.007)
-0.070
(0.005)
0.204
(0.006)
-0.073
(0.005)
Table 5. Table 5: Results of simulation using setting 2 when ℒ = { ( 1 , 2 ) , ( 2 , 3 ) , ( 3 , 4 ) } ℒ 1 2 2 3 3 4 \operatorname{\mathcal{L}}=\{(1,2),(2,3),(3,4)\} . The columns Oracle, and VL report to the
average KL divergence for oracle tuned GFMR and GFMR using validation likelihood methods respectively. The column VL-Oracle
reports the average difference
for GFMR using validation likelihood and oracle tuned GFMR. Similar notation is used for comparison VL and GMR. The columns
labeled Oracle Groups, and VL Groups, contain the average number of unique regression coefficient vectors when using oracle
tuned GFMR, and GFMR using validation likelihood.
Oracle
VL
VL-Oracle
Oracle Groups
VL Groups
VL-GMR
0.10
0.008
(0.003)
0.023
(0.004)
0.015
(0.003)
2.87
2.11
-0.025
(0.003)
0.25
0.040
(0.000)
0.057
(0.002)
0.016
(0.001)
3.17
3.56
-0.024
(0.003)
0.50
0.106
(0.003)
0.135
(0.004)
0.029
(0.003)
3.79
3.56
-0.032
(0.004)
1.00
0.189
(0.003)
0.245
(0.003)
0.054
(0.001)
4
3.77
-0.128
(0.010)
0.10
0.008
(0.000)
0.015
(0.001)
0.007
(0.001)
2.91
3.15
-0.011
(0.001)
0.25
0.035
(0.001)
0.046
(0.002)
0.011
(0.002)
3.43
3.35
-0.013
(0.002)
0.50
0.064
(0.003)
0.073
(0.003)
0.009
(0.001)
3.89
3.75
-0.042
(0.002)
1
0.101
(0.003)
0.119
(0.004)
0.018
(0.002)
4
3.89
-0.085
(0.006)
Table 6. Table 6: Number of replications out of 100 that produced the number of unique regression coefficient vectors in setting 1.
Ordered penalty set references that the penalty set used is 𝒮 𝒮 \operatorname{\mathcal{S}} , while unordered penalty set references ℒ = { ( 1 , 2 ) , ( 2 , 3 ) , ( 3 , 4 ) } ℒ 1 2 2 3 3 4 \operatorname{\mathcal{L}}=\{(1,2),(2,3),(3,4)\} .
Penalty Set
1 Vector
2 Vectors
3 Vectors
4 Vectors
0.10
Ordered
0
2
83
14
Unordered
63
4
4
29
0.25
Ordered
0
4
69
27
Unordered
49
2
4
45
0.50
Ordered
0
2
40
58
Unordered
29
3
0
68
1.00
Ordered
0
0
22
78
Unordered
0
0
0
100
0.10
Ordered
0
7
71
22
Unordered
55
14
3
28
0.25
Ordered
0
3
59
38
Unordered
33
4
3
60
0.50
Ordered
0
0
25
75
Unordered
4
1
1
94
1.00
Ordered
0
0
11
89
Unordered
0
0
0
100
Table 7. Table 7: The fraction of the 100 replications specific group structures are selected for each N 𝑁 N , δ 𝛿 \delta combination using the two-step method . The label
One-Step indicates that the correct group structure is still a possibility if the correct fusion was done with an additional
combination.
1 Group
19/100
0/100
3/100
0/100
2 Groups (Correct)
58/100
71/100
80/100
83/100
2 Groups (Incorrect)
0/100
0/100
0/100
0/100
3 Groups (One-Step)
20/100
21/100
16/100
15/100
3 Groups (Incorrect)
0/100
0/100
0/100
0/100
4 Groups
3/100
8/100
1/100
2/100
Table 8. Table 8: The fraction of the 100 replications specific group structures are selected for each N 𝑁 N , δ 𝛿 \delta combination when all
possible response category combinations were used as candidate models.
1 Group
0/100
0/100
0/100
0/100
2 Groups (Correct)
46/100
59/100
62/100
82/100
2 Groups (Incorrect)
25/100
4/100
14/100
0/100
3 Groups (One-Step)
28/100
29/100
24/100
17/100
3 Groups (Incorrect)
0/100
0/100
0/100
0/100
4 Groups
1/100
8/100
0/100
1/100
Table 9. Table 9: The response categories found by GFMR with tuning parameter selection using validation likelihood on the 1996 United States election data.
Group
1
Strong Republican
Strong Republican
2
Weak Republican
Weak Republican
3
Independent Republican
Independent Republican
Independent
Independent Democrat
4
Independent
Weak Democrat
5
Independent Democrat
Strong Democrat
6
Weak Democrat
7
Strong Democrat
Table 10. Table 10: The response categories found by GFMR using the two step method for model selection on the 1996 United States
election data.
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Statistical Methods and Models · Face and Expression Recognition · Statistical Methods and Inference
MethodsLogistic Regression
Full text
Automatic Response Category Combination in Multinomial Logistic Regression
Bradley S. Price, Charles J. Geyer,
and Adam J. Rothman
††footnotetext: Bradley S. Price, Management Information Systems Department, West Virginia University
(E-Mail: [email protected]).
Charles J. Geyer, School of Statistics, University of Minnesota
(E-mail: [email protected]).
Adam J. Rothman, School of Statistics, University of Minnesota
(E-mail: [email protected]).
Abstract
We propose a penalized likelihood method that simultaneously
fits the multinomial logistic regression
model and combines subsets of the response categories.
The penalty is nondifferentiable when pairs of columns in
the optimization variable are equal. This encourages
pairwise equality of these columns in the estimator, which corresponds to response category
combination. We use an alternating direction method of multipliers algorithm
to compute the estimator and we discuss the algorithm’s convergence.
Prediction and model selection are also addressed.
We propose a new way to fit the multinomial logistic regression model.
Let xi=(1,xi2,…,xip)′∈Rp
be the non-random values of the predictors for the ith subject
and let yi=(yi1,…,yiC)′∈RC be the
observed response category counts for the ith subject (i=1,…,n). The model assumes that yi is a realization of the random vector
[TABLE]
where ni is the index of the multinomial experiment for the ith subject
and πij∗ is the unknown probability
that the response is category j for the ith subject
(i,j)∈{1,…,n}×C, where C={1,2,…,C}.
The model also assumes that Y1,…,Yn are independent.
Using the baseline category parameterization,
[TABLE]
where β1∗,…,βC−1∗ are unknown regression
coefficient vectors and βC∗=0.
Other constraints could be used to
make β1∗,…,βC∗ identifiable, e.g.
Zhu and Hastie (2004) used ∑j=1Cβj∗=0.
Let ℓ:Rp×⋯×Rp→R be the log likelihood with additive terms
that do not contain parameters dropped:
[TABLE]
There are a total of p(C−1) unknown parameters.
For more information on the model see Agresti (2012, Chapter 8).
Shrinkage or regularized estimation is natural in this setting when p(C−1) is large.
Zhu and Hastie (2004) proposed ridge-penalized likelihood estimation,
and Vincent and Hansen (2014) proposed sparse group-lasso penalized likelihood estimation that encourages
estimates of the matrix (β1,…,βC−1)∈Rp×C−1
that have 0’s in all entries of some rows.
These rows correspond to explanatory variables
that are estimated to be irrelevant. A similar procedure was studied by Simon et al. (2013).
Interpreting the regression coefficients when the response has three or more
categories is difficult because the coefficient
values depend on the baseline category choice (or depend on another
constraint employed to make the parameters identifiable).
Interpretation in the special case of binomial logistic regression (C=2)
is much simpler. In some applications, we can create a two-category response
by subjectively grouping the original response categories, e.g. if the jth
response category was of interest, one could create two groups of categories {j}
and C∖{j}, and perform binomial logistic regression. Without
subjective information, we can view response category grouping as a model selection
problem, where the selected submodel with grouped categories
fits the counts in our data nearly as well as the full model.
We propose a penalized likelihood procedure that uses fusion
penalties to encourage fitted models
with grouped response categories. Model selection is addressed
using both K-fold cross validation and AIC. Through simulation, we show that in settings where the true response categories are grouped, our method
using cross validation for tuning
parameter selection performs better than competitors at
predicting the true response category probabilities. We also show that
in certain settings, our method using AIC for model selection
excels at selecting the true grouped response categories.
2 Method
We propose the penalized likelihood estimates defined by
[TABLE]
where L is a user-selected subset of S={(a,b)∈C×C:a>b},
λ≥0 is a tuning parameter, and
∣⋅∣2 is the vector 2-norm.
Since the penalty only depends on differences of β vectors,
the estimated response category probabilities
are invariant to the choice of the baseline category for a given L.
To exploit a natural ordering (if any) of
the response categories, one could set L={(1,2),(2,3),…,(C−1,C)}. Without prior information about the
similarity between response categories, one could set L=S.
Suppose that L=S. The objective function in (4) is non-differentiable
(β1,…,βC) has
at least one pair of equal components, βm=βj, m=j. This
encourages an increasing number of
pairs of equal element vectors in (β^1,…,β^C)
as λ increases. If β^m=β^j, then the
corresponding estimated
response category probabilities are also equal.
In effect, response categories j and m are combined.
If λ is
sufficiently large, then
β^1=⋯=β^C=0. In this uninteresting edge-case,
all π^ij’s are equal to 1/C.
We select λ using K-fold cross validation maximizing the validation log-likelihood. Let Vk be the set of indices in
the k-th fold.
Specifically we maximize the function Q, defined by
[TABLE]
where yil is the observed
count for response category l for validation observation i, and
where πl(−Vq)(xi,λ) is the estimated probability
validation observation i, with predictor values xi, has response
category l using estimates produced from the training set that omits data
with indices in Vq using tuning parameter λ.
We propose an alternative method of tuning parameter selection using AIC
in Section 5.
Our penalty builds on
the ideas behind the group lasso penalty (Bakin, 1999; Yuan and Lin, 2006) and the fused lasso
penalty (Tibshirani et al., 2005). Alaiz et al. (2013) developed the grouped fused lasso
penalty, but their work only fused adjacent groups
and was designed for single-response penalized least-squares regression. Both Hocking et al. (2011) and Chen et al. (2015)
investigate similar penalties in the context of clustering.
3 Algorithm
We propose to solve (4) using an alternating direction method of multipliers algorithm (ADMM).
Boyd et al. (2010) provide an introduction and description of the ADMM algorithm. The proposed ADMM algorithm solves the
optimization in (4) by solving the
equivalent constrained optimization:
[TABLE]
The optimization in (6) can be written as a special case of equation 3.1 of Boyd et al. (2010). Define β=(β1,…,βC−1), Z={Zj,m}(j,m)∈L and W={Wj,m}(j,m)∈L, where Wj,m is the Lagrange
multiplier associated with the constraint on Zj,m. The augmented Lagrangian is
[TABLE]
where ρ>0 is the augmented Lagrangian parameter.
We now apply equations 3.2, 3.3, and 3.4 of Boyd et al. (2010) to obtain the ADMM update equations for solving (6):
[TABLE]
where a superscript of (k) denotes the kth iterate.
We propose to solve (7) with blockwise coordinate descent, where
β1,…,βC−1 are the blocks. In each block update,
we use Newton’s method, which converges because these subproblems have
strongly convex objective functions.
Solving (7) is the most computationally expensive step.
The optimization in (8) decouples into card(L) optimization problems that can be solved
in parallel with the following closed form solutions:
[TABLE]
where (a)+=max(a,0). This solution is a special case of solving the group lasso penalized least squares problem
(Yuan and Lin, 2006).
One could show convergence of the proposed ADMM algorithm by using the theory developed by Mota et al. (2011). This proof
relies on the assumption that the undirected graph defined by the vertex set C and the edge set L, is connected. In
practice the convergence tolerance and step size ρ can be determined
by equation 3.12 and 3.13 respectively in Boyd et al. (2010).
The final iterate β(K) will not have pairs of equal component vectors, but subsets of its pairs of component vectors will
be very similar. We define our final estimates by
[TABLE]
and β^C=0.
We set β^1,…,β^C−1 equal to [math] if the 2-norm of the vector is less than 10−8. A
similar approach was taken by Danaher et al. (2013).
4 Prediction Performance Simulations
4.1 Competitors
We present simulation studies that compare the prediction performance of our proposed method, which we call
group fused multinomial logistic regression (GFMR),
elastic net penalized multinomial logistic regression (EN), and group penalized multinomial logistic regression (GMR)
(Friedman et al., 2008; Simon et al., 2013).
Both methods used for comparison are implemented in the glmnet package in R. An R package implementing the GFMR methods
is currently under development.
4.2 Data Generating Models
In
Sections 4.3 and 4.4, we present simulation
results where the data was generated such that xi=(1,x~i)′, where x~1,…,x~n are drawn
independently from
N9(0,I). We set ni=1 for i=1,…,n and C=4. We consider two settings for β∗.
In setting 1, β1∗=β4∗=0 and β2∗=β3∗=δ. In setting 2, β1∗=β2∗=β3∗=δ, and β4∗=0. We consider (n,δ)∈{50,100}×{0.1,0.25,0.5,1}. The observed responses yi,…,yn are a realization of Y1,…,Yn defined by
(1).
Results are based on 100 replications. We consider two choices for L: L=S and L={(1,2),(2,3),(3,4)}. The second choice attempts to exploit a natural ordering of the
response categories. In each replication, we measure the prediction performance using the
Kullback-Leibler (KL) divergence between the estimated response category probabilities and the
true response category probabilities, based on 1000 test observations generated from the same
distribution as the training data. The KL Divergence is defined by
[TABLE]
where πk∗(xi), and πk(xi,λ) are the true and
estimated response category probabilities that testing observation i has response category k using tuning parameter λ. For both GFMR methods we report the average number of unique regression coefficient vectors observed in the estimator for the 100 replications. The number of unique regression coefficient vectors estimated by the GFMR methods is defined as the unique vectors in (β^1,…,β^C).
Tuning parameters for GFMR are selected from a subset of
{10−10,10−9.98,…,109.98,1010}. We also computed the best overall KL divergence we could obtain using the candidate values for the tuning parameter. We call this
the oracle tuned value. The tuning parameters for the competing methods were selected using 5-fold cross validation minimizing the validation deviance. For the both EN and GMR, the first tuning parameter was selected by default methods in the glmnet package in R and the second
tuning parameter was selected from the set {0,0.01,…,0.99,1}.
4.3 Results for Setting 1
In Table 1, we present the average KL divergence and the average number of unique regression coefficient vectors estimated by the GFMR method for the simulation using setting 1 when L=S. On average GFMR using validation likelihood had a lower KL divergence than its competitors. Since there should be 2 unique regression coefficient vectors in the estimator, the results show GFMR using oracle tuning on average is overselecting the number of unique coefficient
vectors when δ=0.25,0.5,
and 1 for both values of n. A similar result occurred for GFMR using validation likelihood for all values of δ and n. This may indicate that we need
a different method to select the tuning parameter for GFMR if our interest is in model selection, when L=S, rather than prediction of the true response category probabilities where this method performs better than EN and GMR.
The average KL divergence and the average number of unique regression coefficient vectors estimated by GFMR for setting 1 when L={(1,2),(2,3),(3,4)} are
reported in Table 2. These results show the same pattern that the results in Table 1 showed. While prediction accuracy decreased in this setting, we again see the GFMR using validation likelihood tuning is competitive with GMR. This decrease in prediction performance is expected because response categories are not naturally ordered in this data generating model. We also saw an increase in
the average number of unique regression coefficient vectors estimated using GFMR with oracle tuning and GFMR with validation likelihood tuning when compared to the setting L=S.
To further investigate the number of unique regression coefficient vectors estimated by GFMR, we present a comparison between this quantity for the case when L=S and L={(1,2),(2,3),(3,4)} in Table 3. We see that GFMR using validation likelihood performed poorly at
selecting the true number of unique coefficient vectors estimated for both choices of L.
4.4 Results of Setting 2
In Table 4, we present the average KL divergence and the average
number of unique regression coefficient vectors estimated by the GFMR method for the simulation using setting 2 when
L=S. Similar to the patterns shown in setting 1 when L=S, on average the GFMR using validation likelihood tuning
has a lower KL divergence than both EN and the GMR. For all values of n and δ, with the exception of (δ,n)=(0.1,50), both the oracle tuned GFMR and GFMR using validation likelihood tuning overselects the number
of unique coefficient vectors. Again this may indicate the need for a different method for tuning parameter selection if our
interest is model selection.
In Table 5, we present the average KL divergence and the average number of unique regression coefficients
estimated by GFMR for the simulation using setting 2 when L={(1,2),(2,3),(3,4)}.
We see a similar result to the pattern observed Table 5, but the prediction accuracy here was better. This is
expected because the response categories have a natural ordering in this data generating model.
To further investigate this comparison of unique regression coefficient vectors, Table 6 presents a
comparison between the number of unique regression coefficient vectors estimated by the GFMR method in each of the 100
replications for the case when L=S and the case when L={(1,2),(2,3),(3,4)}. Just as in Section 4.3,
these results show that GFMR using validation likelihood performs poorly at selecting the true number of unique coefficient
vectors for both penalty sets, but again on average predicts the true response category probabilities better than EN and GMR.
5 Two-Step Method for Reducing Response Categories
The simulations presented in Section 4 show that our method using validation likelihood to select the tuning parameter
performs well at predicting the true response category probabilities
when compared to EN and GMR. The same simulations show that
our method using validation likelihood does not perform well at model
selection. To improve
model selection performance, we propose an alternative
approach for low dimensional settings. This is a two-step method
that solves (4)
for an increasing sequence of λ’s until there are two
unique vectors in (β^1,…,βC^) .
This sequence provides a set of candidate models with different
response categories from which we will select the best by
refitting using unpenalized maximum likelihood and computing the
AIC. The selected model is the
candidate model with the minimum AIC. We also compute the AIC for the edge case where all probabilities are equal to 1/C.
We present a simulation to show the merits of this two-step
approach to select the
correct combined category model in multinomial logistic regression.
The explanatory variables values are generated using the same
procedure described in Section 4.2. In this simulation,
β1∗=−δ and
β2∗=β3∗=β4∗=0, and observed responses y1,…,yn are a realization of Y1,…,Yn
defined by (1). We consider (n,δ)∈{50,75}×{1,3}.
To evaluate the proposed two-step method, we investigate its ability to detect the correct response category grouping.
It is possible to detect the correct number of groups but the incorrect structure: we call this incorrect. It is also possible
that the method selects more groups than it should, and if two of
the groups were combined it would result in the correct structure: we call this one-step.
Turning parameters were selected from a subset of {10−10,10−9.98,…109.98,1010} and we set L=S. Since this simulation study has only 4 response categories, it was
computationally feasible to compute the best AIC by searching over all possible response category combinations, which makes
this exhaustive search a natural competitor.
In Table 7, we report the proportion of replications that the group structures of interest are selected for each (n,δ)
combination for the two-step method. Table 8 presents the results for the exhaustive search. These tables show
that for every of n and δ, the two-step method correctly picks the true response categories groups
more than any other group structure. In particular, the proposed two-step method performs as well or better than exhaustive
search for each (n,δ). These results show improvement in model selection performance when compared to
the simulation results from Section 4, suggesting
that if model selection is the interest the two step approach should be used.
6 Election Data Example
We analyze the dataset nes96 found in the CRAN package
faraway (Faraway, 2014). The response variable,
self-identified political
affiliation of voters, has 7 levels: strong Democrat, weak Democrat, independent Democrat, independent, independent
Republican, weak Republican, strong Republican. The explanatory variables are voter education level (categorical with 7 levels),
voter income (categorical with 24 levels), and voter age (numerical). An investigation
into both model selection based on validation likelihood tuning parameter selection and model selection based on using the two-step method was performed using L=S and L={(1,2),…,(6,7)}.
Model selection for GFMR using validation likelihood tuning parameter selection was performed using 5-fold cross validation
selecting the tuning parameter from the set
{10−10,10−9.98,…,109.98,1010}. In Table 9, we present
the response category combinations recommended by the GFMR regression coefficient estimates when L=S and L={(1,2),…,(6,7)}. The results show for the case when L=S, GFMR does not combine any response categories. When L={(1,2),…,(6,7)}, the results show that independent Democrats, independent Republicans, and independents have the
same estimated regression coefficient vectors.
We also show the results from the two-step approach proposed in Section 5. Both choices for L resulted in a
selected model with three response categories. These response category groups are shown in Table 10.
Bibliography15
The reference list from the paper itself. Each links out to its DOI / PubMed record.
1Agresti (2012) Agresti, A. (2012), Categorical Data Analysis , Hoboken, New Jersey: Wiley, third edition.
2Alaiz et al. (2013) Alaiz, C. M., Barbero, A., and Dorronsoro, J. R. (2013), “Group Fused Lasso,” in Artificial Neural Networks and Machine Learning ICANN 2013 , Berlin: Springer, 66–73.
3Bakin (1999) Bakin, S. (1999), “Adaptive Regression and Model Selection in Data Mining Problems,” Ph.D. thesis, Canberra: Australian National University.
4Boyd et al. (2010) Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2010), “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers,” Foundations and Trends in Machine Learning , 3, 1–122.
5Chen et al. (2015) Chen, G. K., Chi, E. C., Ranola, J. M., and Lange, K. (2015), “Convex Clustering: An Attractive Alternative to Hierarchical Clustering,” P Lo S Computational Biology , 11.
6Danaher et al. (2013) Danaher, P., Wang, P., and Witten, D. (2013), “The Joint Graphical Lasso for Inverse Covariance Estimation Across Multiple Classes,” The Journal of Royal Statistical Society, Series B , 76, 373–397.
7Faraway (2014) Faraway, J. (2014), faraway: Functions and Datasets for Books by Julian Faraway. , URL http://CRAN.R-project.org/package=faraway , R package version 1.0.6.
8Friedman et al. (2008) Friedman, J., Hastie, T., and Tibshirani, R. (2008), “Regularized Paths for Generalized Linear Models via Coordinate Descent,” Journal of Statistical Software , 33.