# A powerful penalized multinomial logistic regression approach

**Authors:** Cornelia Fuetterer, Malte Nalenz, Thomas Augustin, Ruth M. Pfeiffer

PMC · DOI: 10.1007/s00180-025-01635-0 · 2025-05-25

## TL;DR

This paper introduces a new penalized regression method called DP-lasso for categorical outcomes, which improves variable selection in high-dimensional data.

## Contribution

The novel DP-lasso method uses adaptive L1-type penalties based on predictor distances across outcome categories.

## Key findings

- DP-lasso with ANOVA-based weights (DPan) produced sparser models with high true positive rates in high-dimensional settings.
- DPan outperformed other methods in terms of false positive rates across various simulation scenarios.
- The method was successfully applied to ultra high-dimensional single-cell RNA-sequencing datasets.

## Abstract

Penalized regression methods that shrink model coefficients are popular approaches to improve prediction and for variable selection in high-dimensional settings. We present a penalized (or regularized) regression approach for multinomial logistic models for categorical outcomes with a novel adaptive L1-type penalty term, that incorporates weights based on intra- and inter-outcome category distances of each predictor. A predictor that has large between- and small within-outcome category distances is penalized less and has a higher likelihood to be selected for the final model. We propose and study three measures for weight calculation: an analysis of variance (ANOVA)-based measure and two indices used in clustering approaches. Our novel approach, that we term the discriminative power lasso (DP-lasso), thus combines elements of marginal screening with regularized regression methods. We studied the performance of DP-lasso and other published methods in simulations with varying numbers of outcome categories, numbers of predictors, strengths of associations and predictor correlation structures. For correlated predictors, the DP-lasso approach with ANOVA based weights (DPan) resulted in much sparser models than other regularization approaches, especially in high-dimensional settings. When the number p of (correlated) predictors was much larger than the available sample size N, DPan had the highest true positive rate while maintaining low false positive rates for all simulation settings. Similarly, when \documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$${p<N}$$\end{document}p<N, DPan had high true positive rates and the lowest false positive rates of all methods studied. Thus we recommend DPan for analysing categorical outcomes in relation to high-dimensional predictors. We further illustrate all approaches in ultra high-dimensional settings, using several single-cell RNA-sequencing datasets.

The online version contains supplementary material available at 10.1007/s00180-025-01635-0.

## Full-text entities

- **Genes:** Nktcn1 (natural killer T cell numbers 1) [NCBI Gene 493025] {aka Nkt1}
- **Chemicals:** LPS (MESH:D008070), DP (-)
- **Species:** Mus musculus (house mouse, species) [taxon 10090]

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12552268/full.md

---
Source: https://tomesphere.com/paper/PMC12552268