LipNeXt: Scaling up Lipschitz-based Certified Robustness to Billion-parameter Models
Kai Hu, Haoqi Hu, Matt Fredrikson

TL;DR
LipNeXt introduces a scalable, constraint-free, convolution-free 1-Lipschitz architecture that achieves state-of-the-art certified robustness on large models and datasets, demonstrating the potential of Lipschitz-based certification for modern deep learning.
Contribution
The paper presents LipNeXt, a novel 1-Lipschitz architecture that scales to billion-parameter models using manifold optimization and spatial shift modules, without convolutions or constraints.
Findings
Achieves state-of-the-art certified robustness on CIFAR-10/100 and Tiny-ImageNet.
Scales to 1-2 billion parameters on ImageNet, improving robustness over prior Lipschitz models.
Maintains efficient, stable low-precision training while providing deterministic robustness guarantees.
Abstract
Lipschitz-based certification offers efficient, deterministic robustness guarantees but has struggled to scale in model size, training efficiency, and ImageNet performance. We introduce \emph{LipNeXt}, the first \emph{constraint-free} and \emph{convolution-free} 1-Lipschitz architecture for certified robustness. LipNeXt is built using two techniques: (1) a manifold optimization procedure that updates parameters directly on the orthogonal manifold and (2) a \emph{Spatial Shift Module} to model spatial pattern without convolutions. The full network uses orthogonal projections, spatial shifts, a simple 1-Lipschitz -Abs nonlinearity, and spatial pooling to maintain tight Lipschitz control while enabling expressive feature mixing. Across CIFAR-10/100 and Tiny-ImageNet, LipNeXt achieves state-of-the-art clean and certified robust accuracy (CRA), and on ImageNet it scales to 1-2B…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper is well-organized and clearly written. 2. The proposed manifold optimization and spatial shift techniques are interesting and technically sound.
1. Some important baselines are not discussed in the related work section, such as Sandwich and BRONet. 2. The method integrates several known techniques, making it somewhat difficult to assess the individual effectiveness of each component.
The paper provides solid empirical evidence, along with an ablation study, to support the main claims.
Table 2 presents results with additional data; however, I noticed that the total number of parameters for the proposed model is 256M, which is significantly larger than the competitors. Could the authors provide results for a smaller model configuration, such as L32W1024?
The proposed method exhibits strong rigor and novelty. Almost all designs are supported by strong theoretical analysis and well-motivated. The final overall performance demonstrates the effectiveness of the general algorithm. Detailed ablation studies are provided in the appendix to demonstrate the effectiveness of individual modules.
The overall algorithm seems costly. On the memory side, the optimizer requires a copy of the full parameter, thus doubles the memory cost, which is especially concerning for a model with billion-parameters. On the computation side, the main results on conducted on 8xH100 GPUs, which seems hard to reproduce by academic labs and not scalable to harder tasks. I will not attack the main contribution due to the costs though. The parameter efficiency is of question. All comparisons, although meaningf
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
