A Constrained Optimization Perspective of Unrolled Transformers
Javier Porras-Valenzuela, Samar Hadou, Alejandro Ribeiro

TL;DR
This paper proposes a constrained optimization framework for training transformers that ensures layerwise descent, leading to models with improved robustness and generalization across tasks like video denoising and text classification.
Contribution
It introduces a primal-dual training scheme enforcing layerwise descent constraints, a novel approach for training transformers with enhanced robustness.
Findings
Constrained transformers exhibit stronger robustness to perturbations.
They maintain higher out-of-distribution generalization.
In-distribution performance is preserved.
Abstract
We introduce a constrained optimization framework for training transformers that behave like optimization descent algorithms. Specifically, we enforce layerwise descent constraints on the objective function and replace standard empirical risk minimization (ERM) with a primal-dual training scheme. This approach yields models whose intermediate representations decrease the loss monotonically in expectation across layers. We apply our method to both unrolled transformer architectures and conventional pretrained transformers on tasks of video denoising and text classification. Across these settings, we observe constrained transformers achieve stronger robustness to perturbations and maintain higher out-of-distribution generalization, while preserving in-distribution performance.
Peer Reviews
Decision·Submitted to ICLR 2026
1. This work focuses on decreasing the expected loss value along the layers of the transformer models. This is novel compared to prior work, where most implement each layer as a gradient descent step of the optimization objective. This makes the proposed method work across different transformer architectures. 2. A rather principled and theoretically motivated approach that provides certain performance guarantees. As a bonus, the algorithm itself is also simple enough and is generally applicable.
1. While the authors rightfully state that “the behavior of these networks [from previous works] is non-monotonic along the iterates”, the proposed constrained optimization algorithm only applies to the expectation level rather than sample level. Hence, there is no guarantee that the network from the proposed algorithm will behave monotonically in a real-world setting of finite, streaming samples. 2. Another weakness concerns the experimental results. In the video denoising setting, only 5 out o
- The paper introduces a constrained optimization view of transformer training, in which each layer must monotonically reduce the expected loss—a property inspired by iterative optimization algorithms. - It formalizes this idea rigorously using a primal–dual training framework, backed by proven results such as: - Convergence guarantees (Theorem 2) - Out-of-Distribution (OOD) generalization bounds (Theorem 4) - The inclusion of expressivity and sample complexity terms (ν, ζ(M, δ)) provides
**1. Sacrificing in-domain performance** Figure 2 indicates that the proposed constrained‑optimization transformer underperforms compared to the vanilla ERM baseline on in‑domain (ID) evaluation, while providing advantages mainly in out‑of‑domain (OOD) settings. This gap suggests that the imposed per‑layer descent constraints may introduce an inductive bias that prioritizes generalization robustness at the expense of ID accuracy. While this trade‑off can be acceptable in robustness‑critical re
1. **Novely:** Integrating constrained training objective to transformers is an interesting touch with motivation from the success of traditional unrolled neural models. 2. **Theoretical foundation:** Although not a brand new contribution, the framework is based on a fairly well-established constrained learning framework of [1] and apply it to the Transformer architecture. 3. **Empirical support:** The effectiveness of the framework is supported with positive empirical results in video denoising
1. **Scope:** the constrained learning framework seems to be architecture agnostic in most ways. This would mean most theoretical as well as empirical results should ideally be true across any deep neural networks. I think this needs to be discussed in the main text. 2. **Applicability:** would OOD gains achieved with the method scale with model size? Since experiments are mostly small scale, this is not evident if just scaling the model size would overshadow the OOD benefits of the method. 3. *
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Image Enhancement Techniques
