A Constrained Optimization Perspective of Unrolled Transformers

Javier Porras-Valenzuela; Samar Hadou; Alejandro Ribeiro

arXiv:2601.17257·cs.LG·January 27, 2026

A Constrained Optimization Perspective of Unrolled Transformers

Javier Porras-Valenzuela, Samar Hadou, Alejandro Ribeiro

PDF

Open Access 3 Reviews

TL;DR

This paper proposes a constrained optimization framework for training transformers that ensures layerwise descent, leading to models with improved robustness and generalization across tasks like video denoising and text classification.

Contribution

It introduces a primal-dual training scheme enforcing layerwise descent constraints, a novel approach for training transformers with enhanced robustness.

Findings

01

Constrained transformers exhibit stronger robustness to perturbations.

02

They maintain higher out-of-distribution generalization.

03

In-distribution performance is preserved.

Abstract

We introduce a constrained optimization framework for training transformers that behave like optimization descent algorithms. Specifically, we enforce layerwise descent constraints on the objective function and replace standard empirical risk minimization (ERM) with a primal-dual training scheme. This approach yields models whose intermediate representations decrease the loss monotonically in expectation across layers. We apply our method to both unrolled transformer architectures and conventional pretrained transformers on tasks of video denoising and text classification. Across these settings, we observe constrained transformers achieve stronger robustness to perturbations and maintain higher out-of-distribution generalization, while preserving in-distribution performance.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. This work focuses on decreasing the expected loss value along the layers of the transformer models. This is novel compared to prior work, where most implement each layer as a gradient descent step of the optimization objective. This makes the proposed method work across different transformer architectures. 2. A rather principled and theoretically motivated approach that provides certain performance guarantees. As a bonus, the algorithm itself is also simple enough and is generally applicable.

Weaknesses

1. While the authors rightfully state that “the behavior of these networks [from previous works] is non-monotonic along the iterates”, the proposed constrained optimization algorithm only applies to the expectation level rather than sample level. Hence, there is no guarantee that the network from the proposed algorithm will behave monotonically in a real-world setting of finite, streaming samples. 2. Another weakness concerns the experimental results. In the video denoising setting, only 5 out o

Reviewer 02Rating 4Confidence 3

Strengths

- The paper introduces a constrained optimization view of transformer training, in which each layer must monotonically reduce the expected loss—a property inspired by iterative optimization algorithms. - It formalizes this idea rigorously using a primal–dual training framework, backed by proven results such as: - Convergence guarantees (Theorem 2) - Out-of-Distribution (OOD) generalization bounds (Theorem 4) - The inclusion of expressivity and sample complexity terms (ν, ζ(M, δ)) provides

Weaknesses

**1. Sacrificing in-domain performance** Figure 2 indicates that the proposed constrained‑optimization transformer underperforms compared to the vanilla ERM baseline on in‑domain (ID) evaluation, while providing advantages mainly in out‑of‑domain (OOD) settings. This gap suggests that the imposed per‑layer descent constraints may introduce an inductive bias that prioritizes generalization robustness at the expense of ID accuracy. While this trade‑off can be acceptable in robustness‑critical re

Reviewer 03Rating 4Confidence 3

Strengths

1. **Novely:** Integrating constrained training objective to transformers is an interesting touch with motivation from the success of traditional unrolled neural models. 2. **Theoretical foundation:** Although not a brand new contribution, the framework is based on a fairly well-established constrained learning framework of [1] and apply it to the Transformer architecture. 3. **Empirical support:** The effectiveness of the framework is supported with positive empirical results in video denoising

Weaknesses

1. **Scope:** the constrained learning framework seems to be architecture agnostic in most ways. This would mean most theoretical as well as empirical results should ideally be true across any deep neural networks. I think this needs to be discussed in the main text. 2. **Applicability:** would OOD gains achieved with the method scale with model size? Since experiments are mostly small scale, this is not evident if just scaling the model size would overshadow the OOD benefits of the method. 3. *

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Image Enhancement Techniques