Learning Randomized Algorithms with Transformers
Johannes von Oswald, Seijin Kobayashi, Yassir Akram, Angelika, Steger

TL;DR
This paper introduces a novel approach to integrating randomization into transformer models through learning, enhancing their robustness and performance on various tasks by leveraging data-driven randomized algorithms.
Contribution
It demonstrates for the first time that randomized algorithms can be learned within transformer models using standard optimization methods, improving robustness and effectiveness.
Findings
Transformers can effectively incorporate learned randomization.
Randomized transformers show increased robustness against adversaries.
Performance improvements are observed across multiple tasks.
Abstract
Randomization is a powerful tool that endows algorithms with remarkable properties. For instance, randomized algorithms excel in adversarial settings, often surpassing the worst-case performance of deterministic algorithms with large margins. Furthermore, their success probability can be amplified by simple strategies such as repetition and majority voting. In this paper, we enhance deep neural networks, in particular transformer models, with randomization. We demonstrate for the first time that randomized algorithms can be instilled in transformers through learning, in a purely data- and objective-driven manner. First, we analyze known adversarial objectives for which randomized algorithms offer a distinct advantage over deterministic ones. We then show that common optimization techniques, such as gradient descent or evolutionary strategies, can effectively learn transformer parameters…
Peer Reviews
Decision·ICLR 2025 Oral
Very well-written paper, easy to follow, with extensive examples and related work material both in the main paper and in the appendix. Well-motivated and executed study for the possibility to learn randomized algorithms. 1. Convincing results of superior worst-case performance in the considered tasks when compared to training alternatives, ablating some aspect of the loss, the multi-seed version trained on the relaxed adversarial loss is the best. 2. Experiments cover well important hyperparame
1.a. I would personally have enjoyed seeing experiments with actual adversarial robustness training, using a threat model on human preference alignment data, applied at generative LLMs. 1.b. Lines 216-218 mention that adversarial robustness training is difficult, however recent literature has provided with framework for that [ 1]. While it does not decrease the contributions of this paper, comparing with the methodology mentioned in the paper would have been great! [1] Xhonneux et al., “Effi
1) Innovative integration of randomization: Altough randomization was a concept that has been used with Transformers during the years and at different levels (e.g. positional encodings, attention weights) the paper introduces an original concept of definining randomized algorithms within neural networks through learning: by simple epmloying repetitions and strategies such as majority voting this approach outperforms deterministic approach 2) Comprehensive experimental design: The paper provides
1) Scalability challenges: The most important corcern about the proposed methodology is the computationally cost, particularly with the reliance on multiple seeds and adversarial loss training. The authors acknowledge this in the "Summary and Limitations" paragraph, noting that scaling the approach to larger settings may require significant computational resources, which limits the practicality and broader applicability of the approach. I consider that such a problem should have been addressed i
- The authors provide a strong theoretical foundation for why randomization is advantageous in certain adversarial contexts, referencing game theory and established concepts like Yao’s Minimax Principle. - The study highlights that randomization can increase resilience against adversarial attacks.
- While the paper presents conceptual tasks to validate the approach, it does not provide empirical results on large-scale or real-world datasets. This limitation raises questions about how well the method would scale and perform in more complex, realistic environments. - The approach requires sampling multiple seeds during training (controlled by the hyperparameter mm), which can increase computational overhead significantly. The authors note that this limitation affects memory, training time,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Face and Expression Recognition
