Polynomial, trigonometric, and tropical activations

Ismail Khalfaoui-Hassani; Stefan Kesselheim

arXiv:2502.01247·cs.LG·March 3, 2026

Polynomial, trigonometric, and tropical activations

Ismail Khalfaoui-Hassani, Stefan Kesselheim

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper investigates the use of orthonormal basis-based activation functions, including polynomial, trigonometric, and tropical types, demonstrating their viability for training deep neural networks and offering insights into their structure and approximation capabilities.

Contribution

The study introduces new activation functions based on orthonormal bases, showing they can be effectively used in deep learning without additional mechanisms and providing interpretability as polynomial mappings.

Findings

01

Activations enable training of deep models like GPT-2 and ConvNeXt.

02

They address exploding and vanishing gradients in polynomial activations.

03

Activations can approximate classical functions via Hermite interpolation.

Abstract

Which functions can be used as activations in deep neural networks? This article explores families of functions based on orthonormal bases, including the Hermite polynomial basis and the Fourier trigonometric basis, as well as a basis resulting from the tropicalization of a polynomial basis. Our study shows that, through simple variance-preserving initialization and without additional clamping mechanisms, these activations can successfully be used to train deep models, such as GPT-2 for next-token prediction on OpenWebText and ConvNeXt for image classification on ImageNet. Our work addresses the issue of exploding and vanishing activations and gradients, particularly prevalent with polynomial activations, and opens the door for improving the efficiency of large-scale learning tasks. Furthermore, our approach provides insight into the structure of neural networks, revealing that networks…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper is well written, presented with a clear structure. 2. The theorem-proof logic is clear and rigorious. 3. The visualization helps explain the conclusion.

Weaknesses

Despite the strengths, here are some weaknesses: 1. The motivation is not clear, is it just an exploration on activations? And I am not sure if the proposed activation functions solve any existing problems. (Although I do know that not all innovative thought must solve something discrete, but I do suggest the author to refine this part.) 2. Since the paper is not the first to design a new kind of activation function, even not the first to use orthogonal polynomials, I am not sure what is the c

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper provides a rigorous variance-preserving initialization framework that unifies different activation families under an orthogonal function perspective. This is both mathematically elegant and practically meaningful. 2. By addressing Hermite, Fourier, and tropical bases, the study gives a broad view of orthogonal and piecewise-linear activations, including insightful links to classical activations (ReLU, GELU). 3. Experiments on ImageNet (ConvNeXt) and OpenWebText (GPT-2) convincingly

Weaknesses

1. The reported 30–90% slower training speed (Section 6) is significant. The paper would benefit from more detailed timing analyses and GPU utilization comparisons to quantify the trade-off between performance and efficiency. 2. The experiments focus on classification and next-token prediction tasks. Additional ablations (e.g., fine-tuning, transfer learning, adversarial robustness) could help demonstrate broader applicability.

Reviewer 03Rating 8Confidence 2

Strengths

- The main idea is novel and well-motivated. - The thorough theoretical support on the initialization methods is a valuable contribution to the community. - I appreciate the benchmarking of the method across both text and vision tasks. - The latency analysis is an important addition.

Weaknesses

- I am missing an ablation over different backbones for both vision and language benchmarks. - Although not a major weakness, additional experimental support on challenging benchmarks would increase the impact of the paper, e.g., on COCO for vision related tasks. - A discussion on the application of the proposed activation functions for generative models (e.g., diffusion-based models) would be interesting. Minor: - The last sentence in ln. 485 seems to end abruptly.

Code & Models

Repositories

K-H-Ismail/torchortho
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Topics in Algebra

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Adam · Softmax · Dropout · Weight Decay · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning · Attention Dropout