Optimal Neural Network Approximation for High-Dimensional Continuous Functions

Ayan Maiti; Michelle Michelle; and Haizhao Yang

arXiv:2409.02363·cs.LG·June 17, 2025

Optimal Neural Network Approximation for High-Dimensional Continuous Functions

Ayan Maiti, Michelle Michelle, and Haizhao Yang

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that neural networks with a specific activation function can optimally approximate high-dimensional continuous functions using a number of parameters that grows linearly with the input dimension, improving efficiency over previous methods.

Contribution

The authors show that a composition of neural networks can achieve super approximation with linearly many parameters in the input dimension, using a variant of the Kolmogorov Superposition Theorem.

Findings

01

Achieves super approximation with O(d) parameters

02

Constructs networks with fixed width and depth for high-dimensional functions

03

Proves linear lower bounds on the number of parameters needed for approximation

Abstract

Recently, the authors of \cite{SYZ22} developed a neural network with width $36 d (2 d + 1)$ and depth $11$ , which utilizes a special activation function called the elementary universal activation function, to achieve the super approximation property for functions in $C ([a, b]^{d})$ . That is, the constructed network only requires a fixed number of neurons (and thus parameters) to approximate a $d$ -variate continuous function on a $d$ -dimensional hypercube with arbitrary accuracy. More specifically, only $O (d^{2})$ neurons or parameters are used. One natural question is whether we can reduce the number of these neurons or parameters in such a network. By leveraging a variant of the Kolmogorov Superposition Theorem, \textcolor{black}{we show that there is a composition of networks generated by the elementary universal activation function with at most $10889 d + 10887$ nonzero parameters…

Peer Reviews

Decision·ICLR 2024 Conference Withdrawn Submission

Reviewer 01Rating 1· strong rejectConfidence 2

Strengths

- The construction is simple to follow and the statements of the theorems are somewhat easy to understand.

Weaknesses

- It seems that the theorems are obtained by combining previously obtained results by Shen et al., and Yarotsky, and there are not many innovations. Per se, this is not necessarily bad, but combined with the other weaknesses, it makes this paper much less appealing. - Is a minimum $d$ width not trivial? If we only have access to $k<d$ ridge functions $\sigma (\< w_k , x \>)$, then the neural network only depend son a $k$-dimensional projection of the data and none of section 3 seems to be necess

Reviewer 02Rating 1· strong rejectConfidence 3

Strengths

- Improvement of the width required to approximate the true function from $O(d^2)$ to $O(d)$ is significant.

Weaknesses

* There is significant room for improvement in the organization and writing of the paper. * Only a part of the claim in Theorem 4 is proved. Specifically, for general $d$, the proof is only given to the case of the form $\sum_i \sin(x_i)$. * I have a question about the proof of Theorem 4, as I have a counterexample for it. Let $d=2$, $w_{11} = w_{12} = w$, and $F_1 = F_2 = \sin$. Then, the solution of the system $[w_{11} w_{12}][x_1 x_2]^\top = 0$ is $(x, -x)$. However, when we have $\sin(x) - \

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

Originality: The related works are adequately cited. The authors construct neural networks of $O(d)$ neurons to approximate any continuous function with input domain $[0,1]^{d}$ and show that the bound $O(d)$ is optimal. This is an interesting result, which will certainly help us have a better understanding of the universal approximation property of deep neural networks from a theoretical way. I have checked the technique parts and found that the proofs sound solid. Quality: This paper is techn

Weaknesses

Although the bound $O(d)$ is optimal and the construction on achieving $O(d)$ neurons is given, I found that the setting of activation functions is artificial in some sense. For example, in the main Theorem 3, the authors require a combination of super-expressive activation functions and EUAF, which makes the DNN not very practical. It would be interesting to derive the results for one fixed activation function and for more architectures used in practice.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications