Multiscale Training of Convolutional Neural Networks
Shadab Ahamed, Niloufar Zakariaei, Eldad Haber, Moshe Eliasof

TL;DR
This paper introduces a multiscale training method for CNNs that significantly reduces computational costs on high-resolution images by using a multilevel gradient estimator and a coarse-to-fine training approach, maintaining accuracy.
Contribution
The authors propose Multiscale Gradient Estimation (MGE) and a Full-Multiscale training algorithm that together enable efficient CNN training on high-resolution data, reducing computation by up to 16 times.
Findings
Reduces CNN training costs by 4-16x on high-res images.
Maintains performance with no significant accuracy loss.
Applicable across various CNN architectures and tasks.
Abstract
Training convolutional neural networks (CNNs) on high-resolution images is often bottlenecked by the cost of evaluating gradients of the loss on the finest spatial mesh. To address this, we propose Multiscale Gradient Estimation (MGE), a Multilevel Monte Carlo-inspired estimator that expresses the expected gradient on the finest mesh as a telescopic sum of gradients computed on progressively coarser meshes. By assigning larger batches to the cheaper coarse levels, MGE achieves the same variance as single-scale stochastic gradient estimation while reducing the number of fine mesh convolutions by a factor of 4 with each downsampling. We further embed MGE within a Full-Multiscale training algorithm that solves the learning problem on coarse meshes first and "hot-starts" the next finer level, cutting the required fine mesh iterations by an additional order of magnitude. Extensive…
Peer Reviews
Decision·Submitted to ICLR 2025
This paper introduces an innovative approach to multiscale CNN training through Multiscale-SGD and Mesh-Free Convolutions (MFCs). All the developments are backed by mathematically reasoning. And the whole paper is logically organized.
1. The experimental results are really limited, and you should also compare the performance with the fixed computational budget. 2. The experimental comparison with existing multiscale or Fourier-based CNN methods is not presented 3. The mathematical foundation for Mesh-Free Convolutions, particularly the differential operator theory, could be challenging for readers less familiar with this domain. Adding a visual explanation or intuitive analogies could make the theory more accessible. 4. In th
I believe this paper provides several important contributions and extends the existing literature in different significant ways: __S1.__ The authors propose new techniques for training CNNs within a multiscale resolution framework, which improve computational efficiency while maintaining test performance. This alone can be a valuable contribution, as shown by the “non-MFC” empirical results in Tables 3, 4, and 5. __S2.__ As far as I can tell, MFCs are an original and promising alternative to
Some clarity aspects could be addressed to strengthen the paper. In more detail, I currently see the following weaknesses: __W1.__ Section 3.1 on “Mesh-Free Convolutions” is relatively dense and challenging to follow. For instance, some notations typically used with parabolic PDEs, like indices denoting partial derivatives, should be more clearly introduced. More importantly, the connection between $u$, $v$, $\tilde{v}$, and $\mathcal{C}$ is not very clear and should be made more explicit. __W
1) The primary research questions in the paper are well-motivated. 2) The technical analysis seems to be sound, and the authors are able to propose interesting, non-obvious training algorithms from them (i.e., this goes far beyond "multiscale training is useful for learning convolutional kernels"). I also found the further step of mesh-free convolutions to be very interesting from a theoretical perspective.
My major problems with the paper pertain to the experimental results, and fall into two main categories: 1) The experiments done are on extremely small networks and datasets. There are very standard collections of computer vision experiments that could be used to show the benefits of the proposed multiscale-SGD training algorithms (see, for example, the benchmarks used for experiments in any landmark CV papers from the past few years, like CLIP or Segment Anything). I understand that compute ac
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsConvolution
