Multiscale Training of Convolutional Neural Networks

Shadab Ahamed; Niloufar Zakariaei; Eldad Haber; Moshe Eliasof

arXiv:2501.12739·cs.LG·March 3, 2026·2 cites

Multiscale Training of Convolutional Neural Networks

Shadab Ahamed, Niloufar Zakariaei, Eldad Haber, Moshe Eliasof

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a multiscale training method for CNNs that significantly reduces computational costs on high-resolution images by using a multilevel gradient estimator and a coarse-to-fine training approach, maintaining accuracy.

Contribution

The authors propose Multiscale Gradient Estimation (MGE) and a Full-Multiscale training algorithm that together enable efficient CNN training on high-resolution data, reducing computation by up to 16 times.

Findings

01

Reduces CNN training costs by 4-16x on high-res images.

02

Maintains performance with no significant accuracy loss.

03

Applicable across various CNN architectures and tasks.

Abstract

Training convolutional neural networks (CNNs) on high-resolution images is often bottlenecked by the cost of evaluating gradients of the loss on the finest spatial mesh. To address this, we propose Multiscale Gradient Estimation (MGE), a Multilevel Monte Carlo-inspired estimator that expresses the expected gradient on the finest mesh as a telescopic sum of gradients computed on progressively coarser meshes. By assigning larger batches to the cheaper coarse levels, MGE achieves the same variance as single-scale stochastic gradient estimation while reducing the number of fine mesh convolutions by a factor of 4 with each downsampling. We further embed MGE within a Full-Multiscale training algorithm that solves the learning problem on coarse meshes first and "hot-starts" the next finer level, cutting the required fine mesh iterations by an additional order of magnitude. Extensive…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

This paper introduces an innovative approach to multiscale CNN training through Multiscale-SGD and Mesh-Free Convolutions (MFCs). All the developments are backed by mathematically reasoning. And the whole paper is logically organized.

Weaknesses

1. The experimental results are really limited, and you should also compare the performance with the fixed computational budget. 2. The experimental comparison with existing multiscale or Fourier-based CNN methods is not presented 3. The mathematical foundation for Mesh-Free Convolutions, particularly the differential operator theory, could be challenging for readers less familiar with this domain. Adding a visual explanation or intuitive analogies could make the theory more accessible. 4. In th

Reviewer 02Rating 8Confidence 3

Strengths

I believe this paper provides several important contributions and extends the existing literature in different significant ways: __S1.__ The authors propose new techniques for training CNNs within a multiscale resolution framework, which improve computational efficiency while maintaining test performance. This alone can be a valuable contribution, as shown by the “non-MFC” empirical results in Tables 3, 4, and 5. __S2.__ As far as I can tell, MFCs are an original and promising alternative to

Weaknesses

Some clarity aspects could be addressed to strengthen the paper. In more detail, I currently see the following weaknesses: __W1.__ Section 3.1 on “Mesh-Free Convolutions” is relatively dense and challenging to follow. For instance, some notations typically used with parabolic PDEs, like indices denoting partial derivatives, should be more clearly introduced. More importantly, the connection between $u$, $v$, $\tilde{v}$, and $\mathcal{C}$ is not very clear and should be made more explicit. __W

Reviewer 03Rating 3Confidence 3

Strengths

1) The primary research questions in the paper are well-motivated. 2) The technical analysis seems to be sound, and the authors are able to propose interesting, non-obvious training algorithms from them (i.e., this goes far beyond "multiscale training is useful for learning convolutional kernels"). I also found the further step of mesh-free convolutions to be very interesting from a theoretical perspective.

Weaknesses

My major problems with the paper pertain to the experimental results, and fall into two main categories: 1) The experiments done are on extremely small networks and datasets. There are very standard collections of computer vision experiments that could be used to show the benefits of the proposed multiscale-SGD training algorithms (see, for example, the benchmarks used for experiments in any landmark CV papers from the past few years, like CLIP or Segment Anything). I understand that compute ac

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsConvolution