Less is More: Efficient Weight Farcasting with 1-Layer Neural Network
Xiao Shou, Debarun Bhattacharjya, Yanna Ding, Chen Zhao, Rui Li, and, Jianxi Gao

TL;DR
This paper presents a novel, efficient weight forecasting framework for deep neural networks that uses only initial and final weights, improving accuracy and reducing computational costs.
Contribution
The study introduces a new long-term time series forecasting approach for weights, along with a tailored regularizer, diverging from traditional training efficiency techniques.
Findings
Outperforms existing methods in forecasting accuracy
Reduces computational overhead during training
Effective on synthetic and real-world models like DistilBERT
Abstract
Addressing the computational challenges inherent in training large-scale deep neural networks remains a critical endeavor in contemporary machine learning research. While previous efforts have focused on enhancing training efficiency through techniques such as gradient descent with momentum, learning rate scheduling, and weight regularization, the demand for further innovation continues to burgeon as model sizes keep expanding. In this study, we introduce a novel framework which diverges from conventional approaches by leveraging long-term time series forecasting techniques. Our method capitalizes solely on initial and final weight values, offering a streamlined alternative for complex model architectures. We also introduce a novel regularizer that is tailored to enhance the forecasting performance of our approach. Empirical evaluations conducted on synthetic weight sequences and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVehicle License Plate Recognition
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Dropout · Layer Normalization · Attention Dropout · Softmax · Residual Connection · WordPiece · Linear Layer
