Effect of Architectures and Training Methods on the Performance of Learned Video Frame Prediction
M. Akin Yilmaz, A. Murat Tekalp

TL;DR
This paper compares different neural network architectures and training methods for video frame prediction, highlighting the trade-offs between accuracy and computational efficiency.
Contribution
It provides a comprehensive analysis of feedforward and recurrent architectures, introducing effective training strategies and evaluating their performance.
Findings
Residual FCNN achieves highest PSNR but with higher computational cost.
CRNN can be trained efficiently with stateful BPTT and offers near real-time inference.
Recurrent networks can be trained stably and are more computationally efficient during inference.
Abstract
We analyze the performance of feedforward vs. recurrent neural network (RNN) architectures and associated training methods for learned frame prediction. To this effect, we trained a residual fully convolutional neural network (FCNN), a convolutional RNN (CRNN), and a convolutional long short-term memory (CLSTM) network for next frame prediction using the mean square loss. We performed both stateless and stateful training for recurrent networks. Experimental results show that the residual FCNN architecture performs the best in terms of peak signal to noise ratio (PSNR) at the expense of higher training and test (inference) computational complexity. The CRNN can be trained stably and very efficiently using the stateful truncated backpropagation through time procedure, and it requires an order of magnitude less inference runtime to achieve near real-time frame prediction with an acceptable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
