Leveraging Video Coding Knowledge for Deep Video Enhancement
Thong Bach, Thuong Nguyen Canh, Van-Quang Nguyen

TL;DR
This paper introduces a novel deep learning framework that leverages video coding knowledge, specifically the hierarchical structure and motion characteristics, to significantly improve compressed video quality.
Contribution
It proposes a new method that incorporates video coding insights into deep video enhancement, outperforming existing techniques in benchmarks.
Findings
Achieved higher quantitative metrics in NTIRE22 challenge.
Improved visual quality of compressed videos.
Enhanced state-of-the-art method with coding knowledge integration.
Abstract
Recent advancements in deep learning techniques have significantly improved the quality of compressed videos. However, previous approaches have not fully exploited the motion characteristics of compressed videos, such as the drastic change in motion between video contents and the hierarchical coding structure of the compressed video. This study proposes a novel framework that leverages the low-delay configuration of video compression to enhance the existing state-of-the-art method, BasicVSR++. We incorporate a context-adaptive video fusion method to enhance the final quality of compressed videos. The proposed approach has been evaluated in the NTIRE22 challenge, a benchmark for video restoration and enhancement, and achieved improvements in both quantitative metrics and visual quality compared to the previous method.
| Method \Video ID | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BasicVSR++ | 33.73 | 32.42 | 31.22 | 36.75 | 32.16 | 30.57 | 30.41 | 33.99 | 31.68 | 36.20 | 27.06 | 23.93 | 30.24 | 32.94 | 31.20 | 31.63 |
| Ours (OCL-VCE) | 33.81 | 32.54 | 31.30 | 36.82 | 32.28 | 30.59 | 30.47 | 34.07 | 31.75 | 36.28 | 27.08 | 24.00 | 30.37 | 33.01 | 31.25 | 31.71 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Processing Techniques · Image and Signal Denoising Methods · Video Coding and Compression Technologies
Leveraging Video Coding Knowledge for Deep Video Enhancement
Thong Bach
Thuong Nguyen Canh
Van-Quang Nguyen
Abstract
Recent advancements in deep learning techniques have significantly improved the quality of compressed videos. However, previous approaches have not fully exploited the motion characteristics of compressed videos, such as the drastic change in motion between video contents and the hierarchical coding structure of the compressed video. This study proposes a novel framework that leverages the low-delay configuration of video compression to enhance the existing state-of-the-art method, BasicVSR++. We incorporate a context-adaptive video fusion method to enhance the final quality of compressed videos. The proposed approach has been evaluated in the NTIRE22 challenge, a benchmark for video restoration and enhancement, and achieved improvements in both quantitative metrics and visual quality compared to the previous method.
1 Introduction
With the increasing demand for high-quality video transmission over the Internet, video compression has become essential to efficiently transmit videos over limited bandwidth. It has driven the development of video compression standards such as H.265/HEVC [1], and beyond. However, compressed videos suffer unavoidable compression artifacts. As a result, there is a growing interest in the research community in enhancing the quality of compressed videos.
Several studies have proposed methods to improve the quality of individual frames in videos [2, 3] as well as leveraging temporal information between frames [4, 5, 6, 7, 8, 9]. Most existing methods focus on the architecture design of the model, which typically involves (i) designing the backbone extraction module using CNNs or Transformers, (ii) designing the propagation module to effectively capture information flow between frames, and (iii) designing the enhancement module as a post-processing step to improve the quality of the output video. However, there is often a little emphasis on incorporating prior knowledge from video content such as motion information as well as the compression algorithm. This represents an untapped potential for improving the overall quality of the video compression process.
In this study, we propose several methods to enhance the performance of BasicVSR++, a state-of-the-art video super-resolution method [10]. Our approach begins by examining BasicVSR++’s performance with varying numbers of input frames, taking into account the motion information of the content. As compressed video uses the HEVC low-delay configuration, the first frame (also known as the Intra frame) has significantly higher quality than the others. To take advantage of this, we train a separate network called Intra frame BasicVSR++ to improve the quality of the first frame. Finally, we introduce an adaptive mechanism that combines multiple reconstructed instances with different input sequence lengths to obtain the final enhanced output.
The experiments demonstrate that the proposed framework not only leverages the low-delay configuration of video compression but also incorporates context-adaptive video fusion to enhance the final quality of compressed videos. These results demonstrate the potential of incorporating domain-specific knowledge into deep learning models for advancing the state-of-the-art in compressed video quality enhancement.
2 Performance Analysis of BasicVSR++
BasicVSR++ [10] is a state-of-the-art video super-resolution method that enhances video quality through a combination of frame propagation and alignment techniques. While BasicVSR++ has shown impressive results in enhancing video quality, it does not take into account the unique characteristics of compressed video.
In a compressed video [1], a frame can be an intra or inter frame. Intra frame compression only uses information from the current image, while inter-frame compression utilizes information from previously encoded frames to reduce redundancy. The NTIRE22 challenge encoded video using a low-delay configuration [1], as shown in Figure 1, with a group of pictures of 4. This results in the compressed video having only one intra frame at a significantly higher quality than other inter-frames. As a result, the quality of frames in the compressed video is highly varied, but BasicVSR++ does not take this into account. In practice, the entire video is fed as input to the model during testing, which may not be the optimal choice.
We investigate the effect of varying input video frame lengths on the performance of BasicVSR++. As in Fig. 2-a, the pre-trained network’s performance varies significantly depending on the frame length, with shorter frame length demonstrating higher performance for the first few frames. Interestingly, we also noticed that this performance phenomenon did not occur when the input of the trimmed video started from the 32nd frame, i.e., when the intra frame is not included. This finding suggests that the network is better able to exploit the high performance of the intra frame with a smaller frame length. For later frames, our experiments demonstrated that using the full frame length performed better thanks to the temporal dependency in both backward and forward directions. In contrast, trimmed video inputs have limited backward and forward dependency, resulting in lower performance for later frames.
This observation suggests that the optimal choice of input frame length may vary depending on the temporal characteristics of the video content. The use of shorter frame lengths may be more effective for the early frames, while longer frame lengths may be more suitable for later frames that rely on both backward and forward dependencies. Additionally, the effectiveness of the backward and forward dependencies may be limited to a certain temporal range, as seen in Video 208 at shift 32. Examining 40 test sequences 201-240 in the NTIRE22 dataset, we observe that the first 64 frames of enhanced outputs with the video inputs trimmed at the length of 186 yields the best performance.
3 Proposed Method
3.1 Intra frame BasicVSR++
To further leverage the superior quality of the intra frame, we introduce a new network called Intra rame BasicVSR++. In order to do so, we created a new intra frame video dataset by cutting the original videos into multiple non-overlapping 30-frame segments and encoding them using HEVC with the same low-delay configuration as the NTIRE22 dataset. The resulting dataset contains videos with only one intra frame and multiple inter-frames. We then divided the compressed videos into training and testing datasets for network training. By utilizing this configuration, we ensured that the intra frame is of higher quality than the inter-frames, reflecting the reality of compressed videos.
During training, we fine-tuned the Intra frame BasicVSR++ network by using the intra frame as the first frame of each segment. This allowed the network to learn to enhance the high-quality intra frame more effectively, resulting in a more accurate and efficient network. In this way, the Intra frame BasicVSR++ network is designed to specifically target the improvement of the first frame’s quality, while BasicVSR++ improves the quality of the entire video.
As shown in Fig.3, we observed that the Intra frame BasicVSR++ network improves the performance of the intra frame in most cases, such as videos 201 and 202. However, this improvement is not universal and may not always hold for sequences with high frame rates and slow motion, such as video 226. This may be due to the limited amount of information in the intra frame in such cases, and the network may struggle to extract and propagate useful features. Nonetheless, these results suggest that the intra frame BasicVSR++ network can be an effective tool in improving the performance of compressed video enhancement for most types of content.
3.2 Adaptive Context-Aware Fusion
From our analysis, it is evident that BasicVSR++ benefits from adaptive input frame length and can be further improved by Intra frame BasicVSR++. However, it is also important to note that this improvement is not applicable to all types of video content. In order to address this issue, we propose a heuristic that separates cases where the frame rate is high and there is slow motion. This is achieved by comparing the gradient of the average frame with a given threshold. By using this threshold, we can determine if the video contains a high frame rate and slow motion, and take appropriate measures to optimize the final performance. Firstly, the average frame is obtained as follows:
[TABLE]
where is a scaling factor that is proportional to the input video frame rate. Specifically, we set for videos with a frame rate less than 30fps and for those with a frame rate greater than 30fps. Next, we compare the gradient of the average frame to a given threshold:
[TABLE]
where and denote the gradient in the horizontal and vertical directions of a given frame , and is a threshold of value 2300. The value of can be normalized based on the number of pixels, as shown in Fig. 4.
We propose a novel video fusion mechanism called Adaptive Context-Aware Fusion, as shown in Fig. 4. The method involves enhancing an input video using three different approaches: full BasicVSR++, short BasicVSR++ with the first 154 frames, and the first 122 frames by Intra frame BasicVSR++. Depending on the current content of the video, an adaptive fusion technique is performed to select either the first frame from short BasicVSR++ or Intra frame BasicVSR++. For the subsequent 63 frames, we select frames from the short BasicVSR++ set, and the remaining frames are extracted from the full BasicVSR++ set.
3.3 Loss Function
In order to fine-tune the BasicVSR++ and Intra frame BasicVSR++ networks, we used a weighted sum of three loss components: (i) Charbonnier loss [11], (ii) total variation (TV) loss, and (iii) temporal gradient (TG) loss. The temporal gradient loss captures the difference between the ground truth and the output temporal sequence by calculating the loss between two consecutive frames in each sequence. To calculate the temporal gradient loss, we subtract two consecutive frames in each sequence to obtain the temporal gradient sequence.
The hyperparameters of the loss weights were optimized using grid search to find the best values. The final loss function is given by:
[TABLE]
where is the Charbonnier loss, is the total variation loss, is the temporal gradient loss.
4 Experiments
4.1 Datasets
For the NTIRE 2022 Challenge, we used the original LDV dataset [12] that consists of 240 videos as our primary training set. To increase our training data, we also utilized the LDV 2.0 dataset, which contains an additional 90 videos. We split the LDV 2.0 videos into six sets, each containing 15 videos. Two of these sets were used as our validation and test sets, respectively. In splitting the videos, we aimed to maintain the diversity of the videos in each set, in terms of content, frame rate, and other factors, as similar as possible. All videos in the LDV and LDV 2.0 datasets, as well as the splits for the NTIRE 2021 and NTIRE 2022 Challenges, are publicly available at https://github.com/RenYang-home/LDV_dataset.
4.2 Training Details
We employed the Adam optimizer [13] with a learning rate of and utilized the Cosine Restart scheduler [14] with a period of 10,000 iterations. To ensure a stable optimization process, we linearly increased the learning rate for the first 10% of iterations.
4.2.1 Fine-tuning BasicVSR++
Due to computational limitations, we fine-tuned only the upsample layer of the pre-trained BasicVSR++ network. We found that increasing the input frame length from 30 to 60 frames led to a 0.03 dB improvement in model performance. The network was fine-tuned for 50,000 iterations.
4.2.2 Fine-tuning Intra frame BasicVSR++
We create a new dataset is created by trimming the original videos into multiple segments, each of which consists of 30 frames without overlapping frames. The video segments are encoded using HEVC at a low-delay profile resulting in approximately 16,000 training samples. For the Intra frame BasicVSR++ network, only the segments from the first 200 videos are used to train the model, and the last 40 videos are used for testing.
4.3 Ensembling with Test Time Augmentation
In our study, we perform Ensembling with test time augmentation (TTA) by generating eight input variations by flipping and rotating the input sequences in the spatial dimension. We then use our proposed framework shown in Fig. 4 to enhance each variation, and post-process the corresponding output by flipping and rotating it back to its original location. Finally, we obtain the final output by averaging the outputs of all variations. This approach helps to reduce the impact of input variability and improve the overall performance of the model.
4.4 Experimental Results
We participated in the NTIRE22 challenge as team OCL-VCE and submitted our results on the test set. In Table 1, we present the performance of our method and BasicVSR++ on the test set. Our method achieved a higher PSNR score of 31.71 dB compared to 31.63 dB obtained by BasicVSR++.
We further evaluated our framework on the validation set, which consists of 10 videos. Using the pre-trained BasicVSR++ and Intra frame BasicVSR++ models and applying test-time augmentation (TTA), our framework achieved a PSNR score of 32.12 dB, surpassing the 31.84 dB score achieved without TTA.
In addition, without finetuning, with and without TTA, our framework achieved a PSNR score of 32.02 dB and 31.86 dB, respectively, improving the baseline by 0.06 dB and 0.02 dB, respectively. Our results demonstrate the effectiveness of our proposed Adaptive Context-Aware Fusion framework in enhancing the quality of low-delay compressed videos.
5 Conclusion
In conclusion, this paper proposes a novel method that leverages the unique characteristics of low-delay video compression algorithms to improve the quality of compressed videos using deep learning techniques. By incorporating this prior knowledge into the state-of-the-art method, BasicVSR++, we achieve a significant improvement in performance over existing methods. Our experimental results on the NTIRE22 challenge validate the effectiveness of our proposed method. This work underscores the importance of incorporating video compression knowledge into deep learning models to further enhance their performance and enable real-world applications.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (hevc) standard,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 22, no. 12, pp. 1649–1668, 2012.
- 2[2] R. Yang, M. Xu, T. Liu, Z. Wang, and Z. Guan, “Enhancing quality for hevc compressed videos,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 29, no. 7, pp. 2039–2054, 2018.
- 3[3] T. Wang, M. Chen, and H. Chao, “A novel deep learning-based method of improving coding efficiency from the decoder-end for hevc,” in 2017 Data Compression Conference (DCC) . IEEE, 2017, pp. 410–419.
- 4[4] R. Yang, X. Sun, M. Xu, and W. Zeng, “Quality-gated convolutional lstm for enhancing compressed video,” in 2019 IEEE International Conference on Multimedia and Expo (ICME) . IEEE, 2019, pp. 532–537.
- 5[5] Z. Guan, Q. Xing, M. Xu, R. Yang, T. Liu, and Z. Wang, “Mfqe 2.0: A new approach for multi-frame quality enhancement on compressed video,” IEEE transactions on pattern analysis and machine intelligence , vol. 43, no. 3, pp. 949–963, 2019.
- 6[6] J. Deng, L. Wang, S. Pu, and C. Zhuo, “Spatio-temporal deformable convolution for compressed video quality enhancement,” in Proceedings of the AAAI conference on artificial intelligence , vol. 34, no. 07, 2020, pp. 10 696–10 703.
- 7[7] R. Yang, F. Mentzer, L. V. Gool, and R. Timofte, “Learning for video compression with hierarchical quality and recurrent enhancement,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2020, pp. 6628–6637.
- 8[8] J. Wang, X. Deng, M. Xu, C. Chen, and Y. Song, “Multi-level wavelet-based generative adversarial network for perceptual quality enhancement of compressed video,” in European Conference on Computer Vision . Springer, 2020, pp. 405–421.
