MFQE 2.0: A New Approach for Multi-frame Quality Enhancement on   Compressed Video

Qunliang Xing; Zhenyu Guan; Mai Xu; Ren Yang; Tie Liu; Zulin Wang

arXiv:1902.09707·cs.CV·March 12, 2024

MFQE 2.0: A New Approach for Multi-frame Quality Enhancement on Compressed Video

Qunliang Xing, Zhenyu Guan, Mai Xu, Ren Yang, Tie Liu, Zulin Wang

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces a novel multi-frame quality enhancement method for compressed videos that leverages frame similarity and deep learning to improve video quality, outperforming existing single-frame approaches.

Contribution

It proposes the first multi-frame approach using BiLSTM for PQF detection and a specialized CNN for quality enhancement, utilizing neighboring high-quality frames for better results.

Findings

01

Effective in enhancing compressed video quality

02

Outperforms state-of-the-art single-frame methods

03

Demonstrates strong generalization across videos

Abstract

The past few years have witnessed great success in applying deep learning to enhance the quality of compressed image/video. The existing approaches mainly focus on enhancing the quality of a single frame, not considering the similarity between consecutive frames. Since heavy fluctuation exists across compressed video frames as investigated in this paper, frame similarity can be utilized for quality enhancement of low-quality frames given their neighboring high-quality frames. This task is Multi-Frame Quality Enhancement (MFQE). Accordingly, this paper proposes an MFQE approach for compressed video, as the first attempt in this direction. In our approach, we firstly develop a Bidirectional Long Short-Term Memory (BiLSTM) based detector to locate Peak Quality Frames (PQFs) in compressed video. Then, a novel Multi-Frame Convolutional Neural Network (MF-CNN) is designed to enhance the…

Tables8

Table 1. TABLE I: Averaged SD, PVD and PS values of our database.

PSNR (dB)
Metrics	MPEG-1	MPEG-2	MPEG-4	H.264	HEVC
SD	2.2175	2.2273	2.1261	1.6899	0.8788
PVD	1.1553	1.1665	1.0842	0.4732	1.1734
SSIM
SD	0.0717	0.0726	0.0735	0.0552	0.0105
PVD	0.0387	0.0391	0.0298	0.0102	0.0132
Separation (frames)
PS	5.3646	5.4713	5.4123	2.0529	2.6641

Table 2. TABLE II: Convolutional layers for pixel-wise motion estimation.

Layers	Conv 1	Conv 2	Conv 3	Conv 4	Conv 5
Filter size	$3 \times 3$	$3 \times 3$	$3 \times 3$	$3 \times 3$	$3 \times 3$
Filter number	24	24	24	24	2
Stride	1	1	1	1	1
Function	PReLU	PReLU	PReLU	PReLU	Tanh

Table 3. TABLE III: Performance of our PQF detector on test sequences.

Approach	QP	Precision	Recall	$F_{1}$ -score
Approach	QP	( $%$ )	( $%$ )	( $%$ )
MFQE 2.0	22	100.0	95.9	97.8
	27	98.2	94.1	96.1
	32	100.0	84.3	90.7
	37	100.0	96.5	98.2
	42	100.0	97.3	98.6
MFQE 1.0	37	90.7	92.1	91.1
MFQE 1.0	42	94.0	90.9	92.2

Table 4. TABLE IV: Performance of our PQF detector on test sequences at QP = 37.

Sequence		Precision	Recall	$F_{1}$ -score
Sequence		( $%$ )	( $%$ )	( $%$ )
A	Traffic	100.0	97.4	98.7
A	PeopleOnStreet	100.0	97.4	98.7
B	Kimono	100.0	98.4	99.2
	ParkScene	100.0	98.4	99.2
	Cactus	100.0	99.2	99.6
	BQTerrace	100.0	96.2	98.0
	BasketballDrive	100.0	97.4	98.7
C	RaceHorses	100.0	93.8	96.8
	BQMall	100.0	98.7	99.3
	PartyScene	100.0	98.4	99.2
	BasketballDrill	100.0	91.9	95.8
D	RaceHorses	100.0	94.9	97.4
	BQSquare	100.0	86.2	92.6
	BlowingBubbles	100.0	98.4	99.2
	BasketballPass	100.0	94.0	96.9
E	FourPeople	100.0	99.3	99.7
	Johnny	100.0	98.0	99.0
	KristenAndSara	100.0	99.3	99.7
Average		100.0	96.5	98.2

Table 5. TABLE V: Overall comparison for Δ Δ \Delta PSNR (dB) and Δ Δ \Delta SSIM ( × 10 − 4 absent superscript 10 4 \times 10^{-4} ) over test sequences at five QPs.

QP	Approach		AR-CNN		DnCNN		Li et al.		DCAD		DS-CNN		MFQE 1.0		MFQE 2.0
QP	Approach		[17]^*		[20]		[21]		[35]		[25]		MFQE 1.0		MFQE 2.0
37	Metrics		PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
	A	Traffic	0.239	47	0.238	57	0.293	60	0.308	67	0.286	60	0.497	90	0.585	102
	A	PeopleOnStreet	0.346	75	0.414	82	0.481	92	0.500	95	0.416	85	0.802	137	0.920	157
	B	Kimono	0.219	65	0.244	75	0.279	78	0.276	78	0.249	75	0.495	113	0.550	118
		ParkScene	0.136	38	0.141	50	0.150	48	0.160	50	0.153	50	0.391	103	0.457	123
		Cactus	0.190	38	0.195	48	0.232	58	0.263	58	0.239	58	0.439	88	0.501	100
		BQTerrace	0.195	28	0.201	38	0.249	48	0.279	50	0.257	48	0.270	48	0.403	67
		BasketballDrive	0.229	55	0.251	58	0.296	68	0.305	68	0.282	65	0.406	80	0.465	83
	C	RaceHorses	0.219	43	0.253	65	0.276	65	0.282	65	0.267	63	0.340	55	0.394	80
		BQMall	0.275	68	0.281	68	0.325	88	0.340	88	0.330	80	0.507	103	0.618	120
		PartyScene	0.107	38	0.131	48	0.131	45	0.164	48	0.174	58	0.217	73	0.363	118
		BasketballDrill	0.247	58	0.331	68	0.376	88	0.386	78	0.352	68	0.477	90	0.579	120
	D	RaceHorses	0.268	55	0.311	73	0.328	83	0.338	83	0.318	75	0.507	113	0.594	143
		BQSquare	0.080	8	0.129	18	0.086	25	0.197	38	0.201	38	-0.010	15	0.337	65
		BlowingBubbles	0.164	35	0.184	58	0.207	68	0.215	65	0.228	68	0.386	120	0.533	170
		BasketballPass	0.259	58	0.307	75	0.343	85	0.352	85	0.335	78	0.628	138	0.728	155
	E	FourPeople	0.373	50	0.388	60	0.449	70	0.506	78	0.459	70	0.664	85	0.734	95
		Johnny	0.247	10	0.315	40	0.398	60	0.410	50	0.378	40	0.548	55	0.604	68
		KristenAndSara	0.409	50	0.421	60	0.485	68	0.524	70	0.481	60	0.655	75	0.754	85
	Average		0.233	45	0.263	58	0.299	66	0.322	67	0.300	63	0.455	88	0.562	109
42	Average		0.285	96	0.221	77	0.318	105	0.324	109	0.310	101	0.444	130	0.589	165
32	Average		0.176	19	0.256	35	0.275	37	0.316	44	0.273	38	0.431	58	0.516	68
27	Average		0.177	14	0.272	24	0.295	28	0.316	30	0.267	23	0.399	34	0.486	42
22	Average		0.142	8	0.287	18	0.300	19	0.313	19	0.254	15	0.307	19	0.458	27

Table 6. TABLE VI: Overall BD-BR reduction ( % percent \% ) of test sequences with the HEVC baseline as an anchor. Calculated at QP = 22, 27, 32, 37 and 42.

Sequence		AR-CNN	DnCNN	Li et al.	DCAD	DS-CNN	MFQE 1.0	MFQE 2.0
A	Traffic	7.40	8.54	10.08	9.97	9.18	14.56	16.98
A	PeopleOnStreet	6.99	8.28	9.64	9.68	8.67	13.71	15.08
B	Kimono	6.07	7.33	8.51	8.44	7.81	12.60	13.34
	ParkScene	4.47	5.04	5.35	5.68	5.42	12.04	13.66
	Cactus	6.16	6.80	8.23	8.69	8.78	12.78	14.84
	BQTerrace	6.86	7.62	8.79	9.98	8.67	10.95	14.72
	BasketballDrive	5.83	7.33	8.61	8.94	7.89	10.54	11.85
C	RaceHorses	5.07	6.77	7.10	7.62	7.48	8.83	9.61
	BQMall	5.60	7.01	7.79	8.65	7.64	11.11	13.50
	PartyScene	1.88	4.02	3.78	4.88	4.08	6.67	11.28
	BasketballDrill	4.67	8.02	8.66	9.80	8.22	10.47	12.63
D	RaceHorses	5.61	7.22	7.68	8.16	7.35	10.41	11.55
	BQSquare	0.68	4.59	3.59	6.11	3.94	2.72	11.00
	BlowingBubbles	3.19	5.10	5.41	6.13	5.55	10.73	15.20
	BasketballPass	5.11	7.03	7.78	8.35	7.49	11.70	13.43
E	FourPeople	8.42	10.12	11.46	12.21	11.13	14.89	17.50
	Johnny	7.66	10.91	13.05	13.71	12.19	15.94	18.57
	KristenAndSara	8.94	10.65	12.04	12.93	11.49	15.06	18.34
Average		5.59	7.36	8.20	8.89	7.85	11.41	14.06

Table 7. TABLE VII: Test speed (fps) and parameters.

MFQE		Test speed					Parameters
MFQE		WQXGA	1080p	480p	240p	720p	Parameters
1.0	DS-CNN¹	0.57	1.12	5.92	19.38	2.54	1,344,449
1.0	MF-CNN²	0.36	0.73	3.83	12.55	1.63	1,787,547
2.0	MF-CNN³	0.79	1.61	8.35	25.29	3.66	255,422

Table 8. TABLE VIII: Overall Δ Δ \Delta PSNR (dB) of 10 test sequences at QP = 37.

Seq.	AR-CNN	DnCNN	Li et al.	DCAD	DS-CNN	MFQE 1.0	MFQE 2.0
1	0.280	0.359	0.459	0.510	0.415	0.655	0.775
2	0.266	0.303	0.387	0.399	0.339	0.492	0.579
3	0.315	0.365	0.422	0.439	0.394	0.629	0.735
4	0.321	0.312	0.401	0.421	0.388	0.599	0.719
5	0.237	0.229	0.287	0.311	0.290	0.414	0.476
6	0.261	0.312	0.392	0.373	0.343	0.659	0.723
7	0.346	0.414	0.482	0.481	0.465	0.772	0.920
8	0.219	0.244	0.187	0.279	0.280	0.472	0.550
9	0.267	0.311	0.328	0.317	0.358	0.394	0.594
10	0.259	0.307	0.343	0.332	0.375	0.484	0.728
Ave.	0.277	0.316	0.369	0.386	0.365	0.557	0.680
1: TunnelFlag 2: BarScene 3: Vidyo1 4: Vidyo3 5: Vidyo4 6: MaD
7: PeopleOnStreet 8: Kimono 9: RaceHorses 10: BasketballPass

Equations19

\vspace - 1 e m {l_{n + i}}_{i = 0}^{j} = 1 and l_{n - 1} = l_{n + j + 1} = 0, j \geq 1, \vspace - 1 e m

\vspace - 1 e m {l_{n + i}}_{i = 0}^{j} = 1 and l_{n - 1} = l_{n + j + 1} = 0, j \geq 1, \vspace - 1 e m

\vspace - 1 e m l_{n + i} = 0, where i \neq = ar g max_{0 \leq k \leq j} (p_{n + k}), \vspace - 1 e m

\vspace - 1 e m l_{n + i} = 0, where i \neq = ar g max_{0 \leq k \leq j} (p_{n + k}), \vspace - 1 e m

\vspace - 1 e m {l_{n + i}}_{i = 0}^{d} = 0 and l_{n - 1} = l_{n + d + 1} = 1, d > D, \vspace - 1 e m

\vspace - 1 e m {l_{n + i}}_{i = 0}^{d} = 0 and l_{n - 1} = l_{n + d + 1} = 1, d > D, \vspace - 1 e m

\vspace - 1 e m l_{n + i} = 1, where i = ar g max_{0 < k < d} (p_{n + k}) . \vspace - 1 e m

\vspace - 1 e m l_{n + i} = 1, where i = ar g max_{0 < k < d} (p_{n + k}) . \vspace - 1 e m

\vspace - .5 e m F_{p}^{'} (x, y) = I {F_{p} (x + M_{x} (x, y), y + M_{y} (x, y))}, \vspace - .5 e m

\vspace - .5 e m F_{p}^{'} (x, y) = I {F_{p} (x + M_{x} (x, y), y + M_{y} (x, y))}, \vspace - .5 e m

\vspace - 1 e m L_{MC} (θ_{m c}) = ∣∣ F_{p}^{' R} (θ_{m c}) - F_{n p}^{R} ∣ ∣_{2}^{2}, \vspace - 1 e m

\vspace - 1 e m L_{MC} (θ_{m c}) = ∣∣ F_{p}^{' R} (θ_{m c}) - F_{n p}^{R} ∣ ∣_{2}^{2}, \vspace - 1 e m

\vspace - 1 e m x_{11} x_{12} x_{13} x_{14} = H_{11} ([x_{10}]) = H_{12} ([x_{10}, x_{11}]) = H_{13} ([x_{10}, x_{11}, x_{12}]) = H_{14} ([x_{10}, x_{11}, x_{12}, x_{13}]), \vspace - 1 e m

\vspace - 1 e m x_{11} x_{12} x_{13} x_{14} = H_{11} ([x_{10}]) = H_{12} ([x_{10}, x_{11}]) = H_{13} ([x_{10}, x_{11}, x_{12}]) = H_{14} ([x_{10}, x_{11}, x_{12}, x_{13}]), \vspace - 1 e m

F_{e n} = F_{n p} + R_{n p} (θ_{q e}),

F_{e n} = F_{n p} + R_{n p} (θ_{q e}),

L_{MF} (θ_{m c}, θ_{q e}) = a \cdot L_{MC} : loss of MC-subnet i = 1 \sum 2 ∣∣ F_{p i}^{' R} (θ_{m c}) - F_{n p}^{R} ∣ ∣_{2}^{2}

L_{MF} (θ_{m c}, θ_{q e}) = a \cdot L_{MC} : loss of MC-subnet i = 1 \sum 2 ∣∣ F_{p i}^{' R} (θ_{m c}) - F_{n p}^{R} ∣ ∣_{2}^{2}

\displaystyle+b\cdot\underbrace{\big{|}\big{|}\big{(}F_{np}+R_{np}(\theta_{qe})\big{)}-F^{R}_{np}\big{|}\big{|}_{2}^{2}}_{L_{\text{QE}}:\ \text{loss of QE-subnet}}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

RyanXingQL/MFQEv2.0
tfOfficial

Datasets

ryanxingql/MFQEv2
dataset· 21 dl
21 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

MFQE 2.0: A New Approach for Multi-frame Quality Enhancement on Compressed Video

Qunliang Xing*, Zhenyu Guan*, Mai Xu, Ren Yang, Tie Liu and Zulin Wang

Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence. DOI: 10.1109/TPAMI.2019.2944806. Q. Xing and Z. Guan contribute equally to this paper. Corresponding author: Mai Xu.

Abstract

The past few years have witnessed great success in applying deep learning to enhance the quality of compressed image/video. The existing approaches mainly focus on enhancing the quality of a single frame, not considering the similarity between consecutive frames. Since heavy fluctuation exists across compressed video frames as investigated in this paper, frame similarity can be utilized for quality enhancement of low-quality frames given their neighboring high-quality frames. This task is Multi-Frame Quality Enhancement (MFQE). Accordingly, this paper proposes an MFQE approach for compressed video, as the first attempt in this direction. In our approach, we firstly develop a Bidirectional Long Short-Term Memory (BiLSTM) based detector to locate Peak Quality Frames (PQFs) in compressed video. Then, a novel Multi-Frame Convolutional Neural Network (MF-CNN) is designed to enhance the quality of compressed video, in which the non-PQF and its nearest two PQFs are the input. In MF-CNN, motion between the non-PQF and PQFs is compensated by a motion compensation subnet. Subsequently, a quality enhancement subnet fuses the non-PQF and compensated PQFs, and then reduces the compression artifacts of the non-PQF. Also, PQF quality is enhanced in the same way. Finally, experiments validate the effectiveness and generalization ability of our MFQE approach in advancing the state-of-the-art quality enhancement of compressed video. The code is available at https://github.com/RyanXingQL/MFQEv2.0.git.

Index Terms:

Quality enhancement, compressed video, deep learning.

1 Introduction

During the past decades, there has been a considerable increase in the popularity of video over the Internet. According to Cisco Data Traffic Forecast [1], video generates $60\%$ of Internet traffic in 2016, and this figure is predicted to reach $78\%$ by 2020. When transmitting video over the bandwidth-limited Internet, video compression has to be applied to significantly save the coding bit-rate. However, the compressed video inevitably suffers from compression artifacts, which severely degrade the Quality of Experience (QoE) [2, 3, 4, 5, 6]. Besides, such artifacts may reduce the accuracy for tasks of classification and recognition. It is verified in [7, 8, 9, 10] that compression quality enhancement can improve the performance of classification and recognition. Therefore, there is a pressing need to study on quality enhancement for compressed video.

Recently, extensive works were conducted for enhancing the visual quality of compressed image and video [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]. For example, Dong et al. [17] designed a four-layer Convolutional Neural Network (CNN) [26], named AR-CNN, which considerably improves the quality of JPEG images. Then, Denoising CNN (DnCNN) [20], which applies residual learning strategy, was proposed for image denoising, image super-resolution and JPEG quality enhancement. Later, Yang et al. [24, 25] designed a Decoder-side Scalable CNN (DS-CNN) for video quality enhancement. The DS-CNN structure is composed of two subnets, aiming at reducing intra- and inter-coding distortion, respectively. However, when processing a single frame, all existing quality enhancement approaches do not take any advantage of the information provided by neighboring frames, and thus their performance is severely limited. As Fig. 1 shows, the quality of compressed video dramatically fluctuates across frames. Therefore, it is possible to use the high-quality frames (i.e., Peak Quality Frames, called PQFs111PQF is defined as the frame whose quality is higher than both its previous frame and subsequent frame.) to enhance the quality of their neighboring low-quality frames (non-PQFs). This can be seen as Multi-Frame Quality Enhancement (MFQE), similar to multi-frame super-resolution [27, 28, 29].

This paper proposes an MFQE approach for compressed video. Specifically, we investigate that there exists large quality fluctuation in consecutive frames, for video sequences compressed by almost all compression standards. Thus, it is possible to improve the quality of a non-PQF with the help of its neighboring PQFs. To this end, we first train a Bidirectional Long Short-Term Memory (BiLSTM) based model as a no-reference method to detect PQFs. Then, a novel Multi-Frame CNN (MF-CNN) architecture is proposed for non-PQF quality enhancement, which takes both the current non-PQF and its adjacent PQFs as input. Our MF-CNN includes two components, i.e., Motion Compensation subnet (MC-subnet) and Quality Enhancement subnet (QE-subnet). The MC-subnet is developed to compensate motion between current non-PQF and its adjacent PQFs. The QE-subnet, with a spatio-temporal architecture, is designed to extract and merge the features of current non-PQF and compensated PQFs. Finally, the quality of the current non-PQF can be enhanced by QE-subnet which takes advantage of higher quality information provided by its adjacent PQFs. For example, as shown in Fig. 1, the current non-PQF (frame 95) and its nearest two PQFs (frames 92 and 96) are both fed into MF-CNN in our MFQE approach. As a result, the low-quality content (basketball) in non-PQF (frame 95) can be enhanced upon essentially the same but qualitatively better content in neighboring PQFs (frames 92 and 96). Moreover, Fig. 1 shows that our MFQE approach also mitigates the quality fluctuation, due to the considerable quality improvement of non-PQFs. Note that our MFQE approach is also used for reducing compression artifacts of PQFs by using neighboring PQFs to enhance the quality of the currently processed PQF.

This work is an extended version of our conference paper [30] (called MFQE 1.0 in this paper) with additional works and substantial improvements, thus called MFQE 2.0 (called MFQE in this paper for simplicity). The extension is as follows. (1) We enlarge our database in [30] from 70 to 160 uncompressed videos. On this basis, more thorough analyses of the compressed video are conducted. (2) We develop a new PQF-detector, which is based on BiLSTM instead of the support vector machine (SVM) in [30]. Our new detector is capable of extracting both spatial and temporal information of PQFs, leading to a boost in $F_{1}$ -score of PQF detection from $91.1\%$ to $98.2\%$ . (3) We advance our QE-subnet by introducing the multi-scale strategy, batch normalization [31] and dense connection [32], rather than the conventional design of CNN in [30]. Besides, we develop a lightweight structure for the QE-subnet to accelerate the speed of video quality enhancement. Experiments show that the average Peak Signal-to-Noise Ratio (PSNR) improvement on 18 sequences selected by [33] largely increases from 0.455 dB to 0.562 dB (i.e., $23.5\%$ improvement), while the number of parameters substantially reduces from 1,787,547 to 255,422 (i.e., $85.7\%$ saving), resulting in at least 2 times acceleration of quality enhancement. (4) More extensive experiments are provided to validate the performance and generalization ability of our MFQE approach.

2 Related works

2.1 Related works on quality enhancement

Recently, extensive works [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] have focused on enhancing the visual quality of compressed image. Specifically, Foi et al. [12] applied point-wise Shape-Adaptive DCT (SA-DCT) to reduce the blocking and ringing effects caused by JPEG compression. Later, Jancsary et al. [14] proposed reducing JPEG image blocking effects by adopting Regression Tree Fields (RTF). Moreover, sparse coding was utilized to remove the JPEG artifacts, such as [15] and [16]. Recently, deep learning has also been successfully applied to improve the visual quality of compressed images. Particularly, Dong et al. [17] proposed a four-layer AR-CNN to reduce the JPEG artifacts of images. Afterward, $\text{{D}}^{3}$ [19] and Deep Dual-domain Convolutional Network (DDCN) [18] were proposed as advanced deep networks for the quality enhancement of JPEG image, utilizing the prior knowledge of JPEG compression. Later, DnCNN was proposed in [20] for several tasks of image restoration, including quality enhancement. Li et al. [21] proposed a 20-layer CNN for enhancing image quality. Most recently, the memory network (MemNet) [23] has been proposed for image restoration tasks, including quality enhancement. In the MemNet, the memory block was introduced to generate the long-term memory across CNN layers, which successfully compensates the middle- and high-frequency signals distorted during compression. It achieves the state-of-the-art quality enhancement performance for compressed images.

There are also some other works [34, 24, 35] proposed for the quality enhancement of compressed video. For example, the Variable-filter-size Residue-learning CNN (VRCNN) [34] was proposed to replace the in-loop filters for HEVC intra-coding. However, the CNN in [34] was designed as a component of the video encoder, so that it is not practical for already compressed video. Most recently, a Deep CNN-based Auto Decoder (DCAD), which contains 10 CNN layers, was proposed in [35] to reduce the distortion of compressed video. Moreover, Yang et al. [24] proposed the DS-CNN approach for video quality enhancement. In [24], DS-CNN-I and DS-CNN-B, as two subnetworks of DS-CNN, are used to reduce the artifacts of intra- and inter-coding, respectively. All the above approaches can be seen as single-frame quality enhancement approaches, as they do not take any advantage of neighboring frames with high similarity. Consequently, their performance on video quality enhancement is severely limited.

2.2 Related works on multi-frame super-resolution

To our best knowledge, there exists no MFQE work for compressed video. The closest area is multi-frame video super-resolution. In the early years, Brandi et al. [36] and Song et al. [37] proposed to enlarge video resolution by taking advantage of high-resolution key frames. Recently, many multi-frame super-resolution approaches have employed deep neural networks. For example, Huang et al. [38] developed a Bidirectional Recurrent Convolutional Network (BRCN), which improves the super-resolution performance over traditional single-frame approaches. Kappeler et al. proposed a Video Super-Resolution network (VSRnet) [27], in which the neighboring frames are warped according to the estimated motion, and then both the current and warped neighboring frames are fed into a super-resolution CNN to enlarge the resolution of the current frame. Later, Li et al. [28] proposed replacing VSRnet by a deeper network with residual learning strategy. All these multi-frame methods exceed the limitation of single-frame approaches (e.g., SR-CNN [39]) for super-resolution, which only utilize the spatial information within one single frame.

Recently, the CNN-based FlowNet [40, 41] has been applied in [42] to estimate the motion across frames for super-resolution, which jointly trains the networks of FlowNet and super-resolution. Then, Caballero et al. [29] designed a spatial transformer motion compensation network to detect the optical flow for warping neighboring frames. The current and warped neighboring frames were then fed into the Efficient Sub-Pixel Convolution Network (ESPCN) [43] for super-resolution. Most recently, the Sub-Pixel Motion Compensation (SPMC) layer has been proposed in [44] for video super-resolution. Besides, [44] utilized Convolutional Long Short-Term Memory (ConvLSTM) to achieve the state-of-the-art performance on video super-resolution.

The aforementioned multi-frame super-resolution approaches are motivated by the fact that different observations of a same object or scene are highly likely to exist in consecutive frames of video. As a result, the neighboring frames may contain the content missed when down-sampling the current frame. Similarly, for compressed video, the low-quality frames can be enhanced by taking advantage of their adjacent frames with higher quality, because heavy quality fluctuation exists across compressed frames. Consequently, the quality of compressed videos may be effectively improved by leveraging the multi-frame information. To the best of our knowledge, our MFQE approach proposed in this paper is the first attempt in this direction.

3 Analysis of compressed video

In this section, we first establish a large-scale database of raw and compressed video sequences (Section 3.1) for training the deep neural networks in our MFQE approach. We further analyze our database to investigate the frame-level quality fluctuation (Section 3.2) and the similarity between consecutive compressed frames (Section 3.3). The analysis results can be seen as the motivation of our work.

3.1 Database

First, we establish a database including 160 uncompressed video sequences. These sequences are selected from the datasets of Xiph.org [45], VQEG [46] and Joint Collaborative Team on Video Coding (JCT-VC) [47]. The video sequences contained in our database are at large range of resolutions: SIF (352 $\times$ 240), CIF (352 $\times$ 288), NTSC (720 $\times$ 486), 4CIF (704 $\times$ 576), 240p (416 $\times$ 240), 360p (640 $\times$ 360), 480p (832 $\times$ 480), 720p (1280 $\times$ 720), 1080p (1920 $\times$ 1080), and WQXGA (2560 $\times$ 1600). Moreover, Fig. 2 shows some typical examples of the sequences in our database, demonstrating the diversity of video content. Then, all video sequences are compressed by MPEG-1 [48], MPEG-2 [49], MPEG-4 [50], H.264/AVC [51] and HEVC [52] at different quantization parameters (QPs)222FFmpeg is used for MPEG-1, MPEG-2, MPEG-4 and H.264/AVC compression, and HM16.5 is used for HEVC compression., to generate the corresponding video streams in our database.

3.2 Frame-level quality fluctuation

Fig. 3 shows the PSNR curves of 6 video sequences, which are compressed by different compression standards. It can be seen that PSNR significantly fluctuates along with the compressed frames. This indicates that there exists considerable quality fluctuation in compressed video sequences for MPEG-1, MPEG-2, MPEG-4, H.264/AVC and HEVC. In addition, Fig. 4 visualizes the subjective results of some frames in one video sequence, which is compressed by the latest HEVC standard. We can see that visual quality varies across compressed frames, also implying the frame-level quality fluctuation.

Moreover, we measure the Standard Deviation (SD) of frame-level PSNR and Structural Similarity (SSIM) for each compressed video sequence, to quality fluctuation throughout the frames. Besides, the Peak-Valley Difference (PVD), which calculates the average difference between peak values and their nearest valley values, is also measured for both PSNR and SSIM curves of each compressed sequence. Note that the PVD reflects the quality difference between frames within a short period. The results of SD and PVD are reported in Table I, which are averaged over all 160 video sequences in our database. Table I shows that the average SD values of PSNR are above $0.87$ dB for all five compression standards. This implies that compressed video sequences exist heavy fluctuation along with frames. In addition, we can see from Table I that the average PVD results of PSNR are above 1 dB for MPEG-1, MPEG-2, MPEG-4 and HEVC, except that of H.264 (0.4732 dB). Therefore, the visual quality is dramatically different between PQFs and Valley Quality Frames (VQFs), such that it is possible to significantly improve the visual quality of VQFs given their neighboring PQFs. Note that similar results can be found for SSIM as shown in Table I. In summary, we can conclude that the significant frame-level quality fluctuation exists for various video compression standards in terms of both PSNR and SSIM.

3.3 Similarity between neighboring frames

It is intuitive that the frames within a short time period are with high similarity. We thus evaluate the Correlation Coefficient (CC) values between each compressed frame and its previous/subsequent 10 frames, for all 160 sequences in our database. The mean and SD of the CC values are shown in Fig. 5, which are obtained from all sequences compressed by HEVC. We can see that the average CC values are larger than 0.75 and the SD values of CC are less than 0.20, when the period of two frames is within 10. Similar results can be found for other four video compression standards. This validates the high correlation of neighboring video frames.

In addition, it is necessary to investigate the number of non-PQFs between these two neighboring PQFs, denoted by the Peak Separation (PS), since the quality enhancement of each non-PQF is based on two neighboring PQFs. Table I also reports the results of PS, which are averaged over all 160 video sequences in our database. We can see from this table333Note that this paper only defines PS according to PSNR rather than SSIM, but similar results can be found for SSIM. that the PS values are considerably smaller than 10 frames, especially for the latest H.264 (PS = 2.0529) and HEVC (PS = 2.6641) standards. Such a short distance, together with the similarity results in Fig. 5, indicates the high similarity between two neighboring PQFs. Therefore, the PQFs probably contain some useful content that is distorted in their neighboring non-PQFs. Motivated by this, our MFQE approach is proposed to enhance the quality of non-PQFs through the advantageous information of the nearest PQFs.

4 The proposed MFQE approach

4.1 Framework

The framework of our MFQE approach is shown in Fig. 6. As seen in this figure, our MFQE approach first detects PQFs that are used for quality enhancement of non-PQFs. In practical application, raw sequences are not available in video quality enhancement, and thus PQFs and non-PQFs cannot be distinguished through comparison with raw sequences. Therefore, we develop a no-reference PQF detector for our MFQE approach, which is detailed in Section 4.2. Then, we propose a novel MF-CNN architecture to enhance the quality of non-PQFs, which takes advantage of the nearest PQFs, i.e., both previous and subsequent PQFs. As shown in Fig. 6, the MF-CNN architecture is composed of the MC-subnet and the QE-subnet. The MC-subnet (introduced in Section 4.3) is developed to compensate the temporal motion between neighboring frames. To be specific, the MC-subnet firstly predicts the temporal motion between the current non-PQF and its nearest PQFs. Then, the two nearest PQFs are warped with the spatial transformer according to the estimated motion. As such, the temporal motion between non-PQF and PQFs can be compensated. Finally, the QE-subnet (introduced in Section 4.4), which has a spatio-temporal architecture, is proposed for quality enhancement. In the QE-subnet, both the current non-PQF and compensated PQFs are the inputs, and then the quality of the non-PQF can be enhanced with the help of the adjacent compensated PQFs. Note that, in the proposed MF-CNN, the MC-subnet and QE-subnet are trained jointly in an end-to-end manner. Similarly, each PQF is also enhanced by MF-CNN with the help of its nearest PQFs.

4.2 BiLSTM-based PQF detector

In our MFQE approach, the no-reference PQF detector is based on a BiLSTM network. Recall that a PQF is the frame with higher quality than its adjacent frames. Thus, the features of the current and neighboring frames in both forward and backward directions are used together to detect PQFs. As revealed in Section 3.2, the PQF frequently appears in compressed video, leading to the quality fluctuation. Due to this, we apply the BiLSTM network [53] as the PQF detector, in which the long- and short-term correlation between PQF and non-PQF can be extracted and modeled.

**Notations. ** We first introduce the notations for our PQF detector. The consecutive frames in a compressed video are denoted by $\{f_{n}\}_{n=1}^{N}$ , where $n$ indicates the frame order and $N$ is the total number of frames. Then, the corresponding output from BiLSTM is denoted by $\{p_{n}\}_{n=1}^{N}$ , in which $p_{n}$ is the probability of $f_{n}$ being a PQF. Given $\{p_{n}\}_{n=1}^{N}$ , the labels of PQFs for each frame can be determined and denoted by $\{l_{n}\}_{n=1}^{N}$ . If $f_{n}$ is a PQF, then we have $l_{n}=1$ ; otherwise, we have $l_{n}=0$ .

**Feature Extraction. ** Before training, we extract 38 features for each $f_{n}$ . Specifically, 2 compressed domain features, i.e., the number of assigned bits and quantization parameters, are extracted at each frame for detecting the PQF, since they are strongly related to visual quality and can be directly obtained from bitstream. In addition, we follow the no-reference quality assessment method [2] to extract 36 features at pixel domain. Finally, the extracted features are in form of a 38-dimension vector as the input to BiLSTM.

**Architecture. ** The architecture of the BiLSTM is shown in Fig. 7. As seen in this figure, the LSTM is bidirectional, in order to extract and model the dependencies from both forward and backward directions. First, the input 38-dimension feature vector is fed into 2 LSTM cells, corresponding to either forward or backward direction. Each of LSTM cells is composed of 128 units at one time step (corresponding to one video frame). Then, the outputs of the bi-directional LSTM cells are fused and sent to the fully connected layer with a sigmoid activation. Consequently, the fully connected layer outputs $p_{n}$ , as the probability of being the PQF frame. Finally, the PQF label $l_{n}$ can be yielded upon $p_{n}$ .

Postprocessing. In our PQF detector, we further refine the results from BiLSTM according to the prior knowledge of PQF. Specifically, the following two strategies are developed to refine the labels $\{l_{n}\}_{n=1}^{N}$ of the PQF detector, where $N$ is the total number of frames.

Strategy I: Remove the consecutive PQFs. According to the definition of PQF, it is impossible that the PQFs appear consecutively. Hence, if the consecutive PQFs exist:

[TABLE]

we refine the PQF labels according to their probabilities:

[TABLE]

so that only one PQF is left.

Strategy II: Break the continuity of non-PQFs. According to the analysis in Section 3, PQFs frequently appear within a limited separation. For example, the average value of PS is 2.66 frames for HEVC compressed sequences. Here, we assume that $D$ is the maximal separation between two PQFs. Given this assumption, if the results of $\{l_{n}\}_{n=1}^{N}$ yield more than $D$ consecutive zeros (non-PQFs):

[TABLE]

then one of their corresponding frames $\{f_{n+i}\}_{i=0}^{d}$ need to act as a PQF. Accordingly, we set:

[TABLE]

After refining $\{l_{n}\}_{n=1}^{N}$ as discussed above, our PQF detector can locate PQFs and non-PQFs in the compressed video.

4.3 MC-subnet

After detecting PQFs, our MFQE approach can enhance the quality of non-PQFs by taking advantage of their neighboring PQFs. Unfortunately, there exists considerable temporal motion between PQFs and non-PQFs. Hence, we develop the MC-subnet to compensate the temporal motion across frames, which is based on the CNN method of Spatial Transformer Motion Compensation [29].

**Architecture. ** The architecture of STMC is shown in Fig. 8. Additionally, the convolutional layers of pixel-wise motion estimation are described in Table II. The same as [29], our MC-subnet adopts the convolutional layers to estimate the $\times 4$ and $\times 2$ down-scaling Motion Vector (MV) maps, denoted by $\mathbf{M}^{\times 4}$ and $\mathbf{M}^{\times 2}$ . Down-scaling motion estimation is effective to handle large scale motion. However, because of down-scaling, the accuracy of MV estimation is reduced. Therefore, in addition to STMC, we further develop some additional convolutional layers for pixel-wise motion estimation in our MC-subnet, which does not contain any down-scaling process. Then, the output of STMC includes the $\times 2$ down-scaling MV map $\mathbf{M}^{\times 2}$ and the corresponding compensated PQF $F^{\prime\times 2}_{p}$ . They are concatenated with the original PQF and non-PQF, as the input to the convolutional layers of the pixel-wise motion estimation. Consequently, the pixel-wise MV map can be generated, which is denoted by $\mathbf{M}$ . Note that the MV map $\mathbf{M}$ contains two channels, i.e., horizontal MV map $\mathbf{M}_{x}$ and vertical MV map $\mathbf{M}_{y}$ . Here, $x$ and $y$ are the horizontal and vertical index of each pixel. Given $\mathbf{M}_{x}$ and $\mathbf{M}_{y}$ , the PQF is warped to compensate the temporal motion. Let the compressed PQF and non-PQF be $F_{p}$ and $F_{np}$ , respectively. The compensated PQF $F^{\prime}_{p}$ can be expressed as

[TABLE]

where $\mathcal{I}\{\cdot\}$ denotes bilinear interpolation. The reason for interpolation is that $\mathbf{M}_{x}(x,y)$ and $\mathbf{M}_{y}(x,y)$ may be non-integer values.

**Training strategy. ** Since it is hard to obtain the ground truth of MV, the parameters of the convolutional layers for motion estimation cannot be trained directly. Instead, we can train the parameters by minimizing the MSE between the compensated adjacent frame and the current frame. Note that the similar training strategy is adopted in [29] for motion compensation in video super-resolution tasks. However, in our MC-subnet, both the input $F_{p}$ and $F_{np}$ are compressed frames with quality distortion. Hence, when minimizing the MSE between $F^{\prime}_{p}$ and the $F_{np}$ , the MC-subnet learns to estimate the distorted MV, resulting in inaccurate motion estimation. Therefore, the MC-subnet is trained under the supervision of the raw frames. That is, we warp the raw frame of the PQF (denoted by $F^{R}_{p}$ ) using the MV map output from the convolutional layers of motion estimation, and minimize the MSE between the compensated raw PQF (denoted by $F^{\prime R}_{p}$ ) and the raw non-PQF (denoted by $F^{R}_{np}$ ). Mathematically, the loss function of the MC-subnet can be written by

[TABLE]

where $\theta_{mc}$ represents the trainable parameters of our MC-subnet. Note that the raw frames $F^{R}_{p}$ and $F^{R}_{np}$ are not required when compensating motion in test and practical use.

4.4 QE-subnet

Given the compensated PQFs, the quality of non-PQFs can be enhanced through the QE-subnet. To be specific, the non-PQF $F_{np}$ , together with the compensated previous and subsequent PQFs ( $F^{\prime}_{p1}$ and $F^{\prime}_{p2}$ ), are fed into the QE-subnet. This way, both the spatial and temporal features of these three frames are extracted and fused, such that the advantageous information in the adjacent PQFs can be used to enhance the quality of the non-PQF. It differs from the conventional CNN-based single-frame quality enhancement approaches, which can only handle the spatial information within one single frame.

**Architecture. ** The architecture of QE-subnet is shown in Fig. 9. The QE-subnet consists of two key lightweight components: multi-scale feature extraction (denoted by C1-9) and densely connected mapping construction (denoted by C10-14).

•

Multi-scale feature extraction. The input to the QE-subnet is non-PQF $F_{np}$ and its neighboring compensated PQFs $F_{p1}^{{}^{\prime}}$ and $F_{p2}^{{}^{\prime}}$ . Then, the spatial features of $F_{np}$ , $F_{p1}^{{}^{\prime}}$ and $F_{p2}^{{}^{\prime}}$ are extracted by multi-scale convolutional filters, denoted by C1-9. Specifically, the filter size of C1,4,7 is $3\times 3$ , while the filter sizes of C2,5,8 and C3,6,9 are $5\times 5$ and $7\times 7$ , respectively. The filter numbers of C1-9 are all 32. After feature extraction, 288 feature maps filtered at different scales are obtained. Subsequently, all feature maps from $F_{np}$ , $F_{p1}^{{}^{\prime}}$ and $F_{p2}^{{}^{\prime}}$ are concatenated, and then flow into the dense connection component.

•

Densely connected mapping construction. After obtaining the feature maps from $F_{np}$ , $F_{p1}^{{}^{\prime}}$ and $F_{p2}^{{}^{\prime}}$ , a densely connected architecture is applied to construct the non-linear mapping from feature maps to enhancement residual. Note that enhancement residual refers to the difference between original and enhanced frames. To be specific, there are 5 convolutional layers in the non-linear mapping of the densely connected architecture. Each of them has 32 convolutional filters with size of $3\times 3$ . In addition, dense connection [32] is adopted to encourage feature reuse, strengthen feature propagation and mitigate the vanishing-gradient problem. Moreover, Batch Normalization (BN) [31] is applied to all 5 layers after PReLU activation to reduce internal covariate shift, thus accelerating the training process. We denote the composite non-linear mapping as $H_{l}(\cdot)$ , including Convolution (Conv), PReLU and BN. We further denote the output of the $l$ -th layer as $x_{l}$ , such that each layer can be formulated as follows,

[TABLE]

where $[x_{10},x_{11},...,x_{14}]$ refers to the concatenation of the feature maps produced in layers C10-C14. Finally, the enhanced non-PQF $F_{en}$ is generated by the pixel-wise summation of learned enhancement residual $R_{np}(\theta_{qe})$ and input non-PQF $F_{np}$

[TABLE]

where $\theta_{qe}$ is defined as the trainable parameters of the QE-subnet.

**Training strategy. ** The MC-subnet and QE-subnet in our MF-CNN are trained jointly in an end-to-end manner. Recall that $F^{\prime R}_{p1}$ and $F^{\prime R}_{p2}$ are defined as the raw frames of the previous and incoming PQFs, respectively. The loss function of our MF-CNN can be formulated as

[TABLE]

As (4.4) indicates, the loss function of the MF-CNN is the weighted sum of $L_{\text{MC}}$ and $L_{\text{QE}}$ , which are the $\ell_{2}$ -norm training losses of MC-subnet and QE-subnet, respectively. We divide the training into 2 steps. In the first step, we set $a\gg b$ , considering that $F^{\prime}_{p1}$ and $F^{\prime}_{p2}$ generated by MC-subnet are the basis of the following QE-subnet, and thus the convergence of MC-subnet is the primary target. After the convergence of $L_{\text{MC}}$ is observed, we set $a\ll b$ to minimize the MSE between $F_{np}+R_{np}$ and $F^{R}_{np}$ . Finally, the MF-CNN model can be trained for video quality enhancement.

5 Experiments

5.1 Settings

In this section, the experimental results are presented to validate the effectiveness of our MFQE 2.0 approach. Note that our MFQE 2.0 approach is called MFQE in this paper, while the MFQE approach of our conference paper [30] is named as MFQE 1.0 for comparison. In our database, except for 18 standard test sequences of Joint Collaborative Team on Video Coding (JCT-VC)[33], other 142 sequences are randomly divided into non-overlapping training set (106 sequences) and validation set (36 sequences). We compress all 160 sequences by HM16.5 under Low-Delay configuration, setting the Quantization Parameters (QPs) to 22, 27, 32, 37 and 42, respectively.

For the BiLSTM-based PQF detector, the hyper-parameter $D$ of (3) is set to 3 in post-processing444 $D$ should be adjusted according to the compression standard and configuration., because the average value of PS is 2.66 frames for HEVC compressed sequences. In addition, the LSTM length is set to 8. Before training the MF-CNN, the raw and compressed sequences are segmented into $64\times 64$ patches as the training samples. The batch size is set to be 128. We apply the Adam algorithm [54] with the initial learning rate as $10^{-4}$ to minimize the loss function (4.4). It is worth mentioning that the MC-subnet may be unable to converge, if the initial learning rate is oversize, e.g., $10^{-3}$ . For QE subnet, we set $a=1$ and $b=0.01$ in (4.4) at first to make the MC-subnet convergent. After the convergence, we set $a=0.01$ and $b=1$ , so that the QE-subnet can converge faster.

5.2 Performance of the PQF detector

The performance of PQF detection is critical, since it is the first process of our MFQE approach. Thus, we evaluate the performance of our BiLSTM-based approach in PQF detection. For evaluation, we measure precision, recall and $F_{1}$ -score of PQF detection over all 18 test sequences compressed at five QPs (= 22, 27, 32, 37 and 42). The average results are shown in Table III. In this table, we also list the results of PQF detection by the SVM-based approach of MFQE 1.0 as reported in [30]. Note that the results of only two QPs (= 37 and 42) are reported in [30].

We can see from Table III that the proposed BiLSTM-based PQF detector in MFQE 2.0 performs well in terms of precision, recall and $F_{1}$ -score. For example, at QP = 37, the average precision, recall and $F_{1}$ -score of our BiLSTM-based PQF detector are $100.0\%$ , $96.5\%$ and $98.2\%$ , considerably higher than those of the SVM-based approach in MFQE 1.0. More importantly, the PQF detection of our approach is robust to all 5 QPs, since the average values of $F_{1}$ -score are all above $90\%$ . In addition, Table IV shows the performance of our BiLSTM-based PQF detector over each of 18 test sequences compressed at QP = 37. As seen in this table, the high performance is achieved by our PQF detector for almost all sequences, as only the recall of sequence BQSquare is below $90\%$ . In conclusion, the effectiveness of our BiLSTM-based PQF detector is validated, laying a firm foundation for our MFQE approach.

5.3 Performance of our MFQE approach

In this section, we evaluate the quality enhancement performance of our MFQE approach in terms of $\Delta$ PSNR, which measures the PSNR gap between the enhanced and original compressed sequences. In addition, the structural similarity (SSIM) index is also evaluated. Then, the performance of our MFQE approach is compared with those of AR-CNN [17], DnCNN [20], Li et al. [21], DCAD [35] and DS-CNN [25]. Among them, AR-CNN, DnCNN and Li et al. are the latest quality enhancement approaches for compressed images, while DCAD and DS-CNN are the state-of-the-art video quality enhancement approaches. For fair comparison, all compared approaches are retrained over our training set, the same as our MFQE approach.

**Quality enhancement on non-PQFs. ** Our MFQE approach mainly focuses on enhancing the quality of non-PQFs using the neighboring multi-frame information. Therefore, we first assess the quality enhancement of non-PQFs. Fig. 10 shows the $\Delta$ PSNR and $\Delta$ SSIM results averaged over PQFs and non-PQFs of all 18 test sequences compressed at 4 different QPs. As shown, our MFQE approach significantly outperforms other approaches on non-PQF enhancement. The average improvement of non-PQF quality is 0.614 dB and 0.012 in SSIM, while that of the second-best approach is 0.317 dB in PSNR and 0.007 in SSIM. We can further see from Fig. 10 that our MFQE approach has a considerably larger PSNR improvement for non-PQFs, compared to that for PQFs. By contrast, for compared approaches, the PSNR improvement of non-PQFs is similar to or even less than that of PQFs. In a word, the above results validate the outstanding effectiveness of our MFQE approach in enhancing the quality of non-PQFs.

**Overall quality enhancement. ** Table V presents the results of $\Delta$ PSNR and $\Delta$ SSIM, averaged over all frames of each test sequence. As shown in this table, our MFQE approach consistently outperforms all compared approaches. To be specific, at QP = 37, the highest $\Delta$ PSNR of our MFQE approach reaches 0.920 dB, i.e., for sequence PeopleOnStreet. The averaged $\Delta$ PSNR of our MFQE approach is 0.562 dB, which is $23.5\%$ higher than that of MFQE 1.0 (0.455 dB), $88.0\%$ higher than that of Li et al. (0.299 dB), $74.5\%$ higher than that of DCAD (0.322 dB), and $87.3\%$ higher than that of DS-CNN (0.300 dB). Even higher $\Delta$ PSNR improvement can be observed, when compared with AR-CNN and DnCNN. At other QPs (= 22, 27, 32 and 42), our MFQE approach consistently outperforms other state-of-the-art video quality enhancement approaches. Similar improvement can be found for SSIM in Table V. This demonstrates the robustness of our MFQE approach in enhancing video quality. This is mainly attributed to the significant improvement on the quality of non-PQFs, which is the majority of compressed video frames.

Rate-distortion performance. We further evaluate the rate-distortion performance of our MFQE approach by comparing with other approaches. First, Fig. 11 shows the rate-distortion curves of our and other state-of-the-art approaches over four selected sequences. Note that the results of the DCAD and DS-CNN approaches are plotted in this figure, since they perform better than other compared approaches. We can see from Fig. 11 that our MFQE approach performs better than other approaches in rate-distortion performance. Then, we quantify the rate-distortion performance by evaluating the BD-bitrate (BD-BR) reduction, which is calculated over the PSNR results of five QPs (= 22, 27, 32, 37 and 42). The results are presented in Table VI. As can be seen, the BD-BR reduction of our MFQE approach is $14.06\%$ on average, while that of the second-best approach DCAD is only $8.89\%$ on average. In general, the quality enhancement of our MFQE approach is equivalent to improving rate-distortion performance.

**Quality fluctuation. ** Apart from the compression artifacts, the quality fluctuation in compressed videos may also lead to degradation of QoE [55, 56, 57]. Fortunately, our MFQE approach is beneficial to mitigate the quality fluctuation, because of its significant quality improvement on non-PQFs as found in Fig. 10. We evaluate the fluctuation of video quality in terms of the SD and PVD results of PSNR curves, which are introduced in Section 3. Fig. 12 shows the SD and PVD values averaged over all 18 test sequences, which are obtained from the quality enhancement approaches and the HEVC baseline. As shown in this figure, our MFQE approach succeeds in reducing the SD and PVD, while other five compared approaches enlarge the SD and PVD values over the HEVC baseline. The reason is that our MFQE approach has considerably larger PSNR improvement for non-PQFs than that for PQFs, thus reducing the quality gap between PQFs and non-PQFs. In addition, Fig. 13 shows the PSNR curves of two selected test sequences, for our MFQE approach and the HEVC baseline. It can be seen that the PSNR fluctuation of our MFQE approach is significantly smaller than the HEVC baseline. In summary, our approach is also capable of reducing the quality fluctuation of video compression.

**Subjective quality performance. ** Fig. 14 shows the subjective quality performance on the sequences Fourpeople at QP = 37, BasketballPass at QP = 37 and RaceHorses at QP = 42. It can be observed that our MFQE approach reduces the compression artifacts much more effectively than other five compared approaches. Specifically, the severely distorted content, e.g., the cheek in Fourpeople, the ball in BasketballPass and the horse’s feet in RaceHorses, can be finely restored by our MFQE approach with multi-frame strategy. By contrast, such compression distortion can hardly be restored by the compared approaches, as they only use the single low-quality frame. Therefore, our MFQE approach also performs well in subjective quality enhancement.

**Test speed. ** We evaluate the test speed of quality enhancement using a computer equipped with a CPU of Intel i7-8700 3.20GHz and a GPU of GeForce GTX 1080 Ti. Specifically, we measure the average frame per second (fps), when testing video sequences at different resolutions. Note that the test set has been divided into 5 classes at different resolutions in [33]. The results averaged over sequences at different resolutions are reported in Table VII. As shown in this table, when enhancing non-PQFs, MFQE 2.0 can achieve at least 2 times acceleration compared to MFQE 1.0. For PQFs, MFQE 2.0 is also considerably faster than MFQE 1.0. The reason is that the parameters of the MF-CNN architecture in MFQE 2.0 are significantly fewer than those in MFQE 1.0. In a word, MFQE 2.0 is efficient in video quality enhancement, and its efficiency is mainly due to its lightweight structure.

Furthermore, we calculate the number of operations for the MFQE approach. For MFQE 1.0, there are 99,561 additions and 215,150,624 multiplications needed for enhancing a 64 $\times$ 64 patch, while those for MFQE 2.0 are 150,276 and 5,942,640. The reason for the dramatic reduction of operations is that we decrease the number of filters in the mapping structure of MF-CNN from 64 to 32, and relieve the burden of feature extraction by cutting the number of output feature maps from 128 to 32. At the same time, we deepen the mapping structure and introduce the dense strategy, batch normalization and residual learning. This way, the nonlinearity of MF-CNN is largely improved, while the number of parameters is effectively saved. In a word, MFQE 2.0 is efficient in video quality enhancement, and its efficiency is mainly due to the lightweight structure.

5.4 Ablation study

**PQF detector. ** In this section, we validate the necessity and effectiveness of utilizing PQFs to enhance the quality of non-PQFs. To this end, we retrain the MF-CNN model of our MFQE approach to enhance non-PQFs with the help of adjacent frames, instead of PQFs. The MF-CNN network and experiment settings are all consistent with those in Sections 4.3 and 5.1. The retrained model is represented by MFQE_NF (i.e., MFQE with neighboring Frames), and the experimental results are shown in Fig. 15, which are obtained by averaging over all 18 test sequences compressed at QP = 37. We can see that our approach without considering PQFs can only result in 0.274 dB for $\Delta$ PSNR gain. By contrast, as aforementioned, our approach with PQFs can achieve 0.562 dB enhancement in $\Delta$ PSNR. Moreover, as validated in Section 5.3, our MFQE approach obtains considerably higher enhancement on non-PQFs, when compared to the single-frame approaches. In a word, the above ablation study demonstrates the necessity and effectiveness of utilizing PQFs in the video quality enhancement task.

Besides, we test the MF-CNN model with ground truth PQFs. Specifically, the ground truth PQF labels are obtained according to the PSNR curves and the definition of PQFs. The experimental results (denoted by MFQE_GT, i.e., MFQE with Ground Truth PQFs) are shown in Fig. 15. As we can see, the average $\Delta$ PSNR is 0.563 dB. This indicates an upper bound on the performance with respect to PQF estimation.

Also, we test the impact of post-processing of the PQF detector, i.e., removing the neighboring PQFs and inserting PQFs between two PQFs with long distance. Specifically, we test the $F_{1}$ -score of the PQF detector without post-processing, and further evaluate its performance on quality enhancement (denoted by MFQE_NP, i.e., MFQE with No Post-processing) in terms of $\Delta$ PSNR. The average $F_{1}$ -score with post-processing slightly increases from 98.15 $\%$ to 98.21 $\%$ compared to the detector without post-processing. Additionally, the average $\Delta$ PSNR decreases by 0.001 dB after removing post-processing. Although the $\Delta$ PSNR improvement by taking post-processing is minor, the post-processing is still necessary in some extreme cases, where post-processing can prevent MFQE approach from inaccuate motion compensation and inferior quality enhancement. Take sequence KristenAndSara as an example. The non-PQF labels of frames 273 and 277 are corrected to PQFs. Consequently, the average $\Delta$ PSNR of frames 270 to 280 can increase from 0.659 dB to 0.724 dB after using post-processed labels.

Finally, we conduct an experiment to validate the improvement of quality enhancement after replacing SVM with BiLSTM both in training and evaluation, which is an advancement of MFQE 2.0 over MFQE 1.0. Specifically, we first replace the BiLSTM detector of MFQE 2.0 with SVM in the detection stage. Then, we retrain and test the model (denoted by MFQE_SVM) which consists of the SVM based detector and MF-CNN. The average $\Delta$ PSNR decreases from 0.562 dB to 0.528 dB (i.e., 6.0% degradation). This validates the contribution of the improved PQF detector.

**Multi-scale and dense connection strategy. ** We further validate the effectiveness of the multi-scale feature extraction strategy and the densely connected structure in enhancing video quality. First, we ablate all dense connections in the QE-subnet of our MFQE approach. In addition, we increase the filter number of C11 from 32 to 50, so that the number of trainable parameters can be maintained for fair comparison. The corresponding retrained model is denoted by MFQE_ND (i.e., MFQE with No Dense connection). Second, we ablate the multi-scale structure in the QE-subnet. Based on the dense-ablated network above, we fix all kernel sizes of the feature extraction component to $5\times 5$ . Other parts of the MFQE approach and experiment settings are all the same as those in Sections 4 and 5.1. Accordingly, the retrained model is represented as MFQE_GC (i.e., MFQE with General CNN). Fig. 15 shows the ablation results, which are also averaged over all 18 test sequences at QP = 37. As seen in this table, the PSNR improvement decreases from 0.562 dB to 0.299 dB (i.e., $46.8\%$ degradation) when disabling the dense connections, and then it reduces to 0.278 dB (i.e, $50.5\%$ degradation) when further ablating the multi-scale structure. This indicates the effectiveness of our multi-scale strategy and the densely connected structure.

**Enlarged database. ** One of the contributions in this paper is that we enlarge our database from 70 to 160 uncompressed video sequences. Here, we verify the effectiveness of the enlarged database over our previous database [30]. Specifically, we test the performance of our MFQE approach trained over the database in [30]. Then, the performance is evaluated on all 18 test sequences at QP = 37. The retrained model with its corresponding test result is represented by MFQE_PD (i.e., MFQE with the Previous Database) in Fig. 15. We can see that MFQE 2.0 achieves substantial improvement on quality enhancement compared with MFQE-A7. In particular, the performance of MFQE 2.0 improves $\Delta$ PSNR from 0.533 dB to 0.562 dB on average. Hence, our enlarged database is effective in improving video quality enhancement performance.

5.5 Generalization ability of our MFQE approach

**Transfer to H.264. ** We verify the generalization ability of our MFQE approach for video sequences compressed by another standard. To this end, we test our MFQE approach on the 18 test sequences compressed by H.264 at QP = 37. Note that the test model is the same as that in Section 5.3, which is trained over the training set compressed by HEVC at QP = 37. Consequently, the average PSNR improvement is 0.422 dB. Also, we test the performance of MFQE model retrained over H.264 dataset. The average PSNR improvement is 0.464 dB. In a word, the MFQE model trained over HEVC dataset performs well on H.264 videos, and the MFQE model retrained on H.264 can slightly improve the performance of quality enhancement. This implies the high generalization ability of our MFQE approach across different compression standards.

**Performance on other sequences. ** It is worth mentioning that the test set in [30] is different from that in this paper. In our previous work [30], 10 test sequences are randomly selected from the previous database including 70 videos. In this paper, our 18 test sequences are selected by Joint Collaborative Team on Video Coding (JCT-VC)[33], which is a standard test set for video compression. For fair comparison, we test the performance of our MFQE 2.0 and all compared approaches over the previous test set. The experimental results are presented in Table VIII. Note that 4 test sequences among the 10 test sequences overlap with the 18 test sequences of the above experiments. We can see from Table VIII that our approach has 0.680 dB improvement in $\Delta$ PSNR and again outperforms other approaches. In this table, the results of compared approaches are also better than those reported in [30] and their papers. It is because of retraining over the enlarged database. In conclusion, our MFQE approach has high generalization ability over different test sequences.

6 Conclusion

In this paper, we have proposed a CNN-based MFQE approach to enhance the quality of compressed video by reducing compression artifacts. Differing from the conventional single-frame quality enhancement approaches, our MFQE approach improves the quality of one frame by utilizing its nearest PQFs that have higher quality. To this end, we developed a BiLSTM-based PQF detector to classify PQFs and non-PQFs in compressed video. Then, we proposed a novel CNN framework, called MF-CNN, to enhance the quality of non-PQFs. Specifically, our MF-CNN framework consists of two subnets, i.e., the MC-subnet and QE-subnet. First, the MC-subnet compensates motion between PQFs and non-PQFs. Subsequently, the QE-subnet enhances the quality of each non-PQF by feeding the current non-PQF and the nearest compensated PQFs. In addition, PQF quality is enhanced in the same way. Finally, extensive experimental results showed that our MFQE approach significantly improves the quality of compressed video, superior to other state-of-the-art approaches. Consequently, the overall quality can be significantly enhanced, with considerably higher quality and less quality fluctuation than other approaches.

There may exist two research directions for future work. (1) Our work in this paper only takes PSNR and SSIM as the objective metrics to be enhanced. The potential future work may further embrace perceptual quality metrics in our approach to improve the Quality of Experience (QoE) in video quality enhancement. (2) Our work mainly focuses on the quality enhancement at the decoder side. To further improve the performance of quality enhancement, information from the encoder, such as the partition of coding units, can be utilized. This is a promising future work.

7 Acknowledgment

This work was supported by the NSFC projects 61876013, 61922009 and 61573037.

Bibliography57

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] I. Cisco Systems, “Cisco visual networking index: Global mobile data traffic forecast update,” https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/mobile-white-paper-c 11-520862.html.
2[2] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack, “Study of subjective and objective quality assessment of video,” IEEE transactions on image processing , vol. 19, no. 6, pp. 1427–1441, 2010.
3[3] S. Li, M. Xu, X. Deng, and Z. Wang, “Weight-based r- λ 𝜆 \lambda rate control for perceptual hevc coding on conversational videos,” Signal Processing: Image Communication , vol. 38, pp. 127–140, 2015.
4[4] T. K. Tan, R. Weerakkody, M. Mrak, N. Ramzan, V. Baroncini, J.-R. Ohm, and G. J. Sullivan, “Video quality evaluation methodology and verification testing of hevc compression performance,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 26, no. 1, pp. 76–90, 2016.
5[5] C. G. Bampis, Z. Li, A. K. Moorthy, I. Katsavounidis, A. Aaron, and A. C. Bovik, “Study of temporal effects on subjective video quality of experience,” IEEE Transactions on Image Processing , vol. 26, no. 11, pp. 5217–5231, 2017.
6[6] R. Yang, M. Xu, Z. Wang, Y. Duan, and X. Tao, “Saliency-guided complexity control for hevc decoding,” IEEE Transactions on Broadcasting , 2018.
7[7] M. D. Gupta, S. Rajaram, N. Petrovic, and T. S. Huang, “Restoration and recognition in a loop,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on , vol. 1. IEEE, 2005, pp. 638–644.
8[8] P. H. Hennings-Yeomans, S. Baker, and B. V. Kumar, “Simultaneous super-resolution and feature extraction for recognition of low-resolution faces,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on . IEEE, 2008, pp. 1–8.