Mask Reference Image Quality Assessment

Pengxiang Xiao; Shuai He; Limin Liu; Anlong Ming

arXiv:2302.13770·cs.CV·March 21, 2023

Mask Reference Image Quality Assessment

Pengxiang Xiao, Shuai He, Limin Liu, Anlong Ming

PDF

Open Access

TL;DR

This paper introduces a Mask Reference IQA method that masks and reconstructs image patches to improve quality assessment, achieving state-of-the-art results and better generalization across datasets.

Contribution

The proposed MR-IQA method innovatively uses patch masking and reference patch supplementation to enhance image quality assessment accuracy.

Findings

01

Achieves state-of-the-art performance on KADID-10k, LIVE, and CSIQ datasets.

02

Provides better generalization across different datasets.

03

Reduces overfitting through data augmentation with masked patches.

Abstract

Understanding semantic information is an essential step in knowing what is being learned in both full-reference (FR) and no-reference (NR) image quality assessment (IQA) methods. However, especially for many severely distorted images, even if there is an undistorted image as a reference (FR-IQA), it is difficult to perceive the lost semantic and texture information of distorted images directly. In this paper, we propose a Mask Reference IQA (MR-IQA) method that masks specific patches of a distorted image and supplements missing patches with the reference image patches. In this way, our model only needs to input the reconstructed image for quality assessment. First, we design a mask generator to select the best candidate patches from reference images and supplement the lost semantic information in distorted images, thus providing more reference for quality assessment; in addition, the…

Tables4

Table 1. Table 1: Summary of four IQA databases: LIVE [ 32 ] , CSIQ [ 19 ] , TID2013 [ 30 ] and KADID-10K [ 22 ] .

Dataset	Ref.	Dis.	Dis.type	Rat.	Rat.type	Score
LIVE[32]	29	779	5	25k	DMOS	[0,100]
CSIQ[19]	30	866	6	5k	DMOS	[0,1]
TID2013[30]	25	3000	24	524k	MOS	[0,9]
KADID-10k[22]	81	10125	25	30.4k	MOS	[1,5]

Table 2. Table 2 : Performance evaluation on the LIVE [ 32 ] , CSIQ [ 19 ] , TID2013 [ 30 ] and KADID-10K [ 22 ] , our proposed MR-IQA achieves SOTA performance on three datasets.

				LIVE		CSIQ		TID2013		KADID-10k
	Methods	SRCC	PLCC	SRCC	PLCC	SRCC	PLCC	SRCC	PLCC
	BRISQUE[27]	0.939	0.935	0.746	0.829	0.604	0.694	0.528	0.567
	FRIQUEE[9]	0.940	0.944	0.835	0.874	0.680	0.753	-	-
	CORNIA[44]	0.947	0.950	0.678	0.776	0.678	0.768	0.541	0.580
	M3[43]	0.951	0.950	0.795	0.839	0.689	0.771	-	-
	HOSA[42]	0.946	0.947	0.741	0.823	0.735	0.815	0.609	0.653
	BIECON[14]	0.961	0.962	0.815	0.823	0.717	0.762	0.623	0.648
	WaDIQaM[4]	0.954	0.963	0.844	0.852	0.761	0.787	0.739	0.752
	ResNet-ft[17]	0.950	0.954	0.876	0.905	0.712	0.756	-	-
	IW-CNN[17]	0.963	0.964	0.812	0.791	0.800	0.802	-	-
	CaHDC[40]	0.965	0.964	0.903	0.914	0.862	0.878	-	-
NR	HyperIQA[36]	0.962	0.966	0.923	0.942	0.729	0.775	0.852	0.845
	PSNR	0.873	0.865	0.810	0.819	0.687	0.677	0.676	0.675
	SSIM[38]	0.948	0.937	0.865	0.852	0.727	0.777	0.724	0.717
	MS-SSIM[39]	0.951	0.940	0.906	0.889	0.786	0.830	0.826	0.820
	VSI[45]	0.952	0.948	0.942	0.928	0.897	0.900	0.879	0.877
	FSIMc[46]	0.965	0.961	0.931	0.919	0.851	0.877	0.854	0.850
	MAD[19]	0.967	0.968	0.947	0.950	0.781	0.827	0.799	0.799
	VIF[33]	0.964	0.960	0.911	0.913	0.677	0.771	0.679	0.687
	WaDIQaM[4]	0.970	0.980	0.901	0.909	0.940	0.946	0.896	0.889
	DISTS[7]	0.955	0.955	0.946	0.946	0.830	0.855	0.887	0.886
	PieAPP[31]	0.918	0.909	0.890	0.873	0.876	0.859	0.836	0.836
FR	LPIPS[48]	0.932	0.934	0.903	0.927	0.670	0.749	0.843	0.839
	Ours (Random)	0.976	0.979	0.943	0.949	0.904	0.925	0.949	0.952
	Ours (Diff)	0.977	0.980	0.947	0.952	0.912	0.930	0.952	0.955

Table 3. Table 3 : SRCC performance with cross-database performance evaluation on the CSIQ [ 19 ] , LIVE [ 32 ] and TID2013 [ 30 ] datasets.

Trained on:						CSIQ	LIVE	TID2013
Tested on:	LIVE	TID2013	CSIQ	TID2013	LIVE	CSIQ
BRISQUE[27]	0.847	0.454	0.562	0.358	0.790	0.590
FRIQUEE[9]	0.879	0.463	0.722	0.461	0.755	0.635
M3[43]	0.797	0.328	0.621	0.344	0.873	0.605
CORNIA[44]	0.853	0.312	0.649	0.360	0.846	0.672
HOSA[42]	0.594	0.361	0.594	0.361	0.846	0.612
Ours	0.882	0.592	0.757	0.618	0.926	0.910

Table 4. Table 4 : SRCC/PLCC performance with ablation studies on the TID2013 [ 30 ] and KADID-10K [ 22 ] datasets.

			TID2013	KADID-10K
AGCS	MG	FMM	SRCC / PLCC	SRCC / PLCC
✗	✗	✗	0.873 / 0.895	0.932 / 0.936
✓	✗	✗	0.880 / 0.901	0.935 / 0.940
✓	✓	✗	0.903 / 0.927	0.945 / 0.947
✗	✓	✓	0.909 / 0.925	0.947 / 0.952
✓	✓	✓	0.912 / 0.930	0.952 / 0.955

Equations23

W_{g} = \frac{W _{p}}{G _{w}}, H_{g} = \frac{H _{p}}{G _{h}},

W_{g} = \frac{W _{p}}{G _{w}}, H_{g} = \frac{H _{p}}{G _{h}},

g^{i, j} = P [i \times W_{g} : (i + 1) \times W_{g}, j \times H_{g} : (j + 1) \times H_{g}],

g^{i, j} = P [i \times W_{g} : (i + 1) \times W_{g}, j \times H_{g} : (j + 1) \times H_{g}],

0 \leq i < G_{w}, 0 \leq j < G_{h},

W_{g^{'}} = \frac{W _{in p u t}}{G _{w}}, H_{g^{'}} = \frac{H _{in p u t}}{G _{h}},

W_{g^{'}} = \frac{W _{in p u t}}{G _{w}}, H_{g^{'}} = \frac{H _{in p u t}}{G _{h}},

g^{' i, j} = g^{i, j} [m_{r an} : m_{r an} + W_{g^{'}}, n_{r an} : n_{r an} + H_{g^{'}}],

g^{' i, j} = g^{i, j} [m_{r an} : m_{r an} + W_{g^{'}}, n_{r an} : n_{r an} + H_{g^{'}}],

0 \leq m_{r an} < W_{g} - W_{g^{'}}, 0 \leq n_{r an} < H_{g} - H_{g^{'}},

D i f f^{i, j} = MAE (g_{d s t}^{' i, j}, g_{r e f}^{' i, j}),

D i f f^{i, j} = MAE (g_{d s t}^{' i, j}, g_{r e f}^{' i, j}),

0 \leq i < G_{w}, 0 \leq j < G_{h},

M a s k_{a}^{i, j} = {1, 0, i f D i f f^{i, j} > mi d e l se .

M a s k_{a}^{i, j} = {1, 0, i f D i f f^{i, j} > mi d e l se .

M a s k_{b}^{i, j} = M a s k_{a}^{i, j} ⋂ M a s k_{r an}^{i, j},

M a s k_{b}^{i, j} = M a s k_{a}^{i, j} ⋂ M a s k_{r an}^{i, j},

g_{ma s k}^{i, j} = {g_{r e f}^{' i, j}, g_{d s t}^{' i, j}, i f M a s k_{b}^{i, j} = 1 e l se .

g_{ma s k}^{i, j} = {g_{r e f}^{' i, j}, g_{d s t}^{' i, j}, i f M a s k_{b}^{i, j} = 1 e l se .

F_{1}^{i, j} = \neg M a s k_{b}^{i, j} \times F_{1}^{i, j} (0 \leq i, j < 64),

F_{1}^{i, j} = \neg M a s k_{b}^{i, j} \times F_{1}^{i, j} (0 \leq i, j < 64),

F_{2}^{i, j} = \neg M a s k_{a}^{i, j} \times F_{2}^{i, j} (0 \leq i, j < 32) .

F_{2}^{i, j} = \neg M a s k_{a}^{i, j} \times F_{2}^{i, j} (0 \leq i, j < 32) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Video Quality Assessment · Advanced Image Processing Techniques · Visual Attention and Saliency Detection

Full text

Mask Reference Image Quality Assessment

Pengxiang Xiao

Anlong Ming

Shuai He

Limin Liu

Beijing University Posts and Telecommunications 100876,Beijing

Abstract

Understanding semantic information is an essential step in knowing what is being learned in both full-reference (FR) and no-reference (NR) image quality assessment (IQA) methods. However, especially for many severely distorted images, even if there is an undistorted image as a reference (FR-IQA), it is difficult to perceive the lost semantic and texture information of distorted images directly. In this paper, we propose a Mask Reference IQA (MR-IQA) method that masks specific patches of a distorted image and supplements missing patches with the reference image patches. In this way, our model only needs to input the reconstructed image for quality assessment. First, we design a mask generator to select the best candidate patches from reference images and supplement the lost semantic information in distorted images, thus providing more reference for quality assessment; in addition, the different masked patches imply different data augmentations, which favors model training and reduces overfitting. Second, we provide a Mask Reference Network (MRNet): the dedicated modules can prevent disturbances due to masked patches and help eliminate the patch discontinuity in the reconstructed image. Our method achieves state-of-the-art performances on the benchmark KADID-10k, LIVE and CSIQ datasets and has better generalization performance across datasets.

1 Introduction

Digital images are subject to a wide variety of distortions during acquisition, processing, compression, storage, transmission and reproduction, and each of these processes may result in a degradation in visual quality [27] . The ability to assess and improve the quality of images has become highly desirable in many image processing and computer vision applications. However, asking human observers to assess the visual quality requires expensive subjective experiments. Hence, it is urgent to develop effective methods to automatically evaluate the perceptual quality that are consistent with human subjective perception, and this is known as image quality assessment (IQA).

Based on how much information is available from the reference image, IQA methods can be divided into three categories: full-reference IQA (FR-IQA), reduced-reference IQA (RR-IQA) and no-reference IQA (NR-IQA). Among them, the FR-IQA methods make full use of the difference between the reference image and the original image for comparison and can better judge the distortion of the image. Existing FR-IQA methods usually extract the features from distorted and reference images separately, but these approaches usually fail to directly recover the missing semantic content and texture in distorted images, which is regarded as crucial information for IQA tasks [17, 12]. Moreover, data-driven FR-IQA methods are more complex and have higher computational consumption compared to data-driven NR-IQA methods, and they have to use pairwise labeled data with the mean opinion score (MOS) for training. As the construction process is extremely labor-intensive and costly, available IQA datasets with reference images are too small to be effective for reducing overfitting and training a robust model.

In this paper, we incorporate some patches of the reference image into the distorted images for quality assessment. Compared with the previous full reference method, this idea brings two benefits: (1) The introduced reference image patches can be used to directly recover the necessary but lost semantic and texture information from distorted images to better evaluate image quality (Fig. 1). (2) The incompleteness of the reference image reduces the dependence of the model on the complete reference image, and the randomness of the image block enables data enhancement and improves the generalization ability of the model.

However, two key challenges remain in designing such a method: (1) Which patch of the reference image should be chosen? It is necessary to ensure that the selected reference image patch can provide sufficient semantic and reference information for the quality assessment of the distorted image while not losing the distorted information in the distorted image. (2) How should the supplementary patches of the reference image be utilized? It is necessary to ensure that the model can prevent the reference image patches from interfering with the image quality, while extracting necessary semantic and detailed information to help the quality assessment of distorted images.

Driven by this analysis, we present a simple and efficient Mask Reference method for IQA tasks (MR-IQA). We design a mask generator to select reference image patches, and incorporate them into heavily distorted regions to obtain semantic information gain while maintaining enough distorted information. In this way, the semantic information in the distorted image is recovered and reference information is provided for its quality assessment, meanwhile, the random selection of patches can be regarded as a natural data enhancement strategy.

Then, we refine the Swin Transformer [26], through a dedicated Feature Mask Module (FMM) to shield the perturbation of masked patches in the shallow features, while its patch-wise operations are inherently suitable for exploiting different information in patches. Furthermore, our MR-IQA incorporates a data augmentation strategy to augment datasets while obtaining fixed-size input. In general, the main contributions can be summarized into three-folds:

•

To address the semantic information loss in distorted images, we propose a MR-IQA method. With the dedicated masking strategy, semantic information is recovered in a more direct manner, while avoiding overfitting on limited datasets.

•

We introduce a Mask Reference Network (MRNet) with some novel modules to embody the MR-IQA method into an end-to-end model, while also shielding the perturbation of masked patches and augmenting datasets.

•

Experimental results demonstrate that the proposed MR-IQA not only significantly reduces the computational complexity of the model, but also achieves SOTA performance on multiple datasets. Validation experiments across datasets also demonstrate the strong generalization ability of our method. Moreover, our method has strong generality and can be independently embedded in existing methods or training processes to solve possible stumbling blocks in IQA tasks.

2 Related Work

In this section, we briefly introduce some NR-IQA methods, and then review the traditional methods and recent deep learning methods for FR-IQA.

2.1 No-Reference IQA

Generally, NR-IQA methods can be divided into natural scene statistics (NSS) based methods and learning-based methods. By modeling scene statistics, the traditional NSS based models are sensitive to the appearance of distortion and can detect and quantify the degradation level [27, 41]. In recent years, deep learning-based NR-IQA methods have demonstrated superior prediction performance over traditional methods. Kang et al. [13] proposed a shallow CNN model and divided the images into several patches to estimate image quality. Hallucinated-IQA [23] estimated the perceptual quality of images with reference images generated by a generative network. Bosse et al. [4] modified VGG-Net [35] to learn a local weight for each image patch to measure the importance of its local quality, and weighted average patch aggregation was adopted as the pooling method. Considering the limited size of existing IQA databases may lead to generalization problem, in [2] and [15], FR-IQA methods were used to generate the label of patches, and Pan et al. utilized the intermediate similarity maps for auxiliary training [28]. Meta-IQA [49] and RankIQA [25] trained the network with separate distortion types for learning prior knowledge.

Since the IQA is the human visual perception of high-level semantics [10], models are often pre-trained on ImageNet [6] to extract semantic features from images [21, 3, 17]. However, the deep semantic features extracted from global scale only represent global information, local details and texture information are ignored. To solve above issues, Sun et al. [37] proposed an NR-IQA framework to combine the global high-level semantics and local low-level characteristics. Kim et al. propose a Multiple-level Feature-based Image Quality Assessor (MFIQA) which considers multiple levels of features simultaneously [16].

Although the NR-IQA methods get rid of the dependence on reference images and obtain partial semantic information, they still cannot achieve the competitive performance of FR-IQA methods due to the lack of reference information.

2.2 Full-Reference IQA

FR-IQA methods compare the distorted image against its pristine-quality reference, which can also be further divided into traditional evaluation metrics and learning-based models. Traditional evaluation metrics correlate image quality with some hand-crafted definitions of perceptual differences between the inputs. The most widely used are mean squared error (MSE) and peak signal-to-noise ratio (PSNR). MSE is a signal-based metric that represents the cumulative squared error between the distorted and the reference image. PSNR is the most popular pixel-based metric which represents a measure of the peak error. Image fidelity criterion (IFC) [34] and visual information fidelity (VIF) [33] are natural scene statistics (NSS) based metrics that model natural images in the wavelet domain. Structure similarity (SSIM) [38] and MS-SSIM [39] methods considered the human vision system (HVS) and utilized the local structure similarity to evaluate image quality.

As visual perception is a complicated process, it is difficult to simulate the HVS with limited hand-crafted features. To solve this problem, learning-based FR-IQA models use deep networks to extract features from training data without expert knowledge. Amirshahi et al. [1] used the pre-trained model to capture the feature maps of the test and reference images at multiple layers and compare their feature similarity at each layer. Zhang et al. [47] proposed the Learned Perceptual Image Patch Similarity (LPIPS) to obtain the perceptual similarity judgment by calculating the Euclidean distance between the reference and distorted deep feature representations. Similarly, Ding et al. [7] presented a Deep Image Structure and Texture Similarity metric (DISTS) to measure the similarities between the VGG-based deep features. Hammou et al. [11] proposed an ensemble of gradient boosting (EGB) metrics based on selected feature similarity and ensemble learning. Peng et al. designed a Multi-Metric Fusion Network (MMFN) [29] for aggregating the quality scores predicted by diverse metrics to generate more accurate results.

In recent years, some researchers have tried to apply Transformer in IQA tasks to extract more powerful features. IQT [5] extracted the feature from CNNs and then fed the feature maps into the transformer for the quality prediction task. SwinIQA [24] used the Swin Transformer [26] to extract features and measured the perceptual quality of compressed images in a learned Swin distance space.

However, the above FR-IQA methods all extract features from the reference image and the distorted image separately, which will cause two problems: On the one hand, the model cannot directly obtain the lost semantic and texture information when extracting features from distorted images (Fig. 1), e.g., the neglect of the high-level semantic information may result in predicting a clear blue sky as a bad quality, which is inconsistent with human perception [20]. On the other hand, compared with the no-reference IQA method, using the model to process a pair of images will result in costing twice the amount of computation. In this paper, we present a simple general MR-IQA method to solve above problems.

3 Method

In this section, we describe the full pipeline of our proposed Mask-Reference IQA method. As shown in Fig. 2, reference images and distorted images are first cropped and sampled into fragments via the Adaptive Grid Cropping and Sampling (AGCS) module to obtain fixed-size images. Then, the preprocessed images are fed into the Mask Generator (MG) module to generate the masked images as input. Finally, the masked images are sent to the Mask Reference Network (MRNet) to get final predictions of image quality. We describe details as follows.

3.1 Adaptive Grid Cropping and Sampling

In this paper, we do not resize or use a smaller crop size to meet the model input size like other FR-IQA methods, since resizing corrupts local textures and cropping with small size causes mismatched global quality with local regions. Instead, we directly use a larger size to crop patches from images and design the AGCS module (Fig. 3) to well preserve the original image quality and get output with a fixed size. The core design is dividing the patch obtained by random crop into uniform grids with the same size, and then obtaining fixed-size outputs by cropping and sampling grids adaptively.

Given the distorted images $I_{dst}$ and the corresponding reference images $I_{ref}$ , we first use a random crop operation to get $P_{dst}$ and $P_{ref}$ with the same position and size $W_{p}\times H_{p}$ . Since $P_{dst}$ and $P_{ref}$ follow the same processing flow, we use $P$ to represent them. It should be noted that the crop operation uses a larger random aspect ratio size, where the length and width are both integer multiples of 64 to facilitate subsequent processing. In this way, larger patches can capture a wider range of semantic information, and direct processing of the raw resolution image results in no loss of the local textures, which are vital in IQA.

Then, we divided $P$ into $G_{w}\times G_{h}$ grids with the same sizes, denoted as $G=\left\{g^{0,0},...,g^{i,j},...,g^{G_{w}-1,G_{h}-1}\right\}$ , it can be formalized as follows:

[TABLE]

where $W_{g}$ and $H_{g}$ denote the width and height of each grid, $g^{i,j}$ denotes the area includes by the $i$ -th row and $j$ -th column grid in $P$ . In this step, we obtain grid as the minimum operating unit for cropping and sampling adaptively.

Finally, we randomly crop each $g^{i,j}$ at the same position, to avoid disrupting the original semantic information due to the different distances between grids. The cropping process in a grid is formalized as follows:

[TABLE]

where $m_{ran}$ and $n_{ran}$ are random positions to crop in each grid. In this way, girds of patches with different sizes and regions in the image are randomly selected, which prevents overfitting caused by the small dataset. $W_{g^{\prime}}$ and $H_{g^{\prime}}$ denote the width and height of the $g^{\prime}$ , which are calculated based on the model desired input size $W_{input}\times H_{input}$ and number of grids $G_{w}\times G_{h}$ . In our method, the number of grids $G_{w}\times G_{h}$ is set to $64\times 64$ , the size of of the $g^{\prime}$ will be adaptively changed based on the patch size, so as to ensure that the size of the $G^{\prime}=\left\{g^{\prime\,0,0},...,g^{\prime\,i,j},...,g^{\prime\,G_{w}-1,G_{h}-1}\right\}$ is fixed.

After the above pipeline, $G^{\prime}_{dst}$ and $G^{\prime}_{ref}$ with fixed-size are obtained from $P_{dst},P_{ref}$ for further mask processing in Mask Generator (MG) module.

3.2 Mask Generator Module

To generate a mask image for quality assessment, we mask severely distorted patches and supplement patches of reference images at the same position (Fig. 4). First, we use Mean Absolute Error (MAE) to estimate the difference $Diff$ between the $G^{\prime}_{dst}$ and $G^{\prime}_{ref}$ at each location:

[TABLE]

where $Diff^{i,j}$ represents the distortion degree of $G^{\prime}_{dst}$ at the $(i,j)$ position, the larger the value, the greater the distortion of the $G^{\prime}_{dst}$ at this position. We select masked patches based on the following principles: select regions with higher perceptual differences between the reference and the distorted images, since patches that are similar include little information about image quality differences, and can only supplement limited semantic and texture information. To do this, we calculate the median $mid$ in the $Diff^{i,j}$ , then in the $Diff^{i,j}$ , values greater than the median are set to 1; otherwise, they are set to 0. $Diff^{i,j}$ is redefined as $Mask_{a}$ , which only includes 0 or 1 values. The process is formalized as follows:

[TABLE]

The position where the value is 1 in $Mask_{a}$ indicates the semantic and detail information loss is more serious than other positions in the distorted image.

Second, although it is necessary for quality assessment to supplement the lost semantic information in mask-selected locations in the distorted image, these regions also contain distortion information, which cannot be lost. In order to solve this contradiction, we randomly select the positions where $Mask_{a}^{i,j}=1$ to ensure that half of each selected mask is reserved for judgment distortion, and the remaining half is replaced by reference patches. In this way, each time an image is processed, a new and unseen random set of patches of the reference image is masked into the distorted image. This provides the hidden benefit of data augmentation, which aids in model training and reduces overfitting.

Specifically, we randomly generate a binary feature map $Mask_{ran}$ , in which the proportion of 0 and 1 values is same. $Mask_{ran}$ is twice the size of $Mask_{a}$ , so $Mask_{a}$ need to be upsampled to the same size as $Mask_{ran}$ by copying. After that, we compute the intersection of these masks to obtain $Mask_{b}$ :

[TABLE]

according to the probability calculation, nearly 25% of the values in $Mask_{b}$ are 1, and the rest are 0.

Finally, $G_{dst}^{\prime}$ and $G_{ref}^{\prime}$ are merged to get $G_{mask}$ based on $Mask_{b}$ , each grid $g_{mask}^{i,j}$ of $G_{mask}$ is obtained as follows:

[TABLE]

Existing works [10] have shown that the global semantic information affects the quality predictions. Therefore, we splice each grid in $G_{mask}$ into their original positions to generate masked images $I_{mask}$ , which are fed into MRNet for final quality assessment.

3.3 Mask Reference Network

As shown in Fig. 2, our proposed MRNet consists of a backbone to extract multilayer features, a Feature Mask Module (FMM) to process features and a Scoring Module to predict the quality score after fusing the features. When the AGCS module merges the reference patches into the distorted images to restore the semantic information and reference information, these patches also cause a spatial discontinuity of the masked images. The CNN-based models use convolutional layers to extract features from patches in different locations, which may make it difficult for the model to distinguish the artificial discontinuity between patches, even confusing it with quality degradation.

As a hierarchical structure based on patch-wise operations, the Swin Transformer [26] is more suitable for processing patch-based input than their CNN counterparts. For the task of this paper, on the one hand, Transformer (pre-trained on ImageNet [6]) have a global receptive field that can help restore the lost semantic information from limited reference patches. On the other hand, the non-local self-attention structure included in the Transformer can compare the difference between the local reference block and other regions, thus better evaluating the specific distortion degree of the distorted region. Therefore, we used the Swin Transformer pre-trained on the ImageNet as the backbone, and extracted multi-scale features $F_{k}$ (k $\in$ 1, 2, 3, 4), which are obtained from the four stages to capture both global semantic information and local detail information (Fig. 2). To further avoid the interference of the reference image patches in the masked image to the quality assessment of the distorted image, we design the Feature Mask Module to erase the information from the reference image in the output of stage-1 and stage-2. Specifically, we process $F_{1}\in\mathbb{R}^{64\times 64\times C}$ and $F_{2}\in\mathbb{R}^{32\times 32\times C}$ based on $Mask_{a}\in\mathbb{R}^{32\times 32}$ and $Mask_{b}\in\mathbb{R}^{64\times 64}$ as follows:

[TABLE]

The $F_{k}$ (k $\in$ 1, 2) denote the feature vector at the position $(i,j)$ . The ${Mask_{a}^{i,j}}$ and ${Mask_{b}^{i,j}}$ are outputted from MG, after bit inversion processing they denote whether the distorted image patches at $(i,j)$ is masked. Considering that the deep features outputted by the stage-3 and stage-4 in Swin Transformer contain abstract semantic information and distortion features, no processing is done on these features.

Finally, we use the Scoring Module to map the multilayer features to quality scores. As shown in Figure 2, Adaptive average pooling (GAP) is used to align sizes of four multiscale features firstly, then we use $1\times 1$ convolution block to boost the channel interaction among the extracted features, and concatenate the features along the channel dimension. After these features are globally average pooled into vectors, with a sigmoid function as the activation function, MLP which consists of two fully connected layers is employed to regress the vectors to the quality score.

4 Experiments

In this section, we first introduce experimental settings including IQA datasets, evaluation criteria and implementation details of our method. Then, we compare it with state-of-the-art FR and NR IQA methods on four benchmark datasets, and evaluate the generalization ability of our method. In addition, we also compare the impact of different mask ratios on performance. Finally, we conduct ablation studies to analyze the proposed method.

4.1 Experimental Settings

Evaluation Datasets. The main experiments are conducted on four singly distorted synthetic IQA databases: LIVE [32], CSIQ [19], TID2013 [30] and KADID-10K [22], whose configurations are presented in Table 1. LIVE, CSIQ and TID2013 are three relatively small-scale IQA datasets, where distorted images only contain traditional distortion types (e.g., noise, downsampling, JPEG compression). KADID-10K [22] further incorporates the recovered results of a denoising algorithm into the distorted images, resulting in a medium-sized IQA dataset. Since the explicit splits of training and testing are not given on these four datasets, we randomly split 80% distorted images for training and the rest 20% ones for testing like previous works do. It should be emphasized that our split is based on reference images to avoid content overlapping. To reduce the bias caused by a random split, we run the random train-test splitting operation ten times, the comparison results are reported as the average of ten times evaluation experiments.

Evaluation Metrics. To evaluate the performance of the proposed method, the Spearman’s Rank Ordered Correlation Coefficient (SRCC) and the Pearson’s Linear Correlation Coefficient (PLCC) are applied between the subjective DMOS/MOS from human opinion and the predicted score. The two criteria both range from 0 to 1 and a higher value indicates better performance.

4.2 Implementation Details

As reported in [8], Vision Transformer strongly benefits from pre-training on larger datasets prior to transfer to downstream tasks. To avoid training Transformer from scratch, we use the Swin Transformer (Tiny) which is pre-trained on ImageNet-1k [6] classification datasets for all our experiments. We set initial learning rate to $1e^{-6}$ with a decay rate of 0.5 after every 20 epochs, using Adam optimizer [18] and the mini-batch size is set to 16. Our experiments are implemented on an NVIDIA GeForce RTX 3090 with PyTorch 1.8.0 and CUDA 11.2 for training and testing.

Thanks to the adaptive sampling of images by the AGCS module, we do not limit the size of the input images through resizing or cropping during training and testing. However, in order to ensure that the model performance is not affected, under the premise of not exceeding the size of the input image, we designed a variety of possible crop sizes (256, 320, 384, 448, 512) for random selection. After cropping, 8 patches from test image are randomly sampled and perform flipping (horizontal/vertical) with a given probability 0.5, their corresponding prediction scores are average pooled to get the final quality score. We evaluate the training of our model with MAE and MSE loss function, and find that more consistent results are obtained with MAE loss.

4.3 Comparison with the State-of-the-art Methods

As shown in Table 2, some representative IQA methods are selected for performance comparison, except for some methods to train models with external datasets to ensure fairness. The methods for comparison including hand-crafted based approaches, deep learning based NR-IQA approaches and deep learning based FR-IQA approaches. Our methods are compared with these competitors on the four traditional IQA datasets, including LIVE [32], CSIQ [19], TID2013 [30] and KADID-10K [22]. We can observe that the FR-IQA models achieve a higher performance compared to the NR-IQA models, since the pristine-quality reference image provides more accurate reference information for quality assessment.

On LIVE, CSIQ and KADID-10k datasets, our MR-IQA achieves SOTA performance on all metrics. Although WaDIQaM-FR achieves slightly better performance with our method on the TID2013 dataset, it is inferior to ours on the large KADID-10K dataset, indicating its limited generalization ability. To verify the potential of our method for quality assessment using masked images, we directly use random selection (Random) of positions instead of selecting mask positions based on difference (Diff) to generate masked images with the same proportion of reference images patches, although the performance is slightly worse than The latter, but still has significant advantages over other methods. By adopting Mask strategy, our MR-IQA method achieves the best performance on all the four datasets. Especially on the larger KADID-10k database, we observe a solid improvement over previous work. Besides, even conventional IQA perform well on the smaller LIVE and CSIQ, but fall short on the more complex datasets such as TID2013 and KADID-10k. Furthermore, MR-IQA does not require input of complete reference image and is directly applicable to the simple FR IQA problem involving a pair of images. As such, we achieve both excellent performance and reduced input requirements for the FR-IQA method.

4.4 Patch Masking Strategies

In order to explore the influence of the proportion of mask in the masked image on the evaluation results, we generate masked images with different proportions of reference image patches and test the effect. It should be noted that, in order to flexibly adjust the mask ratio, we do not generate a mask image based on the difference between the reference image and the distorted image, but randomly select locations to mask the distorted image using the reference image patches. As shown in Fig. 5, two images suffering from severe JPEG2000 compression distortion are masked by the reference image patches with different proportions (30%, 60%, 90%).

We keep the rest of the model unchanged and verify the performance of the model on masked images with different mask ratios on four datasets. The SRCC and PLCC results are shown in Fig. 6. Except that the performance of different mask ratios does not change significantly on the LIVE dataset with simpler distortion types, the addition of a small number of reference image patches improves the performance of the model, but the performance of the model decreases significantly when the mask ratio is too large.

4.5 Cross-Database Performance Evaluation

To verify the generalisation ability of our proposed MR-IQA method, the cross dataset test is performed to compare with BRISQUE [27], FRIQUEE [9], M3 [43], CORNIA [44], HOSA [42]. The cross database experiments are conducted by training the model on an entire dataset and testing it on the other two datasets. Since scores from different databases have different scales and meanings, we add a linear operation to the subjective scores to evaluate SRCC. We show the SRCC results in Table 3, where the best results are shown in bold. It can be observed that the results of our proposed method significantly outperform other methods on all database, especially in a case that training on a small database with limited distortion types (LIVE, CSIQ) while testing on the TID2013.

4.6 Ablation Study

We conduct the ablation experiment to verify the effectiveness of AGCS, MG, and FMM in our methods. Considering that the TID2013 [30] and KADID-10k [22] have more distorted images and the distortion types are more complex, all the ablation study experiments are performed on them. We removed AGCS, MG, and FMM, only use the backbone to extract multi-layer features and then used the scoring module to complete the scoring as a baseline. The performance contributions of each individual component are shown in Table 4.

To verify the performance of MG module which is the core design of our method, we remove this module, since the FMM module depends on the addition of reference image patches, it is also removed, only the AGCS module is used. Compared to our complete method, after adding the AGCS module to the baseline, the performance improves slightly, while removing the MG and FMM results in the greatest performance degradation on the two datasets.

Then, we removed the FMM and the AGCS module respectively on the basis of our complete method, SRCC and PLCC both drop slightly on TID2013 and KADID-10K. Finally, when the three modules designed in this paper are added to the baseline, it achieves better performance than other ablation methods.

5 Conclusion

In this paper, we propose a Mask Reference IQA (MR-IQA) method to address the semantic and texture information loss in distorted images. To achieve this goal, we introduce a Mask Generator (MG) to mask specific patches of a distorted image and supplement missing patches with the reference image patches, then we use Mask Reference Network (MRNet) to shield the perturbation of masked patches and complete the quality assessment of the masked image. Compared with the FR-IQA and NR-IQA methods, our method achieves top performance on four representative datasets, and reduces almost 50% of the computational complexity in structural design compared to learning-based FR-IQA methods. We hope that the MR-IQA will motivate the community to rethink IQA and stimulate research with a broader perspective.

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Seyed Ali Amirshahi, Marius Pedersen, and Stella X Yu. Image quality assessment by comparing cnn features between images. Journal of Imaging Science and Technology , 60(6):60410–1, 2016.
2[2] Bahetiyaer Bare, Ke Li, and Bo Yan. An accurate deep convolutional neural networks model for no-reference image quality assessment. In 2017 IEEE International Conference on Multimedia and Expo (ICME) , pages 1356–1361, July 2017.
3[3] Simone Bianco, Luigi Celona, Paolo Napoletano, and Raimondo Schettini. On the use of deep learning for blind image quality assessment. Signal, Image and Video Processing , 12(2):355–362, 2018.
4[4] Sebastian Bosse, Dominique Maniry, Klaus-Robert Müller, Thomas Wiegand, and Wojciech Samek. Deep neural networks for no-reference and full-reference image quality assessment. IEEE Transactions on image processing , 27(1):206–219, 2017.
5[5] Manri Cheon, Sung-Jun Yoon, Byungyeon Kang, and Junwoo Lee. Perceptual image quality assessment with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 433–442, 2021.
6[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Image Net: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition , pages 248–255, June 2009.
7[7] Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity. IEEE transactions on pattern analysis and machine intelligence , 2020.
8[8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x 16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929 , 2020.