TL;DR
This paper introduces the use of the multiscale structural-similarity score (MS-SSIM) as a perceptually aligned loss function for training image synthesis networks, leading to images preferred by humans over traditional pixel-wise losses.
Contribution
It demonstrates that MS-SSIM, being differentiable, improves image quality in synthesis tasks and aligns better with human perception compared to pixel-wise loss functions.
Findings
Humans prefer images generated with MS-SSIM loss over pixel-wise loss.
MS-SSIM-optimized models outperform pixel-wise models in image reconstruction quality.
Perceptually-optimized representations enhance performance in image classification and super-resolution.
Abstract
Deep networks are increasingly being applied to problems involving image synthesis, e.g., generating images from textual descriptions and reconstructing an input image from a compact representation. Supervised training of image-synthesis networks typically uses a pixel-wise loss (PL) to indicate the mismatch between a generated image and its corresponding target image. We propose instead to use a loss function that is better calibrated to human perceptual judgments of image quality: the multiscale structural-similarity score (MS-SSIM). Because MS-SSIM is differentiable, it is easily incorporated into gradient-descent learning. We compare the consequences of using MS-SSIM versus PL loss on training deterministic and stochastic autoencoders. For three different architectures, we collected human judgments of the quality of image reconstructions. Observers reliably prefer images synthesized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
