# Analysis of Deep Image Quality Models

**Authors:** Pablo Hern\'andez-C\'amara, Jorge Vila-Tom\'as, Valero Laparra,, Jes\'us Malo

arXiv: 2302.13345 · 2023-02-28

## TL;DR

This paper thoroughly analyzes pre-trained deep neural networks to understand their ability to predict image quality, revealing that many models outperform traditional metrics and align closely with human perception, with simpler architectures often performing better.

## Contribution

It provides a comprehensive analysis of how different factors influence deep networks' perceptual properties and their correlation with human image quality judgments.

## Key findings

- All models outperform SSIM in correlating with human opinion.
- Some models are on par with state-of-the-art specialized methods.
- Simpler architectures often perform better than deeper ones.

## Abstract

Subjective image quality measures based on deep neural networks are very related to models of visual neuroscience. This connection benefits engineering but, more interestingly, the freedom to optimize deep networks in different ways, make them an excellent tool to explore the principles behind visual perception (both human and artificial). Recently, a myriad of networks have been successfully optimized for many interesting visual tasks. Although these nets were not specifically designed to predict image quality or other psychophysics, they have shown surprising human-like behavior. The reasons for this remain unclear.   In this work, we perform a thorough analysis of the perceptual properties of pre-trained nets (particularly their ability to predict image quality) by isolating different factors: the goal (the function), the data (learning environment), the architecture, and the readout: selected layer(s), fine-tuning of channel relevance, and use of statistical descriptors as opposed to plain readout of responses.   Several conclusions can be drawn. All the models correlate better with human opinion than SSIM. More importantly, some of the nets are in pair of state-of-the-art with no extra refinement or perceptual information. Nets trained for supervised tasks such as classification correlate substantially better with humans than LPIPS (a net specifically tuned for image quality). Interestingly, self-supervised tasks such as jigsaw also perform better than LPIPS. Simpler architectures are better than very deep nets. In simpler nets, correlation with humans increases with depth as if deeper layers were closer to human judgement. This is not true in very deep nets. Consistently with reports on illusions and contrast sensitivity, small changes in the image environment does not make a big difference. Finally, the explored statistical descriptors and concatenations had no major impact.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.13345/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/2302.13345/full.md

## References

28 references — full list in the complete paper: https://tomesphere.com/paper/2302.13345/full.md

---
Source: https://tomesphere.com/paper/2302.13345