# Ultrasound Image Representation Learning by Modeling Sonographer Visual   Attention

**Authors:** Richard Droste, Yifan Cai, Harshita Sharma, Pierre Chatelain, Lior, Drukker, Aris T. Papageorghiou, J. Alison Noble

arXiv: 1903.02974 · 2019-05-28

## TL;DR

This paper shows that learning ultrasound image representations from sonographer visual attention, via gaze prediction, improves transferability to standard plane detection, reducing the need for manual annotations.

## Contribution

It introduces a novel approach of modeling sonographer visual attention to learn transferable ultrasound image representations without manual labels.

## Key findings

- Attention-based models outperform random initialization in transfer learning.
- Saliency prediction enhances standard plane detection accuracy.
- Representation quality approaches fully-supervised models in early CNN layers.

## Abstract

Image representations are commonly learned from class labels, which are a simplistic approximation of human image understanding. In this paper we demonstrate that transferable representations of images can be learned without manual annotations by modeling human visual attention. The basis of our analyses is a unique gaze tracking dataset of sonographers performing routine clinical fetal anomaly screenings. Models of sonographer visual attention are learned by training a convolutional neural network (CNN) to predict gaze on ultrasound video frames through visual saliency prediction or gaze-point regression. We evaluate the transferability of the learned representations to the task of ultrasound standard plane detection in two contexts. Firstly, we perform transfer learning by fine-tuning the CNN with a limited number of labeled standard plane images. We find that fine-tuning the saliency predictor is superior to training from random initialization, with an average F1-score improvement of 9.6% overall and 15.3% for the cardiac planes. Secondly, we train a simple softmax regression on the feature activations of each CNN layer in order to evaluate the representations independently of transfer learning hyper-parameters. We find that the attention models derive strong representations, approaching the precision of a fully-supervised baseline model for all but the last layer.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1903.02974/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/1903.02974/full.md

## References

19 references — full list in the complete paper: https://tomesphere.com/paper/1903.02974/full.md

---
Source: https://tomesphere.com/paper/1903.02974