AVRA: Automatic Visual Ratings of Atrophy from MRI images using   Recurrent Convolutional Neural Networks

Gustav M{\aa}rtensson; Daniel Ferreira; Lena Cavallin; J-Sebastian; Muehlboeck; Lars-Olof Wahlund; Chunliang Wang; Eric Westman

arXiv:1901.00418·physics.med-ph·August 28, 2019

AVRA: Automatic Visual Ratings of Atrophy from MRI images using Recurrent Convolutional Neural Networks

Gustav M{\aa}rtensson, Daniel Ferreira, Lena Cavallin, J-Sebastian, Muehlboeck, Lars-Olof Wahlund, Chunliang Wang, Eric Westman

PDF

TL;DR

AVRA is a machine learning model that automatically rates brain atrophy from MRI images, matching expert radiologist assessments and offering a reliable, fast, and accessible tool for clinical and research use.

Contribution

This work introduces AVRA, a novel recurrent convolutional neural network model trained on extensive radiologist ratings to automate atrophy assessment from MRI scans.

Findings

01

AVRA achieves substantial agreement with expert ratings (Cohen's kappa 0.62-0.74).

02

The model provides rapid, automatic ratings for multiple atrophy scales.

03

AVRA is freely available for clinical and scientific applications.

Abstract

Quantifying the degree of atrophy is done clinically by neuroradiologists following established visual rating scales. For these assessments to be reliable the rater requires substantial training and experience, and even then the rating agreement between two radiologists is not perfect. We have developed a model we call AVRA (Automatic Visual Ratings of Atrophy) based on machine learning methods and trained on 2350 visual ratings made by an experienced neuroradiologist. It provides fast and automatic ratings for Scheltens' scale of medial temporal atrophy (MTA), the frontal subscale of Pasquier's Global Cortical Atrophy (GCA-F) scale, and Koedam's scale of Posterior Atrophy (PA). We demonstrate substantial inter-rater agreement between AVRA's and a neuroradiologist ratings with Cohen's weighted kappa values of $κ_{w}$ = 0.74/0.72 (MTA left/right), $κ_{w}$ = 0.62 (GCA-F) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

AVRA: Automatic Visual Ratings of Atrophy from MRI images using Recurrent Convolutional Neural Networks.

Gustav Mårtensson

[email protected]

Daniel Ferreira

Lena Cavallin

J-Sebastian Muehlboeck

Lars-Olof Wahlund

Chunliang Wang

Eric Westman

for the Alzheimer’s Disease Neuroimaging Initiative 111Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf.

Division of Clinical Geriatrics, Department of Neurobiology, Care Sciences and Society, Karolinska Institutet, Stockholm, Sweden.

Department of Clinical Neuroscience, Karolinska Institutet, Stockholm, Sweden.

Department of Radiology, Karolinska University Hospital, Stockholm, Sweden.

School of Technology and Health, KTH Royal Institute of Technology, Stockholm, Sweden.

Department of Neuroimaging, Centre for Neuroimaging Sciences, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK.

Abstract

Quantifying the degree of atrophy is done clinically by neuroradiologists following established visual rating scales. For these assessments to be reliable the rater requires substantial training and experience, and even then the rating agreement between two radiologists is not perfect. We have developed a model we call AVRA (Automatic Visual Ratings of Atrophy) based on machine learning methods and trained on 2350 visual ratings made by an experienced neuroradiologist. It provides fast and automatic ratings for Scheltens’ scale of medial temporal atrophy (MTA), the frontal subscale of Pasquier’s Global Cortical Atrophy (GCA-F) scale, and Koedam’s scale of Posterior Atrophy (PA). We demonstrate substantial inter-rater agreement between AVRA’s and a neuroradiologist ratings with Cohen’s weighted kappa values of $\kappa_{w}$ = 0.74/0.72 (MTA left/right), $\kappa_{w}$ = 0.62 (GCA-F) and $\kappa_{w}$ = 0.74 (PA), with an inherent intra-rater agreement of $\kappa_{w}$ = 1. We conclude that automatic visual ratings of atrophy can potentially have great clinical and scientific value, and aim to present AVRA as a freely available toolbox.

††journal: arXiv Preprint

1 Introduction

The assessment of structural changes in the brain is made clinically by visual ratings of brain atrophy according to established visual rating scales. They offer an efficient and inexpensive method of quantifying the degree of atrophy and can help to improve the specificity and sensitivity of dementia diagnoses [1, 2]. However, there are limitations associated with visual ratings of atrophy, which may explain why they are still not widely used in the clinical routine. First, the ratings are inherently subjective which means that the agreement between two radiologist might be low if they have not had sufficient training [1]. Second, in order to achieve adequate reliability the radiologist needs to be experienced and regularly perform ratings for the reproducibility not to drop [3]. Third, the ratings are relatively time consuming and tedious. It takes a few minutes per image [4], depending on rating scale and level of rating experience. While this amount of time may be feasible in most clinical settings, it does not easily allow studying large imaging cohorts of potentially thousands of images. An automatic method would remove the inter- and intra-rater variability and eliminate the time-consuming process of rating.

1.1 Visual rating scales

Amongst the most commonly used rating scales—both in research and in clinical routine—are Scheltens’ Medial Temporal Atrophy (MTA) scale [5], Koedam’s scale for Posterior Atrophy (PA) [6] and Pasquier’s scale for Global Cortical Atrophy (GCA) [7, 8] (see Fig. 1 for examples). These scales have previously been validated by quantitative neuroimaging techniques [9, 10, 11, 12].

The MTA scale was developed by Scheltens et al. (1992) [5]. A rating is given for each hemisphere ranging from 0 (no atrophy) to 4 (severe atrophy) and focuses on three structures: the width of the choroid fissure, the width of the temporal horn and the height of the hippocampus. The assessment is made in a single or few coronal slices on a high quality CT or ideally a T1-weighted MRI. Different cut-offs have been suggested where the most common is that an average MTA score $\geq 2$ is considered pathological if the patient is younger than 75 years old, and an average MTA $\geq 3$ for patients older than 75 years [5, 13, 14].

The PA scale assesses atrophy of the parietal lobe of the brain and was proposed by Koedam et al. (2011) [6]. A rating from 0 (no atrophy) to 3 (severe atrophy) is given that specifically assesses the degree of atrophy of the precuneus, the posterior cingulate sulcus, the parieto-occipital sulcus and the parietal cortex.

Pasquier et al. (1996) developed a visual rating system of cerebral atrophy in 13 different brain regions that assesses the level of dilatation of sulci and the ventricles [8]. For each of these regions a score ranging from 0 (no atrophy) to 3 (severe atrophy) is given by the radiologist. These measures have been simplified into a global assessment of cortical atrophy rated from 0 to 3 called the GCA scale. The original paper by Pasquier and colleagues used T2-weighed images[8] but several studies have also assessed GCA in T1-weighted images [13, 12, 15, 7]. A frontal subscale of GCA (GCA-F) is of particular interest since frontal atrophy has been shown to be associated with executive dysfunction [16] and can offer improved diagnosis of frontotemporal dementia (FTD) [12].

1.2 Related work

A few automatic (or semi-automatic) methods to quantify medial temporal atrophy—besides volumetrics—have previously been proposed. Two of them involve planimetrics based on manual delineation of hippocampus and surrounding structures that are combined into a single score of medial temporal atrophy [17, 18]. While these methods assess almost the same structures as Scheltens’ MTA scale, the different scales are not interchangeable and do not necessarily reflect the same atrophy patterns. Another study recently reported an automatic method that is trained on radiologist ratings which predicts MTA scores based on volumetric measures extracted from the MRI image [19]. Volumetric measures of brain regions can not be extracted from most CT images nor do they retain any information regarding the shape of the structures. It is reasonable to assume that the shapes are important since the visual MTA rating is done on a single slice, from which it is not possible to estimate the hippocampal volume.

Deep learning—a branch of machine learning—has recently generated impressive results in several fields, such as speech recognition, text semantics, image recognition and genomics [20]. Convolutional neural networks (CNN’s) have already been substantially applied in medical image analysis (for recent reviews, see [21, 22]). For instance, studies using CNN’s have achieved similar levels of accuracy as medical experts in classifying skin cancer [23], mammographic skin lesion detection [24], and diabetic retinopathy diagnosis [25]. Focusing on applications in neuroimaging, deep neural networks have been used successfully for automatic methods of skull stripping [26, 27], brain age prediction [28], brain segmentation [29], PET image enhancement [30] and brain tumor segmentation [31, 32] to name a few. In dementia research, several studies have investigated brains of patients with Alzheimer’s disease (AD) using deep learning and shown impressive diagnostic abilities [33, 34, 35, 36]. A Recurrent Neural Network (RNN) is an artificial neural network that has an internal state (or ”memory”) and is useful when processing sequential data, such as words in a sentence or frames in a video[20, 37]. RNN’s have successfully been combined with CNN’s to segment MRI images, where the addition of an RNN module helped to leverage adjacent slice dependencies [38, 39].

1.3 Our approach

In this study, we aimed to develop an automatic algorithm based on convolutional and recurrent neural networks that provides fast, reliable, and systematic predictions of established visual ratings scales of atrophy of brain regions often affected in dementia: the MTA, GCA-F and PA scales. The models are trained on a large set of MRI images that have been rated by an experienced neuroradiologist. This method is atlas-free and requires minimum amount of setup and third-party software. We plan to present the proposed algorithm as a freely available software targeted towards neuroimaging researchers.

2 Material and methods

2.1 MRI data and protocols

Two different dementia cohorts of MRI images were included in this project: Alzheimer’s Disease Neuroimaging Initiative (ADNI) and a clinical cohort with images from the memory clinic at Karolinska University Hospital (referred to as MemClin from here on). Informed consent was obtained for all participants, or by an authorized representative of theirs.

Individuals in the MemClin cohort mainly consisted of patients clinically diagnosed with dementia according to the ICD-10 criteria between 2003 and 2011. All participants underwent a T1-weighted MRI scan at the Radiology Department of Karolinska University Hospital in Stockholm, Sweden. Exclusion criteria were if the patient had other types of dementia, history of traumatic brain injury, or insufficient quality of the MRI scan [40, 41].

Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). For up-to-date information, see www.adni-info.org. A majority of the participants in the ADNI cohort were scanned multiple times within a few weeks—often in the same day. A subset of participants were scanned both in 1.5T and 3T machines.

All available images with an associated visual atrophy rating performed by a neuroradiologist were used in this study. Images that did not pass the initial automatic AC/PC-alignment (the anterior and posterior commissures) were excluded from the training and evaluation process (144 out of 5355 images in total).

The algorithm was developed using theHiveDB database system[42] and will become part of its automated activity system.

2.2 Human ratings

An experienced neuroradiologist, Lena Cavallin (L.C.), visually rated 2350 T1-weighted MRI images over the course of 16 months with no prior knowledge of age, sex, or diagnosis. For ADNI subjects scanned more than once, only one of the images was rated by the radiologist and the additional image(s) were labeled with the same rating. The distribution of L.C.’s MTA, PA and GCA-F ratings are shown in Table 1. Many of the ADNI ratings have been analyzed and reported in previous studies [40, 13, 12, 15, 14]. All visual ratings of MTA, PA and GCA-F were based on T1-weighted MRI images, and illustrative examples of the ratings can be seen in Fig. 1. The images were aligned with AC/PC by the radiologist if the protocol allowed for it [3]. The MTA ratings were made in a single coronal slice, just behind the amygdala and mammillary bodies. The GCA-F ratings were based on multiple sagittal slices, whereas the PA score was based on slices in all three planes.

2.3 Computer ratings

The motivation behind the proposed model architecture was to mimic how a neuroradiologist would process an MRI image: to scroll through the brain volume slice-by-slice looking for the ”correct” slice(s) to base the rating on. A human rater assesses images acquired using different scanners, vendors and protocols without any need for substantial preprocessing such as segmentation, intensity normalization, non-linear registrations or skull-stripping. To better mimic the clinical situation (and to keep the number of time consuming preprocessing steps that can potentially fail to a minimum) we trained AVRA to rate images with as little preprocessing as possible. The main difference between AVRA’s and a human rater is that AVRA’s ratings are continuous instead of discrete.

All code in this project was developed in Python 3.4.3 using the deep learning framework PyTorch 1.0 [43].

Preprocessing

The only preprocessing included in our method is the registration of all brains to the MNI standard brain using FSL FLIRT 6.0 (FMRIB’s Linear Image Registration Tool) [44, 45, 46]. This rigid transform is computed with 6 degrees of freedom (i.e. rotation and translation only) and is used to automatically AC-PC align each brain and conform all images to the same voxel size (1x1x1mm3) and input dimension (182x218x182). The AC-PC aligned images are cropped to remove excess space outside the brain and redundant slices not part of the ratings scale (as indicated in Figs. 1 and 2). The center-voxel of the cropped images depended on the rating scale. For the MTA ratings, 22 coronal slices of the dimension 128mm x 128mm are input to the model—enough to ensure that the ”correct” rating slice is included. The GCA-F ratings are done on multiple axial slices so each volume is cropped to 160mm x 192mm x 40 slices, with 2mm slice thickness. The PA model requires slices from all three anatomical planes. From each MRI image a smaller volume of 128mm x 128mm x 128mm was extracted from the parietal lobe, sufficiently large to include all relevant structures in the parietal cortex. From this cropped volume 37 axial, 28 coronal and 34 sagittal slices with 2mm slice thickness (i.e. 99 slices in total) were used as input to the model. Since the distribution of raw voxel values was very different—particularly between 1.5T and 3T images—all cropped volumetric images were normalized to have a zero variance and mean.

Model architectures

The overall structure of the models can be seen in Fig. 2 and can be split into three parts. First, relevant features from a single slice are extracted using a Residual Attention Network [47], detailed in Fig. 3. It combines the abilities from residual learning [48], which can allow for even deeper models, and attention models that can ”focus” spatially on images—particularly useful for visual ratings since they are based on regional atrophy [49, 50]. Our implementation is a slimmed version of the original, with the same depth but a smaller number of filters in each layer to reduce memory usage and computation time. Initial experiments showed no noticeable performance reduction on the validation set compared to using a larger network. Second, the features are reshaped to a 1D vector and fed to an RNN, which consists of a two-layer Long-Short Term Memory (LSTM) network with 256 hidden nodes [51, 52]. The LSTM modules are expected to ”remember” relevant features seen in previous slices and update its state (”memory”) when it is exposed to a slice containing useful information for the rating. Finally, when slice 0, 1, …, $(n-1)$ have been propagated through the network, the final output from the second LSTM module $h_{n}^{(2)}$ is used to make a linear prediction of the visual rating. All three models share the same network architecture except for the size of the input vector fed to the LSTM network, as that is dependent on the input size of the MRI slices.

For comparison, we train a VGG16 network [53] without the RNN part, where the 3D volumes are treated as multi-channel 2D images. That is, for the MTA model we input one ”22-channel” image to the CNN once instead of 22 single-slice images.

Training

For training and evaluation, the dataset was randomly split into a training and a hold-out test set, where 20% of all subjects where assigned to the test set. On the remaining images in the training set we applied 5-fold cross validation for hyper-parameter tuning for each rating scale. The five trained models were used together as an ensemble classifier evaluated on the test set, where the average prediction was considered the final rating.

The models were trained for 200 epochs using backpropagation and optimized through stochastic gradient descent (SGD) with cyclic learning rate to maximize the probability of predicting the radiologist’s rating [54, 55]. The training set was randomly split into minibatches, each containing 20 MRI images, and the weights were updated to minimize the mean-squared error between the automatic and the integer ratings by L.C. We employed data augmentation in the training process of the network to reduce the risk of overfitting to the training set. This included random cropping (within $\pm$ 10mm off the center voxel), scaling, left/right mirroring, and randomly selecting N4ITK inhomogeniety corrected images instead of the original file [56]. Due to the imbalance of ratings in the dataset we employed random oversampling of images with less frequent ratings, which has been shown to improve the prediction performance of CNN’s [57]. For ADNI subjects that had multiple scans for a single timepoint, a scan was selected randomly for each minibatch.

2.4 Analyses metrics

The visual rating scales are subjective measures by definition. Consequently, there are no objective ground truth ratings available. In most studies, the performance of a rater is reported in kappa statistics—a group of measures that can quantify the level of agreement between two sets of discrete ratings—but there is no single metric always reported. To make our results comparable to previous findings, we present our results with Cohen’s weighted kappa ( $\kappa_{w}$ ), which has been used in several previous rating studies [6, 14, 3, 58, 12, 15, 59], as well as accuracy and the Pearson correlation coefficient ( $\rho$ ). The agreement between two sets of ratings is referred to inter-rater agreement if the sets were assessed by different raters, and intra-rater agreement if a single radiologist rated the set twice.

3 Results

3.1 Intra-rater agreements

To have an idea of the variability in the human ratings used for training in this project, we studied the intra-rater agreement in a subset of 244 images that had been rated 2-4 times with at most 16 months from the first to the last rating session. To be consistent with the computer training and evaluation procedure, we compared the latest rating to a previous one. If there were more than two ratings, the previous rating was chosen randomly. This yielded $\kappa_{w}$ agreements and accuracies for MTA (left): $\kappa_{w}$ = 0.83, acc = 76%; MTA (right): $\kappa_{w}$ = 0.79, acc = 70%; GCA-F: $\kappa_{w}$ = 0.46, acc = 71%; PA $\kappa_{w}$ = 0.65, acc = 72%. Ratings made only 1 week apart showed substantially better intra-rater agreement (see Ferreira et al. (2017) entry in Table 2). These results provide an estimate of the ”human-level agreement”—i.e. approximate levels of agreement our models should be able to achieve by training on the available cohort due to rating inconsistencies over 16 months.

Since there are no random elements in the evaluation process of a brain image, the ”intra-rater” agreement of AVRA is inherently $\kappa_{w}=1$ .

3.2 Inter-rater agreements

Our models predicted continuous rating scores of an image, based on training from discrete ratings by L.C. We rounded AVRA’s ratings to the nearest integer to be able to compare the rating consensus in terms of accuracy and kappa statistics. The agreements between the radiologist’s and AVRA’s (as well as the VGG networks’) ratings on the hold-out test set are summarized in Table 2 together with previously reported $\kappa_{w}$ values of inter- and intra-rater agreements. The inter-rater agreement $k_{w}$ , Pearson correlation $\rho$ , and accuracy on the test set for MTA (left): $\kappa_{w}$ = 0.74, acc = 70 %; MTA (right): $\kappa_{w}$ = 0.72, $\rho$ = 0.88, acc = 70 %; GCA-F: $\kappa_{w}$ = 0.62, $\rho$ = 0.71, acc = 84 %; PA: $\kappa_{w}$ = 0.74, $\rho$ = 0.85, acc = 83%. These agreement levels were similar to previously reported in studies, see Table 2. The naive VGG16 implementations showed lower inter-rater agreements with the radiologist compared to AVRA.

To increase interpretability and understanding of the models, we computed gradient-based sensitivity maps of images in the test set based on the SmoothGrad method [60]. These indicated how influential individual voxels were in the rating prediction, which we can apply to verify that the network identified the correct features. Examples of AVRA’s rating predictions for each scale are shown in Fig. 4. As can be observed, the MTA sensitivity maps were generally focused only around the area of the hippocampus and the inferior lateral ventricle in $\sim\pm$ 3 slices from the ”correct” rating slice. The sensitivity maps in other more posterior and anterior slices were close to zero. The GCA-F maps were more diffused, but the greatest magnitudes were primarily seen in the sulci of the frontal lobe. The PA maps were mainly visible in the parietal lobe and in the sagittal plane, with the greatest magnitudes appearing in parieto-occiptal sulcus and precuneus.

4 Discussion

We have developed a tool for automatic visual ratings of atrophy (AVRA) that is fast, systematic and robust. AVRA is trained on a large set of images rated by an expert neuroradiologist using the established clinical assessment measures of Scheltens’ MTA scale, Pasquier’s GCA-F scale and Koedam’s PA scale with agreement levels similar to that between two experienced radiologists. This tool runs in under 1 minute on a regular laptop, which enables automatically rating thousands of images in a couple of hours. Rating an MRI image of the brain requires minimum amount of preprocessing and the models were built to potentially work in a clinical setting. The main advantage of an automatic model is the absence of randomness, which can ensure rating consistency between different clinics, research groups and cohorts. Thus, AVRA has potential to function as a clinical aid, and to increase the use of visual ratings in research.

4.1 Agreement levels

The rating agreements between AVRA’s and the radiologist’s ratings were considered substantial (i.e. between 0.6-0.8) according to the often cited paper by Landis and Koch (1977) [61]. The agreements were close to the ”human-level agreements” in this study (i.e. the agreement between the multiple L.C. ratings of the same image). This was reasonable since a model trained on imperfect labels due to rating inconsistency can never achieve perfect agreement. A previous study has investigated the overtime reliability of MTA ratings, where their results showed that the intra-rater agreement is typically higher when a set is rated twice closer in time–especially when the radiologist do not rate images on a daily basis [3]. The time between ratings is often not reported, but in Pasquier’s introduction of the GCA scale the second rating was performed 24 h after the first [8]. Thus it is reasonable that if all images in a study were rated twice 16 months apart, the intra-rater agreement would generally be lower than the actual reported values. Our analysis of the subset of images rated more than once suggests this to be the case. Those values may not necessarily reflect the ”true” rating consistency either since the multiple-rating subset does not follow the same distribution as the whole cohort. Limiting the time span between the first and last set of ratings meant having to discard a large part of the images in the training set, and initial investigations of this showed decreasing agreement in our study. This suggested that a large number of images for training was more important than the potential inclusion of noisy labels.

AVRA’s ratings agreed more with the radiologist ratings than the VGG16 models’ did. A recurrent CNN architecture might thus be particularly suitable for visual rating predictions, but we can not say from these results if it were the residual modules, the attention components, or the LSTM cells—all used in AVRA but not in the VGG16 models—that had the greatest positive impact on the performance. Another contributing factor may be the wide difference in the number of trainable parameters between AVRA (1.5M) and VGG16 (65M) that makes AVRA less prone to overfit on the training data. However, it should be noted that we spent more time to tune and optimize AVRA compared to the VGG16 networks, which biases the results in favor of AVRA.

The automatic model presented by Lötjönen and colleagues (2017) is, to our knowledge, the only software that also attempts to predict scores based on clinical visual rating scales [19]. It is based on volume measures of hippocampus and surrounding structures, whereas AVRA predicts the ratings directly from the voxel intensity values. This makes our proposed method promising to also work on MRI images with large slice thickness and CT images, from which volumes generally cannot be computed. The fact that CT is a cheaper and more commonly used imaging modality than MRI in the clinics speaks in favor of using convolutional neural networks over volumetrics for automatic ratings of atrophy [62]. No $\kappa_{w}$ values are reported in [19], but they provided correlation coefficients between radiologist and computer ratings for the MTA scale as 0.86 (left) and 0.85 (right). AVRA showed a similar magnitude of correlation for the MTA scale on the hold-out test set: $\rho=0.88$ .

4.2 Reliability of AVRA

One of the main motivations of having a computer rate brain atrophy instead of humans is its inherent perfect intra-rater agreement—the same image will be rated exactly the same regardless of when (and where) it is rated. A relevant question to ask is: why not let a computer segment and calculate e.g. hippocampal volumes instead of an MTA rating? We see three main motivations for this: 1) CT, and some MRI protocols, have too large slice thickness that do not allow for extracting reliable volumetric information from the images. 2) Segmentaion methods will—just as AVRA—fail in processing some cases, and for clinician to manually intervene and delineate structures would neither be feasible nor practical. If an automatic visual rating would fail the radiologist would be able to quickly perform their own visual rating, as is done today. 3) There is a lack of how to clinically interpret volumetric data, e.g. the hippocampal volumes. However, extensive research has been done on cut-offs for visual rating scales, even considering modulating factors such as age [13].

The sensitivity maps shown in Fig. 4 suggested that the models were able to correctly identify relevant structures to base their ratings on. Particularly the sensitivity maps of the MTA model were typically not visible $\pm 3$ mm from the ”correct” rating slice, indicating that the employed recurrent CNN architecture used was able to correctly identify relevant slices and disregard redundant ones. The diffused sensitivity maps seen for the GCA-F scale was also observed in the quantitative validation study done by Ferreira et al. (2015), showing that frontal atrophy is also associated with temporal and posterior atrophy—at least in the ADNI cohort [12]. Möller and colleagues (2014) found, using VBM analysis, significant differences between PA ratings not only in the parietal lobe, but also in parts of the cerebellum, temporal lobe and the occipital lobe [11]. Their study was also performed on a cohort with individuals with probable AD and subjective memory complaints, concluding that atrophy solely in the posterior cortex is an exception. The sensitivity maps from our PA model indicate that AVRA based the PA ratings on mainly the same regions. AVRA learns to how to predict a GCA-F or a PA score from an MRI image only based on previous human ratings. Thus, if e.g. frontal atrophy is strongly associated with atrophy in the temporal lobe, the model is likely to find it difficult to learn to only assess the frontal lobe in the GCA-F scale. Since the sensitivity maps are based on the absolute values of the calculated gradients in the backward propagation, the magnitude of these decrease every time it propagates through the LSTM cell due to the point-wise multiplication in the forget gate [52]. The PA model inputs 99 slices. As the sagittal slices are the last to be fed to the model it is reasonable to assume that they dominate the sensitivity maps as opposed to early axial slices, which have propagated through the LSTM cell almost 100 times.

The performance of AVRA was validated in a test set that was randomly sampled from the same cohorts as the training data set. This means that the data distribution in the test set was similar to the image samples that the models were trained on. This is a simpler test set than if the test set was from a different cohort with images acquired using other scanning parameters. We are currently in the process of validating how the models would handle data from a different image distribution (cohort), and the effect it would have on the rating agreement.

Frequently, it is difficult for a radiologist to decide between two scores, and in a clinical situation the level of atrophy is often described as ”the left MTA is between 2 and 3” for instance. This nuance might be important information for the physician diagnosing dementia, but in research single integer scores have typically been used following the original definitions of the rating scales. Previous attempts of (semi-)automatic atrophy measures have output a continuous measure [17, 18, 63, 19]. The main advantages of using a continuous measure of atrophy are 1) atrophy evolves continuously and thus it is reasonable to describe its degree through a continuous measure, and 2) it provides more detailed information about the severity of the atrophy. The latter point is for instance particularly useful to track disease progression and could allow us to establish more sensitive cut-off values for different diagnoses. It is also easy to convert the continuous measures of the rating scales to their discrete, original versions by rounding to nearest integer.

In Fig. 5 we show some examples between AVRA’s continuous and the radiologist discrete ratings in the important diagnostic interval between MTA=2 and MTA=3. When studying these images again post AVRA’s ratings, the radiologist only assessed that the images originally rated MTA=2 with associated AVRA scores of 2.6-3.0 to be wrongly rated. They would be re-rated as MTA=3, i.e. closer to AVRA’s score. The image scored MTA=2 (radiologist) and MTA=2.4 (AVRA) was described as a case between 2 and 3, which may illustrate the usefulness of continuous ratings. However, we noticed that in two of the most disagreeing ratings (L.C.: MTA=3, Avra: MTA={2.0, 2.2}) the individuals had an adhesion between the hippocampus and the cerebral white matter. These cases are not frequent, and the rating disagreements in Fig. 5 indicate that AVRA did not learn to correctly adjust the score for the presence of adhesions.

We aimed to design AVRA to function on images with the least amount of preprocessing possible to demonstrate that it could work in a clinical setting. A few concessions were made to facilitate the training process—mainly the AC-PC alignment performed through rigid registration to the MNI brain using FSL FLIRT. This helped centering all images to allow for tighter cropping around the structures of interest. However, this automatic preprocessing step failed in around 2.5% of all images, which were discarded for future training and evaluation although the quality of most of the images was good enough for a radiologist to visually rate it. Since the MRI image input to CNN has not been intensity normalized, skull stripped or motion corrected, it is possible to perform a manual AC-PC alignment for the failed cases and then input them to the model. More extensive data augmentation and training, or using reinforcement learning to find the correct slice, could potentially be used to avoid the AC-PC alignment step and just input the raw MRI image. This was, however, not explored in the current study.

4.3 Limitations

There are some limitations of the proposed algorithm. First, the models are solely based on the ratings by a single radiologist and thus assume that the ratings we trained the model on are ”ground truth” labels. A model trained on these labels can therefore never be ”better” than the rater. If the ratings have systematic errors the model will incorporate these. For instance, a rater might systematically look at the left medial temporal lobe when rating the MTA of the right hemisphere, which could influence (bias) the right hemisphere MTA score. If we train a model on these ratings, this bias would be learned by the model as well. Another approach would be to have multiple expert radiologists rate a set of images together or separately and use these labels as ground truth. However, it is not feasible to have multiple radiologist visually assess the large number of images necessary for training a deep neural network. It also does not automatically mean that these ratings would necessarily be ”closer” to the ground truth. If future studies want to use a neural network based on their own set of ratings, it should be possible to start from the pre-trained networks of AVRA and fine-tune the final classification layer(s) on the new ratings. This would require substantially fewer ratings, since the convolutional part would already have learned to extract relevant features from the images.

The second limitation of the study are the small numbers of the highest GCA-F and PA ratings, which may increase the risk of ”true” 3 score to be misclassified. Based on the results in Fig. 6 this seems to be the case. As the diagnostic cut-off values for these ratings scales in AD diagnosis have been suggested as PA $\geq$ 1 and GCA-F $\geq$ 1 [13], the clinical implications of this may be minor even in the cases where the atrophy is rated as a 2 instead of a 3. These severe ratings are rare also in previous studies on dementia cohorts [13, 64], so this will likely be an issue for any computerized method trained on radiologist ratings.

5 Conclusion

In this study, we have proposed an automatic method (AVRA) to provide visual ratings of atrophy according to Scheltens’ MTA scale, Koedam’s PA scale, and Pasquier’s frontal GCA scale. AVRA mimics the neuroradiologist’s rating procedure and achieves similar levels of agreement to that between two experienced neuroradiologists—without any prior preprocessing of the MRI images. We plan to make AVRA freely available as a user-friendly software aimed towards neuroscientists and neuroradiologists.

Acknowledgements

We would like to thank the Swedish Foundation for Strategic Research (SSF), The Swedish Research Council (VR), the Strategic Research Programme in Neuroscience at Karolinska Institutet (StratNeuro), Swedish Brain Power, the regional agreement on medical training and clinical research (ALF) between Stockholm County Council and Karolinska Institutet, Hjärnfonden, Alzheimerfonden, the Åke Wiberg Foundation and Birgitta och Sten Westerberg for additional financial support.

Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

Bibliography64

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Lorna Harper, Frederik Barkhof, Nick C. Fox, and Jonathan M. Schott. Using visual rating to diagnose dementia: A critical evaluation of MRI atrophy scales. Journal of Neurology, Neurosurgery and Psychiatry , 86(11):1225–1233, 2015.
2[2] Lars Olof Wahlund, Eric Westman, Danielle van Westen, Anders Wallin, Sara Shams, Lena Cavallin, and Elna Marie Larsson. Imaging biomarkers of dementia: recommended visual rating scales with teaching cases. Insights into Imaging , 8(1):79–90, 2017.
3[3] Lena Cavallin, Kirsti Løken, Knut Engedal, Anne Rita Øksengård, Lars Olof Wahlund, Lena Bronge, and Rimma Axelsson. Overtime reliability of medial temporal lobe atrophy rating in a clinical setting. Acta Radiologica , 53(3):318–323, 2012.
4[4] Lars-Olof Wahlund, Per Julin, Johan Lindqvist, and Philip Scheltens. Visual assessment of medial temporal lobe atrophy in demented and healthy control subjects: correlation with volumetry. Psychiatry Research: Neuroimaging , 90(3):193–199, 1999.
5[5] Philip Scheltens, D Leys, F Barkhof, D Huglo, H C Weinstein, P Vermersch, M Kuiper, M Steinling, E Ch Wolters, and J Valk. Atrophy of medial temporal lobes on MRI in ”probable” Alzheimer’s disease and normal ageing: diagnostic value and neuropsychological correlates. Journal of Neurology Neurosurgery, and Psychiatry , 55:967–972, 1992.
6[6] Esther L.G.E. Koedam, Manja Lehmann, Wiesje M. Van Der Flier, Philip Scheltens, Yolande A.L. Pijnenburg, Nick Fox, Frederik Barkhof, and Mike P. Wattjes. Visual assessment of posterior atrophy development of a MRI rating scale. European Radiology , 21(12):2618–2625, 2011.
7[7] Philip Scheltens, Florence Pasquier, Jan G.E. Weerts, Frederik Barkhof, and Didier Leys. Qualitative assessment of cerebral atrophy on MRI: inter- and intra- observer reproducibility in dementia and normal aging. European Neurology , 37(2):95–99, 1997.
8[8] Florence Pasquier, Didier Leys, Jan G.E. Weerts, Francois Mounier-Vehier, Frederik Barkhof, and Philip Scheltens. Inter-and intraobserver reproducibility of cerebral atrophy assessment on mri scans with hemispheric infarcts. European Neurology , 36(5):268–272, 1996.