An ordinal measure of interrater absolute agreement
Giuseppe Bove, Pier Luigi Conti, Daniela Marella

TL;DR
This paper introduces a new measure for interrater agreement on ordinal scales that overcomes variance restriction issues, with proven statistical properties and demonstrated effectiveness through simulations and real data application.
Contribution
A novel ordinal agreement measure based on dispersion index, with unbiased estimation, confidence interval construction, and validation through simulations and real case study.
Findings
The new measure avoids variance restriction problems.
Simulation studies confirm the measure's accuracy.
Application to real data demonstrates practical utility.
Abstract
A measure of interrater absolute agreement for ordinal scales is proposed capitalizing on the dispersion index for ordinal variables proposed by Giuseppe Leti. The procedure allows to avoid the problem of restriction of variance that sometimes affect traditional measures of interrater agreement in different fields of application. An unbiased estimator of the proposed measure is introduced and its sampling properties are investigated. In order to construct confidence intervals for interrater absolute agreement both asymptotic results and bootstrapping methods are used and their performance is evaluated. Simulated data are employed to demonstrate the accuracy and practical utility of the new procedure for assessing agreement. Finally, an application to a real case is provided.
| Method | Indicators | |||
|---|---|---|---|---|
| Normal | CP | 99.4 | 99.4 | 99.4 |
| LE | 0.6 | 0.6 | 0.6 | |
| RE | 0 | 0 | 0 | |
| AL | 0.16 | 0.16 | 0.16 | |
| T-int | CP | 26.2 | 72.4 | 28.8 |
| LE | 73.8 | 26.2 | 71.2 | |
| RE | 0 | 1.4 | 0 | |
| AL | 0.18 | 0.08 | 0.15 | |
| Perc | CP | 92.8 | 91.2 | 92.8 |
| LE | 0 | 8.8 | 0 | |
| RE | 7.2 | 0 | 7.2 | |
| AL | 0.23 | 0.10 | 0.18 | |
| Pivot | CP | 27 | 79.2 | 30 |
| LE | 73 | 19.6 | 70 | |
| RE | 0 | 1.2 | 0 | |
| AL | 0.23 | 0.10 | 0.18 | |
| Parameter | True value | Min | Max | Mean | Sd |
|---|---|---|---|---|---|
| 0.10 | 0.05 | 0.15 | 0.10 | 0.01 | |
| 0.20 | 0.14 | 0.25 | 0.19 | 0.02 | |
| 0.35 | 0.29 | 0.41 | 0.35 | 0.02 | |
| 0.25 | 0.20 | 0.33 | 0.26 | 0.02 | |
| 0.10 | 0.06 | 0.16 | 0.10 | 0.02 |
| Approach | mean of (d=0.61) | mean of (d=0.41) |
|---|---|---|
| Nonparametric | 0.53 | 0.36 |
| Parametric | 0.61 | 0.41 |
| Pseudo-nonparametric | 0.55 | 0.37 |
| Method | Indicators | |||
|---|---|---|---|---|
| Normal | CP | 98.2 | 98.2 | 98.2 |
| LE | 1.8 | 1.8 | 1.8 | |
| RE | 0 | 0 | 0 | |
| AL | 0.13 | 0.13 | 0.13 | |
| T-int | CP | 60.2 | 83.2 | 61.2 |
| LE | 39.8 | 14.8 | 38.8 | |
| RE | 0 | 2 | 0 | |
| AL | 0.18 | 0.10 | 0.14 | |
| Perc | CP | 93.2 | 93.8 | 93.2 |
| LE | 0 | 5.8 | 0 | |
| RE | 6.8 | 0.4 | 6.3 | |
| AL | 0.19 | 0.10 | 0.15 | |
| Pivot | CP | 64.8 | 84.6 | 65.4 |
| LE | 35.2 | 12.6 | 34.6 | |
| RE | 0 | 2.8 | 0 | |
| AL | 0.19 | 0.10 | 0.15 | |
| Group | CV | |||||
|---|---|---|---|---|---|---|
| L1 | 20 | 0.14 | 0.90 | 8.12 | 0.17 | 0.19 |
| L2 | 20 | 0.63 | 0.84 | 16.20 | 0.28 | 0.32 |
| Total | 40 | 0.67 | 0.87 | 12.16 | 0.22 | 0.25 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReliability and Agreement in Measurement · Multi-Criteria Decision Making · Hemodynamic Monitoring and Therapy
An ordinal measure of interrater absolute agreement
Giuseppe Bove*∗* , Pier Luigi Conti*∗∗, Daniela Marella∗* 111email: [email protected], [email protected],[email protected]
Corresponding author: Giuseppe Bove - Dipartimento di Scienze della Formazione, Università “Roma Tre” - [email protected]
*∗*Dipartimento di Scienze della Formazione, Università “Roma Tre”
*∗∗*Dipartimento di Scienze Statistiche, Università “La Sapienza”
Abstract
A measure of interrater absolute agreement for ordinal scales is proposed capitalizing on the dispersion index for ordinal variables proposed by Giuseppe Leti. The procedure allows to avoid the problem of restriction of variance that sometimes affect traditional measures of interrater agreement in different fields of application. An unbiased estimator of the proposed measure is introduced and its sampling properties are investigated. In order to construct confidence intervals for interrater absolute agreement both asymptotic results and bootstrapping methods are used and their performance is evaluated. Simulated data are employed to demonstrate the accuracy and practical utility of the new procedure for assessing agreement. Finally, an application to a real case is provided.
Key words: ordinal data, interrater agreement, resampling.
1 Introduction
Ordinal rating scales are frequently developed in study designs where several ‘raters’ (or ‘judges’) evaluate a group of ‘targets’. For instance, in language studies new rating scales before their routine application are tested out by a group of raters, who assess the language proficiency of a corpus of argumentative (written or oral) texts produced by a group of writers. Similar situations can be found in organizational, educational, biomedical, social, and behavioural research areas, where raters can be counsellors, teachers, clinicians, evaluators, or consumers and targets can be organization members, students, patients, subjects, or objects. When each rater evaluates targets, the raters provide comparable categorizations of the targets. The more the raters categorizations coincide, the more the rating scale can be used with confidence without worrying about which raters produced those categorizations. Hence, the main interest here consists in analysing the extent that raters assign the same (or very similar) values on the rating scale (interrater absolute agreement), that is to establish to what extent raters evaluations are close to an equality relationship (e.g., in the case of only two raters, if the two sets of ratings are represented by and the relation of interest is ). Measures of interrater absolute agreement, as Cohen’s Kappa (and extensions to take into account three or more raters, e.g., [19]) and intraclass correlations ([18]; [14]) are usually applied when dealing with rating performed by ordinal scales. A first problem of these procedures is that they are not originally defined for ordinal scales, and so they have to be adapted. For instance, the application of indices based on Cohen’s Kappa need to assign numerical values to the ordinal level of the scale; intraclass correlation indices are based on ANOVA for repeated measures approach for interval data. Another limitation of the above mentioned measures is that they are affected by the restriction of variance problem (e.g., [9]), that consists in an attenuation of estimates of rating similarity caused by an artefact reduction of the between-subjects variance in ratings. For instance, this happens in language studies when the same task is defined for native (L1) and non-native (L2) writers, and the analysis compare rater agreement in the two groups separately. Even in the presence of a very good absolute agreement, Cohen’s Kappa coefficient and intraclass correlations can take low values, especially for L1 group, because the range of ratings provided by the raters are concentrated on one or two very high levels of the scale (a range restriction that determines a between-target variance restriction).
In order to overcome the restriction of variance problem, measure for absolute agreement (or consensus) have been proposed, see [10] for a review. The main underlying idea is to measure the within-target variance of ratings (i.e., the between-rater variance) separately for each target, and summarize the results in a final average index (usually normalized in the interval ). In this approach, the influence of the low level of the between-target variance is removed by separate analysis of the ratings of each target. One of the most popular index in this group was proposed by ([5],[6]). For a scale it can be expressed as
[TABLE]
where is the observed between-rater variance of the ratings and is the between-rater variance obtained from a theoretical null distribution representing a complete lack of agreement among raters. Roughly speaking, the null distribution conceptually represents no agreement, which means that to calculate , one makes a direct comparison between the observed variance in raters’ ratings with the variance one would expect if there was no agreement among raters. Higher numbers indicate a greater agreement.
For raters in perfect agreement we have , with a corresponding value . In applications, values greater than (possibly ) are considered associated with high level of interrater absolute agreement (see [10], p. 836 table 3). Often researchers define the no agreement, or the null distribution, in terms of a uniform distribution. When the null distribution is assumed as uniform, the equation for the corresponding variance is
[TABLE]
where refers to the total number of levels of the scale .
The index and other indices reviewed in [10] (e.g., standard and average deviation indices) allow to avoid the problem of variance restriction, but as traditional measures of interrater agreement they are defined only for interval data. Besides, depending on the choice of the null distribution, negative values could be obtained. For these reasons, in this contribution we propose a new procedure to measure absolute agreement for ordinal rating scales by using the dispersion index proposed by [11] (pp. 290-297) for ordinal variables. In this way, we take into consideration the ordinal level of the measurement scales. The new measure is not affected by restriction of variance problems and does not depend on the choice of a particular null distribution. In this paper we assume a two-way random sampling design, where the sampling design involves a sample of raters as well as a sample of targets, all of which are rated by each sampled rater.
The paper is organized as follows. In Section 2 the dispersion index proposed by [11] (pp. 290-297) for ordinal variables is introduced and its sampling properties are analyzed in Section 3. Section 4 contains results of a simulation experiment used to illustrate the theoretical results. With this regard, confidence intervals for the proposed interrater agreement index are constructed using both the asymptotic results described in Section 3 (Proposition 4) and bootstrapping procedures. Finally, in Section 5 an application to real data is performed.
2 Leti index as a measure of interrater absolute agreement for ordinal scales
The dispersion of an ordinal categorical variable can be measured by the index proposed in [11] (pp. 290-297), which is given by
[TABLE]
where is the number of categories of the variable and is the cumulative proportion associated to category , for . Index is nonnegative and it is easy to prove that if and only if all observed categories are equal (absence of dispersion). The maximum value of the index () is obtained when all observations are concentrated in the two extreme categories of the variable (maximum dispersion), and it is
[TABLE]
as is even,
[TABLE]
as is odd, being the total number of observations. For moderately large, the maximum of the index can be assumed equal to . Hence, it is possible to define a measure of dispersion normalized in the interval given by
[TABLE]
Two advantages of this proposal respect to measures of absolute agreement like reported below are:
- (i)
does not depend by the formulation of a null distribution for normalization; 2. (ii)
can never be out of the range .
It is interesting to notice that has properties of within and between dispersion decomposition analogous to the well-known variance decomposition [3].
3 Sampling Properties of index
A sample of raters and a sample of targets are drawn by simple random sampling without replacement from a finite population of targets and raters, respectively. Let us denote with the score given by the th rater to the th target on a -point scale, for and . Formally, s are independent categorical random variables having categories with , for , and . In the sequel we assume that both the targets and the raters are homogeneus (targets-raters homogeneity assumption), which implies that the probability does not depend on rater or target , for , , .
As a consequence of homogeneity assumptions, the variables are independent and identically distributed (i.i.d.). As previously stressed, the dispersion of an ordinal categorical variable can be measured by the index (3). With regard to th target, let us denote with the empirical cumulative distribution function defined as
[TABLE]
where the numerator represents the number of raters giving score less than or equal to to the th target. It is known that , where the last equality comes from the targets homogeneity assumptions. Furthermore, and .
In order to estimate , for each target the following estimator can be defined
[TABLE]
As stressed in [16], (8) can be alternatively expressed as
[TABLE]
where
[TABLE]
is an unbiased estimator of .
Proposition 1**.**
The random variable (r.v) , with for , follows a multinomial distribution with parameters and .
The expression (9) allows to compute easily the expectation and the variance of estimator as shown in Proposition 2, see [12] for details.
Proposition 2**.**
The estimator has expectation
[TABLE]
and variance given by
[TABLE]
where
[TABLE]
Proof.
Both (11) and (12) come from the results in [12]. With regard to (11), we have
[TABLE]
for the variance (12) we obtain
[TABLE]
∎
Remark 1**.**
For sufficiently large, we have
[TABLE]
As an estimator of index (6) we consider
[TABLE]
where is an estimator of obtained averaging the estimates .
In Proposition 3 both the sampling properties and the asymptotic distribution of are analyzed for large and moderate .
Proposition 3**.**
The estimator has expectation
[TABLE]
and variance
[TABLE]
*where is given in . Furthermore, since are i.i.d., for the central limit theorem, as goes to infinity the random variable tends to a standard normal distribution with mean and variance given by (19) and (20), respectively, *
In Proposition 4 an unbiased estimator of is proposed and its asymptotic distribution is evaluated.
Proposition 4**.**
From (19), an unbiased estimator of can be defined as follows
[TABLE]
As a consequence of Proposition (3), the distribution of is approximately normal with mean and variance
[TABLE]
The proof of Proposition 4 follows from Proposition 3. The above results are useful to construct point and interval estimates of . They are also useful for testing both the statistical significance of the index (that is the null hypotheses ) and null hypothesis such as , where be a real number in . Consider the hypothesis problem
[TABLE]
As a consequence of Proposition 4, a test with an asymptotic significance level consists in accepting whenever
[TABLE]
where is the quantile of the standard normal distribution and is an estimate of variance (22).
4 Simulation Study
In this section, a simulation study to compare the performance of different confidence intervals for index is performed. We focus our efforts on developing methods for constructing confidence intervals for the index because confidence intervals indicate the range within which the population parameter (the interrater agreement in the population) is likely to fall, as well as precision of this estimate (i.e., the size of the range).
A finite population of size targets and raters was generated from a multinomial model with parameters and probabilities . Then, the finite population consists in a matrix of size . The value of index (6) is .
From the population, samples were drawn according to a simple random sampling without replacement on the basis of the following two-step procedure. First of all, a simple random sample of size from the raters has been selected. This is equivalent to select a simple random sampling without replacement of columns in the finite population matrix , the result is a matrix of size . Secondly, a simple random sampling of size from targets has been drawn. This means to draw a simple random sampling of rows from .
In order to construct confidence intervals for the index , both the asymptotic result in Proposition 4 and bootstrapping procedures are used. The bootstrap methods are described in points (2)-(4) below, where we assume that bootstrap samples are drawn from each initial sample . Formally, confidence intervals for of level have been constructed using the following methods:
- (1)
Normal approximation. For the initial sample (for ), the confidence interval based on the asymptotic normal approximation is given by
[TABLE]
where and are the estimates of and , respectively. 2. (2)
Percentile method. For the initial sample (for ), the confidence interval is obtained by taking and quantiles of the bootstrap samples. Formally
[TABLE] 3. (3)
Bootstrap-t interval. For the initial sample (for ), the confidence interval is computed as follows
[TABLE]
where is the th percentile of the distribution of (for ) with
[TABLE]
In (28) is the estimate of based on the th bootstrap sample and is the standard error based on the data in the th bootstrap sample. 4. (4)
Pivotal method. For the initial sample (for ), the confidence interval is computed as follows
[TABLE]
where and are the and quantiles of the bootstrap estimates , for .
As far as the methods described in steps (2)-(4) are concerned, from each of the initial samples, the bootstrap samples were selected according to the following methods:
- 1
Nonparametric bootstrap. From each initial sample , the th bootstrap sample is selected as follows: (i) a simple random sample with replacement of raters has been selected from the original sample of raters; (ii) a simple random sampling with replacement of writers has been drawn from the original sample of writers. 2. 2
Parametric bootstrap From each initial sample , the th bootstrap sample is generated according the multinomial model specified in Proposition 1. 3. 3
Pseudo-Nonparametric bootstrap. The nonparametric bootstrap described in point (1), is based on the assumption that the data are i.i.d., see [7]. Since survey data are not necessarily i.i.d., many bootstrap resampling methods have been proposed in the context of survey sampling. These methods are obtained after making some modifications to the classical i.i.d. bootstrap in order to adapt it for survey data. For a review of bootstrap methods in the context of survey data, see [13]. The class of pseudo-population bootstrap methods consists in creating a pseudo-population by repeating the units of the initial sample and drawing from such a pseudo-population bootstrap samples with the same design as the initial one. In order to illustrate how a pseudo-population is constructed, let us assume that a simple random sample without replacement has been selected from a finite population of size . A pseudo-population of size can be created by repeating the selected sample, times. This method, was first introduced by [4]. In practice is rarely an integer, in this case a method to build a pseudo-population of size was proposed by [1]. In this method, a pseudo-population is first constructed by replicating times each unit of the original sample . Then, the pseudo-population is completed by taking a simple random sample of size without replacement from . Taking into account the two-way sampling design of both targets and raters, the pseudo-population has been generated according the following two step procedure:
- Step 1
the ratings of raters have been reconstruted replicating the columns of the original sample , times. As a consequence, this first step generates a sample of size and ; 2. Step 2
the points of targets have been reconstruted replicating the rows of the sample obtained in Step 1, time.
The accuracy of confidence intervals has been evaluated by the following indicators.
- (1)
Estimated coverage probability, in per cent, for the interval
[TABLE] 2. (2)
Estimated left-tail and right-tail errors (lower and upper error rates) in per cent
[TABLE] 3. (3)
Estimated average length (AL) of all 1000 simulated intervals given by
[TABLE]
where if is true and elsewhere, and .
4.1 Simulation results
Tables 1 presents the outcomes achieved in the simulation study. More specifically, the estimated coverage probabilities of confidence intervals (CP), the estimated left-tail (LE) and right-tail (RE) errors (nominal values is for both) and the average length (AL) for the index , when (), are reported. The value is equal to .
As reported in Table 1, the confidence intervals obtained with the normal approximation perform very well. Coverage probabilities are larger than nominal value () with an average length of . Furthermore, the normal confidence intervals construction is simple, as it does not require resampling from the initial sample. Figure 1 shows the kernel density of the index estimated from the 1000 original samples. The bandwidth selection rule is as proposed by [17].
The percentile method has a good performance with coverage probability larger than . The worst methods are the and methods. The lower and upper error rates, giving us an idea of how skewed the distribution of the estimator is, are not well balanced. With regard to the methods used to generate the bootstrat samples, the parametric approach performance is strictly related to the estimation of the multinomial probabilities. As previously stressed, each row in the inital sample provides an estimate of and the mean of such estimates defines the estimated probabilities of the multinomial distribution used to generate the bootstrap samples as specified in Proposition 1. In Table 2, the minimum, the maximum, the mean and the standard deviation of the distribution of (for ) estimated from the original 1000 samples are reported.
As Table 1 shows, the pseudo-nonparametric approach taking into account the sample selection effects has a slightly better performance than the nonparametric approach both in terms of coverage probabilities and average lengths for all methods (, , ).
Finally, note that in the nonparametric approach the resampling with replacement from raters generates a replication of columns of the bootstrap sample introducing a false agreement between raters and as a consequence an underestimation of . This fact is showed in Table 3 where the mean of the estimates over both the original samples and over the bootstrap replications are reported.
Such means have been computed both for the original population with and for a population with , showing as the magnitude of bias depends also on the original agreement degree between raters. That is, the higher the raters agreement (low values of ), the smaller the bias in the estimator introduced by the resampling with replacement. Clearly, such a bias is also present in the pseudo-nonparametric approach but with a smaller magnitude, thank to the construction of the pseudo-population that mitigates such a phenomenon. As table 3 shows, the parametric approach produces null bias estimates.
The simulation in Table 1 has been repeated for a populaiton with . The results are reported in Table 4.
In conclusion, the most competitive method in terms of performance and computational time seem to be the normal. Among the bootstrapping procedures the percentile method in the parametric approach seems to perform better.
5 An application on real data: the assessment of language proficiency
The aim of this section is to apply the methodology illustrated in the previous sections on an empirical data set, we have analysed ratings obtained in a research conducted at Roma Tre University (see [15], for a detailed description). The main aim of the study was to investigate the applicability of a six-point Likert scale for functional adequacy (an aspect of language proficiency) developed by [8] to texts produced by native and non-native writers, and to different task types (narrative, instruction, and decision-making tasks). The scale comprises four subscales, corresponding to the four dimensions of functional adequacy identified by the authors of the scale: content, task requirements, comprehensibility, coherence and cohesion (the reader is referred to [8] for a detailed presentation of scales and descriptors). 20 native speakers of Italian (L1) and 20 non-native speakers of Italian (L2) participated in the study as writers. All the texts produced by L1 and L2 writers (120 texts in total for the three tasks) were assessed by 7 native speakers of Italian on the Kuiken and Vedder’s six-point Likert scale. The raters did not have any specific experience in judging written texts, and can therefore be categorized as being non-expert. For our purposes, we have selected ratings concerning only the narrative task and the subscale comprehensibility. Just to give a general idea of the subscale, definitions of levels 1 and 6 are reported in the following:
- Level 1:
The text is not at all comprehensible. Ideas and purposes are unclearly stated and the efforts of the reader to understand the text are ineffective. 2. Level 6:
The text is very easily comprehensible and highly readable. The ideas and the purpose are clearly stated.
The results of the interrater agreement analysis for the subscale are summarized in Table 5, where the intraclass correlation and the average values of , as defined in [10], the coefficient of variation , and are shown for L1, L2 and total groups. The intraclass correlation provides a low-moderate level of agreement for the total group (). The results for the average values of (), () and () seem in accord with , while the average value of (0.87), highlights a higher level of agreement. As it was observed in [2], when the analysis focuses separately on the two subgroups of L1 and L2 students, results regarding the L1 group deserve particular attention. Interrater agreement measured by intraclass correlation is very low in the L1 group (). Analysing the dispersion of the ratings given to this subgroup, it comes out that most of the raters used almost exclusively levels 5 and 6 of the scale. Such a range restriction caused the very low value of the intraclass correlation, despite the substantial agreement among the raters that scored all the L1 texts in the same high levels. This problem does not regard the results for the other three indices of Table 5 (; ; ; ) that show a very good level of absolute agreement. Finally, the standard deviation of computed on the basis of formula (22) is equal to . As a consequence, the confidence interval using the normal approximation for the total group is and the error is at most .
6 Conclusions
In this paper a measure of interrater absolute agreement for ordinal scales is proposed. Such a measure is not affected by restriction of variance problems and does not depend on the choice of a particular null distribution. An unbiased estimator of the proposed measure is introduced and its sampling properties are investigated. In the simulation study confidence intervals for the proposed interrater agreement index are constructed using the normal approximation, the parametric and nonparametric bootstrap. Furthermore, a pseudo-nonparametric bootstrap taking into account the sampling design is also implemented. As previously stressed, the resampling involves both raters and targets sample. Confidence intervals obtained with the normal approximation seem to perform very well both in terms of coverage probability and computational cost.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Booth, J. G., R. W. Butler, and P. Hall (1994). Bootstrap methods for finite populations. Journal of the American Statistical Association , 89 (428), 1282–1289.
- 2[2] Bove, G., Nuzzo, E., Serafini, A. (2018) Measurement of interrater agreement for the assessment of language proficiency. In: S. Capecchi, Di Iorio F., Simone R. ASMOD 2018: Proceedings of the Advanced Statistical Modelling for Ordinal Data Conference . Università Federico II di Napoli, 24-26 October 2018. Napoli: Fed OA Press, 61–68.
- 3[3] Grilli L., Rampichini C. (2002) Scomposizione della dispersione per variabili statistiche ordinali [Dispersion decomposition for ordinal variables], Statistica , 62 , 111–116.
- 4[4] Gross, S. (1980). Median estimation in sample surveys. In Proceedings of the Section on Survey Research Methods , American Statistical Association, pp. 181–184.
- 5[5] James, L. J., Demaree, R. G.,Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology , 69 , 85–98.
- 6[6] James L. J., Demaree R. G., Wolf G. (1993) rwg: An assessment of within-group interrater agreement, Journal of Applied Psychology , 78 , 306–309.
- 7[7] Efron, B. (1979). Bootstrap methods: another look at the jackknife. The Annals of Statistics , 7 (1), 1–26.
- 8[8] Kuiken F., Vedder I. (2017) Functional adequacy in L 2 writing. Towards a new rating scale, Language Testing , 34 , 321-336.
