Mitigating Observation Biases in Crowdsourced Label Aggregation
Ryosuke Ueda, Koh Takeuchi, Hisashi Kashima

TL;DR
This paper introduces statistical methods to reduce observation biases in crowdsourced labels, improving data quality by addressing response variability, spam, and collusion.
Contribution
It proposes novel bias removal techniques integrated with aggregation methods, enhancing accuracy and robustness in crowdsourced labeling tasks.
Findings
Improved aggregation accuracy under strong observation biases
Enhanced robustness against spam and colluding workers
Validated effectiveness on synthetic and real datasets
Abstract
Crowdsourcing has been widely used to efficiently obtain labeled datasets for supervised learning from large numbers of human resources at low cost. However, one of the technical challenges in obtaining high-quality results from crowdsourcing is dealing with the variability and bias caused by the fact that it is humans execute the work, and various studies have addressed this issue to improve the quality by integrating redundantly collected responses. In this study, we focus on the observation bias in crowdsourcing. Variations in the frequency of worker responses and the complexity of tasks occur, which may affect the aggregation results when they are correlated with the quality of the responses. We also propose statistical aggregation methods for crowdsourcing responses that are combined with an observational data bias removal method used in causal inference. Through experiments using…
| Dataset | # classes | # workers | # instances | # labels per instance |
|---|---|---|---|---|
| (a) RTE | 2 | 164 | 800 | 10 |
| (b) TEMP | 2 | 76 | 462 | 10 |
| (c) WSD | 3 | 34 | 177 | 10 |
| (d) SP | 2 | 143 | 500 | 20 |
| Dataset | (a) RTE | (b) TEMP | (c) WSD | (d) SP |
|---|---|---|---|---|
| Correlation | -0.384 | -0.377 | 0.062 | 0.097 |
| Dataset | RTE | TEMP | WSD | SP | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Number of labels per task | 2 | 5 | 8 | 2 | 5 | 8 | 2 | 5 | 8 | 2 | 5 | 8 |
| MV | 0.769 | 0.845 | 0.896 | 0.789 | 0.894 | 0.939 | 0.973 | 0.992 | 0.882 | 0.933 | 0.938 | |
| IPS-MV () | 0.845 | 0.902 | 0.825 | 0.894 | 0.939 | 0.979 | 0.880 | 0.933 | 0.937 | |||
| IPS-MV () | 0.867 | 0.908 | 0.825 | 0.905 | 0.937 | 0.979 | 0.992 | 0.993 | 0.880 | 0.933 | 0.938 | |
| IPS-MV () | 0.808 | 0.871 | 0.902 | 0.824 | 0.893 | 0.933 | 0.977 | 0.992 | 0.880 | 0.924 | 0.928 | |
| D&S | 0.757 | 0.899 | 0.925 | 0.842 | 0.988 | 0.989 | 0.993 | 0.900 | ||||
| IPS-D&S () | 0.767 | 0.835 | 0.941 | 0.984 | 0.988 | 0.991 | 0.902 | 0.937 | ||||
| IPS-D&S () | 0.781 | 0.898 | 0.926 | 0.844 | 0.926 | 0.937 | 0.980 | 0.986 | 0.989 | 0.902 | 0.935 | |
| IPS-D&S () | 0.798 | 0.889 | 0.922 | 0.925 | 0.939 | 0.988 | 0.989 | 0.993 | 0.901 | 0.928 | 0.938 | |
| GLAD | 0.788 | 0.894 | 0.921 | 0.835 | 0.925 | 0.940 | 0.934 | |||||
| IPS-GLAD () | 0.786 | 0.895 | 0.920 | 0.836 | 0.926 | 0.939 | 0.934 | |||||
| IPS-GLAD () | 0.890 | 0.911 | 0.846 | 0.923 | 0.935 | 0.982 | 0.993 | 0.900 | 0.934 | 0.941 | ||
| IPS-GLAD () | 0.884 | 0.910 | 0.843 | 0.921 | 0.936 | 0.988 | 0.992 | 0.891 | 0.924 | 0.928 | ||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Mitigating Observation Biases in
Crowdsourced Label Aggregation
Ryosuke Ueda
Kyoto University
Kyoto, Japan
Koh Takeuchi
Kyoto University
Kyoto, Japan
Hisashi Kashima
Kyoto University
Kyoto, Japan
Abstract
Crowdsourcing has been widely used to efficiently obtain labeled datasets for supervised learning from large numbers of human resources at low cost. However, one of the technical challenges in obtaining high-quality results from crowdsourcing is dealing with the variability and bias caused by the fact that it is humans execute the work, and various studies have addressed this issue to improve the quality by integrating redundantly collected responses. In this study, we focus on the observation bias in crowdsourcing. Variations in the frequency of worker responses and the complexity of tasks occur, which may affect the aggregation results when they are correlated with the quality of the responses. We also propose statistical aggregation methods for crowdsourcing responses that are combined with an observational data bias removal method used in causal inference. Through experiments using both synthetic and real datasets with/without artificially injected spam and colluding workers, we verify that the proposed method improves the aggregation accuracy in the presence of strong observation biases and robustness to both spam and colluding workers.
000© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
I Introduction
Owing to the rapid development of machine learning technologies, there has been growing demand for data driven, human decision support in various fields. In particular, prediction through supervised learning is one of the key techniques; however, its use requires accurate labels to train the prediction machine. However, because these labels often need to be provided by humans, data collection can be extremely expensive. The use of crowdsourcing is quite effective in obtaining a large number of training data, for example, in entity resolution [1], text classification [2], and image recognition [3]. This is because crowdsourcing platforms such as Amazon Mechanical Turk can make large amounts of human resources available at a relatively low cost on the Internet.
One of the major challenges faced in training data collection using crowdsourcing is the variability in the reliability of the responses provided by the workers [4]. This is attributed to the significant variations in the abilities and motivations of the human workers, as well as the anonymity of crowdsourcing workers. In addition, there are spam workers who provide random answers without looking at the tasks [5, 6], as well as colluding workers who share their answers with other workers [7, 8, 9], which also contribute to the variability in the reliability. To reduce the impact of incorrect responses, several studies have attempted to improve the quality by aggregating multiple redundantly collected responses from different workers. This can be achieved using simple majority voting or statistical response aggregation methods that consider worker ability and the task difficulty [10, 11].
In crowdsourcing, each worker does not necessarily need to answer all tasks in general. Depending on the knowledge and preferences of the workers, or the type and difficulty of the tasks, the questions they answer may be biased. In fact, we examined several public datasets, and found a large variation in worker response frequency and the biased relationship between such frequency and worker ability (Figure 3, Table II). In the aforementioned response aggregation methods, workers with a high response frequency have a large influence on the aggregation results; therefore, the aggregation results can also be biased toward such high-frequency workers. We investigate the effects of such biases on the performance of the aggregation methods, and attempt to improve the aggregation by removing bias when it has a negative influence. We propose methods combining inverse propensity scoring [12, 13], which is an observation bias removal method used in causal inference, with simple majority voting, Dawid-and-Skene model (D&S) [10], and GLAD (Generative model of Labels, Abilities, and Difficulties) [11]. Our method estimates the integration results for uniformly random observations from biased observation labels.
Experiments using synthetic data suggest that bias removal improves the aggregation accuracy when there is a negative correlation between the observation rates of the worker responses and agreement rates with true labels; however, the opposite occurs when there is a positive correlation. In addition, an examination of some real datasets indicates weak negative correlations between the number of worker responses and the percentage of correct responses. For datasets with such a negative correlation, the proposed method outperforms the baseline when the number of labels is small, and thus the effect of the observation bias increases. Because an analysis of real datasets suggests the existence of spam workers who provide a significant number of random answers, we created semi-synthetic datasets with enhanced observation biases by hypothetically highlighting such inappropriate workers. The results of the additional experiments indicate that the proposed methods are robust to the presence of both spam and malicious colluding workers.
Our contributions are summarized as follows:
- •
We investigate and control the effect of observation bias on the results of crowdsourced label aggregation.
- •
We propose an EM algorithm-based method for label aggregation using a novel lower bound that mitigates the observation bias.
II Related Work
Quality control is one of the major challenges of crowdsourced label aggregation, and how to handle the uncertainty brought by workers is an important aspect of the problem. For example, the Dawid-and-Skene model [10] models a worker as a confusion matrix, where an EM algorithm simultaneously estimates the confusion matrices and the ground truth instance labels. GLAD [11] considers instance difficulty as well as worker ability based on item response theory, and uses an EM algorithm to estimate the parameters and ground truth labels. Learning from crowds (LFC) [5] is a problem setting that directly learns a classifier from task instance features and crowdsourced labels. The ground truth estimates are also obtained as byproducts. LFC can be considered as an extension of D&S to cases where task instance features are available. Additionally, various label aggregation algorithms based on Bayesian inference have been proposed that aim to deal with small number of worker labels. Bayesian Classifier Combination (BCC) [14] is a Bayesian extension of D&S and uses MCMC for inference. Community BCC [15] further extends BCC to consider group structures within workers. In recent years, Enhanced BCC [16], which models correlation between workers, shows better performance on many datasets. In addition, Li et al. [17] proposes a Bayesian model without worker’s confusion matrix. Our proposal in the present study is to investigate and control the effects of observational biases in crowdsourced label aggregation. In order to test the promise of the idea, we restrict our focus to basic aggregation methods such as majority voting, D&S, and GLAD, as our base model in the present study. Although more modern Bayesian models like BCC are the state-of-the-art methods, implementing the proposed idea on them is not necessarily obvious, and is a subject for future research.
Recent work explores the effect of biases in crowdsourced label aggregation induced by crowdworkers being humans. Eickhoff [18] investigates the impact of several cognitive biases such as the bandwagon effect in crowdsourced experiments, and shows that inappropriate task design leads to poor accuracy. Zhuang and Young [19] show that when multiple tasks are annotated by a worker as a single batch, the combination of tasks in a batch can affect the response. In addition, both crowdsourcing [20] and psychological [21] experiments showed that the effect of previous tasks is present when performing sequential tasks. Biswas et al. [22] focus on worker race in a defendant recidivism prediction task, reporting that classifiers are fair when trained with balanced worker racial distribution data. In addition, the existence of a “confirmation bias” that is responsive to choices that fit workers’ beliefs [23, 24, 25] is shown. In contrast, we focus on the observation bias and investigate and control its effects in this study.
In crowdsourcing, it is desirable to avoid spammers who respond randomly to many instances for reward, and the presence of spammers reinforces the observational bias. Kittur et al. [6] show the effectiveness of introducing a rating task at the beginning of the entire task. Raykar and Yu [5] propose scores for spammer detection.
Several studies explore issues related to observation bias. Han et al. [26] investigate “task abandonment,” which was once tackled by a worker but never submitted. Difallah et al. [7] use the propensity score to estimate the number of workers on Amazon Mechanical Turk. Schnabel et al. [27] mitigate observation bias in recommendation systems using weighting by the inverse of the propensity score.
III Problem Setting
In this study, we consider a standard problem setting for crowdsourced label aggregation. Suppose we have task instances such as a set of images and texts. Each task instance belongs to one of different classes ; the ground truth label of the -th instance is denoted by . We assume that the set of ground truth labels for the task instances are unknown.
We ask crowdsourcing worker to give labels to task instances denoted by , which is the label given by worker to task instance . The workers do not necessarily have to give labels to all task instances, i.e., is occasionally missing. We introduce as a variable to indicate whether the label is obtained; we set when is observed; otherwise, we set .
Our goal is to estimate the ground truth labels by aggregating the set of crowdsourced labels . Although the simplest way to aggregate the crowdsourced labels is majority voting, in recent years more sophisticated probabilistic generative models have been used. A number of models have also been proposed for considering various factors such as worker ability and task difficulty, and for estimating these parameters along with the ground truth answers [28, 29]. However, such models do not account for observation biases, and workers who respond more frequently have a larger impact on the results. The accuracy may also decrease when there are biases in the response frequency and reliability.
IV Proposed Method
We propose the use of response aggregation methods to reduce the effect of observation bias discussed in the previous section. We combine inverse propensity scoring (IPS) [12, 13], which is used to remove observation bias in causal inference, with three aggregation methods: simple majority voting, D&S and GLAD.
IV-A Majority-Voting-based Method: IPS-MV
First, we propose IPS-MV, which combines the simple majority voting (MV) method with IPS. With the simple MV method, we obtain the aggregate label for task instance as
[TABLE]
where represents the indicator function.
When we adopt IPS, instead of the equally weighted aggregation formula (1), we apply its weighted version:
[TABLE]
where is the (estimated) probability that worker answers instance , which is called the propensity score.
IV-B EM-Algorithm-based Method: IPS-D&S and IPS-GLAD
Next, we propose IPS-D&S and IPS-GLAD, which combine IPS with D&S and GLAD, well-known label aggregation methods using the EM algorithm, in order to mitigate observation bias. D&S [10] is one of the early representative approaches to this problem, and shows high accuracy in decision-making tasks [30]. GLAD [11] is an algorithm that extends D&S to simultaneously handle worker ability and task difficulty. Although D&S and GLAD have different label generation assumptions, they can be estimated with the same EM algorithm framework.
Figure 1a shows the graphical model of D&S; indicates the conditional probability that worker will respond label given true label . Hence, the likelihood of the label is given as
[TABLE]
In D&S, we optimize the label prior as a parameter.
The graphical model of GLAD is shown in Figure 1b. In the GLAD model, the probability that worker provides the correct answer to instance depends on the worker’s ability , and task difficulty , which are specifically defined as
[TABLE]
where denotes the logistic function. When the answer is incorrect, it is sampled from a uniform distribution over incorrect labels. The probability that a wrong answer is given is
[TABLE]
We set the label prior to a uniform distribution in this study.
Let be the unobserved parameters other than and (i.e. in D&S, and in GLAD). The maximum likelihood estimation is used to estimate the distribution of the true label , the model parameter , and (for D&S only) the prior distribution . Instead of maximizing the log likelihood , its lower bound:
[TABLE]
is maximized using the EM algorithm.
The lower bound is rewritten as
[TABLE]
This indicates that the propensity score implicitly weights the lower bound , and the lower bound is biased when the propensity scores are biased. Instead, we use the following unbiased lower bound :
[TABLE]
Because the expected value for over is
[TABLE]
the new lower bound (2) is unbiased w.r.t. the uniform distribution. The new lower bound can also be maximized using the EM algorithm.
IV-C Propensity Score Estimation
Because true propensity scores are not always available in practice, we estimate them using a 1-bit matrix completion (1-bit MC) [31], which applies a matrix completion under a nuclear-norm () constraint. The 1-bit MC approximates as using a matrix . The optimization problem w.r.t. is given as
[TABLE]
where is a hyperparameter. It is known that the nuclear norm is a convex relaxation of the rank constraint. As is decreased, approaches the low-rank matrix and eventually approaches a zero matrix; on the other hand, as is increased, the constraint is relaxed and approaches .
We use the estimated as an approximation of the propensity score .
V Experiments
The proposed method estimates unbiased results by giving smaller weights to responses that are more likely to be observed. Depending on the relationship between the propensity and percentage of correct answers, the proposed model is expected to give different aggregation results from the base model. Therefore, we first investigate the relationship between the aggregation accuracy and correlations of propensity and the percentage of correct answers using synthetic datasets. We further compare the results of different methods on real datasets. Finally, we conduct experiments using semi-synthetic data with both virtual spam and colluding workers to investigate the robustness against such harmful workers.
V-A Experiments Using Synthetic Data
First, we investigate the impact of the relationship between the propensity and accuracy on the aggregation performance of the proposed method using synthetic datasets. A synthetic dataset includes 20 workers and 100 task instances. The ground truth labels are uniformly sampled at random over binary classes (). For each -pair, we sample the observation probability and the correct answer probability from the two-dimensional Gaussian distribution with the mean and variance so that the average number of labels per instance is 3. The covariances are determined according to the correlation coefficient between -1 and 1. The sampled parameters are clipped to within .
We compare the accuracy of the simple MV and the proposed bias-mitigating MV (IPS-MV) with the true observation rate . The data generation and estimation are repeated 1000 times, and their average accuracy is compared. In this experiment, we investigate the dependency of the correlation and accuracy, and do not consider worker ability. Hence, D&S and GLAD are not used. Figure 2 shows the average accuracy when the correlation between the observation probability and correct answer probability is varied. The stronger the negative correlation is, the better the proposed IPS-MV method in comparison to MV; by contrast, MV is more accurate for non-negative correlations. The experimental results indicate that the relationship between the propensity and ability has a significant impact on the aggregation accuracy. Note that since the correlation estimation requires ground-truth labels, it is usually impossible to determine in advance whether the proposed method will be effective.
V-B Experiments Using Real Data
The previous experiments on artificial data suggested that the removal of observation bias leads to an improvement when there is a negative correlation between the observation probability and the correct answer probability; therefore, we tested this hypothesis using the four real datasets: (a) Recognizing Textual Entailment (RTE) [32], (b) Temporal Ordering (TEMP) [32], (c) Word Sense Disambiguation (WSD) [32], and (d) Sentiment Popularity111https://eprints.soton.ac.uk/376544/. Table I shows the number of classes, workers, instances, and the number of labels per instance for each dataset. Figure 3 shows a plot of worker propensity versus accuracy, and Table II shows their correlation coefficients. Among the four datasets, (c) WSD and (d) SP do not show much correlation, whereas (a) RTE and (b) TEMP show negative correlations, thus suggesting the occurrence of observation biases. In addition, Figure 3 shows the presence of spammers answering many tasks with correct answer rates around 50% for binary questions (that is, the chance level) in the (a) RTE and (b) TEMP datasets.
Even though crowdsourcing is relatively low cost, we want to obtain high aggregation accuracy with as few labels as possible, but fewer labels are more sensitive to observation bias. To investigate this situation, we conduct experiments using randomly sampled subsets of the original datasets. The random sampling is conducted five times and the accuracy is averaged. For propensity score estimation, we use the 1-bit MC. Table III shows the results when the number of labels per task is set to 2, 5 and 8. For (a) RTE and (b) TEMP, which show a weakly negative correlation in Table II, the proposed method shows a higher accuracy than the baseline. The difference in accuracy is particularly large when the number of labels is small. For the two labels in RTE, IPS-MV outperforms MV by up to 4.0 percentage points, IPS-D&S outperforms D&S by up to 4.1 percentage points, and IPS-GLAD outperforms GLAD by up to 2.1 percentage points. By contrast, in (c) WSD and (d) SP, the accuracy of the proposed method is equal to or slightly lower than the baseline.
V-C Robustness against Harmful Workers
The previous analyses of the real datasets suggest the existence of spammers in the RTE and TEMP datasets, which can be one of the factors causing observation biases. To investigate the effect of such observation biases, we conducted experiments using semi-synthetic data with two types of harmful workers: spam workers and colluding workers. The synthetic spam workers and colluding workers respond to all tasks. The spam worker labels are sampled from a uniform distribution over . The colluding workers are more malicious, and they collude and try to guide the outcome of the majority vote [7, 8, 9]. In our experiment, a label is sampled from a uniform distribution over , and all the colluding workers respond with the same label. We continue adding spam and collusion workers until the malicious worker labels make up 50% of all labels.
Figures 4 and 5 show the accuracy when the numbers of added spam workers or colluding workers are varied, respectively. The accuracy is obtained as the average of five trials of random data generation. In Figure 4a and 4b, with an increase in the number of spam workers, the performance of the simple MV degrades whereas the IPS-MV () remains robust against them. In Figure 4c and 4d, IPS-MV, and IPS-D&S show unstable results. This is probably due to the extremely high accuracy in (c) WSD and the large number of original labels in (d) SP. Figure 4 also shows that D&S, IPS-D&S, GLAD, and IPS-GLAD are consistently robust against spammers since all of them consider worker’s ability.
The robustness against harmful workers is more significant for colluding workers. Figure 5 shows the accuracy when colluding workers exist. In contrast to the previous experiment, not only MV, but also D&S and GLAD decreased significantly in accuracy across all the datasets as the number of colluding workers increase. On the other hand, combining the proposed method with MV, D&S, and GLAD consistently improved performance, especially at .
VI Conclusion
We investigated the effect of observation bias, as well as how to deal with such bias in crowdsourcing response aggregation. By introducing the IPS into the response aggregation methods (majority voting, D&S and GLAD), we proposed response aggregation methods that eliminate observation bias. Experiments on synthetic and real data show that the proposed method is effective when a negative correlation exists between the correct answer and the observation rates. By adding spam and colluding workers to the real datasets, we also demonstrated that the proposed method is robust against such harmful workers. Since our main focus of this study is to investigate and mitigate the observation bias, we restricted ourselves to the rather classical label aggregation methods. In the future, we will study more modern and sophisticated aggregation methods.
Acknowledgment
This work was supported by JST CREST Grant Number JPMJCR21D1.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang, “CDAS: a crowdsourcing data analytics system,” Proceedings of the International Conference on Very Large Data Bases (VLDB) , vol. 5, no. 10, pp. 1040–1051, 2012.
- 2[2] J. Wang, T. Kraska, M. J. Franklin, and J. Feng, “Crowd ER: crowdsourcing entity resolution,” Proceedings of the International Conference on Very Large Data Bases (VLDB) , vol. 5, no. 11, pp. 1483–1494, 2012.
- 3[3] H. Su, J. Deng, and L. Fei-Fei, “Crowdsourcing annotations for visual object detection,” in Workshops at the AAAI Conference on Artificial Intelligence , 2012.
- 4[4] G. Li, J. Wang, Y. Zheng, and M. J. Franklin, “Crowdsourced data management: A survey,” IEEE Trans. Knowl. Data Eng. , vol. 28, no. 9, pp. 2296–2319, 2016.
- 5[5] V. C. Raykar and S. Yu, “Eliminating spammers and ranking annotators for crowdsourced labeling tasks,” Journal of Machine Learning Research , vol. 13, pp. 491–518, 2012.
- 6[6] A. Kittur, E. H. Chi, and B. Suh, “Crowdsourcing user studies with mechanical turk,” in Proceedings of the Conference on Human Factors in Computing Systems (CHI) , 2008, pp. 453–456.
- 7[7] D. E. Difallah, G. Demartini, and P. Cudré-Mauroux, “Mechanical cheat: Spamming schemes and adversarial techniques on crowdsourcing platforms,” in Proceedings of the International Workshop on Crowdsourcing Web Search (Crowd Search) , 2012, pp. 26–30.
- 8[8] P.-P. Chen, H.-L. Sun, Y.-L. Fang, and J.-P. Huai, “Collusion-proof result inference in crowdsourcing,” Journal of Computer Science and Technology , vol. 33, no. 2, pp. 351–365, 2018.
