TL;DR
This paper introduces a probabilistic model for yes/no crowdsourcing queries in multi-class classification, enabling effective label estimation with shorter, less demanding questions, and demonstrates its effectiveness on real datasets.
Contribution
The work presents a novel probabilistic framework for yes/no queries in crowdsourcing, including an approximate inference method and validation on real-world scenarios.
Findings
Model achieves comparable accuracy to full query methods.
Model effectively estimates true classes by accounting for labeler failures.
Provides publicly available datasets and code for further research.
Abstract
Crowdsourcing has become widely used in supervised scenarios where training sets are scarce and difficult to obtain. Most crowdsourcing models in the literature assume labelers can provide answers to full questions. In classification contexts, full questions require a labeler to discern among all possible classes. Unfortunately, discernment is not always easy in realistic scenarios. Labelers may not be experts in differentiating all classes. In this work, we provide a full probabilistic model for a shorter type of queries. Our shorter queries only require "yes" or "no" responses. Our model estimates a joint posterior distribution of matrices related to labelers' confusions and the posterior probability of the class of every object. We developed an approximate inference approach, using Monte Carlo Sampling and Black Box Variational Inference, which provides the derivation of the…
| MACHO | The Catalina Surveys | Animals | |||
|---|---|---|---|---|---|
| EB | 104 | CEP | 119 | Mammal | 232 |
| BE | 57 | RRLYR | 99 | Bird | 73 |
| LPB | 49 | EB | 80 | Amphibian | 31 |
| CEP | 40 | LPV | 60 | Reptile | 23 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A Full Probabilistic Model for Yes/No Type Crowdsourcing
in Multi-Class Classification
Belen Saldias-Fuentes Computer Science Department, Pontificia Universidad Católica de Chile, Santiago, Chile.MIT Media Lab, Massachusetts Institute of Technology, Cambridge, MA, USA (present affiliation, [email protected]).
Pavlos Protopapas Institute for Applied Computational Science, Harvard University, Cambridge, MA, USA.
Karim Pichara B.11footnotemark: 1 33footnotemark: 3
Abstract
Crowdsourcing has become widely used in supervised scenarios where training sets are scarce and difficult to obtain. Most crowdsourcing models in the literature assume labelers can provide answers to full questions. In classification contexts, full questions require a labeler to discern among all possible classes. Unfortunately, discernment is not always easy in realistic scenarios. Labelers may not be experts in differentiating all classes. In this work, we provide a full probabilistic model for a shorter type of queries. Our shorter queries only require “yes” or “no” responses. Our model estimates a joint posterior distribution of matrices related to labelers’ confusions and the posterior probability of the class of every object. We developed an approximate inference approach, using Monte Carlo Sampling and Black Box Variational Inference, which provides the derivation of the necessary gradients. We built two realistic crowdsourcing scenarios to test our model. The first scenario queries for irregular astronomical time-series. The second scenario relies on the image classification of animals. We achieved results that are comparable with those of full query crowdsourcing. Furthermore, we show that modeling labelers’ failures plays an important role in estimating true classes. Finally, we provide the community with two real datasets obtained from our crowdsourcing experiments. All our code is publicly available111https://github.com/bcsaldias/yes-no-crowdsourcing.
1 Introduction.
Labeled data is the very first requirement for training classifiers. Moreover, the availability of data has stimulated great breakthroughs in AI. For example, convolutional neural networks (CNNs) were first proposed by [16], but only when ImageNet [5] achieved a corpus of 1.5 million labeled images could Google’s GoogLeNet [15] perform object classification almost as well as humans by using CNNs. This encouraged us to create new mechanisms for producing labels. Nevertheless, labeling means getting ground truths, which are often difficult, expensive, or impossible to obtain.
To increase the amount of labeled data, we can use crowdsourcing [4, 23, 27, 28] to gather a large amount of labels. A major challenge is to combine unreliable crowd information: this is not entirely accurate, but cheaper [32]. A typical case is to take the majority of votes for each object. For this to work, we must assume everyone has equal knowledge about the topic, which is in many cases a wrong assumption. In addition, we can use active learning (AL) [34, 31], a semi-supervised scenario in which a learning model iteratively selects the best instances (for example, those that most confuse the model) to be tagged by an expert. We can also mix these strategies [34, 17, 32] to select candidates by considering labelers’ expertise. Nevertheless, here we propose a model to make the labeling task even easier.
Instead of selecting the best instances as candidates for training the model, we propose a novel approach to query type (see figure 1). Typically in a four-class scenario, a labeler is asked the class of an object with possible responses “A” or “B” or “C” or “D”. We refer to that type of full question as an ABCD question. Our model generates low-cost queries in which each response gives partial information. This method iteratively selects, per labeler, a random object along with a class’ label, then asks if that object belongs to that class: “yes” or “no” (proposed YN question).
The proposed method has many advantages over traditional approaches. First, the YN model focuses on the importance of learning an estimation of how labelers fail. Our strategy probabilistically learns initial parameters from the data for the labeling stage. Second, the labelers do not need to know all the classes. Third, it captures partial information with fewer errors because the labelers do not need to know the ground truth to accurately respond to some YN questions. Finally, the method is independent of the kind of data, given that we only need to include labelers’ votes, without worrying about representation of the objects to be classified.
This work makes the following main contributions:
Crowdsourcing query type: We propose a new crowdsourcing framework to obtain labeled data focused on the query type. This method costs less than other models because it reconstructs ground truth labels by only using partial information. We show that the aggregation of partial information allows the YN model to ask fewer questions than others, while achieving similar accuracy. 2. 2.
New data released: We developed two real-world experiments with humans and published the data.
The rest of this paper is organized as follows: Section 2 presents some related work. In section 3 we explain the proposed model, and in section 4 we show how we solved it. Then, section 5 describes our implementations of the model. Section 6 describes the datasets for comparison. Then, section 7 shows experiments and analysis. Finally, in section 8 we discuss and conclude with the main results of our work.
2 Related Work
2.1 Creating Training Sets
To acquire labels, we can manually label as many objects as possible. Furthermore, others have used crowdsourcing or/and active learning [34, 17, 32]. From another point of view, [22] proposes using data programming, in which labelers give functions that return the asked labels. Another option to create labels is co-training [1], in which data is labeled from two independent views. Closer to our approach is boosting [25], which combines several “weak” classifiers to create a “strong” one. We considered the weaknesses by modeling the labelers’ (many views) errors to infer the true labels probabilistically.
2.2 Crowdourcing Scenarios
Several efforts have been made on estimating labelers’ expertise [33, 31, 17, 27] and maximizing labelers’ accuracy by giving them the right incentives [26]. Some researchers have proposed new query types on active learning scenarios [21, 12]. Additionally, there are strategies to optimize the trade-off between redundancy and reliability in multi-class scenarios [13]. The closest research to the YN query type [19] involved assuming that each instance could belong to more than one class. However, these works did not involve a crowdsourcing context to improve the scenario. They mostly maintained a perfect oracle assumption.
Until now, no research has been presented to integrate query type, partial information asked to labelers, and the power of crowd. We propose a mechanism that outperforms other methods and handles many difficulties, as we outlined in section 1 and through this paper.
2.3 Variational Inference Approaches
Several inference schemes have been used to solve the YN model. Following a probabilistic perspective, EM or MAP algorithms make the YN model very likely to converge to a local optimum [23, 32]. This can be handled using the Gibbs sampler [9, 17]. Previous research on labeling has always involved methods for full questions.
We used the No-U-Turn Hamiltonian sampler (NUTS) [11] to converge more quickly than the random walk that MCMC [10, 7] uses. Additionally, we tested Black Box variational inference (BBVI) [20] because it tends to be faster than NUTS [27]. BBVI is inexpensive and easy to implement because it only requires estimating the ELBO gradient.
3 The Model
Consider a dataset with objects; each object has only one true class , among possible classes, where and . Each labeler is then presented with a series of binary “yes” or “no” (YN) questions, where .
Formally, we define a YN question as the question asked to labeler about whether belongs (“yes” or “no”) to the class , . We define as the set of queries asked to labeler for the object . Let be the response (or vote) assigned by to the question , and the set of all responses . Note that a labeler is not asked twice for the same class for the same object.
We propose a probabilistic graphical model [14, 29] (shown in figure 4) to infer the true labels . The Labeling area represents the joint distribution of and the other variables involved in their prediction.
3.1 Responses
For object , labeler , and question , it is convenient to encode the response as a two dimensional vector: , where [YES, NO]. Figure 2 shows an example of votes for object given by labeler . Note that means that question was not asked.
3.2 Credibility Matrices
Common approaches involve the use of the confusion matrix of each labeler to represent their errors, due to the nature of the full question. We represented the YN error per labeler as a credibility matrix. We needed to find the probability per labeler of giving the right answer when the class asked is , and the true class is . Figure 3 shows the credibility matrix of a specific labeler, where is the probability of labeler saying “yes” to question when . We assumed that the labelers were not random voters so that we could find patterns in their behaviors.
Our main goal was to find the most likely class for each object, given the votes and credibility matrices . A side goal was to estimate . In particular, we considered conjugate priors. Given that each “yes” or “no” response can be modeled as a distribution, the prior for distributes , where and are the estimated prior initial parameters from the first stage. Finally, the likelihood is:
[TABLE]
Modeling the prior of as a distribution that lives in a 0 to 1 space allowed us to model the probability of a response. It is also a conjugate distribution for the Bernoulli likelihood and can model any expertise due to its flexibility.
3.3 Joint Distribution
Each YN vote depends on the real, but unknown, label . Furthermore, the vote also depends on the credibility of labeler . The conditioning to allows the labeler to be more accurate in subsets of classes. The dependency on allowed us to model the labeler’s biases and errors for all classes. These dependencies are represented by the conditional distribution [17].
From prior information, we could estimate the initial class proportions and define a global Dirichlet variable in charge of this unknown distribution of vector . Finally, this gave:
[TABLE]
[TABLE]
Likelihood
We started from a single labeler, one object, and one question. For labeler and question , the likelihood is found in (3.1), where we encoded the response as a two-dimensional vector: , where [YES, NO]. For all responses , all labelers , and all data , the likelihood is found in (3.2).
[TABLE]
[TABLE]
4 Inference Schema
We separated the inference into two intuitive stages: first, to estimate the labelers’ reliability by asking them for known objects (Training Set), and second to ask them for unknown objects labels. We could unify these stages in a single inference model with an identical result. In the scenario where are observed values, the model estimates beforehand and converges faster (see section 7). The likelihood for all responses , all labelers , and all data is found in (4.3).
[TABLE]
The prior distribution of each was chosen to be uninformative, but flexible enough to represent labelers with both high and low expertise. We selected with an expected value equivalent to (see section 7). As stated before, this inference scheme works in two stages (that can also be done analytically):
Credibility stage: estimating . Because we assumed the labelers would behave similarly in the Labeling stage, as they do here, we obtained the and parameters from each . 2. 2.
Labeling stage: predicting Z and via posterior inference.
5 Implementation
Due to the convergence time of NUTS, we also used BBVI [20], both in Python3.5. Each one works as follows: First, it estimates the latent variables . Second, it estimates , , and . All the experiments presented in section 7 used NUTS [24], except when indicated otherwise. BBVI approximately tries to find a probability distribution that is closest (in KL divergence) to the true posterior distribution. The supplementary material provides the derivation of the needed gradients to solve the model, which can be easily extended to any model with similar variable types (based on [2]).
6 Data
We used simulated and real-world datasets. First, we simulated data to understand the YN model’s behavior. Then, we trained classifiers with real-world data to produce responses and evaluate the YN model performance. Finally, we tested the model in two human scenarios. These three sources of labels are described in the following subsections.
6.1 Synthetic Votes for Synthetic Data.
To simulate labelers and their votes, we proceeded as follows: First, we created labels ( and ). Then, for each labeler, we sampled a credibility matrix. Each row was simulated using a distribution. Labelers have high expertise in at most half of the classes; expertises were sampled from a distribution (because its expected value is close to 1). Finally, we simulated the votes using the labelers and true labels. When the labeler was presented with object of class and is asked , we consulted its credibility matrix to obtain the response for . We took by flipping a coin with the probability given by .
6.2 Synthetic Votes for Real-World Data.
We used a subset of MACHO data [3] (250 objects). We trained six different classifiers as labelers, each with a different training set but equally sized (2 Random Forest classifiers, 2 Logistic Regressions, and 2 Support Vector Machines). We proceeded as follows: First, we split the data into three different sets; one to train classifiers, another to infer , and the last to test the model. Each labeler was composed of a pool of one-vs-all classifiers. When a labeler was asked for , we consulted its one-vs-all binary classifier for the class to get the probability of the object belonging to the class . Then, we flipped a coin with that probability to obtain and .
- MACHO data: Irregularly-sampled time series. Several works aim to classify astronomical irregular time series [18]. Table 1 shows the data distribution that we used.
6.3 Real Votes for Real-World Data.
Two websites were set up to acquire data from human crowds. Each of them presented a contest to people related to a specific dataset domain (see table 1):
Astronomical irregular time series: We aim to classify irregular time series of the Catalina Surveys [6]. The labelers, 8 in total, were astronomers and engineers familiar with the field. From the human experiments, we proved that our model can assist astronomers’ work. 2. 2.
Animal classes: The objective of classifying animals222The full dataset is available at: https://a-z-animals.com/animals/pictures/. We filtered the number of mammals to do not have an extremely unbalanced dataset. The class fish was removed to work with only four classes and to increase the difficulty. was to compare the model in different fields. The labelers selected were 11 university students.
Each dataset contains 4 classes and 318 unknown objects, for about 15 people. Each user was presented with 1 to 4 random YN questions per instance. Also, the sets have (i) 40 and (ii) 41 known objects, respectively. For those known objects and 80 of the 318 unknown ones, the users were asked the ABCD question as well. The following results are based only on those labelers who finished at least 70% of the questions.
7 Results
The experiments are divided into eleven parts: Two full experiments with synthetic data (7.1 and 7.2); four using classifiers on MACHO data (7.3, 7.4, 7.5, and 7.6); finally, we set up the websites to get real crowds’ results, which we present in five experiments (11.3, 7.8, 7.9, 7.10, and 7.11). We used NUTS for all experiments, except for the benchmark against BBVI presented in experiment 11.3. We always used ten sampling chains and burned the first 1500 samples.
7.1 Convergence Simulations - Synthetic Data.
We created votes, as explained in subsection 6.1. For synthetic and classifiers’ votes, we used six labelers and four classes. We asked each labeler between 1 and 4 questions (Random(1,4)) for about 250 objects. Between 25 and 40 objects were used to approximate ; the rest were used for testing.
For all experiments we performed, the classification accuracy scores became completely stable after 3000 iterations. Similar results for convergence were obtained from both classifiers’ scenarios and the two set-up contests with real-world data. The convergence of each variable (, , Z, and ) was diagnosed based on the Gelman-Rubin statistic [8]. They all converged.
7.2 Modeling the Crowd Expertise - Synthetic Data.
To prove that our model can effectively differentiate between accurate and inaccurate labelers, we compared it with the baselines used in [33]. Here, we worked with 7 synthetic labelers with higher expertise for at most two of four classes (as explained in section 6). Figure 5 shows the performance of each method after convergence. This shows that our method outperforms all the baselines when the labelers do not have equal knowledge about all classes. Since we only have YN responses, an ABCD model would not be appropriately trained.
- •
YN query: We predicted via posterior inference.
- •
Each labeler’s ABCD simulated votes: We asked one per object to each labeler, where . This means we asked if belongs, “yes” or “no”, to what we know is the true label . We considered these answers as ABCD votes. We obtained the classification accuracy score as the proportion of right answers.
- •
Majority vote: As a prediction, we took the majority of the labelers’ ABCD simulated votes.
- •
Average vote: Represents the average of the accuracy scores of each labeler’s ABCD simulated votes.
7.3 Performance Depending on the Training Set Size - MACHO Data.
First, we evaluated how many objects we would need to converge the estimation quickly. Second, we checked the model’s sensitivity to the hyperparameters and . Figure 6 shows that the learning rate grows logarithmically with the training set size. This means that by only asking about a few known objects , the model can quickly converge to a good estimation of and , almost independently of . It also shows that this model can achieve equal results with different initial hyperparameter values.
7.4 Recovery of Credibility Matrices - MACHO Data.
The accuracy classification score and the training set size are closely related, as shown in figure 6. Figure 7 shows that the convergence of also depends on the training set size. Hence, if we estimate a good , we can reach a higher accuracy score. Finally, the accuracy score depends on the convergence of .
7.5 Performance Simulations Depending on Convergence - MACHO Data.
Figure 8 shows that the better the model estimates the labelers’ credibilities , the better the classification accuracy score.
7.6 Performance Simulations - MACHO Data.
In a four-class scenario, our method reaches the performance of the ABCD method (see figure 18) when we asked YN queries per object per labeler. The implemented baseline is a Bayesian ABCD model, a Hybrid Confusion Matrix [17] based on DawidSkene [4] plus the prior estimation stage of confusion matrices.
In a five-class scenario, six labelers outperformed the ABCDE model when giving responses for only four classes. This means that the labelers were not required to discern among the five classes to reach high accuracy scores. However, we found that three labelers are not enough for this scenario, since they need to respond for all five classes to reach the full question model.
Scenarios with four and five classes showed that the YN model outperforms the ABCD method when we ask a YN question for every possible class for every object . This indicates that each YN response is more precise or confident than each ABCD response. The difference relies on the fact that in the YN model we can ask for enough explicit information to estimate each row of the credibility matrices, while in the ABDC scenario, we cannot ask queries to evaluate specific errors between pairs of classes.
7.7 Performance Real-World Votes MCMC vs. BBVI - Websites.
We ran all previous simulations using the PyMC3 implementation mainly for two reasons. First, even though we used the AdaGrad [20] algorithm to set the learning rate, this setting presents more parameter tunning than does MCMC parametrization in BBVI. Second, the PyMC3 implementation usually slightly outperformed the BBVI results. Even though we also evaluated time and memory complexity, here we present only time until complete convergence.
Time Until Complete Convergence
The experiments were performed for times of 10 minutes (PyMC3) versus 5 minutes (BBVI) for The Catalina Surveys full model running 1 chain; the times for the Animals Dataset were 14 minutes (PyMC3) versus 7 minutes (BBVI). Since both datasets were equal in siz, those times depend only on the number of labelers, 8 and 11 respectively for each dataset. The time spent is linear on the number of chains for both models.
Given that the experiments took minutes to converge, these implementations cannot support active learning, as each step would require converging a model to estimate the next question and labeler.
The results for The Catalina Surveys are shown in figure 10. The figure shows that for this data, the MCMC model outperforms the BBVI implementation. For the Animals Data, both implementations have a 99.7% accuracy score. The BBVI implementations are both parametrized equally. We found that the BBVI approach can get higher accuracy if we fine-tune each learning rate of the latent variables.
7.8 Performance Crowd Versus Each Labeler - Websites.
To evaluate the individual performance of each labeler versus the mixture of them, we trained one YN model per labeler. Figure 11 shows the three best individual performances in the The Catalina Surveys contest. The figure shows that our strategy effectively modeld and integrated the unreliable crowd knowledge.
The YN strategy can control unreliable labelers mainly for two reasons. First, the Credibility stage allows the model to discover how each labeler makes mistakes and interprets the labelers’ responses. Second, the mixture of labelers helps the model to converge to a correct posterior distribution of the classes by weighting them according to their credibility matrices.
The labelers’ behavior for the Animals datasets is quite similar; many of them are unreliable, but the full model is more accurate than all the labelers.
7.9 Performance Real-World Votes YN vs. ABCD - Websites.
As we explained in section 6, each labeler was presented with a series of full ABCD questions for 80 objects, for which the labelers were asked for YN queries as well. For these objects, the animals contest achieved 100% accuracy with both strategies. For The Catalina Surveys, the YN query reached 91.2% and the ABCD 90.0%.
7.10 Performance Analysis YN Question vs. ABC Question - Websites.
Finally, we analyzed the cost and performance of the number of YN queries versus the number of ABCD queries needed for convergence of the classification accuracy score. Although the YN query requires less expertise than the full ABCD question, the time spent on selecting an ABCD response is not proportional to the number of possible classes . This is shown in the websites’ time records, where answering an ABCD question required less than twice the time of answering a YN question. To measure the cost, we compared how many YN queries versus how many ABCD queries are needed for the model to converge. We could assume that each ABCD query is equivalent to give YN votes [30], because each ABCD response requires the labeler to recognize the YN response for all possible classes. Figure 12 shows that if 4 YN queries require as much effort as 1 ABCD question, the YN model converges faster and to a higher classification accuracy score. This occurs because the YN model can better differentiate among the possible errors, since the YN query gives specific information to estimate all the rows within the credibility matrices. As figure 8 shows, the better the model estimates the credibility matrices, the better the classification accuracy score.
Despite assuming that 4 YN queries are equivalent to 1 ABCD query, figure 13 presents an analysis of different ABCD equivalences. All ABCD predictions were obtained from the Bayesian model described in section 7.6, which was also used in figure 18.
The analysis in figure 13 corresponds to how much difference exists between the classification accuracy score of the YN scenario and that of the ABCD scenario. The “1 ABCD = 4 YN” lines represent the differences in figure 12, where the YN surpasses the ABCD strategy. We compared this error (axis-Y) to the number of equivalent ABCD questions asked during the labeling stage (axis-X). Figure 13 illustrates that the YN strategy outperforms the ABCD strategy when we assumed that each ABCD query is equivalent to at least 3 YN queries. In addition, we can see that when asking an average of 2.5 questions per object and labeler, the YN model reached the ABCD’s performance quickly. Furthermore, when we assume that each YN question is equivalent in cost to one ABCD question, at some point the YN reaches or outperforms the ABCD’s performance.
7.11 Cognitive Cost Analysis YN Question vs. ABC Question - Websites.
The amount of cognitive effort made by annotators depends on factors like the information available or the number of classes. Since we cannot evaluate all possible scenarios objectively, we show the assessment of different costs in a four-class scenario in figure 14. Figure 14 illustrates that assuming that each ABCD query is equivalent to one YN query, the model is not convenient regarding time spent. However, when the cognitive cost of a YN query is less than half that of an ABCD query, the effort made by annotators to converge the model is less than the effort required when they are asked for ABCD queries. Overall, we can see that if the cognitive cost for a YN query is less than 0.6 times that for an ABCD query, the YN strategy reduces the total effort.
8 Conclusion
We developed a new model for crowdsourcing with “yes” or “no” type queries that can be applied to any context. The YN model obtains comparable results with models that ask full questions to labelers. The reduction of labelers’ efforts depends on how much cognitively easier it is to respond to a YN versus an ABCD question. Furthermore, our model convergences more quickly without sacrificing accuracy. We could also see that in cases where most labelers are unreliable, the YN model was able to capture the right posterior of the classes by taking advantage of crowds.
As a future work, the model could capture variations in expertise over time. Also, here we randomly selected an object along with a class; this election could be optimized using an active learning approach or by understanding the biases produced by the order in which the pairs of objects and questions are presented to the labelers.
Acknowledgements
Our work was supported in part by the CSS survey, which is funded by the National Aeronautics and Space Administration under Grant No. NNG05GF22G issued through the Science Mission Directorate Near-Earth Objects Observations Program. We would also like to thank the anonymous reviewers whose comments greatly improved this manuscript.
Supplementary Material
9 Background Theory
This section describes the main theory behind this work. We based this discussion mainly on [2] and [6].
9.1 Probabilistic Graphical Models
We represented the joint distribution of the proposed method with a probabilistic graphical model (PGM) [14, 29]. A PGM is a graph-based representation for compactly encoding a complex distribution over a high-dimensional space. For example, figure 15 illustrates the elemental DawidSkene [4] distribution for a crowdsourcing classification scenario. The circles represent random variables, observed variables are gray circles, and the points represent hyperparameters. When a set of variables shares the same probability distribution, we can use the “plate” notation, which stacks identical objects in a rectangle. In that case, the plates’ dimensions are written in capital letters within the rectangles.
In the PGM shown in figure 15, is the number of instances to be labeled and is the number of labelers, where and . In the DawidSkene model, is the initial parameter for the distribution over the hidden labels , where is the predicted label for object . In that scenario, represents the class given by labeler to object , whose confusion matrix is . In this case, if each is a random variable instead of a hyperparameter, and will be conditionally dependent given all the labelers’ votes due to the graph structure. Following the notation from [17], in that model definition the variable distributions are:
[TABLE]
[TABLE]
This structure allows inferring a compact representation of the explicit joint distribution. To get the posterior distribution, we can either use sampling-based methods or variational inference. In this work, we address the proposed probabilistic model solution with approximate inference. In the following subsections, we explain two approaches to infer the posterior target distribution by approximating a distribution: Markov chain Monte Carlo (MCMC) and Variational Inference (VI).
9.2 Markov Chain Monte Carlo
MCMC [10, 7] is the most popular method for sampling when simple Monte Carlo methods do not work well in high-dimensional spaces. The key idea is to build a Markov chain on the state space where the stationary distribution is the target, for instance, a posterior distribution , where is observed data. MCMC performs a random sampling walk on the space, where the time spent in each state is proportional to the target distribution. The samples allow approximating .
MCMC approaches Bayesian inference with developments as the Gibbs sampler [9]. The key idea behind Gibbs sampling is to turn the sampling among the variables. In each turn, the sampler conditions a new variable sample on the recent values of the rest of the distributions in the model. Suppose we want to infer . In each iteration, we would turn the samples iteratively: and .
No-U-Turn Hamiltonian Monte Carlo (NUTS)
To avoid the random walk and converge the sampling more quickly than with simple MCMC, we used NUTS [9], an MCMC algorithm based on a Hamiltonian Monte Carlo sampler (HMC). As an advantage, NUTS uses an informed walk and avoids the random walk by using a recursive algorithm to obtain a set of candidate points widely spread over the target distribution. Furthermore, NUTS stops when the recursion starts to go back to trace the dropped steps again. Nevertheless, HMC requires computing the gradient of the log-posterior to inform the walk, which can be difficult.
Using NUTS does not require establishing the step size and the number of steps to converge, compared to what a simple MCMC or HMC sampler does. Setting those parameters would require preliminary runs and some expertise. This sampling stops when drawing more samples no longer increases the distance between the proposal and the initial values of .
Even though MCMC algorithms can be very slow when working with large datasets or very complex models, they asymptotically draw exact samples from the target density [8]. Under these heavy computational settings, we can use variational inference (VI) as an approximation to the target distribution. VI does not guarantee finding the density distribution, it only finds a close distribution; however, it is usually faster than simple MCMC.
9.3 Variational Inference
Variational inference (VI) [4] proposes a solution to the problem of posterior inference. VI selects an approximation from some tractable family and then tries to make this as close as possible to the true posterior . The VI approach reduces this approximation to an optimization problem: the minimization of the KL divergence [5] from to .
The KL divergence is a measure of the dissimilarity of two probability distributions, and q. Given that the forward KL divergence includes taking expectations over the intractable , a natural alternative is the reverse KL divergence , defined in (9.6).
[TABLE]
9.4 The Evidence Lower Bound
Variational inference minimizes the KL divergence from to . It can be shown to be equivalent to maximize the lower bound (ELBO) on the log-evidence . The ELBO is equivalent to the negative KL divergence plus a constant, as we show in the following definitions.
Assume is the observations, the latent variables, and the free parameters of . We want to approximate by setting such that the KL divergence is minimum. In this case, we can rewrite (9.6) and expand the conditional in (9.7).
[TABLE]
Therefore, the minimization of the KL in (9.8) is equivalent to maximizing the ELBO:
[TABLE]
9.5 Mean Field Inference
Optimization over a given family of distributions is determined by the complexity of the family. This optimization can be difficult to optimize when a complex family is used. To keep the variational inference approach simple, [7] proposes to use the mean field approximation. This approach assumes that the posterior can be approximated by a fully factorized , where each factor is an independent mean field variational distribution, as is defined in (9.9).
[TABLE]
The goal is to solve the optimization in (9.10) over the parameters of each marginal distribution q.
[TABLE]
9.6 Stochastic Variational Inference
Common posterior inference algorithms do not easily scale to work with high amounts of data. Furthermore, several algorithms are very computationally expensive because they require passing through the full dataset in each iteration. Under these settings, stochastic variational inference (SVI) [3] approximates the posterior distribution by computing and following its gradient in each iteration over subsamples of data. SVI iteratively takes samples from the full data, computes its optimal local parameters, and finally updates the global parameters.
SVI solves the ELBO optimization by using the natural gradient [1] in a stochastic optimization algorithm. This optimization consists of estimating a noisy but cheap-to-compute gradient to reach the target distribution.
9.7 Black Box Variational Inference
The BBVI [20] avoids any model-specific derivations. Black Box VI proposes stochastically maximizing the ELBO using noisy estimates of its gradient. The estimator of this gradient is computed using samples from the variational posterior. Then, we need to write the gradient of the ELBO (9.8) in (9.11).
[TABLE]
Using this equation, we can compute the noisy unbiased gradient of the ELBO sampling the variational distribution with Monte Carlo, as shown in equation (9.12), where is the number of samples we take from each distribution to be estimated.
[TABLE]
where,
[TABLE]
For estimating the approximating q distribution, in BBVI the variational distributions are mean field factors with free variational parameters , for each index (see (9.9)). In appendix 12, we show how to apply this method to the proposed model.
10 Inference Schema
As stated in the paper, the proposed inference scheme works in two stages. The PGM in 16 shows both stages.
11 Complementary Results
11.1 Convergence Simulations - Synthetic Data.
To check for convergence of the full model, we analyzed each variable convergence. The convergence diagnostics for our random variables was based on the Gelman-Rubin statistic [7]. To try this diagnostic, we needed multiple chains to compare the similarity between them. Our experiments were based on 10 chains each. When the Gelman-Rubin ratio (potential scale reduction factor) is less than 1.1, it is possible to conclude that the estimation has converged. Figure 17 presents the potential scale reduction factors for all the estimated variables. According to this figure, there is no disagreement on whether each converges.
11.2 Performance Simulations - MACHO Data.
Figure 18 shows the results for the experiment in subsection 7.6 in a five-class scenario. We can see that when all classes were asked per object per labeler, the YN model outperformed the ABCD strategy. However, three labelers are not enough for this scenario because the only way they reached the ABCDE performance (five classes implies ABCDE) was when we asked them about all five classes. In this five-class scenario, six labelers outperformed the ABCDE model when giving responses for only four classes. This means that the labelers were not required to discern among the five classes to reach a high accuracy score.
11.3 Performance Real-World Votes MCMC vs. BBVI on Websites Results
We developed all the previous simulations using the PyMC3 implementation mainly for two reasons. First, even though we used the AdaGrad [20] algorithm to set the learning rate, this setting presents more parameters tuning than the MCMC parametrization. Second, the results were usually slightly outperformed by NUTS.
Iterations Until Convergence
As we said before, PyMC3 needs about 3000 iterations until convergence when running one chain. BBVI needs only 4 iterations, but each iteration implies estimating the gradient of each latent variable, which means taking samples from the variational approximation distribution of every variable. This estimation converges at 3072 total samples.
Time and Memory Complexity
The model has and parameters to estimate, respectively, in each stage. If we assume that always , this model is . Both implementation require samples. The memory complexity for the PyMC3 model is and for the BBVI is . When the number of samples remains constant, as in this work, the complexity is . Both time complexities are equivalent.
12 Derivation Black Box Inference Equations.
The BBVI minimizes the KL divergence from an approximating distribution q to the true p posterior. Lets say is the observations, the latent variables, and the free parameters of . We want to approximate by setting . This optimization is equivalent to maximizing the ELBO in (9.8):
[TABLE]
BBVI proposes stochastically maximizing the ELBO using noisy estimates of its gradient. The estimator of this gradient is computed using samples from the variational posterior. This require writing the gradient of the ELBO as in (9.11):
[TABLE]
Using (9.11), we can compute the noisy unbiased gradient of the ELBO sampling the variational distribution with Monte Carlo, as shown in (9.12), where is the number of samples taken from each distribution to be estimated:
[TABLE]
[TABLE]
Then is set at each iteration as:
[TABLE]
Where the learning rate can be fine-tuned as a global rate for all s or as a unique rate per .
To estimate the approximating q distribution, BBVI uses the mean field theory. Then we define the approximating distribution q as in (9.9):
[TABLE]
The variational mean field distributions q from (9.9) in the Credibility Estimation (first stage) of the YN model are found in (12.14). Their free variational parameters to estimate are in (12.15).
[TABLE]
[TABLE]
For the Labeling part (second stage) of the proposed model, the mean field distributions q from (9.9) are defined in (12.16), (12.17), and (12.18).
[TABLE]
[TABLE]
[TABLE]
Their free variational parameters to estimate are defined in (12.19), (12.20), and (12.21) respectively.
[TABLE]
[TABLE]
[TABLE]
As shown in (9.11), to maximize the ELBO, we need the expectations under q. Given that we prefer to avoid the derivation for the YN model joint distribution, we used the black box method by approximating the gradient of the ELBO as defined in (9.12).
To apply this method to our model, we needed to write the needed functions for both the Credibility stage and the Labeling stage. In this appendix, we show only the derivation for that second stage (the gradients for the training part are a simplification of the presented derivations).
12.1 Labeling Parameters Estimation
The joint distribution to be inferred is:
[TABLE]
First, for each variable, we defined the log probability of all distributions containing the free parameters in order to obtain the mean field . The priors are:
[TABLE]
[TABLE]
[TABLE]
[TABLE]
Then, we wrote those log probabilities to estimate the gradient with respect to the variational parameters:
[TABLE]
[TABLE]
[TABLE]
Finally, we wrote the gradients for each parameter to be estimated, where :
[TABLE]
[TABLE]
[TABLE]
[TABLE]
12.2 Constrained Parameters
All the estimated parameters must be positive to remain in their distribution domain. In fact, each vector and the vector must sum one. We used the soft-plus function and a normalized soft-plus function to deal with these constraints.
References
- [1]
Amari SI (1998) Natural gradient works efficiently in learning. Neural computation 10(2):251–276
- [2]
Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational inference: A review for statisticians. Journal of the American Statistical Association (just-accepted)
- [3]
Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. The Journal of Machine Learning Research 14(1):1303–1347
- [4]
Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK (1999) An introduction to variational methods for graphical models. Machine learning 37(2):183–233
- [5]
Kullback S, Leibler RA (1951) On information and sufficiency. The annals of mathematical statistics 22(1):79–86
- [6]
Murphy KP (2012) Machine learning: a probabilistic perspective. MIT press
- [7]
Opper M, Saad D (2001) Advanced mean field methods: Theory and practice. MIT press
- [8]
Robert CP (2004) Monte carlo methods. Wiley Online Library
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on Computational learning theory, ACM, pp 92–100
- 2[2] Chaney AJ (2015) A guide to black box variational inference for gamma distributions
- 3[3] Cook KH, Alcock C, Allsman R, Axelrod T, Freeman K, Peterson B, Quinn P, Rodgers A, Bennett D, Reimann J, et al (1995) Variable stars in the macho collaboration 1 database. In: International Astronomical Union Colloquium, Cambridge University Press, vol 155, pp 221–231
- 4[4] Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics pp 20–28
- 5[5] Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, pp 248–255
- 6[6] Drake A, Djorgovski S, Mahabal A, Beshore E, Larson S, Graham M, Williams R, Christensen E, Catelan M, Boattini A, et al (2009) First results from the catalina real-time transient survey. The Astrophysical Journal 696(1):870
- 7[7] Gelfand AE, Smith AF (1990) Sampling-based approaches to calculating marginal densities. Journal of the American statistical association 85(410):398–409
- 8[8] Gelman A, Rubin DB (1992) Inference from iterative simulation using multiple sequences. Statistical science pp 457–472
