Generative Models for Learning from Crowds

Chi Hong

arXiv:1706.03930·cs.AI·October 4, 2017

Generative Models for Learning from Crowds

Chi Hong

PDF

Open Access

TL;DR

This paper introduces generative probabilistic models for aggregating labels from crowds, utilizing Gibbs sampling and a new variational inference algorithm, demonstrating superior performance over existing methods.

Contribution

The paper presents a novel generative modeling approach with a new inference algorithm for better label aggregation from crowds.

Findings

01

Our methods outperform state-of-the-art label aggregation techniques.

02

The proposed variational inference algorithm is efficient and effective.

03

Empirical results validate the superiority of our models.

Abstract

In this paper, we propose generative probabilistic models for label aggregation. We use Gibbs sampling and a novel variational inference algorithm to perform the posterior inference. Empirical results show that our methods consistently outperform state-of-the-art methods.

Tables3

Table 1. Table 1. Datasets Overview

Dataset	Workers	Items	Labels	Classes
Heartdisease	12	237	952	2
Web Search	76	2653	14638	5
RTE	164	800	8000	2
Synthetic	100	1000	11615	5

Table 2. Table 2. Error-rates (%) of Methods

Method	Heartdisease	Web Search	RTE	Synthetic
Majority Voting	24.89	24.76	9.88	19.80
DS-EM	18.99	18.58	7.50	13.50
BCC	18.82	21.57	7.15	11.26
IDBLA	16.03	16.66	7.13	9.50
Fixed-IDBLA	15.61	18.77	7.50	9.50

Table 3. Table 3. Error-rates of Methods on the Heartdisease Dataset

Model	Initialization	Without Initialization
IDBLA	0.160	0.409
F-IDBLA	0.156	0.430

Equations100

T_{i} = ar g max_{l \in {1, ..., C}} k = 1 \sum K \mathbbm I (L_{i, k} = l)

T_{i} = ar g max_{l \in {1, ..., C}} k = 1 \sum K \mathbbm I (L_{i, k} = l)

p (T_{i} = t ∣ L, ϕ) \propto p (T_{i} = t) p (L_{i} ∣ T_{i} = t, ϕ) \propto p_{t} k = 1 \prod K l = 1 \prod C (ϕ_{t, l}^{(k)})^{\mathbbm I (L_{i, k} = l)}

p (T_{i} = t ∣ L, ϕ) \propto p (T_{i} = t) p (L_{i} ∣ T_{i} = t, ϕ) \propto p_{t} k = 1 \prod K l = 1 \prod C (ϕ_{t, l}^{(k)})^{\mathbbm I (L_{i, k} = l)}

M (ϕ) = i = 1 \prod I t = 1 \prod C {p_{t} k = 1 \prod K l = 1 \prod C (ϕ_{t, l}^{(k)})^{\mathbbm I (L_{i, k} = l)}}^{\mathbbm I (T_{i} = t)}

M (ϕ) = i = 1 \prod I t = 1 \prod C {p_{t} k = 1 \prod K l = 1 \prod C (ϕ_{t, l}^{(k)})^{\mathbbm I (L_{i, k} = l)}}^{\mathbbm I (T_{i} = t)}

p (T, Q, π, α, β ∣ L, μ) \propto p (T ∣ α) p (α ∣ γ_{α}) p (Q ∣ β) p (β ∣ γ_{β}) p (L ∣ π, T, Q) p (π ∣ ω)

p (T, Q, π, α, β ∣ L, μ) \propto p (T ∣ α) p (α ∣ γ_{α}) p (Q ∣ β) p (β ∣ γ_{β}) p (L ∣ π, T, Q) p (π ∣ ω)

N_{l} (k, h, t, c) = i = 1 \sum I \mathbbm I (L_{i, k} = c, T_{i} = t, Q_{i} = h)

N_{l} (k, h, t, c) = i = 1 \sum I \mathbbm I (L_{i, k} = c, T_{i} = t, Q_{i} = h)

p (T, Q, π, α, β ∣ L, μ) \propto {c = 1 \prod C α_{c}^{N_{t} (c) + γ_{α} - 1}} {h = 1 \prod H β_{h}^{N_{q} (h) + γ_{β} - 1}} {k = 1 \prod K h = 1 \prod H t = 1 \prod C c = 1 \prod C (π_{t, c}^{(k, h)})^{N_{l} (k, h, t, c) + ω - 1}}

p (T, Q, π, α, β ∣ L, μ) \propto {c = 1 \prod C α_{c}^{N_{t} (c) + γ_{α} - 1}} {h = 1 \prod H β_{h}^{N_{q} (h) + γ_{β} - 1}} {k = 1 \prod K h = 1 \prod H t = 1 \prod C c = 1 \prod C (π_{t, c}^{(k, h)})^{N_{l} (k, h, t, c) + ω - 1}}

p (T_{i} = t ∣ r es t) \propto α_{t} k = 1 \prod K c = 1 \prod C (π_{t, c}^{(k, Q_{i})})^{\mathbbm I (L_{i, k} = c)}

p (T_{i} = t ∣ r es t) \propto α_{t} k = 1 \prod K c = 1 \prod C (π_{t, c}^{(k, Q_{i})})^{\mathbbm I (L_{i, k} = c)}

p (Q_{i} = h ∣ r es t) \propto β_{h} k = 1 \prod K c = 1 \prod C (π_{T_{i}, c}^{(k, h)})^{\mathbbm I (L_{i, k} = c)}

p (Q_{i} = h ∣ r es t) \propto β_{h} k = 1 \prod K c = 1 \prod C (π_{T_{i}, c}^{(k, h)})^{\mathbbm I (L_{i, k} = c)}

p (π_{t}^{(k, h)} ∣ r es t) \propto c = 1 \prod C (π_{t, c}^{(k, h)})^{N_{l} (k, h, t, c) + ω - 1}

p (π_{t}^{(k, h)} ∣ r es t) \propto c = 1 \prod C (π_{t, c}^{(k, h)})^{N_{l} (k, h, t, c) + ω - 1}

p (α ∣ r es t) \propto c = 1 \prod C α_{c}^{N_{t} (c) + γ_{α} - 1}

p (α ∣ r es t) \propto c = 1 \prod C α_{c}^{N_{t} (c) + γ_{α} - 1}

p (β ∣ r es t) \propto h = 1 \prod H β_{h}^{N_{q} (h) + γ_{β} - 1}

p (β ∣ r es t) \propto h = 1 \prod H β_{h}^{N_{q} (h) + γ_{β} - 1}

π^{(k, H - 1)} = 1 - ν \frac{ν}{C - 1} ⋮ \frac{ν}{C - 1} \frac{ν}{C - 1} 1 - ν ⋮ \frac{ν}{C - 1} \dots \dots ⋱ \dots \frac{ν}{C - 1} \frac{ν}{C - 1} ⋮ 1 - ν

π^{(k, H - 1)} = 1 - ν \frac{ν}{C - 1} ⋮ \frac{ν}{C - 1} \frac{ν}{C - 1} 1 - ν ⋮ \frac{ν}{C - 1} \dots \dots ⋱ \dots \frac{ν}{C - 1} \frac{ν}{C - 1} ⋮ 1 - ν

π^{(k, H)} = 1 - δ \frac{δ}{C - 1} ⋮ \frac{δ}{C - 1} \frac{δ}{C - 1} 1 - δ ⋮ \frac{δ}{C - 1} \dots \dots ⋱ \dots \frac{δ}{C - 1} \frac{δ}{C - 1} ⋮ 1 - δ

π^{(k, H)} = 1 - δ \frac{δ}{C - 1} ⋮ \frac{δ}{C - 1} \frac{δ}{C - 1} 1 - δ ⋮ \frac{δ}{C - 1} \dots \dots ⋱ \dots \frac{δ}{C - 1} \frac{δ}{C - 1} ⋮ 1 - δ

p (L ∣ π, T, Q) p (π ∣ ψ) \propto {k = 1 \prod K h = 1 \prod H - 2 t = 1 \prod C c = 1 \prod C (π_{t, c}^{(k, h)})^{N_{l} (k, h, t, c) + ψ - 1}} {k = 1 \prod K h = H - 1 \prod H t = 1 \prod C c = 1 \prod C (π_{t, c}^{(k, h)})^{N_{l} (k, h, t, c)}}

p (L ∣ π, T, Q) p (π ∣ ψ) \propto {k = 1 \prod K h = 1 \prod H - 2 t = 1 \prod C c = 1 \prod C (π_{t, c}^{(k, h)})^{N_{l} (k, h, t, c) + ψ - 1}} {k = 1 \prod K h = H - 1 \prod H t = 1 \prod C c = 1 \prod C (π_{t, c}^{(k, h)})^{N_{l} (k, h, t, c)}}

p (π_{t}^{(k, h)} ∣ r es t) \propto c = 1 \prod C (π_{t, c}^{(k, h)})^{N_{l} (k, h, t, c) + ψ - 1}

p (π_{t}^{(k, h)} ∣ r es t) \propto c = 1 \prod C (π_{t, c}^{(k, h)})^{N_{l} (k, h, t, c) + ψ - 1}

q (T, Q, π, α, β) = q (π, α, β ∣ T, Q) i = 1 \prod I q (T_{i} ∣ λ_{i}) q (Q_{i} ∣ ρ_{i})

q (T, Q, π, α, β) = q (π, α, β ∣ T, Q) i = 1 \prod I q (T_{i} ∣ λ_{i}) q (Q_{i} ∣ ρ_{i})

lo g p (L ∣ μ) = K L (q (T, Q, π, α, β) ∣ p (T, Q, π, α, β ∣ L, μ)) + L (q (T, Q, π, α, β)) \geq L (q (T, Q, π, α, β))

lo g p (L ∣ μ) = K L (q (T, Q, π, α, β) ∣ p (T, Q, π, α, β ∣ L, μ)) + L (q (T, Q, π, α, β)) \geq L (q (T, Q, π, α, β))

L (q (T, Q, π, α, β)) = \mathbbm E_{q (π, α, β ∣ T, Q) q (T, Q)} [lo g p (T, Q, π, α, β, L ∣ μ)] - \mathbbm E_{q (π, α, β ∣ T, Q) q (T, Q)} [lo g q (π, α, β ∣ T, Q) q (T, Q)] = \iint q (π, α, β ∣ T, Q) q (T, Q) lo g \frac{p ( T , Q , π , α , β , L ∣ μ )}{q ( π , α , β ∣ T , Q ) q ( T , Q )} d_{π, α, β} d_{T, Q}

L (q (T, Q, π, α, β)) = \mathbbm E_{q (π, α, β ∣ T, Q) q (T, Q)} [lo g p (T, Q, π, α, β, L ∣ μ)] - \mathbbm E_{q (π, α, β ∣ T, Q) q (T, Q)} [lo g q (π, α, β ∣ T, Q) q (T, Q)] = \iint q (π, α, β ∣ T, Q) q (T, Q) lo g \frac{p ( T , Q , π , α , β , L ∣ μ )}{q ( π , α , β ∣ T , Q ) q ( T , Q )} d_{π, α, β} d_{T, Q}

L (q (T, Q)) = max_{q (π, α, β ∣ T, Q)} L (q (T, Q, π, α, β)) = \mathbbm E_{q (T, Q)} [lo g p (T, Q, L ∣ μ)] - \mathbbm E_{q (T, Q)} [lo g q (T, Q)]

L (q (T, Q)) = max_{q (π, α, β ∣ T, Q)} L (q (T, Q, π, α, β)) = \mathbbm E_{q (T, Q)} [lo g p (T, Q, L ∣ μ)] - \mathbbm E_{q (T, Q)} [lo g q (T, Q)]

p (T, Q ∣ L, μ) = ∭ p (T, Q, π, α, β ∣ L, μ) d_{π} d_{α} d_{β} \propto \frac{\prod _{c = 1}^{C} Γ ( N _{t} ( c ) + γ _{α} )}{Γ ( I + C γ _{α} )} \frac{\prod _{h = 1}^{H} Γ ( N _{q} ( h ) + γ _{β} )}{Γ ( I + H γ _{β} )} k = 1 \prod K h = 1 \prod H t = 1 \prod C \frac{\prod _{c = 1}^{C} Γ ( N _{l} ( k , h , t , c ) + ω )}{Γ ( N _{l} ( k , h , t , \cdot ) + C ω )}

p (T, Q ∣ L, μ) = ∭ p (T, Q, π, α, β ∣ L, μ) d_{π} d_{α} d_{β} \propto \frac{\prod _{c = 1}^{C} Γ ( N _{t} ( c ) + γ _{α} )}{Γ ( I + C γ _{α} )} \frac{\prod _{h = 1}^{H} Γ ( N _{q} ( h ) + γ _{β} )}{Γ ( I + H γ _{β} )} k = 1 \prod K h = 1 \prod H t = 1 \prod C \frac{\prod _{c = 1}^{C} Γ ( N _{l} ( k , h , t , c ) + ω )}{Γ ( N _{l} ( k , h , t , \cdot ) + C ω )}

p (T_{i} = t ∣ r es t) \propto (N_{t}^{\neg i} (t) + γ_{α}) k \in S_{i} \prod \frac{N _{l}^{\neg i} ( k , Q _{i} , t , L _{i, k} ) + ω}{N _{l}^{\neg i} ( k , Q _{i} , t , \cdot ) + C ω}

p (T_{i} = t ∣ r es t) \propto (N_{t}^{\neg i} (t) + γ_{α}) k \in S_{i} \prod \frac{N _{l}^{\neg i} ( k , Q _{i} , t , L _{i, k} ) + ω}{N _{l}^{\neg i} ( k , Q _{i} , t , \cdot ) + C ω}

p (Q_{i} = h ∣ r es t) \propto (N_{q}^{\neg i} (h) + γ_{β}) k \in S_{i} \prod \frac{N _{l}^{\neg i} ( k , h , T _{i} , L _{i, k} ) + ω}{N _{l}^{\neg i} ( k , h , T _{i} , \cdot ) + C ω}

p (Q_{i} = h ∣ r es t) \propto (N_{q}^{\neg i} (h) + γ_{β}) k \in S_{i} \prod \frac{N _{l}^{\neg i} ( k , h , T _{i} , L _{i, k} ) + ω}{N _{l}^{\neg i} ( k , h , T _{i} , \cdot ) + C ω}

λ_{i, t} = q (T_{i} = t) \propto exp (\mathbbm E_{q (T^{\neg i}, Q)} [lo g p (T_{i} = t ∣ r es t)])

λ_{i, t} = q (T_{i} = t) \propto exp (\mathbbm E_{q (T^{\neg i}, Q)} [lo g p (T_{i} = t ∣ r es t)])

ρ_{i, h} = q (Q_{i} = h) \propto exp (\mathbbm E_{q (T, Q^{\neg i})} [lo g p (Q_{i} = h ∣ r es t)])

ρ_{i, h} = q (Q_{i} = h) \propto exp (\mathbbm E_{q (T, Q^{\neg i})} [lo g p (Q_{i} = h ∣ r es t)])

\rho_{i,h}\propto\exp\Big{(}\mathbbm{E}_{q(\bm{T},\bm{Q}^{\lnot i})}\Big{[}\log(\mathcal{N}_{q}^{\lnot i}(h)+\gamma_{\beta})+\\ \sum_{k\in S_{i}}\big{(}\log(\mathcal{N}_{l}^{\lnot i}(k,h,T_{i},L_{i,k})+\omega)-\log(\mathcal{N}_{l}^{\lnot i}(k,h,T_{i},\cdot)+C\omega)\big{)}\Big{]}\Big{)}

\rho_{i,h}\propto\exp\Big{(}\mathbbm{E}_{q(\bm{T},\bm{Q}^{\lnot i})}\Big{[}\log(\mathcal{N}_{q}^{\lnot i}(h)+\gamma_{\beta})+\\ \sum_{k\in S_{i}}\big{(}\log(\mathcal{N}_{l}^{\lnot i}(k,h,T_{i},L_{i,k})+\omega)-\log(\mathcal{N}_{l}^{\lnot i}(k,h,T_{i},\cdot)+C\omega)\big{)}\Big{]}\Big{)}

N_{l}^{\neg i} (k, h, T_{i}, L_{i, k}) = j \neq = i, L_{j, k} = L_{i, k} \sum \mathbbm I (Q_{j} = h, T_{j} = T_{i})

N_{l}^{\neg i} (k, h, T_{i}, L_{i, k}) = j \neq = i, L_{j, k} = L_{i, k} \sum \mathbbm I (Q_{j} = h, T_{j} = T_{i})

\mathbbm E_{q^{\neg i}} [N_{l}^{\neg i} (k, h, T_{i}, L_{i, k})] = j \neq = i, L_{j, k} = L_{i, k} \sum (λ_{j}^{T} λ_{i}) ρ_{j, h}

\mathbbm E_{q^{\neg i}} [N_{l}^{\neg i} (k, h, T_{i}, L_{i, k})] = j \neq = i, L_{j, k} = L_{i, k} \sum (λ_{j}^{T} λ_{i}) ρ_{j, h}

\mathrm{Var}_{q^{\lnot i}}[\mathcal{N}_{l}^{\lnot i}(k,h,T_{i},L_{i,k})]=\sum_{j\neq i,L_{j,k}=L_{i,k}}(\bm{\lambda}_{j}^{\rm T}\bm{\lambda}_{i})\rho_{j,h}\big{(}1-(\bm{\lambda}_{j}^{\rm T}\bm{\lambda}_{i})\rho_{j,h}\big{)}

\mathrm{Var}_{q^{\lnot i}}[\mathcal{N}_{l}^{\lnot i}(k,h,T_{i},L_{i,k})]=\sum_{j\neq i,L_{j,k}=L_{i,k}}(\bm{\lambda}_{j}^{\rm T}\bm{\lambda}_{i})\rho_{j,h}\big{(}1-(\bm{\lambda}_{j}^{\rm T}\bm{\lambda}_{i})\rho_{j,h}\big{)}

lo g (N_{l}^{\neg i} (k, h, T_{i}, L_{i, k}) + ω) \approx lo g (\mathbbm E_{q^{\neg i}} [N_{l}^{\neg i} (k, h, T_{i}, L_{i, k})] + ω) + \frac{N _{l}^{\neg i} ( k , h , T _{i} , L _{i, k} ) - \mathbbm E _{q^{\neg i}} [ N _{l}^{\neg i} ( k , h , T _{i} , L _{i, k} )]}{\mathbbm E _{q^{\neg i}} [ N _{l}^{\neg i} ( k , h , T _{i} , L _{i, k} )] + ω} - \frac{( N _{l}^{\neg i} ( k , h , T _{i} , L _{i, k} ) - \mathbbm E _{q^{\neg i}} [ N _{l}^{\neg i} ( k , h , T _{i} , L _{i, k} )] ) ^{2}}{2 ( \mathbbm E _{q^{\neg i}} [ N _{l}^{\neg i} ( k , h , T _{i} , L _{i, k} )] + ω ) ^{2}}

lo g (N_{l}^{\neg i} (k, h, T_{i}, L_{i, k}) + ω) \approx lo g (\mathbbm E_{q^{\neg i}} [N_{l}^{\neg i} (k, h, T_{i}, L_{i, k})] + ω) + \frac{N _{l}^{\neg i} ( k , h , T _{i} , L _{i, k} ) - \mathbbm E _{q^{\neg i}} [ N _{l}^{\neg i} ( k , h , T _{i} , L _{i, k} )]}{\mathbbm E _{q^{\neg i}} [ N _{l}^{\neg i} ( k , h , T _{i} , L _{i, k} )] + ω} - \frac{( N _{l}^{\neg i} ( k , h , T _{i} , L _{i, k} ) - \mathbbm E _{q^{\neg i}} [ N _{l}^{\neg i} ( k , h , T _{i} , L _{i, k} )] ) ^{2}}{2 ( \mathbbm E _{q^{\neg i}} [ N _{l}^{\neg i} ( k , h , T _{i} , L _{i, k} )] + ω ) ^{2}}

\mathbbm E_{q^{\neg i}} [lo g (N_{l}^{\neg i} (k, h, T_{i}, L_{i, k}) + ω)] \approx lo g (\mathbbm E_{q^{\neg i}} [N_{l}^{\neg i} (k, h, T_{i}, L_{i, k})] + ω) - \frac{Var _{q^{\neg i}} [ N _{l}^{\neg i} ( k , h , T _{i} , L _{i, k} )]}{2 ( \mathbbm E _{q^{\neg i}} [ N _{l}^{\neg i} ( k , h , T _{i} , L _{i, k} )] + ω ) ^{2}}

\mathbbm E_{q^{\neg i}} [lo g (N_{l}^{\neg i} (k, h, T_{i}, L_{i, k}) + ω)] \approx lo g (\mathbbm E_{q^{\neg i}} [N_{l}^{\neg i} (k, h, T_{i}, L_{i, k})] + ω) - \frac{Var _{q^{\neg i}} [ N _{l}^{\neg i} ( k , h , T _{i} , L _{i, k} )]}{2 ( \mathbbm E _{q^{\neg i}} [ N _{l}^{\neg i} ( k , h , T _{i} , L _{i, k} )] + ω ) ^{2}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Mobile Crowdsensing and Crowdsourcing · Data Stream Mining Techniques

Full text

Generative Models for Learning from Crowds

Chi Hong

Tsinghua National Laboratory for Information Science and TechnologyDepartment of Computer Science and Technology,Tsinghua UniversityBeijingChina

[email protected]

( )

Abstract.

In this paper, we propose generative probabilistic models for label aggregation. We use Gibbs sampling and a novel variational inference algorithm to perform the posterior inference. Empirical results show that our methods consistently outperform state-of-the-art methods.

††copyright: rightsretained††doi: / ††isbn: / / ††conference: ; ; ††journalyear:

1. Introduction

Nowadays, huge amounts of data are produced from various sources which allows people to build models for all kinds of applications. Usually, these data cannot be directly used for model building, they should be labeled at first. It is expensive and time-consuming to employ domain experts to manually label data. Recently, crowdsourcing has become a cheap and efficient way to make large training datasets (eydaapproximating, ; wang2016blind, ; kamar2012combining, ; welinder2010multidimensional, ; li2017reliable, ; chen2013pairwise, ; sung2017classification, ). Online platforms such as Crowdflower111http://crowdflower.com/ and Amazon Mechanical Turk222https://www.mturk.com/mturk can split large dataset into small parts and distribute these small labeling tasks to workers (whitehill2009whose, ; li2016crowdsourcing, ; kazai2016quality, ; zhuang2015leveraging, ). However, these non-professional workers may provide noisy labels. The accuracy of each worker may be lower than expected. To improve the accuracy, it is important to aggregate the noisy labels and infer the true labels.

Note that each item may be labeled multiple times by different workers. The most straightforward way to infer the true label for each item is to simply choose the most frequent label. This approach is called as majority voting. In most cases, it has significantly better performance than single workers (zhou2012learning, ; snow2008cheap, ). However, majority voting implicitly assumes that all workers are equally good and treats each item independently. For a specific case, majority voting will make mistake if there are few true labels given by experienced workers and lots of identical error labels generated by novices. To overcome this problem, some methods take into account the worker reliability in label aggregation. Dawid and Skene (dawid1979maximum, ) associated each worker with a confusion matrix which can evaluate the reliabilities and potential biases of the workers. For a given worker, each element of the confusion matrix is a probability that the worker labels items in one class as another.

In recent years, there are many works which also use confusion matrices to build label aggregation models (raykar2010learning, ; kim2012bayesian, ; kamar2015identifying, ; raykar2010learning, ; kim2012bayesian, ). However, in these methods, each worker only uses one confusion matrix to generate labels for items. There is a one-to-one correspondence between an worker and an confusion matrix. They don’t consider the characteristics of items, such as difficulty. It is somewhat unreasonable. Actually, different items may have different difficulty levels. For example, there are two face photos of the same person, one is very clear, the other is blurred. A worker is more likely to make mistakes when recognizing the blurred one. He or she should have different confusion in the face of items (face photos) with different difficulties (blurred or clear).

Different from these methods, in this paper, we set a confusion matrix for each worker-difficulty level pair. The confusion matrices is not only correlate to workers, but also correlate to item difficulty levels. We haven’t seen any previous work taking this into account. We assume that the collected labels are generated by a distribution over the confusion matrices, item difficulties, true labels and labels. Based on this assumption, we propose models which utilize item difficulty information in the process of label aggregation. In this paper, our main contributions are as follows:

•

We define a generative probabilistic model that considers item difficulty in label aggregation. Its model parameters can be simply inferred by Gibbs sampling. We call this model as Item Difficulty-Based Label Aggregation (IDBLA) model.

•

We propose a variation of the IDBLA model, considering the existence of particularly simple items and particularly difficult items.

•

Gibbs sampling is hard to diagnose convergence. So we derive a novel variational inference algorithm for the IDBLA model. The proposed algorithm can converge to a good solution within a few iterations.

•

We also design a method to preliminarily predict the true label and difficulty of each item. The predicted results are used to initialize our models. The initialization is important for the performance

We conduct our experiments on three real datasets and one synthetic dataset. The experimental results show that our methods have better performance than state-of-the-art label aggregation methods.

The structure of the paper is as follows. Section 2 introduces the related work. Section 3 introduces the preliminary notation, the majority voting algorithm and the DS-EM algorithm. Section 4 illustrates our methods in detail. Section 5 introduces the experiments and the results. Section 6 concludes the paper.

2. Related Work

Raykar et al. (raykar2010learning, ) proposed the two two-coin model for binary labeling tasks, which can be seen as a variation of the confusion matrix. In their method, the labels generated by workers are directly used for learning. Estimating the reliability of the workers and the true labels of the items is a byproduct of the method. In Minimax Conditional Entropy (MMCE) (zhou2012learning, ), each worker-item pair is related to a independent distribution. The authors used a minimax entropy method to estimate the distributions and infer the true labels. Kim and Ghahramani (kim2012bayesian, ) used the confusion matrix to define their Bayesian Classifier Combination (BCC) model which is a Bayesian extension of Dawid and Skene’s method. Venanzi et al. (venanzi2014community, ) extended the BCC model and proposed the CommunityBCC model, which groups workers into communities. The workers that belong to the same community have the similar confusion matrices. CommunityBCC is particularly suit for sparse dataset which doesn’t have enough labels to learn a large number of parameters. Nguyen et al. (nguyen2016correlated, ) also proposed a graphical model based approach for crowdsourcing. This model can capture the correlations between entries in worker confusion matrices and use the correlations to improve the estimation of the confusion matrices.

There are many other methods besides confusion matrix-based methods. Zhou and He (zhou2016crowdsourcing, ) proposed a approach which based on tensor augmentation and completion for crowdsourcing. The collected labels are represented as tensor. The GLAD model (whitehill2009whose, ) is based on parameters which represent the expertise of workers and difficulties of items. The authors used the Expectation-Maximization algorithm as the learning and inference algorithm. GLAD is a label aggregation model for binary labeling tasks. It simultaneously considers the expertise of each worker and the difficulty of each item. Liu et al. (liu2012variational, ) also used a single parameter to describe the reliability of a worker. Gaunt et al. (gaunt2016training, ) trained a deep neural network for label aggregation. Unlike Bayesian network based methods, this method doesn’t need to make assumptions for the model definition. However, it is not adapt to the case of incomplete data where some workers labeled only few items. Karger et al. (karger2011iterative, ) put different weights on workers in their model. This work extended the majority voting algorithm. In recent years, Multi-Armed Bandits based method (tran2014efficient, ) has been proposed for crowdsourcing. There are also many works predict the truth from a set of conflicting views (galland2010corroborating, ; yu2014wisdom, ).

Markov chain Monte Carlo methods (george1993variable, ; griffiths2004finding, ; kim1999state, ) and variational inference methods (wang2011online, ; wainwright2008graphical, ; kurihara2007collapsed, ; teh2007collapsed, ; bishop2006pattern, ; he2015hawkestopic, ) have been proposed for many models. These works provided us with references. However, it is hard to find a general algorithm, a concrete model needs a concrete analysis.

3. Preliminary

3.1. Notation

In the aggregation problem, we consider that there are $K$ workers and $I$ items. In each task, the workers are required to label one item. $L_{i,k}$ is the label of item $i$ that labeled by worker $k$ , where $L_{i,k}\in\left\{1,...,C\right\}$ . $C$ is the number of classes. $\bm{L}$ is a collection of all the collected labels. Note that each worker may only labels a part of the dataset, if worker $k$ hasn’t labeled item $i$ , then $L_{i,k}=None$ . $T_{i}$ is the true label of item $i$ and $\bm{T}=\{T_{1},...,T_{I}\}$ includes all the true labels.

3.2. Majority Voting

Majority voting is a simple and straightforward way to aggregate labels. For each item $i$ , it selects the most frequent answer as the true label $T_{i}$ . This can be represented as:

[TABLE]

where $\mathbbm{I}(\cdot)$ is the indicator function taking the value $1$ when the predicate is true, and 0 otherwise. Majority voting is easy to implement and normally can generate high quality aggregation result although it only considers each item independently. Recent years, there are several works have extended this method (tian2015max, ; li2014error, ).

3.3. DS-EM

DS-EM (dawid1979maximum, ) is a classic generative approach proposed by Dawid and Skene. It assumes that the worker has consistent performances in different labeling tasks and let $\phi^{(k)}_{t,c}$ be the probability of worker $k$ labels an item as class $c$ when the true label of the item is $t$ . It also assumes that $T_{i}$ has a multinomial distribution, $p(T_{i}=t)=p_{t}$ , where $i\in\{1,...,I\}$ and $t\in\{1,...,C\}$ . DS-EM uses an expectation-maximization (EM) algorithm to obtain maximum likelihood estimates of $\bm{\phi}$ and $\bm{T}$ .

E-step: Estimating $\bm{T}$ by using:

[TABLE]

where $\bm{L}_{i}=\{L_{i,1},...,L_{i,k}\}$ . The value of $p_{t}$ can be estimated by $\hat{p}_{t}=\sum_{i=1}^{I}\mathbbm{I}(T_{i}=t)/I$ .

M-step: The likelihood for the full data is:

[TABLE]

where the values of $\bm{T}$ and $\{p_{1},...,p_{C}\}$ are known, then $\bm{\phi}$ can be estimated by the maximum-likelihood estimation: $\bm{\hat{\phi}}=\mathop{\arg\max}_{\bm{\phi}}M(\bm{\phi})$ . Note that this optimisation problem can be analytically solved.

4. Methods

As described in Section 3.3, DS-EM associates each worker $k$ with a probabilistic confusion matrix $\bm{\phi}^{(k)}$ . Recently, there are many works (zhang2014spectral, ; liu2012variational, ; raykar2010learning, ; kim2012bayesian, ) use this kind of confusion matrix to evaluate the ability of a worker, and build their own models.

As mentioned in Section 1, these models don’t consider the characteristics of the labeling tasks, such as the difficulties of items. In these models, the confusion matrix is only correlate to worker $k$ . It is somewhat unreasonable. Therefore, we try predicting and utilizing the item difficulty information in label aggregation. In our models, we use two indexes for the confusion matrix, one is worker $k$ and the other is difficulty level $h$ . Next, we will describe our methods detailedly, including model representation, inference and parameter initialization.

4.1. Item Difficulty-Based Label Aggregation Model

In our model, each item $i$ correlates to a difficulty level $Q_{i}\in\left\{1,...,H\right\}$ , where $H$ is the number of difficulty levels. We set an independent probabilistic confusion matrix $\bm{\pi}^{(k,h)}$ for each worker-difficulty level pair $(k,h)$ . Consider item $i$ with true label $T_{i}\in\left\{1,...,C\right\}$ and difficulty level $Q_{i}$ , we let $p(L_{i,k}|T_{i},Q_{i})=\pi_{T_{i},L_{i,k}}^{(k,Q_{i})}$ . That is to say the label $L_{i,k}$ that generated by worker $k$ for item $i$ has a multinomial distribution with parameters $\bm{\pi}_{T_{i}}^{(k,Q_{i})}$ . We assume that $\bm{\pi}_{t}^{(k,h)}\sim Dirichlet(\omega)$ . The Dirichlet distribution is the conjugate prior of the multinomial distribution. The true label $T_{i}$ and difficulty level $Q_{i}$ of item $i$ have multinomial distributions.

The generative process is

Draw all the confusion matrices $\bm{\pi}_{t}^{(k,h)}\sim Dirichlet(\omega)$ .
Draw the true label proportion $\bm{\alpha}\sim Dirichlet(\gamma_{\alpha})$ .
Draw the difficulty level proportion $\bm{\beta}\sim Dirichlet(\gamma_{\beta})$ .
For each item i,

(a) Draw a true label $T_{i}\sim Multinomial(\bm{\alpha})$ .

(b) Draw a difficulty level $Q_{i}\sim Multinomial(\bm{\beta})$ .

(c) For each label $L_{i,k}$ ,

$(c_{1})$ Draw the label $L_{i,k}\sim Multinomial(\bm{\pi}_{T_{i}}^{(k,Q_{i})})$ .

Each worker generates her own labels independently. Labels for different items are independent and identically distributed. Let $\bm{\mu}$ represents the hyperparameters $\omega$ , $\gamma_{\alpha}$ and $\gamma_{\beta}$ . Then the posterior distribution over the model parameters is:

[TABLE]

Now, we introduce several notations that are needed in the following. $\mathcal{N}_{l}(k,h,t,c)$ is the number of times worker $k$ labels $c$ when the item difficulty level is $h$ and the true label is $t$ .

[TABLE]

$\mathcal{N}_{t}(c)$ is the number of items with true label $c$ . $\mathcal{N}_{q}(h)$ is the number of items with difficulty level $h$ . $\mathcal{N}_{t}(c)=\sum_{i=1}^{I}\mathbbm{I}(T_{i}=c)$ and $\mathcal{N}_{q}(h)=\sum_{i=1}^{I}\mathbbm{I}(Q_{i}=h)$ .

According to the generative process, equation (2) can be rewritten as:

[TABLE]

The parameters $\bm{T}$ , $\bm{Q}$ , $\bm{\pi}$ , $\bm{\alpha}$ and $\bm{\beta}$ can be learned with a Gibbs sampler. For each iteration, we update each parameter by sampling from its conditional distribution given the rest parameters. Here are the conditional distributions for Gibbs sampling which are derived from equation (3):

[TABLE]

We see that the posterior distributions of $\bm{\pi}_{t}^{(k,h)}$ , $\bm{\alpha}$ and $\bm{\beta}$ take the form of Dirichlet distributions. The new values of $T_{i}$ and $Q_{i}$ are generated by multinomial distributions. So model parameters $\bm{T}$ , $\bm{Q}$ , $\bm{\pi}$ , $\bm{\alpha}$ and $\bm{\beta}$ all can be easily sampled.

4.2. A Variation of the IDBLA Model

In the dataset, there may be some very easy items and some very difficult items. We assume that every worker labels the easy items with a high correct rate and labels the difficult items with a very low correct rate. Based on this assumption, we propose the Fixed-IDBLA model which is a simple variation of the IDBLA model.

In Fixed-IDBLA, we set $\bm{\pi}^{(k,H-1)}$ for easy items and set $\bm{\pi}^{(k,H)}$ for difficult items, where $k\in\left\{1,...,K\right\}$ . $\bm{\pi}^{(k,H-1)}$ and $\bm{\pi}^{(k,H)}$ are fixed as:

[TABLE]

where $\nu$ and $\delta$ are constants. We assume that $\bm{\pi}_{t}^{(k,h)}\sim Dirichlet(\psi)$ . The distributions of $\bm{L}$ , $\bm{T}$ , $\bm{\alpha}$ , $\bm{Q}$ and $\bm{\beta}$ are the same as described in 4.1. With the above definitions, we have:

[TABLE]

We again use Gibbs sampling to sample all the unknown model parameters. The latent parameters $\bm{T}$ , $\bm{Q}$ , $\bm{\alpha}$ and $\bm{\beta}$ are still sampled by (4), (5), (7) and (8). $\bm{\pi}_{t}^{(k,h)}$ is sampled by conditional distribution:

[TABLE]

where $h=1,...,H-2$ . Note that the posterior distribution of $\bm{\pi}_{t}^{(k,h)}$ is a Dirichlet distribution.

4.3. Collapsed Variational Inference

We have introduced the Gibbs sampling algorithm for the IDBLA and Fixed-IDBLA model. However, Gibbs sampling is hard to diagnose convergence. It also needs large numbers of samples to reduce sampling noise. So we would like to use variational inference algorithm to perform the posterior inference and parameter estimation. The most common variational inference algorithms make the mean-field assumption (wainwright2008graphical, ), which may be too strict in practice. Model parameters can be strongly dependent in the true posterior, the mean-field assumption ignores this dependence and may leading to inaccurate estimates of the posterior (teh2007collapsed, ).

In this Section, we derive a collapsed variational inference for the IDBLA model. We don’t assume independence between $\bm{\pi}$ , $\bm{\alpha}$ and $\bm{\beta}$ , but we still assume that variables $\bm{T}$ and $\bm{Q}$ are mutually independent. We use the variational distribution

[TABLE]

to approximate the posterior $p(\bm{T},\bm{Q},\bm{\pi},\bm{\alpha},\bm{\beta}|\bm{L},\bm{\mu})$ , where $q(\bm{T},\bm{Q})=\prod_{i=1}^{I}q(T_{i}|\bm{\lambda}_{i})q(Q_{i}|\bm{\rho}_{i})$ , the distributions $q(T_{i}|\bm{\lambda}_{i})$ and $q(Q_{i}|\bm{\rho}_{i})$ are multinomial distributions. The data log likelihood can be bounded by

[TABLE]

where $KL(\cdot)$ is the Kullback-Leibler divergence, $\mathcal{L}(q(\bm{T},\bm{Q},\bm{\pi},\bm{\alpha},\bm{\beta}))$ is called evidence lower bound (ELBO).

[TABLE]

where $\mathbbm{E}$ means expectation. We maximize the ELBO with respect to $q(\bm{\pi},\bm{\alpha},\bm{\beta}|\bm{T},\bm{Q})$ . The maximum is achieved at $q(\bm{\pi},\bm{\alpha},\bm{\beta}|\bm{T},\bm{Q})=p(\bm{\pi},\bm{\alpha},\bm{\beta}|\bm{L},\bm{T},\bm{Q},\bm{\mu})$ that means the approximation distribution equals the true distribution. Plugging this equation into equation (12), we get

[TABLE]

where we have used $\int q(\bm{\pi},\bm{\alpha},\bm{\beta}|\bm{T},\bm{Q})d_{\bm{\pi},\bm{\alpha},\bm{\beta}}=1$ . Marginalizing out $\bm{\pi}$ , $\bm{\alpha}$ and $\bm{\beta}$ from the posterior, we have

[TABLE]

where $\mathcal{N}_{l}(k,h,t,\cdot)=\sum_{c=1}^{C}\mathcal{N}_{l}(k,h,t,c)$ , $\Gamma(\cdot)$ is the gamma function. According to equation (14), we derive:

[TABLE]

where ” $\lnot i$ ” means that the frequency is calculated over items except item $i$ , $S_{i}$ is a set of workers who have labeled the item $i$ .

Then we can further maximize $\mathcal{L}(q(\bm{T},\bm{Q}))$ with respect to $q(\bm{T},\bm{Q})$ which is factorized. The following maximization process is just like the variational Bayes algorithm (bishop2006pattern, ). We optimize the variational parameters $\bm{\lambda}$ and $\bm{\rho}$ to maximize $\mathcal{L}(q(\bm{T},\bm{Q}))$ by the coordinate ascent algorithm. $\bm{\lambda}$ and $\bm{\rho}$ are updated iteratively. For brevity, the derivations of equations (17) and (18) are shown in Appendix A.

[TABLE]

$\bm{T}^{\lnot i}$ means excluding $T_{i}$ and $\bm{Q}^{\lnot i}$ means excluding $Q_{i}$ .

Next, we introduce how to compute $\rho_{i,h}$ . Note that the computation of $\lambda_{i,t}$ and $\rho_{i,h}$ are similar. Plugging equation (16) into equation (18), we get

[TABLE]

We use the Gaussian approximation to calculate the expectation terms $\mathbbm{E}_{q(\bm{T},\bm{Q}^{\lnot i})}[\log(\mathcal{N}_{l}^{\lnot i}(k,h,T_{i},L_{i,k})+\omega)]$ , $\mathbbm{E}_{q(\bm{T},\bm{Q}^{\lnot i})}[\log(\mathcal{N}_{q}^{\lnot i}(h)+\gamma_{\beta})]$ and $\mathbbm{E}_{q(\bm{T},\bm{Q}^{\lnot i})}[\log(\mathcal{N}_{l}^{\lnot i}(k,h,T_{i},\cdot)+C\omega)]$ . For brevity, in the following we abbreviate $q(\bm{T},\bm{Q}^{\lnot i})$ as $q^{\lnot i}$ .

Firstly, we apply Gaussian approximation to the expectation $\mathbbm{E}_{q(\bm{T},\bm{Q}^{\lnot i})}[\log(\mathcal{N}_{l}^{\lnot i}(k,h,T_{i},L_{i,k})+\omega)]$ . Note that the variable

[TABLE]

is a sum of many independent Bernoulli variables $\mathbbm{I}(Q_{j}=h,T_{j}=T_{i})$ each with mean $(\bm{\lambda}_{j}^{\rm T}\bm{\lambda}_{i})\rho_{j,h}$ and variance $(\bm{\lambda}_{j}^{\rm T}\bm{\lambda}_{i})\rho_{j,h}\big{(}1-(\bm{\lambda}_{j}^{\rm T}\bm{\lambda}_{i})\rho_{j,h}\big{)}$ . So $\mathcal{N}_{l}^{\lnot i}(k,h,T_{i},L_{i,k})$ can be seen as a Gaussian variable, its mean and variance are:

[TABLE]

The function $\log(\mathcal{N}_{l}^{\lnot i}(k,h,T_{i},L_{i,k})+\omega)$ is approximated by its second-order Taylor expansion about $\mathbbm{E}_{q^{\lnot i}}[\mathcal{N}_{l}^{\lnot i}(k,h,T_{i},L_{i,k})]$ .

[TABLE]

Using equation (20), we have

[TABLE]

$\mathcal{N}_{q}^{\lnot i}(h)=\sum_{j\neq i}\mathbbm{I}(Q_{j}=h)$ is a sum of many independent Bernoulli variables $\mathbbm{I}(Q_{j}=h)$ each with mean $\rho_{j,h}$ and variance $\rho_{j,h}(1-\rho_{j,h})$ . $\mathcal{N}_{l}^{\lnot i}(k,h,T_{i},\cdot)=\sum_{j\neq i,L_{j,k}\neq None}\mathbbm{I}(Q_{j}=h,T_{j}=T_{i})$ is a sum of many independent Bernoulli variables $\mathbbm{I}(Q_{j}=h,T_{j}=T_{i})$ . The Gaussian approximations applied to $\mathbbm{E}_{q(\bm{T},\bm{Q}^{\lnot i})}[\log(\mathcal{N}_{q}^{\lnot i}(h)+\gamma_{\beta})]$ and $\mathbbm{E}_{q(\bm{T},\bm{Q}^{\lnot i})}[\log(\mathcal{N}_{l}^{\lnot i}(k,h,T_{i},\cdot)+C\omega)]$ are similarly computed as above. According to (19) and (21) we have the update equation:

[TABLE]

where

[TABLE]

For completeness, we also show the update equation of $\lambda_{i,t}$ . We abbreviate $q(\bm{T}^{\lnot i},\bm{Q})$ as $\hat{q}^{\lnot i}$ .

[TABLE]

where

[TABLE]

4.4. Parameter Initialization

In many crowdsourcing methods, such as BCC (kim2012bayesian, ), DS-EM (dawid1979maximum, ), CrowdSVM (tian2015max, ), and DiagCov (nguyen2016correlated, ), the unknown true labels are initialized by majority voting to avoid bad local optima.

In our models, we have introduced the concept of item difficulty. Both parameters $T$ and $Q$ need to be initialized. We manage to design a low overhead method to initialize them. Note that instead of making a precise prediction, we just make a preliminarily prediction of true labels $T$ and difficulty levels $Q$ to avoid bad local optima in the non-convex optimization.

In actually, the ground truth labels are unknown. So we use majority voting to initialize $\bm{T}$ , and approximate $\bm{T}$ as a collection of the ground truth labels. Based on $\bm{T}$ , we calculate the correct rate $R_{k}$ for each worker $k$ :

[TABLE]

Then, let $\lambda_{k}=xR_{k}$ be the ability of worker $k$ , where $x$ is a constant. We assume that:

[TABLE]

where $1/\epsilon_{i}\in[0,+\infty)$ is the difficulty of item $i$ .

We see that as the item difficulty $1/\epsilon_{i}$ increases, the probability of labeling the item correctly decreases toward $1/C$ . That means for the most difficult item the worker just arbitrarily chooses a label. When $1/\epsilon_{i}$ decreases toward [math], the probability of labeling the item correctly increases toward $1$ . Using (24), we have the likelihood of the observed labels:

[TABLE]

where $\bm{\lambda}=\{\lambda_{1},...,\lambda_{K}\}$ , $\bm{\epsilon}=\{\epsilon_{1},...,\epsilon_{I}\}$ . We use gradient ascent to locally maximize the log-likelihood $F(\bm{\epsilon})=\ln p(\bm{L}|\bm{T},\bm{\lambda},\bm{\epsilon})$ and get the corresponding value of $\bm{\epsilon}$ . Further, according to $\bm{\epsilon}$ , $\bm{Q}$ is initialized by dividing the item into $H$ groups which belong to different difficulty levels.

5. Experiments

IDBLA and Fixed-IDBLA are compared with three state-of-the-art algorithms: majority voting, DS-EM and BCC(kim2012bayesian, ). We use three real crowdsourcing datasets and one synthetic dataset in our experiments. In the following, we will introduce our experiments in detail, and evaluate the effectiveness of our models.

5.1. Datasets

The four datasets are shown in Table 1. They have different sizes and features. We introduce them in the following sections.

5.1.1. Heart Disease Diagnosis

We got the heart disease instances (Heart1988, ) and the corresponding ground truth labels from the UC Irvine machine learning repository website. Each instance include attributes such as age, sex, maximum heart rate achieved, etc. We removed the ground truth labels and requested $12$ medical students to label the instances in their spare time. These students volunteer to offer the labels without pay. They have different expertise levels about heart disease diagnosis. For each instance we don’t care about the type of heart disease, we only consider whether the patient has heart disease. Each student doesn’t have to label all the instances. The number of the instances for heart disease is $87$ and there are other $150$ instances for health. There are $237$ instances in total. We finally got $952$ labels. The average accuracy of the students is $68.59\%$ . The student who labeled the most labeled $213$ instances with accuracy of $87.79\%$ . Each student labeled at least 43 instances. For the sake of simplicity, in the following we refer to this data set as Heartdisease.

5.1.2. Web Search Relevance Judgment

The Web Search (zhou2012learning, ) dataset is about web search relevance judgment. In the original dataset, workers are asked to rate a set of 2665 query-URL pairs on a relevance rating scale from 1 to 5. It is difficult to evaluate the worker who only labeled few items. Therefore, we removed the workers who labeled less than 30 items. We also removed the items that haven’t ground truth labels. Then we got $2653$ items and $14638$ labels which are offered by $76$ workers. The average accuracy of these workers is $41.08\%$ . The accuracy of the best worker is $76.73\%$ and this worker generated 1225 labels. All the labels are offered by workers in MTurk website. The accuracies of the workers are shown in Figure 1, each bar represents the accuracy of one worker.

5.1.3. RTE Dataset

In this dataset, the tasks are about recognizing textual entailment. RTE (snow2008cheap, ) contains $8000$ binary labels for 800 documents. The numbers of the documents for each class are all equal to $400$ . A total of $164$ labelers participated in the labeling task. The average accuracy of the labelers is $83.70\%$ . The labeler who labeled the most generated 800 labels and had a accuracy of $50.63\%$ .

5.1.4. Synthetic Dataset

In addition to the above three real data sets, we also made a Synthetic dataset. We generated 1000 items and 100 workers. The true label of each item is sampled from $\{1,2,3,4,5\}$ with probabilities $\{0.18,0.27,0.45,0.05,0.05\}$ , the classes are imbalanced. In order to simulate the real situation, each worker $k$ labels a item with probability $\rho_{k}$ . The maximum value in $\bm{\rho}=\{\rho_{1},...,\rho_{100}\}$ is $0.74$ , most elements in $\bm{\rho}$ get values in the interval $[0.03,0.2]$ . The average accuracy of the workers is $41.92\%$ . The best worker has a accuracy of $83.49\%$ . Finally, we got a total of $11615$ labels.

5.2. Setups

Majority voting and DS-EM are implemented according to the introduction in Section 3. In a specific task, if there are multiple most frequent classes for the item, the majority voting algorithm will randomly choose one among them.

In order to avoid bad local optima, DS-EM and BCC use majority voting to initialize the unknown true labels. For BCC, the hyperparameters are set as described in Kim’s paper(kim2012bayesian, ). The model parameters are also initialized according to the paper’s description.

For IDBLA model, $\omega$ , $\gamma_{\alpha}$ and $\gamma_{\beta}$ are all set as $1.0$ . Which means the distributions of $\bm{\pi}_{t}^{(k,h)}$ , $\bm{\alpha}$ and $\bm{\beta}$ have uninformed priors. $\bm{T}$ and $\bm{Q}$ are are initialized as described in Section 4.4. $\bm{\alpha}$ is initialized by the result of counting $\bm{T}$ , $\alpha_{c}=\frac{\sum_{i=1}^{I}\mathbbm{I}(T_{i}=c)}{I}.$ $\bm{\beta}$ is initialized by the result of counting $\bm{Q}$ , $\beta_{h}=\frac{\sum_{i=1}^{I}\mathbbm{I}(Q_{i}=h)}{I}.$ According to $\bm{T}$ , $\bm{Q}$ and the known $\bm{L}$ , $\pi_{t,c}^{(k,h)}$ can be initialized by:

[TABLE]

For Fixed-IDBLA model, $\nu$ is set as $0.1$ , $\delta$ is set as $0.8$ . The other model parameters and hyperparameters are set the same as in IDBLA.

$\bm{H}$ ** selection**:For IDBLA and Fixed-IDBLA models, the number of difficulty levels $H$ should be determined. Given a value $\hat{H}$ and a dataset, the corresponding likelihood $p(\bm{L}|\hat{\bm{\pi}},\hat{\bm{T}},\hat{\bm{Q}})$ could be computed. So, for each dataset, we can try several values of $H$ and select the value which generates the maximum likelihood. Note that the likelihood $p(\bm{L}|\hat{\bm{\pi}},\hat{\bm{T}},\hat{\bm{Q}})$ and the accuracy are not directly related.

5.3. Experimental Results

We used a PC with Intel Core i5 2.6GHz CPU and 8GB RAM for our experiments. In order to guarantee the fairness of the experiments. We use uninformed priors in all methods. We use majority voting to initialize parameters in all methods. Every method have the same input data format.

5.3.1. Error-rates of Methods

Firstly, we use error rate to evaluate the performance of each method. For BCC, IDBLA and Fixed-IDBLA, we sampled 500 samples in each run. On each dataset, we executed every method independently 10 times and averaged the error rates.

The error rates of the methods are shown in Table 2. Compared with Figure 1, we can see that these label aggregation methods generate results obviously better than single workers. The best result for each dataset is highlighted in bold. In our experiments, IDBLA consistently outperforms majority voting, DS-EM and BCC across all datasets. On the Heartdisease dataset, Fixed-IDBLA has the best performance. IDBLA achieves a lower error rate than Fixed-IDBLA on the Web Search dataset and RTE dataset. On the Synthetic dataset, IDBLA and Fixed-IDBLA have the same error rate. DS-EM, BCC, IDBLA and Fixed-IDBLA are substantially better than majority voting. They are methods based on confusion matrix. The experimental results show that our methods outperform the baselines.

5.3.2. Collapsed Variational Inference Iteration Process

In Section 4.3, we derive a collapsed variational inference algorithm for the IDBLA model. Now, we show the iteration processes of the algorithm in Figure 2. We can see that the algorithm converges after only a few iterations. Note that the collapsed variation inference attains quite similar results with Gibbs sampling in our experiments.

5.3.3. Negative Log Likelihood

Then, we use negative log likelihood (NLL) to evaluate the methods. It provides a more comprehensive measure. NLL can simultaneously evaluate the true labels and confusion matrices found by the model. The likelihoods of IDBLA and Fixed-IDBLA are computed with:

[TABLE]

The likelihoods of DS-EM and BCC are computed with:

[TABLE]

where $\bm{\phi}^{(k)}$ is the confusion matrix for worker $k$ . The NLLs of DS-EM, BCC and our methods on the Heartdisease dataset are shown in Figure 3(a). Figure 3(b) shows the NLLs of the methods on the Synthetic dataset. The NLLs of the four methods are close to each other which means that our methods can also find high quality confusion matrices to evaluate the reliabilities and potential biases of workers.

5.3.4. Model Analysis

Finally, we further investigate the effectiveness of the parameter initialization and the quality of the difficulty level prediction of our models.

Parameter Initialization Effectiveness: We introduce the parameter initialization method in Section 4.4. In order to prove the effectiveness of the initialization, we remove it from our methods and then conduct the experiments again. For simplicity, we show the results on the Heartdisease dataset in Table 3. For IDBLA model, the error rate increases from $0.160$ to $0.409$ . For Fixed-IDBLA model, the error rate increases from $0.156$ to $0.430$ .

Difficulty Level Prediction Quality: Compare to the ground truth labels, we can get the labeling error rate $E_{i}$ of a item $i$ .

[TABLE]

where $TRUTH_{i}$ is the ground truth labels of item $i$ . Average the labeling error rates of items in the same difficulty level, we can get the average labeling error rate of the difficulty level. We illustrate the results on the Heartdisease dataset. For IDBLA model, the average labeling error rate of the predicted easiest difficulty level is 0.280 and the average labeling error rate of the predicted hardest difficulty level is 0.523. For Fixed-IDBLA model, the average labeling error rate of the predicted easiest difficulty level is 0.213, the average labeling error rate of the predicted hardest difficulty level is 0.694. All the prediction results are consistent with the truth. That means the prediction of item difficulty is effective.

6. Conclusions

We propose the IDBLA model to aggregate labels collected from non-professional workers. In this model, the item difficulties are taken into consideration. Each worker-difficulty level pair is associated with a confusion matrix. The model finds the values of the confusion matrices and other latent variables in the learning process, and further uses them to infer the true labels. We derive a collapsed variational inference algorithm for the IDBLA model. The algorithm converges within only a few iterations and attains good solutions. We define a variation of the IDBLA model which assumes that there exists some very easy items and some very difficult items. We also design a method to preliminarily predict the true label and difficulty of each item. The prediction results are used to initialize the latent parameters. The experiments are conducted on three real datasets and one synthetic dataset. The empirical results show that our methods are effective.

Appendix A Appendix

Actually, we derive the equations (17) and (18) according to the variational Bayes algorithm (bishop2006pattern, ). To be self-contained we show the derivation of equation (18). Note that the derivation of equation (17) is similar. According to equation (13) we have:

[TABLE]

where $\int\log p(\bm{T},\bm{Q},\bm{L}|\bm{\mu})\big{(}\prod_{j=1}^{I}q(T_{j}|\bm{\lambda}_{j})d_{T_{j}}\big{)}\big{(}\prod_{j\neq i}q(Q_{j}|\bm{\rho}_{j})d_{Q_{j}}\big{)}=\\ \mathbbm{E}_{q(\bm{T},\bm{Q}^{\lnot i})}[\log p(\bm{T},\bm{Q},\bm{L}|\bm{\mu})]$ .

Defining a new distribution $\tilde{p}(\bm{L},Q_{i})$ by the relation $\log\tilde{p}(\bm{L},Q_{i})=\mathbbm{E}_{q(\bm{T},\bm{Q}^{\lnot i})}[\log p(\bm{T},\bm{Q},\bm{L}|\bm{\mu})]+const$ . Then

[TABLE]

We maximize $\mathcal{L}(q(\bm{T},\bm{Q}))$ with respect to $q(Q_{i}|\bm{\rho}_{i})$ . The maximum is achieved at $q(Q_{i}|\bm{\rho}_{i})=\tilde{p}(\bm{L},Q_{i})$ , then we have:

[TABLE]

where $rest$ means $\bm{T},\bm{Q}^{\lnot i},\bm{L}$ and $\bm{\mu}$ . Plugging in $Q_{i}=h$ , we have equation (18).

Acknowledgements.

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) S. Ertekin, H. Hirsh, and C. Rudin, “Approximating the wisdom of the crowd,” 2011.
2(2) W. Bi, L. Wang, J. T. Kwok, and Z. Tu, “Learning to predict from crowdsourced data.” in UAI, 2014, pp. 82–91.
3(3) E. Kamar, S. Hacker, and E. Horvitz, “Combining human and machine intelligence in large-scale crowdsourcing,” in Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1. International Foundation for Autonomous Agents and Multiagent Systems, 2012, pp. 467–474.
4(4) P. Welinder, S. Branson, S. J. Belongie, and P. Perona, “The multidimensional wisdom of crowds.” in NIPS, vol. 23, 2010, pp. 2424–2432.
5(5) J. Whitehill, T.-f. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo, “Whose vote should count more: Optimal integration of labels from labelers of unknown expertise,” in Advances in neural information processing systems, 2009, pp. 2035–2043.
6(6) D. Zhou, S. Basu, Y. Mao, and J. C. Platt, “Learning from the wisdom of crowds by minimax entropy,” in Advances in Neural Information Processing Systems, 2012, pp. 2195–2203.
7(7) R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng, “Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks,” in Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2008, pp. 254–263.
8(8) A. P. Dawid and A. M. Skene, “Maximum likelihood estimation of observer error-rates using the em algorithm,” Applied statistics, pp. 20–28, 1979.