Quality assessment in crowdsourced classification tasks

Qiong Bu (School of Electronics and Computer Science, University of Southampton, Southampton, UK)

Elena Simperl (School of Electronics and Computer Science, University of Southampton, Southampton, UK)

Adriane Chapman (School of Electronics and Computer Science, University of Southampton, Southampton, UK)

Eddy Maddalena (School of Electronics and Computer Science, University of Southampton, Southampton, UK)

International Journal of Crowd Science

ISSN: 2398-7294

Article publication date: 17 October 2019

Issue publication date: 9 December 2019

Downloads

1342

pdf (1.9 MB)

Abstract

Purpose

Ensuring quality is one of the most significant challenges in microtask crowdsourcing tasks. Aggregation of the collected data from the crowd is one of the important steps to infer the correct answer, but the existing study seems to be limited to the single-step task. This study aims to look at multiple-step classification tasks and understand aggregation in such cases; hence, it is useful for assessing the classification quality.

Design/methodology/approach

The authors present a model to capture the information of the workflow, questions and answers for both single- and multiple-question classification tasks. They propose an adapted approach on top of the classic approach so that the model can handle tasks with several multiple-choice questions in general instead of a specific domain or any specific hierarchical classifications. They evaluate their approach with three representative tasks from existing citizen science projects in which they have the gold standard created by experts.

Findings

The results show that the approach can provide significant improvements to the overall classification accuracy. The authors’ analysis also demonstrates that all algorithms can achieve higher accuracy for the volunteer- versus paid-generated data sets for the same task. Furthermore, the authors observed interesting patterns in the relationship between the performance of different algorithms and workflow-specific factors including the number of steps and the number of available options in each step.

Originality/value

Due to the nature of crowdsourcing, aggregating the collected data is an important process to understand the quality of crowdsourcing results. Different inference algorithms have been studied for simple microtasks consisting of single questions with two or more answers. However, as classification tasks typically contain many questions, the proposed method can be applied to a wide range of tasks including both single- and multiple-question classification tasks.

Keywords

Citation

Bu, Q., Simperl, E., Chapman, A. and Maddalena, E. (2019), "Quality assessment in crowdsourced classification tasks", International Journal of Crowd Science, Vol. 3 No. 3, pp. 222-248. https://doi.org/10.1108/IJCS-06-2019-0017

Publisher

:

Emerald Publishing Limited

License

Published in International Journal of Crowd Science. Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode

1. Introduction

Microtask crowdsourcing has attracted interest from researchers, businesses and government as a means to leverage human computation into their activities in a fast, accurate and affordable way. In the last ten years, we have seen it applied to anything from spotting sarcasm on social media to discovering new galaxies and helping digitise large cultural heritage collections. The underlying model is relatively straightforward: a problem is decomposed into smaller chunks that can be tackled independently by several people. Their individual outputs are then compared and consolidated into a final solution (Shahaf and Horvitz, 2010). However, none of these steps is actually easy: some problems are less amenable to microtasking and need to be turned into bespoke microtask workflows (Bernstein et al., 2010; Kulkarni et al., 2011; Kittur et al., 2011); the performance of the crowd varies across tasks (Mao et al., 2013; Redi and Povoa, 2014); and determining which answers are the most useful ones can be both complex and computationally expensive (Kittur et al., 2008; Snow et al., 2008; Vickrey et al., 2008; Demartini et al., 2012; Wiggins et al., 2011). It is on this last aspect, determining the correct answers, that we focus on in this paper. The aggregation method proposed in this paper is able to infer the correct answer for a range of tasks involving either single-step or multiple-step classifications when gold answers are not available. It also serves as a proxy to help task requesters to assess the quality of the crowdsourced results when they already have some gold answers, such as piloting specific multiple-step task design before putting it online for a larger scale.

Quality assessment in microtask crowdsourcing refers to the evaluation of quality of the workers’ work. First, quality can be assessed based on different criteria, as it has many dimensions (Kahn et al., 2002; Batini et al., 2009). Under the crowdsourcing context, it depends on the type of the data, which is decided by the task type (Malone et al., 2010; Gadiraju et al.,2014, 2015). The most common quality metric we have seen is to calculate the accuracy (Bernstein et al., 2010; Gelas et al., 2011; Hung et al., 2013; Zhang et al., 2017a, 2017b) with available gold standards. However, in lots of the cases the gold standard is not available. This is where different inference algorithms come into picture, which helps to infer or predict the correct (gold) answer. Second, quality assessment can be done either on the fly (Ipeirotis et al., 2014) during the task running that can be used to optimise task assignment hence reduce cost, or in the post aggregation (Whitehill et al., 2009; Ipeirotis et al., 2010; Bachrach et al., 2012; Difallah et al., 2015a) to assess the overall quality of the classification. This work focus on aggregating the result after the crowdsourcing task has been completed, so that accuracy can be calculated based on the gold standards we have.

There are many different types of tasks where microtask crowdsourcing are applied (Eickhoff and de Vries, 2011; Difallah et al., 2015b; Yang et al., 2016; Zheng et al., 2017a). We focus on inferring the correct answer for a classification task which is one of the most popular type of crowdsourcing tasks. We are by no means the first to do so; previous research has proposed a range of methods to infer and predict the quality of crowd answers (Bachrach et al., 2012; Dawid and Skene, 1979; Difallah et al., 2015a; Hare et al., 2013; Ipeirotis et al., 2010; Karger et al., 2011; Loni et al., 2014; Paulheim and Bizer, 2014; Hung et al., 2013; Rosenthal and Dey, 2010; Simpson et al., 2013; Whitehill et al., 2009). Whilst all methods have their benefits, they work on relatively simple task models that consist of single questions with one or more answers (Sheshadri and Lease, 2013; Hung et al., 2013; Zhang et al., 2017a; Zheng et al., 2017b). The scenario we are targeting is different. We take a close look at existing classification tasks from Zooniverse, and notice a large percentage of these tasks are multiple-step tasks, as shown in Figure 1. In fact, in a random sampling of 20 tasks, only 20 per cent has a single question. Consider the example in Figure 2, which is taken from a labelled citizen science project in which pictures taken in the Serengeti national park in Tanzania are analysed online by thousands of volunteers[1]. The crowd is asked to answer a series of related, independent questions about what they see in the image, including the types and number of animals.

Our work is motivated by a range of online crowd science classification projects. Each of them uses a slightly different type of task to classify an object, for example, an image, according to a number of criteria. For a relatively complex task, it is split into several steps, typically in the form of multiple-choice answers. Sometimes there are dependencies between steps as the answer chosen for one questions prompts other questions to be displayed. For instance, in the Cities at Night project, which uses microtask crowdsourcing to analyse night-time photographs taken by astronauts onboard the ISS[2], seven different Options are provided for the first question to identify what the given image contains, a city, stars, aurora, astronaut, black image, no photo or none of these, and only when “city” is identified, two more independent questions will be asked to classify cloudiness (three Options: cloudy, someclouds, clear) and sharpness (two Options: sharp, blurry). In the GalaxyZoo[3] project, several different questions were asked in sequence depending on the answers to previous questions, and questions and answers are arranged in a decision tree. It has a more complex workflow in which more questions are involved, and questions vary based on what has been chosen in previous classification step. For instance, the first question is “Is the galaxy simply smooth and rounded, with no sign of a disk?” and three options are provided: “Smooth”, “Features or disk”, and “Star or artifact”. When choosing “Smooth”, a new question will be asked “How rounded is it?” and available options are “Completely round”, “In between” and “Cigar shaped”. If “Features or disk” is chosen as the answer to the first question, a different set of subsequent questions will be asked. Other times, workflows are rather sequences of independent, though related questions, such as what we see in Snapshot[1] (Figure 2). Determining the correct answer for such complex classification task can be tricky and has not been fully studied yet. Existing research also does not investigate how inference methods could affect the classification accuracy when using different crowd types for complex classification tasks. As a result, there is the need to understand whether different algorithms and aggregation strategies are required for different crowd contexts.

To tackle the issue of determining the correct answer from crowd produced annotations for the classification task with multiple questions, we model the problem of complex classification tasks that span over multiple, related questions as a graph. To the best of our knowledge, we are the first to propose using the structure of a microtask crowdsourcing workflow as an additional feature to support inference algorithms in making decisions about correct labels, using output data produced by the crowd. We look at three inference algorithms (majority voting [MV] [Paulheim and Bizer, 2014; Hung et al., 2013], message passing [MP] [Karger et al., 2011] and expectation maximisation [EM] [Dawid and Skene, 1979; Whitehill et al., 2009]), which have been commonly used in answer inference in microtask crowdsourcing previously. We adapt these algorithms to work on the graph modelled from crowdsourcing tasks with multiple steps. We perform a large-scale evaluation of the performance of these algorithms on six data sets across two crowd contexts from three image classification tasks: Darkskies[2], GalaxyZoo[3] and Snapshot Serrengeti[1]. The rationale behind choosing data sets from both volunteer and paid crowd context is that algorithms may perform differently in these contexts. The experiments show that our aggregation strategy achieves significantly better performance than the current approach of naively applying individual algorithms on each node level. The result also indicates that MV, despite its simplicity, compares well with more sophisticated approaches that consider additional factors such as user performance and hence need more computation time. Sophisticated algorithms such as expectation maximisation, however, can complement MV for relatively complex tasks. We also prove that each algorithm obtains better inference accuracy in the volunteer context compared to paid crowdsourcing context.

This rest of this paper is structured as follows: Section 2 provides the foundations of existing algorithms which we have adapted to handle answer inference in classification tasks with multiple questions, and illustrate how this aggregation fits in the quality assessment process. In Section 3, we explain our graph model and notations used in the graph, formalise the classification problem, and elaborate our aggregation approach. In Section 4, we perform large-scale evaluation and demonstrate the performance of different algorithms. Section 5 discusses our findings. Section 6 reviews existing work which has inspired our research, and Section 7 summarises our result and future work.

2. Foundations

A classification task generally has one single question and a few options to choose from, such as the one shown in Figure 3. It looks like a simple tree structure where the classification starts with a root node which refers to the object to be classified and has a few branches which represent the available options. In this section, we present three existing algorithms, MV, MP and EM, that have been used in inferring the true label for a single-step multiple-choice classification task. These are the foundations to understand our proposed adapted approach. Notations used in elaborating these algorithms are defined in Table I. For the sake of explaining the individual algorithms and our method, we use following notations throughout this paper.

2.1 Majority voting

Due to its simplicity, MV has been used in many microtask projects (Hung et al., 2015; Liu et al., 2012) and is the standard aggregation method in some existing crowdsourcing platforms[4]. Given the list of options for a labelling task and an object, the MV algorithm chooses those options with the highest number of votes from the crowd. Formally, it takes as input an object o and the crowd labels L_o and outputs the resulting candidate label lo˜ that received the most votes from the users.

Algorithm 1 MV

1: procedure findUniqueLabel(L_o)

2: Lunique← {lou}, where L_unique ⊆ A and lou∈ A and u ≤ U_o;

3: lo˜ =““;

4: num_max = 0;

5: for i ∈ |L_unique| do

6: if count(lunique(i))≥nummax then

7: nummax←count(lunique(i));

8: lo˜←lunique(i);

9: return lo˜;

2.2 Expectation maximisation

EM is another algorithm that has been widely used and involves two steps to infer the true label for a given object. In the first step, the true label for the current object is estimated using simple MV, where the input of all users is considered equally. Then, in the next step, the error rate of each user is estimated based on this result and used in turn to calculate the new estimation for the first step. The steps are alternating iteratively until the algorithm converges and a maximum is found. It takes as input an object o and all labels L. It starts by estimating the true label for each object and each user’s error rate by comparing their answers (using an indicator function I() to check whether the user classifies object to a certain category/class) for all objects they have looked at. The error rate is used subsequently to update the confusion matrix for each user. The output is candidate labels for o with the probability (indicated by p) of the corresponding candidate label to be correct.

Algorithm 2 EM

1: procedure Initialise(p_l)

2: p_l ← count(l)÷|L_o| ⊳ probability of l being the true label for object o (l ∈ A);

3: while not converged do

4: Estimate error rate for user 5: θll−u←λll−u+∑o∈Lopl×I(lou=l−)

6: Estimate confusion matrix:

7: ell−u←θll−u÷∑qθlqu ⊳ q is the accuracy of user u

8: Estimate class priors:

9: prl←∑oplo÷|O|

10: Calculate class probability for object o:

11: pl←prl∏u∈Uo∏m(eamj)I(lu=m)÷∑qprq∏m(eqmu)I(lu=m)

12: lo˜ =““;

13: p_max = 0;

14: for l ∈ A do

15: if p_l ≥ p_max then

16: p_max ← p_l;

17: lo˜←l;

18: return lo˜;

2.3 Message passing

MP is an algorithm that takes into account both the labels and the performance of the users. MP constructs object and user-specific messages to represent the reliability of the particular user, and iteratively updates the object and the user messages. More specifically, at each object update, it adds up more weight to labels that come from more trustworthy parts of the crowd, and at each user update, it adds more trust (a confidence value) to the user if the labels they give for other objects are in line with the current estimates of object labels. The iterative updates continue until the algorithm converges or a specified threshold is hit. The threshold for the stopping condition is a parameter that has to be empirically determined. It takes as input an object o, a label a ∈ A, all labels received from the crowd L and a threshold k_max. MP computes the object message by firstly iterating all previous labels from the users who have been assigned the object o and then looking at whether each label is the same as the given one. In a next step, it uses the object message x_o−>u (∈ L) to update the user message y_u−>o (∈ L), which is computed by iterating over the labels they have submitted. Until convergence, the object message for object o is aggregated by weighing the user messages (confidence) for that object and the computed sign is stored in E_ou. MP outputs the candidate label l for o and the sign of whether the label applies or not. A detailed description of the algorithm can be found in Karger et al.’s (2011) study. Whilst providing accurate estimations, MP is also known for its high computational costs as the number of labels and users increase.

Algorithm 3 MP

1: procedure Initialisation(y_u−>u)

2: for (o, u) ∈ L do

3: Initialise y_u−>o (∼N(−1, 1));

4: procedure Iteration(k_max)

5: for k ∈{1,…,k_max} do

6: for (o, u) ∈ L do

7: xo−>uk ← ∑u−∈UEou−×yu−>ok−1 (u⁻≠u)

8: for (o, u) ∈ L do

9: yu−>ok ← ∑o−∈OEo−u×xo−>uk (o⁻≠o)

10: x_o ← ∑u∈U Eou×yu−>okmax−1

11: if sign(x_o) == 1 then

12: lo˜=xo

13: return lo˜

2.4 Quality assessment

In the microtask crowdsourcing context, achieving a good quality result is one of the major goals, and when we talk about quality, it generally means the quality of the data collected from the crowd. For the classification microtasks, existing work in quality assessment mostly use the accuracy metric (Khattak and Salleb-Aouissi, 2011; Hung et al., 2013; Zhang et al., 2017a). Some research also uses precision/recall (Hung et al., 2015; Zhang et al., 2017) or F1 score (Zheng et al., 2017a), while other work use ROC (Zheng et al., 2017b) or RMSE (Bachrach et al., 2012). For classification, the quality of the result refers to how good the overall collected classifications are, which is a data-value centric dimension to reflect how accurate the classifications are. In this work, if not specially specified, when referring to quality of the input/answer/data/result, it means Accuracy – “The degree to which data values correctly represent the real-world facts” (Zaveri et al., 2013); definition in science (JCGM, 2008) as “closeness of agreement between a measured quantity value and a true quantity value of a measurand”. We can look at individual crowd worker’s work to evaluate whether its work is of good quality, or we can look at the overall result from all the workers to see how accurate they classify the given objects. The later one which involves aggregating the input from different crowd workers in a multiple-step classification task is the focus of this paper.

In the crowdsourcing context, the ground truth is not usually available. To assess the quality of the result, we need to understand what algorithms or mechanisms can be used to infer or predict the correct answer based on all the input from the crowd workers. Correspondingly, each existing different algorithm has been studied by researchers and evaluated its performance in various contexts (Section 6.2). This work mainly takes a look at three popular existing algorithms elaborated above and investigates how the adaptation of these algorithms can be used for aggregating the crowdsourced data and help to assess the quality of the classification result. The whole process, in a nutshell, includes three major phases, data collection (microtask design and task execution) from the crowd which is available to this study, aggregation to infer the correct answer/label, and evaluation of the quality (in this work is the Accuracy metric) by comparing the inferred result to the gold standards we have. This research focuses on the aggregation and evaluates the accuracy accordingly.

3. Our approach

In this section, we first illustrate the range of classification tasks we address via a set of examples: classification tasks with a single question and multiple-questions. We then introduce a set of notations and formalise the classification problem as a path searching problem in a graph. Following that, we present our aggregation method by illustrating how existing established algorithms can be adapted to handle more complex cases.

3.1 Multi-level workflow model and problem formalisation

A classification task, as shown in Figure 3, is generally considered as a simple task as it contains only one question. A relatively complex task normally involves more than one question and hence more options. It will be more like a tree with branches which has further branches and leaves.

If we draw such a ‘tree’ for the three tasks we are exploring in this paper, we can see each of them uses a different type of workflow consisting of several independent/interdependent steps. Each step in the workflow is associated with a Question to classify an object according to a criterion. To answer the question the crowd needs to choose among a set of Options. Figure 4 involves minimum one step and maximum three steps for the classification task. Figure 5 has a fixed two steps to complete a classification task and each step has more than ten options. For the GalaxyZoo[5] task, it can involve minimum one step and a maximum of nine steps to complete a classification, as shown in Figure 6. It is notable that these different tasks do present a tree-like structure each of which has a number of questions,with various number of available options, however, there are indeed cases where some nodes have more than one parent node which means it can not be considered as a tree.

As a result, the workflow can be modelled as a directed acyclic graph (DAG), where the root node is the object under consideration and all other nodes are classification options. Each node can be reached via multiple paths from the root, which prompts the first question of the workflow[6]. For a given object o, the crowd is asked to carry out a labelling task, which implies answering a series of (independent or dependent) classification questions with a set of labels which identify the outstanding features of the object being classified. We define this task as a path search problem in a workflow W_f modelled as directed acyclic graph (DAG) with a root entry point and levels (similar to tree levels, representing the number of questions in the task), each corresponding to a set of options as depicted in Figure 7. Each node in such a graph represents a particular labelling option. The labelling finishes when a leaf in the graph is reached, that is a label that does not lead to any further questions. In our definition, the level corresponds to classification question(s) and the level of a node is serialised and counted at the lowest level. We use level exchangeably with depth of a node which is indicated by the number of edges from the node to the root node. A directed edge represents a label chosen for the corresponding question related with that node level. Table II has a summary of the definitions we use.

On top of the notations we defined in Section 2, we also define the notations which are specific to our workflow graph model in Table III. The problem we are solving in the paper can be defined as follows:

Definition 3.1 The Correct Labelling Problem:

Given a particular object o, a workflow-based graph W_f, a set of labels L_o for object o, and (optionally) a set of previous labels from all users on all objects L, our aim is to infer the correct label path Lo˜ in W_f for object o.

3.2 Adapted aggregation

In the classic approaches, it does not look at the dependency between node levels hence naively putting inferred result from each node level together does not guarantee a valid result. It is obvious that producing a valid path with possible choices should improve the accuracy of the users. As such, a basic adaptation of the classic algorithms should show some improvement over multiple level workflows. We show such a basic adaptation in Algorithm 4.

Algorithm 4 Our Adapted Approach

1: procedure Predict_By_NodeLevel(L_o)

2: num_levels = n;

3: for level ∈ range(n) do

4: if method = =mv then

5: procedure findUniqueLabel(L_o)

6: L_unique ← {lou}, where L_unique ⊆ A and lou∈ A and u ≤ U_o;

7: for l ∈ L_unique do

8: p_l ← count(l)÷|L_i| ⊳ percentage of l being voted as the label for object o;

9: return LC_n ← {(l,p_l)} ⊳ list of candidate labels and their percentage for o;

10: if method = = em then

11: procedure Initialise(p_l)

12: p_l ← count(l) ÷ |L_o| ⊳ percentage of l being the true label for object o (l ∈ A);

13: while not converged do

14: Estimate error rate for user u:

15: θll−u←λll−u+∑o∈Lopl×I(lou=l−)

16: Estimate confusion matrix:

17: ell−u←θll−u÷∑qθlqu ⊳ q is the accuracy of user u

18: Estimate class priors:

19: prl←∑oplo÷|O|

20: Calculate class probability for object o:

21: pl←prl∏u∈Uo∏m(eamj)I(lu=m)÷∑qprq∏m(eqmu)I(lu=m)

22: return LC_n ← {(l,p_l)} ⊳ list of label candidates and corresponding probability for o;

23: if method = = mp then

24: procedure Initialisation(y_u−>u)

25: for (o, u) ∈ L do

26: Initialise y_u−>o (∼N(−1, 1));

27: procedure Iteration(k_max)

28: for k ∈{1,…,k_max} do29: for (o, u) ∈ L do

30: xo−>uk ← ∑u−∈UEou−×yu−>ok−1(u−≠u);

31: for (o, u) ∈ L do

32: yu−>ok ← ∑o−∈OEo−u×xo−>uk(o−≠o);

33: x_o ← ∑u∈UEou×yu−>okmax−1

34: if sign(x_o) == 1 then

35: LC_n.append((x_o,1.0))

36: procedure Assemble_MostPossiblePath(L_o)

37: num_levels = n;38: LC = {};

39: for z₁ ∈ LC₁ do

40: for z₂ ∈ LC₂ do

41: …

42: for z_n ∈ LC_n do

43: LC.append((z₁,z₂,…,z_n),(p_z1×p_z₂…×p_{z_n}));

44: Lo˜ =;

45: p_max = 0;

46: for Z ∈ LC do

47: if p_Z ≥ p_max then

48: p_max ← p_Z;

49: Lo˜←Z;

50: return Lo˜;

Our adapted approach assumes that labels at different levels in the workflow are independent, then assemble the label path from each node level based on the workflow graph. In the adapted approach, not only we reward partially correct answers from the crowd by applying each of the algorithms at each node level in the graph and compute scores for each individual labels, but also we consider the valid path when inferring the correct path. We also specially choose two algorithms that take into account the performance of the crowd in their computations, EM and MP. The EM algorithm sums up all node probabilities along each path to determine the ranking score. The MP algorithm returns true if that particular label at the node level is relevant or false otherwise. This means that we assign the score for the candidate paths correspondingly either as 1.0 or 0.0. By studying it, we want to allow MP and EM to be able to better identify those users who, while not doing so well overall, are very skilled at a particular sub-task (question) in the workflow.

4. Evaluation

To evaluate the three algorithms and our adapted approach, we compare the classic approach where algorithms are applied on each node level and simply put together (we call it “naive-approach” here) with our “adapted-approach” which uses classic approach while strives to infer a valid correct path by considering the workflow graph. Thus, we have six different approaches: mv_adapted, mv_naive, mp_adapted, mp_naive, em_adapted, em_naive. Each inference algorithm was applied to six data sets with different microtask crowdsourcing workflows. We start with the evaluation setup of the data in Section 4.1 and the evaluation metrics in Section 4.2. Then we present the evaluation of inferred result in Section 4.3.

4.1 Data

First, we used three existing data sets. The first one is from the Snapshot Serengeti[1] project and consists of all crowd classifications within the time span from 10 December 2012 until 17 July 2013. It contains 7,800,896 labels from 890,280 volunteers for a total of 66,892 objects. For our evaluation, we used a gold standard with curated labels for 4,149 objects, which was created by professional scientists working on the Snapshot Serengeti project. To evaluate our approach we took all labels received from the crowd for the 4,149 objects which contains 112,027 labels submitted by 8, 304 volunteers. The second data set is from the Dark Skies app within the Cities at Night[2] project. It consists of 1,275,354 classifications by 19,818 volunteers submitted in a time span from April 27th, 2014 until December 5th, 2016. The gold standard consisted of 200 objects whose labels were manually validated by the science team in Cities at Night. These 200 objects received 1,341 labels from 692 users from CrowdCrafting[7]. The third one is from the GalaxyZoo[3] project where we randomly choose 500 objects consisting classifications from 16 February 2009 to 21 May 2009. The workflows for the three data sets are depicted in Figures 4, 5 and 6, respectively. To explore the effects of volunteers/paid context on the results, the tasks are also setup on paid crowdsourcing platform to mimic the tasks done by volunteers.

4.2 Metric

To measure the performance of our aggregation approach, we employ the Accuracy metric which has been commonly used in classification evaluation in previous work (Khattak and Salleb-Aouissi, 2011; Kamar et al., 2012; Sheshadri and Lease, 2013; Hung et al., 2013; Zhang et al., 2017a; Zheng et al., 2017b). Accuracy is a measure allowing us to understand the percentage of correct answers (inferred by algorithms). The accuracy is defined as the percentage of objects that have been correctly inferred. Higher accuracy indicates better performance.

Accuracy=∑o|O|Bernoulli(Lgoldo==Lo˜)|O|

The above equation is by default for calculating the accuracy for the inferred label path. Bernoulli(Lgoldo==Lo˜) indicates the outcome (either 0 or 1) of comparing gold category with the category predicted by different predictor. As we use the adapted node-level based implementation, it makes sense to also evaluate how accurate the inferred label is on each node level. In such context, L_{gold_o}[n] represents the ground truth for object o at node level n and Lo˜[n] represents the inferred true label at node level n. Hence, the accuracy at node level n for the top answer can be calculated by:

Accuracy_leveln=∑o|O|Bernoulli(Lgoldo[n]==Lo˜[n])|O|

To understand whether our adapted approach is significantly better, we will also run significant testing for all algorithms chosen. We will use standard 5 per cent significance level. For each data set, we will randomly select 100 objects and select 50 times. The accuracy for each selection is calculated for MV, MP and EM for both naive and adapted approach. We will use the function scipy.stats.ttest_ind from Python[8] to perform the two-sided test for naive and adapted samples in all six cases (three workflows, each has two contexts: volunteer and paid).

4.3 Results

Table IV shows the accuracy of each algorithm on each data set for the inferred answer. Considering the overall classification accuracy (by path), our adapted methods have better performance than the naive approach in both volunteer and paid crowd context; at the same time, each algorithm generally has higher accuracy for volunteer context compared to the paid crowd. Note that the best accuracy achieved increases as the depth of the workflow increases for the paid crowd context, where Serengeti with two questions achieves 45.9 per cent, darkskies with three questions achieves 53.0 per cent and galaxyzoo with maximum of nine questions achieves 57.9 per cent. Similar pattern is not observed for the volunteer context. If looking at the accuracy breakdown by node level (Figures 8, 9 and 10), it is notable that for multiple-questions task with more steps, adapted method of MP and EM generally shows better accuracy at most of the node levels. For the data sets from a task with fewer steps in its workflow (less number of levels in the graph), such as the Serengeti task in Figure 8, MV performs better.

Meanwhile, from the Table IV we can see MV shows an acceptable accuracy for most of the volunteered data sets (mostly over 75 per cent, except for GalaxyZoo data set), but has poor accuracy (less than 60 per cent) in the paid crowd context though it performs better than other individual algorithms we tested, which suggests it need to be complemented by other methods which might be good at specific objects where MV cannot perform well. Looking at the accuracy by level results, it does not seem to suggest that as the depth of the task (number of levels) increases, accuracy has a tendency to consistently increase or decrease. The accuracy of each level is more relevant to its intrinsic character (e.g. number of options in that level, and ambiguity or subjectivity of the corresponding object). For instance, the darkskies task asks the user to evaluate the sharpness and cloudiness of the image, which can be subjective to some degree. This is also why the result by node level seems to show an interesting picture that on different node level for different workflow, sometimes em has the best result (such as level 4 and 5 of GalaxyZoo), sometimes mp has the best result (such as level 1 of Serengeti in volunteer case), other times mv has the best result (level 1, 2, 3 of Darkskies in both volunteer and paid context).

Notice that MP for the darkskies paid crowd context, it is the only case we observe that the naive approach has higher overall accuracy (by path) than adapated, which is due to the fact that both the level 2 and 3 (determining cloudiness and sharpness of the image) of darkskies workflow are in essence independent questions of the first node level (whether it is a city, or stars or anything else) though the task workflow made it a subsequent question only when“ city” is chosen as the label for first node level. Similarly, the accuracy by level result from mp_adapated is lower than mp_naive on a few other occasions at different node level, but in those occasions, there is always one node level mp_naive has considerably poor accuracy, such as in Galaxyzoo node level 2, which subsequently leads to the very low overall accuracy considering the whole path. The reason that the mp_adapted approach could have lower accuracy at certain level is that mp approach actually only returns 1.0 or 0.0 to indicate whether that is the predicted label, but our adapted approach tried to assemble/infer a most probable valid label path (as shown in Algorithm 4) based on the candidate of predicted labels from individual node level. Therefore, for the mp case, the randomness of ranking the combinations might not do well for the corresponding node level, however, the overall accuracy has shown to be better than the naive approach which completely neglects the validity of a label path.

Notice that though our adapted approaches achieve higher accuracy for the first node level in most case, mv_adapted has slightly lower accuracy comparing to mv_naive for GalaxyZoo workflow under volunteer context, which is because the way we assemble the result is based on the overall possibility (percentage of voting at each node level multiplied) of a path instead of assuming the top voted label at node level 1 is correct (and then traversing subsequent node based on that assumption). Our main purpose is to obtain the most possible valid label path, which has been shown effective in Table IV. We have run the significant testing for all algorithms chosen. The result is statistically significant for all our adapted approach as the p-value is smaller than the pre-defined significant level (5 per cent) in all cases.

5. Discussion

In this section, we expand on the key findings of the evaluation results introduced earlier.

5.1 Crowd context matters

We have deliberately chosen three representative tasks each presenting two data sets produced by volunteers and paid crowd. Based on our results, there is a distinctive difference in performance for the same algorithm applied in these two different contexts. For all algorithms, the accuracy it can achieve under the volunteer context is evidently higher than the paid crowd, without any exception. For the same workflow, the overall accuracy (by path) it can achieve in volunteer context is normally around 30 per cent higher than the paid crowd context for workflows with two to three questions. However, this does not seem to be the case when workflow involves more questions, such as in the galaxyzoo case where the best accuracy all the algorithms can achieve is only around 5 per cent higher in volunteer context compared to paid crowd context.

5.2 Workflow counts

From the representative tasks we have shown so far, there are two main factors that need to be taken into account when designing a classification crowdsourcing workflow especially when classification steps are interdependent: the number of questions (determining the depth of the graph) and how many answer options per each question (width of the corresponding node level, affecting cognitive efforts required for passing that node level with correct chosen options). In our evaluation, we found evidence that both depth and width impact on overall performance of the inference algorithms. One visible pattern is for the paid crowd data sets. In this setting, overall accuracy (by path) increases as the depth of the graph increases (for both mv_adapted and mp_adapted), which suggests that it might be a good idea to have more classification questions each with fewer options rather than having fewer questions and giving many options to choose from, particularly for the case where the crowd’s skill level is uncertain. The other notable aspect is for volunteer context, the mp algorithm has a comparative performance with mv in Serengeti workflow, but not in the other two workflows with more levels.

5.3 Heuristics-based aggregation as an addition

On observing the result in Section 4.3, it seems to be a promising way if we consider combining output from these algorithms using a heuristic strategy to perform better inference. We want to use results from mv_adapted, em_adapted and mp_adapted in combination to exploit their strengths and weaknesses for complex classification tasks. To do so, we could have an aggregator which is based on following intuitions: the number of unique classifications of an object (defined by u) shows the degree that the crowd workers agree/disagree on the classification where the higher number indicates higher degree of disagreement and normally imply the object is either a bit difficult or ambiguous to be classified; the ratio (defined by r) between the unique number of classifications/answers collected from the crowd and the total number of classifications/judgments also demonstrates how diverse the answers are for the corresponding object and hence similarly; As three-sigma rule (Pukelsheim, 1994) in the empirical sciences suggests that almost all values should lie within three standard deviations of the mean in a normal distribution, and theoretically mean plus one, two or three standard deviation(s) covers 68, 95 and 99.7 per cent of the data. In the case where MV might potentially fail (where workers tend to disagree), the number of unique classification or the ratio of the number of unique to the total number of classification for an object falls within the higher range of the distribution. Thus, a heuristic aggregation strategy we could consider: Look at the intrinsic characteristics of collected classifications for each object, such as the number of unique classifications and the ratio of that against the total number of classifications. Then, based on the third intuition above, we can use the skewness (defined by s below) of the distribution for number of unique (U∼N(u_μ,u_σ)) and ratio (defined by R∼N(r_μ,r_σ)) respectively to heuristically chosen bound where MV can be potentially complemented by other approaches. However, choosing an optimal threshold is not straightforward and need to be explored in future work.

6. Related work

Our approach is informed by existing work on microtask crowdsourcing and quality assurance in crowdsourcing, which we review in section.

6.1 Microtask crowdsourcing and workflows

In crowdsourcing, a problem needs to be sometimes decomposed into smaller, fine-granular microtasks and then arranged in a workflow for more effective processing. In general, a workflow consists of a set of microtasks; the microtasks are sometimes of different types and can be dependent or independent of each other. For instance, the find-fix-verify workflow proposed by Bernstein et al. (2010) uses microtask crowdsourcing to proofread and shorten text in three steps: finding areas of improvement in the text; fixing or improving them; and verifying the quality of the changes. In each step, the crowd is asked to carry out the same type of microtask, sometimes iteratively. In Kittur et al. (2008, 2013) and Acosta et al.’s (2013) studies, researchers have proposed to group the same or similar microtasks into batches as a means to facilitate learning effects. Previous studies have also shown that task performance can be improved as a function of several factors, including the design of tasks and workflows, motivation and incentives and training (Bernstein et al., 2010; Demartini et al., 2012; Kittur et al., 2008; Wiggins et al., 2011).

In the citizen science platform such as Zooniverse[9], most of the classification projects are not simple tasks with one-question, instead is multiple-questions chained together. Zooniverseⁱ uses workflow to “group a collection of tasks into a logic unit”[10] which is, in essence, referring to the relatively multiple-questions task which need to be finsihed in several steps. In Snapshot Serengeti[1], classifying an image means answering a set of independent questions, sometimes several times when more than one animal is present in the image. In Cities at Night[2] and Galaxy Zoo[3], questions are inter-related and the answers given in one step determine the questions in the subsequent steps. In the context of such classification task, a workflow is used to refer to the logical organisation of each classification questions and corresponding options.

Most previous studies around crowdsourcing workflows have focussed on the design of the workflows and have shown that a particular type of workflow can be crowdsourced effectively (in terms of the accuracy of outputs, budget, time etc.) (Little et al., 2009; Bernstein et al., 2010; Tran-Thanh et al., 2015). In some cases, researchers have proposed bespoke quality assurance methods for their workflows (Lintott et al., 2011; Willett et al., 2013). Our work proposes a strategy which can be applied to determine the correct label path for a whole range of classification tasks, spanning over several steps with independent or dependent multiple-choice questions, which is different than existing research that mainly focus on the result for the final step (no matter how many other previous steps exist in its workflow).

6.2 Inference algorithms

Researchers have proposed inference algorithms, mathematical models that can automatically infer the correct solution to a given problem from a solution space defined by the crowd. For example, Ipeirotis et al. presented an algorithm that assesses the performance of crowd workers and exploits this information to estimate the quality of answers on Mechanical Turk (Ipeirotis et al., 2010). Karger et al. proposed to use MP to infer correct answers from worker’s answers (Karger et al., 2011). Bachrach et al. (2012) used a Bayesian graphical model to grade test answers in scenarios where the ground truth cannot be made available. Whitehill et al. (2009) followed an expectation maximisation approach to identify correct classifications, depending on the expertise of the workers and the level of difficulty of the task. In the citizen science project Galaxy Zoo Supernovae, crowd answers were analysed using a Bayesian generalisation of the same expectation maximisation idea (Simpson et al., 2011). More recently, Difallah et al. (2015b) compiled a set of features that can be used to predict answer quality, based on an analysis of Mechanical Turk logs. Several studies have shown that it is possible to combine automatic prediction methods (such as Bayesian or generative probabilistic models) with additional input from the crowd to further improve the accuracy of the predictions (dos Reis et al., 2015; Hare et al., 2013; Ipeirotis et al., 2010; Loni et al., 2014; Simpson et al., 2013). Other studies have analysed and compared different algorithms (Zheng et al., 2017a; et al., 2015; Sheshadri and Lease, 2013), emphasising the need for more research to understand the interplay among different sets of design parameters on the overall performance.

All these existing methods have considerably advanced the state of the art. However, they cannot be applied to every type of microtask crowdsourcing workflow without restrictions. Moreover, most of the research carried out so far in this space has looked at rather simple binary or multiple-choice classification tasks with the aim to identify a single, correct answer. This class of microtasks, albeit important and widely used, is not always the norm. As we have seen in the examples from the previous section, there are cases where a problem cannot be easily decomposed into independent microtasks, or where different, related microtasks should be grouped into more complex workflows for efficiency reasons. Although there are a few recent works looking into the relatively complex multiple-step classification tasks, each of them has a domain-specific or problem-specific focus (Parameswaran et al., 2011; Kim et al., 2002; Wu et al., 2012; Bragg et al., 2013; Kamar and Horvitzm, 2015; Otani et al., 2016). Bragg et al. (2013) and Otani et al. (2016) both researched the entity classification that normally involve categorising the given entity into parent-child classes in different steps but have very different perspectives. Bragg et al. (2013) focus on improving the workflow for generating taxonomy, as well as inference methods to induce the parent-child relationship, while Otani et al. (2016) focus on the task where a parent-child relationship exists between two adjacent classification steps, and propose label aggregation methods that adapt from existing GLAD method (Whitehill et al., 2009) by considering the hierarchical class-subclass structure. In addition, Wu et al. (2012) investigate the sequential data labelling scenario and present Sembler to ensemble crowd sequential labellings by leveraging the statistical correlation and dependency among multiple instances/sentences which is domain specific and not applicable to other multiple-step classification where no such statistics can be exploited. Parameswaran et al. (2011) and Kamar and Horvitz (2015) particularly look at the multiple-step image classification tasks while both took the approaches that are not easy to be generalised to suit for other multiple-step classification. Parameswaran et al. (2011) explicitly formulate the classification task as human-assisted graph search problem, presenting the dimensions characterising the different type of classification and developing algorithms to optimise the questions to be asked (at the different node) which is evaluated with simulation. On the other hand, Kamar and Horvitz (2015) focus on optimising worker allocation in the hierarchical classification task (HCT) and develop answer models and evidence models for HCT consensus while both models are constructed with supervised learning, assisting with the Sloan Digital Sky Survey (SDSS) features identified by machine visions available for GalaxyZoo^c data set. There is also a few research particularly dedicating to automatic hierarchical classification where an taxonomy is given and a parent-child relationship among classes exists, but all are bound to a certain domain. For instance, Dumais (2000) investigate automatic hierarchical classification using Support Vector Machine with existing web pages whose category are known as training data. Su et al. (2006) present an automatic method to classify structured web databases by leveraging probing queries, the returned count of query result and the SVM classifier. Such automatic hierarchical classification not only needs existing labelled data as training data but also focus on the classification where answers to further classification step down the line (child classes) are always a sufficient condition to confirm the answer to the previous classification step (parent classes).

Our approach differs from existing work mainly in the fact it is not restricted to a specific type of multiple-step classification and does not need additional information such as the machine identified features of the image or frequency/correlation among word usage, neither does it rely on the parent-child relationships between classification steps. Our method is general and intuitively easy to be applied in any multi-step classifications. We discussed the three main individual algorithms in Section 2 and noted that whilst all three algorithms can be used to infer the correct answer for a multiple-choice question, they differ in terms of the inputs and outputs. In our approach, we devised a new strategy to use existing algorithms to achieve higher classification accuracy.

7. Conclusion

Ensuring quality is one of the grand challenges of microtask crowdsourcing. While previous research has looked at inferring correct answers for microtasks consisting of single binary or multiple-choice questions, our research proposes a model that can be applied to both single-question and multiple-question scenarios, filling the gap for understanding how to aggregate in the multiple-question scenarios. We propose a graph model and an “adapted” aggregation method that can improve the accuracy in inferring true label path in complex workflows with several interdependent questions. Though a few previous works tried to address similar multiple-step classification, they are either limiting it to the hierarchical classification scenarios where a parent–child relationship exists between classification steps or restricting the method by having to involve additional information. We propose using the graph to model a microtask crowdsourcing workflow and to support inference algorithms in making decisions about correct labels for classification tasks with multiple-questions, where the answer to one question does not have to be the sufficient condition to or imply the answer to the previous question is correct. We believe this is the first work that investigates aggregation in a multiple-step classification task with interdependent questions to infer the correct label path and assess the classification accuracy accordingly.

To this end, we explored three inference algorithms, MV, MP and EM, each with proven benefits in quality assurance in crowdsourcing. We compared the performance of our adapted approach and the existing naive approach, using six representative data sets. We evaluate the performance of individual algorithms for overall accuracy where a full labelling path is considered as an atomic, correct answer and a more refined measure which looks at accuracy in individual node level of the workflow graph. The results have shown that our adapted approach has significantly improved the accuracy compared with the naive approach. The result also demonstrates that while MV does well in overall accuracy, a deeper analysis of the accuracy in each node level revealed a more interesting picture. Hence, a heuristic-based aggregation approach might be a potentially better solution by combining results from multiple algorithms leveraging the strength of each other. This suggests the need for more dynamic inference approaches that can adapt to the complexity of the crowdsourcing workflow.

In future work, we plan to devise inference methods that take other, more workflow-specific factors into account. Our current method assumes independence between labels from different levels when inferring the answer for each level. It can be potentially improved to consider the possible correlation between labels in different node levels. For instance, it can consider giving different weight to labels based on the inferred result from the previous level. Such method requires a top-down traversal process which might bring side-effects since it counts heavily on the inferred result from the previous level, and carries on the effect (weight) to subsequent levels even the choice in the previous levels may be incorrect. As the correlation between labels in different node level is complicated, the feasibility of incorporating such correlation information into the aggregation process needs further investigation. Meanwhile, the number of options and the length of possible paths in a workflow deserve more in-depth experiments. One promising direction will be to employ other machine learning approaches for truth inference. For instance, using the workflow properties along with the crowdsourcing generated data to learn and explore features automatically [Huynh et al. (2013)], and produce decision tree to help choose the proper inference algorithm. Alternatively, certain properties from crowd-collected data could be further exploited to train machine learning algorithm(s) with selective labels to directly infer true label path.

Figures

Figure 1.

Classification tasks from Zooniverse

Figure 2.

Example classification paths collected from 20 workers for a given photo

Figure 3.

Representation of a task with a single question

Figure 4.

Representation of dark skies workflow from cities at night

Figure 5.

Representation of snapshot Sergenti workflow from Zooniverse

Figure 6.

Representation of GalaxyZoo workflow from Zooniverse

Figure 7.

Graph representation of an example classification workflow W_f vs the corresponding classic way of looking at the classification with multiple questions

Figure 8.

Accuracy by node level (Serengeti)

Figure 9.

Accuracy by node level (Darkskies)

Figure 10.

Accuracy by node level (Galaxyzoo)

Table I.

Notations

Notation	Definition
o	The current object being classified
O	The set of all objects in a data set
A	All available options
u	User u
U	The set of all users who contributed to the current data set
U_o	All users who have classified object o
L	All labels received from the crowd, and L ⊆ A
L_o	The set of all labels from the crowd for object o
L^u	The set of all labels from user u
lou	The label for object o from user u
lo˜	The inferred label for object o

Table II.

Definitions

Term	Definition
Task	A general term referring to an action or a series of action need to be executed
Classification task	Task classifying objects into given categories, it could be a simple task (one question) or a relatively complex task (more than one question)
Microtask	A task is decomposed into smaller unit making it easier for the crowd. One microtask is equivalent to one question in classification task
Workflow	Microtasks are arranged/chained in a way to automatically complete the task
Question	Classification task asked of the user to elicit/assign a label to an attribute of the object to be classified
Option	The set of possible labels
Chosen option	An option user chooses per question
Correct label	The correct label for a question
Chosen path	A user chooses a set of labels for entire workflow
Correct path	The correct set of labels for entire workflow
Workflow graph	The workflow can be modelled as a directed acyclic graph (DAG), in which the root node represents the object under consideration and all other nodes are classification options
Node	A representation of an option in our model
Node level	The sequence that the question is presented to the user within a workflow

Table III.

Notations specific to our model

Notation	Definition
W_f	Represents the graph based on the workflow of classifying object o, it has node levels to indicate the questions to classify the corresponding attributes of the given object, and nodes to represent the options available for each attribute
A_(n)	Represents the available options at node level n
a_n(j)	Represents the individual option at node level n, where j ∈{1,…,\|A_(n)\|}
lo(n)u	Represents the label chosen by user u at node level n for object o. Thus, the labelling result ( lo(1)1, lo(2)1,…, lo(n)1) will represent the ordered list of nodes (the traversal path) visited by user 1 when classifying o, which is called as a label path
Lou	The label path chosen by user u for object o
L_o(n)	All labels for object o at node level n
L_o(n)(unique)	Unique labels for object o at node level n, L_o(n)(unique)⊆A(n)
Lo˜	Represents the inferred label path for object o. It is a set of inferred labels for each node level described as ( lo(1)˜,…,lo(n)˜)
L_{gold_o}	True label path for object o

Table IV.

Accuracy (by path) of each algorithm

Data set	Graph depth/size	Crowd type	Algorithm	Accuracy
serengeti	54-11	volunteer	mv_naive	0.590
			mv_dapted	0.776
			em_naive	0.572
			em_adapted	0.655
			mp_naive	0.755
			mp_adapted	0.755
		paid	mv_naive	0.299
			mv_adapted	0.459
			em_naive	0.244
			em_adapted	0.337
			mp_naive	0.083
			mp_adapted	0.207
darkskies	8-3-2	volunteer	mv_naive	0.690
			mv_adapted	0.785
			em_naive	0.040
			em_adapted	0.450
			mp_naive	0.340
			mp_adapted	0.495
		paid	mv_naive	0.405
			mv_adapted	0.530
			em_naive	0.020
			em_adapted	0.385
			mp_naive	0.335
			mp_adapted	0.305
galaxyzoo	3-3-2-3-2-2-3-6-4-2-7	volunteer	mv_naive	0.554
			mv_adapted	0.631
			em_naive	0.470
			em_adapted	0.564
			mp_naive	0.002
			mp_adapted	0.562
		paid	mv_naive	0.371
			mv_adapted	0.579
			em_naive	0.000
			em_adapted	0.331
			mp_naive	0.002
			mp_adapted	0.367

Notes

1.

https://www.snapshotserengeti.org/

2.

http://citiesatnight.org/

3.

https://www.galaxyzoo.org/

4.

https://success.crowdflower.com/hc/en-us/articles/203527635-CML-Attribute-Aggregation

5.

https://data.galaxyzoo.org/gz_trees/gz_trees.html

6.

In a lot of cases, the workflows are tree-shaped, but some cases are not a tree such as the three tasks presented above.

7.

https://crowdcrafting.org/

8.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

9.

https://www.zooniverse.org/

10.

https://blog.zooniverse.org/2013/06/20/how-the-zooniverse-works-the-domain-model/

References

Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Auer, S. and Lehmann, J. (2013), “Crowdsourcing linked data quality assessment”, The Semantic Web – ISWC 2013, pp. 260-276.

Bachrach, Y. Minka, T. and Guiver, J. (2012), “How to grade a test without knowing the answers – a Bayesian graphical model for adaptive crowdsourcing and aptitude testing”.

Batini, C., Cappiello, C., Francalanci, C. and Maurino, A. (2009), “Methodologies for data quality assessment and improvement”, ACM Computing Surveys, Vol. 41 No. 3, pp. 1-52.

Bernstein, M.S., Little, G., Miller, R.C., Hartmann, B., Ackerman, M.S., Karger, D.R., Crowell, D. and Panovich, K. (2010), “Soylent: a word processor with a crowd inside”, Proceedings of the 23nd annual ACM symposium on User interface software and technology, ACM, pp. 313-322.

Bernstein, M.S., G., Little, Robert, C., Miller, B., Hartmann, Mark, S., Ackerman, David, R., Karger, David Crowell, K. and Panovich, (2010), “Soylent: a word processor with a crowd inside”, Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology, ACM, pp. 313-322.

Bragg, J., Mausam and Weld, D.S. (2013), “Crowdsourcing multi-label classification for taxonomy creation”, in HCOMP 2013, First AAAI Conference on Human Computation and Crowdsourcing.

Dawid, A.P. and Skene, A.M. (1979), “Maximum likelihood estimation of observer error-rates using the em algorithm”, Applied Statistics , Vol. 28 No. 1, p. 20.

Demartini, G., Difallah, D.E. and Cudré-Mauroux, P. (2012), “Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking”, Proceedings of the 21st international conference on World Wide Web, ACM, pp. 469-478.

Difallah, D.E. Catasta, M. Demartini, G. Ipeirotis, P.G. and Cudré-Mauroux, P. (2015a), “The dynamics of micro-task crowdsourcing: the case of Amazon MTurk”, pp. 238-247.

Difallah, D.E., Catasta, M., Demartini, G. and Panagiotis, G. (2015b), “Ipeirotis, and Philippe Cudré-Mauroux”, The Dynamics of Micro-Task Crowdsourcing: The Case of Amazon MTurk,” Pages, pp. 238-247.

dos Reis, F.J.C.S., Lynn, H.R., Ali, D., Eccles, A., Hanby, E., Provenzano, C., Caldas, W.J., Howat, L.-A., McDuffus, B. and Liu, (2015), “Crowdsourcing the general public for large scale molecular pathology studies in cancer”, EBioMedicine, Vol. 2 No. 7, pp. 679-687.

Dumais, S. (2000), “Hierarchical classification of web content”, pp. 256-263.

Eickhoff, C. and de Vries, A. (2011), “How crowdsourcable is your task”, in Proceedings of the Workshop on Crowdsourcing for Search and Data Mining (CSDM) at the Fourth ACM International Conference on Web Search and Data Mining (WSDM), pp. 11-14.

Gadiraju, U., Demartini, G., Kawase, R. and Dietze, S. (2015), “Human beyond the machine: challenges and opportunities of microtask crowdsourcing”, IEEE Intelligent Systems, Vol. 30 No. 4, pp. 81-85.

Gadiraju, U., Kawase, R. and Dietze, S. (2014), “A taxonomy of microtasks on the web”, Proceedings of the 25th ACM conference on Hypertext and social media, ACM, pp. 218-223.

Gelas, H. Solomon Teferra Abate, L. and Besacier, (2011), Laboratoire Dynamique, Du Langage, Cnrs Universit, De Lyon, Laboratoire Informatique De Grenoble, Cnrs Universit, and Fourier Grenoble., “Quality assessment of crowdsourcing transcriptions for African languages,” (August), pp. 3065-3068.

Hare, J.S., Acosta, M., Weston, A., Simperl, E., Samangooei, S., Dupplaw, D. and Lewis, P.H. (2013), “An investigation of techniques that aim to improve the quality of labels provided by the crowd”, in Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop, Barcelona, Spain, October 18-19, 2013., vol. 1043 of CEUR Workshop Proceedings, available at: CEUR-WS.org

Hung, Q.V.N., Tam, N.T., Tran, L.N. and Aberer, K. (2013), “An evaluation of aggregation techniques in crowdsourcing”, Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 8181 LNCS, no. PART 2, pp. 1-15.

Hung, N.Q.V., Thang, D.C., Weidlich, M. and Aberer, K. (2015), “Minimizing efforts in validating crowd answers”, Proceedings of the ACM SIGMOD International Conference on Management of Data, Vol. 2015-May, pp. 999-1014.

Huynh, T.D., Ebden, M., Venanzi, M., Ramchurn, S., Roberts, S. and Moreau, L. (2013), “Interpretation of crowdsourced activities using provenance network analysis”, The First AAAI Conference on Human Computation and Crowdsourcing, pp. 78-85.

Ipeirotis, P.G., Provost, F., Sheng, V.S. and Wang, J. (2014), “Repeated labeling using multiple noisy labelers”, Data Mining and Knowledge Discovery, Vol. 28 No. 2, pp. 402-441.

Ipeirotis, P.G., Provost, F. and Wang, J. (2010), “Quality management on amazon mechanical Turk”, Proceedings of the ACM SIGKDD Workshop on Human Computation – HCOMP ’10, p. 64.

JCGM. JCGM 200 (2008), “International vocabulary of metrology? Basic and general concepts and associated terms (VIM) vocabulaire international de métrologie? Concepts fondamentaux et généraux et termes associés (VIM)”, International Organization for Standardization Geneva ISBN, Vol. 3 No. Vim, p. 1042008.

Kahn, B.K., Strong, D.M. and Wang, R.Y. (2002), “Information quality benchmarks: product and service performance”, Communications of the Acm, Vol. 45 No. 4, pp. 184-192.

Kamar, E., Hacker, S. and Horvitz, E. (2012), “Combining human and machine intelligence in large-scale crowdsourcing”, Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, International Foundation for Autonomous Agents and Multiagent Systems, Vol. 1, pp. 467-474.

Kamar, E. and Horvitz, E. (2015), “Planning for crowdsourcing hierarchical tasks”, Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, p. 2030.

Karger, D.R., Oh, S. and Shah, D. (2011), “Iterative learning for reliable crowdsourcing systems”, Advances in Neural Information Processing Systems, pp. 1953-1961.

Khattak, F.K. and Salleb-Aouissi, A. (2011), “Quality control of crowd labeling through expert evaluation”, Second Workshop on Computational Social Science and the Wisdom of Crowds (NIPS 2011), pp. 1-5.

Kim, J.-H., Kang, I.-H. and Choi, K.-S. (2002), “Unsupervised named entity classification models and their ensembles”, Proceedings of the 19th International Conference on Computational Linguistics, Vol. 1, pp. 1-7.

Kittur, A., Chi, E.H. and Suh, B. (2008), “Crowdsourcing user studies with mechanical Turk”, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ACM, pp. 453-456.

Kittur, A., Nickerson, J.V., Bernstein, M., Gerber, E., Shaw, A., Zimmerman, J., Lease, M. and Horton, J. “The future of crowd work”, Proceedings of the 2013 Conference on Computer Supported Cooperative Work – CSCW ’13, ACM Press, New York, NY, USA), p. 1301, 2013.

Kittur, A., Smus, B., Khamkar, S. and Kraut, R.E. (2011), “CrowdForge: Crowdsourcing complex work”, Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology – UIST ’11, pp. 43-52.

Kulkarni, A.P., Can, M. and Hartmann, B. (2011), “Turkomatic”, Proceedings of the 2011 Annual Conference Extended Abstracts on Human Factors in Computing Systems – CHI EA ’11, p. 2053.

Lintott, C., Schawinski, K., Bamford, S., Slosar, A., Land, K., Thomas, D., Edmondson, E., Masters, K., Nichol, R.C. and Raddick, M.J. (2011), “Galaxy zoo 1: data release of morphological classifications for nearly 900 000 galaxies”, Monthly Notices of the Royal Astronomical Society, Vol. 410 No. 1, pp. 166-178.

Little, G., Chilton, L.B., Goldman, M. and Miller, R.C. (2009), “Turkit: tools for iterative tasks on mechanical Turk”, in Proceedings of the ACM SIGKDD Workshop on Human Computation, ACM, pp. 29-30.

Liu, X., Lu, M., Ooi, C., Shen, Y., Wu, S. and Zhang, M. (2012), “CDAS: a crowdsourcing data analytics system”, Proceedings of the Vldb Endowment, Vol. 5 No. 10, pp. 1040-1051.

Loni, B., Hare, J., Georgescu, M., Riegler, M., Zhu, X., Morchid, M., Dufour, R. and Larson, M. (2014), “Getting by with a little help from the crowd: practical approaches to social image labeling”, Proceedings of the 2014 International ACM Workshop on Crowdsourcing for Multimedia, pp. 69-74.

Malone, T.W., Laubacher, R. and Dellarocas, C. (2010), “The collective intelligence genome”, IEEE Engineering Management Review, Vol. 38 No. 3.

Mao, A., Kamar, E., Chen, Y., Horvitz, E., Schwamb, M.E., Lintott, C.J. and Smith, A.M. (2013), “Volunteering versus work for pay: incentives and tradeoffs in crowdsourcing”, First AAAI Conference on Human Computation and Crowdsourcing, pp. 94-102.

Otani, N., Baba, Y. and Kashima, H. (2016), “Quality control for crowdsourced hierarchical classification”, Proceedings – IEEE International Conference on Data Mining, ICDM, Vol. 2016-Janua, pp. 937-942.

Parameswaran, A., Sarma, A.D., Garcia-Molina, H., Polyzotis, N. and Widom, J. (2011), “Human-Assisted graph search: it’s okay to ask questions”, Proceedings of the VLDB Endowment, Vol. 4 No. 5, pp. 267-278.

Paulheim, H. and Bizer, C. (2014), “Improving the quality of linked data using statistical distributions”, International Journal on Semantic Web and Information Systems (Systems), Vol. 10 No. 2, pp. 63-86.

Pukelsheim, F. (1994), “The three sigma rule”, The American Statistician, Vol. 48 No. 2, pp. 88-91.

Redi, J. and Povoa, I. (2014), “Crowdsourcing for rating image aesthetic appeal: better a paid of a volunteer crowd?”, Proceedings of the 2014 International ACM Workshop on Crowdsourcing for Multimedia – CrowdMM ’14, no. NOVEMBER 2014, pp. 25-30.

Rosenthal, S.L. and Dey, A.K. (2010), “Towards maximizing the accuracy of human-labeled sensor data”, in Proceedings of the 15th International Conference on Intelligent User Interfaces – IUI ’10, ACM Press, New York, NY, p. 259.

Shahaf, D. and Horvitz, E. (2010), “Generalized task markets for human and machine computation”, in AAAI.

Sheshadri, A., Lease, M. (2013), “SQUARE: a benchmark for research on computing crowd consensus”, First AAAI Conference on Human Computation and …, pp. 156-164.

Simpson, E., Roberts, S., Psorakis, I. and Smith, A. (2013), “Dynamic Bayesian combination of multiple imperfect classifiers”, Studies in Computational Intelligence, Vol. 474, pp. 1-35.

Simpson, E., Roberts, S.J., Smith, A. and Lintott, C. (2011), “Bayesian combination of multiple, imperfect classifiers”, in Proceedings of the 25th Conference on Neural Information Processing Systems, Granada.

Snow, R., O’Connor, B., Jurafsky, D. and Ng, A.Y. (2008), “Cheap and fast – but is it good? Evaluating non-expert annotations for natural language tasks”, Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 254-263.

Su, W., Wang, J. and Lochovsky, F. (2006), Automatic Hierarchical Classification of Structured Deep Web Databases BT - Web Information Systems – WISE, 2006, Springer, Berlin Heidelberg, pp. 210-221.

Tran-Thanh, S.R.L., Huynh, T.D. and Rosenfeld, A. (2015), “Crowdsourcing complex workflows under budget constraints”, Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI-15), pp. 1298-1304.

Vickrey, D., Bronzan, A., Choi, W., Kumar, A., Turner-Maier, J., Wang, A. and Koller, D. (2008), “Online word games for semantic data collection”, Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 533-542.

Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J. and Movellan, J. (2009), “Whose vote should count more: optimal integration of labels from labelers of unknown expertise”, Advances in Neural Information Processing Systems, Vol. 22 No. 1, pp. 1-9.

Wiggins, A., Newman, G., Stevenson, R.D. and Crowston, K. (2011), “Mechanisms for data quality and validation in citizen science”, e-Science Workshops (eScienceW), 20111 IEEE Seventh International Conference on, IEEE, pp. 14-19.

Willett, K.W., Lintott, C.J., Bamford, S.P., Masters, K.L., Simmons, B.D., Casteels, K.R.V., Edmondson, E.M., Fortson, L.F., Kaviraj, S., Keel, W.C., Melvin, T., Nichol, R.C., Raddick, M.J., Schawinski, K., Simpson, R.J., Skibba, R.A., Smith, A.M. and Thomas, D. (2013), “Galaxy zoo 2: detailed morphological classifications for 304 122 galaxies from the sloan digital sky survey”, Monthly Notices of the Royal Astronomical Society, Vol. 435 No. 4, pp. 2835-2860.

Wu, X., Fan, W. and Yu, Y. (2012), “Sembler: ensembling crowd sequential labeling for improved quality”, Proceedings of the National Conference on Artificial Intelligence, vol. 2, pp. 1713-1719.

Yang, J. Redi, J. Demartini, G. and Bozzon, A. (2016), “Modeling task complexity in crowdsourcing”.

Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S. and Hitzler, P. (2013), “Quality assessment methodologies for linked open data”, Semantic Web.

Zhang, J., Sheng, V.S., Li, Q., Wu, J. and Wu, X. (2017a), “Consensus algorithms for biased labeling in crowdsourcing”, Information Sciences, Vol. 382-383, pp. 254-273.

Zheng, Y., Li, G., Li, Y., Shan, C. and Cheng, R. (2017b), “Truth inference in crowdsourcing: is the problem solved?”, Proceedings of the VLDB Endowment, Vol. 10 No. 5.

Corresponding author

Qiong Bu can be contacted at: qb1g13@soton.ac.uk