Mitigating Skewed Bidding for Conference Paper Assignment
Inbal Rozencweig, Reshef Meir, Nick Mattei, Ofra Amir

TL;DR
This paper explores modifications to the paper bidding process in conference peer review to reduce bid skew and orphan papers, using a new platform and experiments to identify effective simple adaptations.
Contribution
It introduces a flexible bidding platform and provides experimental evidence on effective bidding process adaptations to improve paper-reviewer matching.
Findings
Simple bidding process adaptations can reduce bid skew.
Order of paper presentation influences bidding behavior.
Information on paper demand affects bidding patterns.
Abstract
The explosion of conference paper submissions in AI and related fields, has underscored the need to improve many aspects of the peer review process, especially the matching of papers and reviewers. Recent work argues that the key to improve this matching is to modify aspects of the \emph{bidding phase} itself, to ensure that the set of bids over papers is balanced, and in particular to avoid \emph{orphan papers}, i.e., those papers that receive no bids. In an attempt to understand and mitigate this problem, we have developed a flexible bidding platform to test adaptations to the bidding process. Using this platform, we performed a field experiment during the bidding phase of a medium-size international workshop that compared two bidding methods. We further examined via controlled experiments on Amazon Mechanical Turk various factors that affect bidding, in particular the order in which…
| Code | Condition | participants | non-spammers | games | StReward | StOrder | StDemand | |
| B | Base | 50 | 29 | 29 | # | |||
| P | iPrices | 124 | 80 | 80 | ||||
| H | Highlight | 43 | 21 | 39 | ||||
| PS | P+ Sort | 34 | 17 | 17 | ||||
| PHS | P+H+Sort | 33 | 28 | 59 | # | |||
| IR | Imp. Req. | 54 | 36 | 36 | ||||
| Total (controlled exp.) | 338 | 211 | 260 | |||||
| FC | Control | 14 | 14 | – | – | |||
| FT | Treatment | 28 | 28 | – | – | |||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuction Theory and Applications · Mobile Crowdsensing and Crowdsourcing · Expert finding and Q&A systems
\setcopyright
ifaamas \acmConference[AAMAS ’23]Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023) London, UK? (eds.) \copyrightyear2023 \acmYear2023 \acmDOI \acmPrice \acmISBN \acmSubmissionID35
\affiliation \institutionTechnion - Israel Institute of Technology \cityHaifa \countryIsrael
\affiliation \institutionTechnion - Israel Institute of Technology \cityHaifa \countryIsrael
\affiliation \institutionTulane University \cityNew Orleans, LA \countryUSA
\affiliation \institutionTechnion - Israel Institute of Technology \cityHaifa \countryIsrael
Mitigating Skewed Bidding for Conference Paper Assignment
Inbal Rozencweig
,
Reshef Meir
,
Nicholas Mattei
and
Ofra Amir
Abstract.
The explosion of conference paper submissions in AI and related fields, has underscored the need to improve many aspects of the peer review process, especially the matching of papers and reviewers. Recent work argues that the key to improve this matching is to modify aspects of the bidding phase itself, to ensure that the set of bids over papers is balanced, and in particular to avoid orphan papers, i.e., those papers that receive no bids. In an attempt to understand and mitigate this problem, we have developed a flexible bidding platform to test adaptations to the bidding process. Using this platform, we performed a field experiment during the bidding phase of a medium-size international workshop that compared two bidding methods. We further examined via controlled experiments on Amazon Mechanical Turk various factors that affect bidding, in particular the order in which papers are presented Cabanac and Preuss (2013); Fiez et al. (2020a); and information on paper demand Meir et al. (2021). Our results suggest that several simple adaptations, that can be added to any existing platform, may significantly reduce the skew in bids, thereby improving the allocation for both reviewers and conference organizers.
Key words and phrases:
Peer Review, Bidding, Allocation
1. Introduction
Academic peer review of papers and grants sits at the heart of academic work and is the cornerstone of modern scientific enterprise Bohannon (2013). In some areas of computer science (mainly AI/ML), where most papers are submitted to large conferences, the fate of a paper is very much in the hands of automated assignment algorithms that help program chairs distribute thousands of papers among a similar number of committee members that serve as reviewers Meir et al. (2021). For this matching to happen, the committee members must first submit their preferences over papers. These preferences are supposed to reflect both the competence and the interest of the reviewer in reviewing those particular papers, using a designated platform—a process typically referred to as bidding, and that many of the readers probably know well from their own experience. Cabanac and Preuss (2013) provide a detailed account of conference bidding and review flow. After the bidding process, one of the many algorithms for matching under preferences Manlove (2013); Klaus et al. (2016) can be used to find an assignment satisfying various notions of optimality, fairness, stability, etc. Chen et al. (2020); Aziz et al. (2015, 2019); Benabbou et al. (2021).
Crucially, the current design of the bidding process falls far short of eliciting the full preferences and capabilities of reviewers. First, in some widely used platforms (e.g. EasyChair) there are only three levels of preference: ‘no’ / ‘maybe’ / ’yes’. Other platforms provide a finer scale for reviewers to express their preferences. However, it is not clear to what extent reviewers use this flexibility, as extreme responding or scale end bias is a well known phenomena in many social sciences Furnham (1986). Additionally, it is not clear yet whether or not a finer grained scale of responses would actually lead to more desirable matchings between reviewers and papers. Second and more importantly, going over the entire list of submissions to determine the fit of every paper would take hours, whereas most reviewers would not invest that much time in bidding. Given that modern computer science conferences may have thousands of papers submitted to them, automated systems are being increasingly used to impute the bids of reviewers over papers, an example being the Toronto Paper Matching System (TPMS) Charlin and Zemel (2013).
Hence, for these reasons and many others, it has been claimed that skewed bidding, i.e., where a few papers get many bids and some papers get no bids, is one of the main reasons for poor paper assignment Fiez et al. (2020a); Meir et al. (2021); Shah (2021); Shah and Lipton (2020); Lian et al. (2018); Leyton-Brown et al. (2022). The argument is that some papers get insufficient (or no) bids and have to be assigned randomly or manually by the program chair, often ending up at unqualified reviewers. For example, Cabanac and Preuss (2013) analyzed data from nearly 20,000 reviews in dozens of conferences managed on ConfMaster, and showed that more than 8,000 (42%) were done by reviewers who did not bid on the paper at all!111Indeed, ConfMaster also allows reviewers to express negative preference on a paper by bidding ‘no’, but this is not very helpful when facing thousands of papers. A poor assignment, in turn, may affect review quality Stelmakh et al. (2019); Peng et al. (2017); Rodriguez et al. (2007); and increase the overhead on conference chairs, who need to handle these orphan papers that receive no bids via manual (re)assignments. Skewed bidding is also likely to put obstacles in the way of achieving alternative goals such as fairness Payan and Zick (2021); Lian et al. (2018), as creating a fair assignment crucially depends on actually knowing the preferences of the reviewers.
Skewed Bidding at AI Conferences
At AAMAS, where we have data from PrefLib Mattei and Walsh (2013, 2017), there are also a high number of orphan papers.222Note that the AAMAS data reviewers were able to opt out of being included in the public dataset, hence some papers and bids are missing from this dataset. The AAMAS 2015 dataset contains 9,817 bids of 201 reviewers over 613 papers; this represents about 40% of the actual 22,360 bids of 281 reviewers over 670 papers. The 2016 data contains 161 out of 393 reviewers with bids over 442 out of 550 papers. Within this, for AAMAS 2015 papers had 6.9 bids on average, yet there are 30 papers that have no bids at all (5%) and 95 papers that have less than 3 bids (15.4%), while for AAMAS 2016 papers had 6.5 bids on average, but there are 8 papers that have no bids at all (1.8%) and 54 papers with less than 3 bids (12.2%).
Simply increasing the bidding requirement, which increases the burden on reviewers during the bidding process, may still not be sufficient to deal with the issue of orphan papers. For example, at IJCAI 2018 each paper received almost 40 bids on average (!), and yet 140 papers (4%) had only two or fewer bids Meir et al. (2021).
1.1. Proposed Solutions
Given the strong skew in bidding, there have been two recent suggestions to alleviate the problem of skewed bidding put forward in the literature:
- (1)
Presenting low-demand papers higher on the list Cabanac and Preuss (2013); Fiez et al. (2020a); 2. (2)
Providing information regarding paper demand Meir et al. (2021).
Interestingly, the first suggestion builds on reviewers’ cognitive biases, while the latter exploits their (bounded) rational behavior.
In more detail, Fiez et al. (2020a) proposed an algorithm to determine the order in which papers are presented to the reviewer during bidding, taking advantage of the ordering of papers to bidders. This suggestion rests on the primacy effect: items that appear earlier on a list are more likely to be selected Murphy et al. (2006). Primacy effects have been empirically shown to occur in conference bidding data on ConfMaster Cabanac and Preuss (2013). The underlying idea is that demand can be smoothed by taking advantage of well known cognitive biases rather than providing more information to bidders.
The other suggestion, by Meir et al. (2021), considers a model where the demand over papers is known (or revealed) to the bidders. They showed that as long as reviewers are individually rational and interpret their probability of being assigned a paper as inversely proportional to demand, a simple market-based scheme induces an incentive to follow the recommended instructions, and thereby reduces the skew in bids and leads to an improved assignment. Drawing inspiration from the Trading Post Mechanism Shapley and Shubik (1977), they suggest tagging papers with their inverse price rather than actual demand, and assign a budget the bidder is encouraged to use. Interestingly, rational bidders then have an incentive to exhaust their budget, but some bias in favor of high-price (low-demand) papers is necessary to obtain more balanced bids. Thus the model predicts bounded rationality would lead to the best results.
In both the work of Meir et al. (2021) and Fiez et al. (2020a), the actual behavior of the individual bidder (i.e. how their bid is affected by order or demand) is assumed, and the theoretical and empirical results are contingent on these assumptions. However, bidding behavior with prices has never been tried or empirically validated, and while primacy effect has been shown to exist on average, it is not well understood how substantial it is compared to other factors.
1.2. Contribution
The goal of this paper is to explore how different components of the bidding platform affect the probability that a participant will select a particular paper. The main motivation, following Fiez et al. (2020a); Meir et al. (2021) is to promote the selection of papers with few bids, thereby reducing the skew and indirectly improving the paper assignment.
Since previous work has suggested to control either the order of papers Cabanac and Preuss (2013); Fiez et al. (2020a), or the information given to users on the demand Meir et al. (2021), these are the main parameters we considered.
**Hypothesis 1 (Order Effect): **
Subjects tend to select papers appearing earlier on the list.
**Hypothesis 2 (Demand Effect): **
Subjects tend to select papers that are indicated as low-demand.
In addition we are interested in how these tendencies, if they exist, are distributed in the population, as well as in various factors affecting them. Hence, we designed and executed two types of experiments. The first is a field experiment on a medium-size workshop, and the second is a large scale experiment on Amazon Mechanical Turk where we control all the variables. In both experiments only some of the subjects were exposed to information on the demand, so their behavior can be compared to the control group.
Our main findings support both hypotheses, as we show that both paper order and information on demand can be used to shift reviewers towards low-demand papers. However at the individual level there is a substantial difference. The order of papers has an effects on most subjects, but in a rather weak manner. In contrast, we identify in both experiments a small group of people that are highly sensitive to the demand, and results from the field experiment suggest that their effect on the bid distribution is substantial. We further study via controlled experiments the relative and cumulative effect of exposing the subjects to different forms of information on the demand, and simple factors affecting compliance with the bidding instructions.
We conclude with a list of simple, practical suggestions to improve the use bidding platforms so as to reduce the prevalent skew in paper bidding, thereby improving paper matching.
1.3. Related Work
Ordering effects are well studied in economic and psychological models of choice. Typically, decision makers attend to the first few and last few items in a list more than the rest, increasing response rates for these items Krosnick and Alwin (1987). In an academic context, papers appearing earlier on an email digest are more likely to be downloaded and cited Feenberg et al. (2017). Cabanac and Preuss (2013) were the first to show that ordering effects occur in paper bidding. Later, Fiez et al. (2020a) suggested a sophisticated sorting algorithm that takes into account both dynamic demand and estimated reviewers’ preferences.
Rodriguez et al. (2007) aimed at uncovering the factors underlying bidders’ behavior in the JCDL’05 conference. Their starting point was that bids are expected to reflect the (objective) expertise of the reviewer w.r.t. the domain of the submission. They evaluate this expertise through alternative means, e.g., co-author network or keyword occurrence. The authors find very low correlation between reviewers’ areas and their bids, and conjecture that reviewer fatigue may be responsible. Our work does not get into whether reviewers’ preferences are indeed based on expertise (as opposed to, say, curiosity and interest in the title). It does however shed light on the other, more consistent factors that affect bidding behavior.
A major challenge in behavioral studies is having subjects with real-world preferences and comparing behavior against true preferences, which are private. Ideally, we would combine these in a single experiment that cleverly elicits the real preferences, as in the work of Budish and Kessler (2017) on course allocation, or by performing individual exit polls Blais et al. (2000) on voters. Since there is no conference, let alone a large one, that uses a similar mechanism for paper bidding, we resorted to use a combination of field and controlled experiments.
Assignment Algorithms
The assignment of papers to reviewers is formally a version of the multi-agent resource allocation problem with capacities Bouveret et al. (2016) and has been well studied in a number of areas of computer science Goldsmith and Sloan (2007); Lian et al. (2018), economics Budish and Cantillon (2012), and beyond Dickerson et al. (2014). Garg et al. (2010) provide a comprehensive discussion of assignment algorithms, their application to the review process, and different methods for evaluating the quality of an assignment from both the conference and reviewer standpoint. Two popular ways to evaluate assignments are maximizing either the egalitarian welfare Demko and Hill (1988), i.e., making sure the worst off reviewers is as happy as can be or the utilitarian welfare, i.e., maximizing the sum of reported utilities for assigned papers across all reviewers. There are other refinements of these solution concepts Garg et al. (2010); Lian et al. (2018), and a large literature on calibrating feedback across reviewers for better assignment Wang and Shah (2019). While the workshop in which we ran our field experiment used the utilitarian maximal assignment (an assignment maximizing social welfare), the results we report are independent of the assignment algorithm in use. Note that while assignment of heterogeneous tasks is also common in other domains such as crowdsourcing Assadi et al. (2015), the ‘workers’ in paper bidding have some unique features. They are volunteers (which is also true in some crowdsourcing tasks), they often participate repeatedly every year, and they expect a roughly fixed workload.
Some modern platforms use TPMS or other systems that infer the interests of the reviewer from her list of publications or other sources Charlin and Zemel (2013). However it does not seem that implicit preferences induced from TPMS are less skewed than explicit bids. As Fiez et al. (2020a) find in their study, TPMS scores result in a very skewed and sub-optimal bid distribution, where many papers receive very low scores. For example, in the TPMS dataset from ICLR 2018, out of the 911 papers, 85 of them (9.3%) have a maximum similarity score 0.1 (on a scale), meaning that these papers are very unlikely to get bids from reviewers.
2. Experimental Design
We implemented a platform that resembles common paper bidding platforms—mainly EasyChair and ConfMaster.333See https://easychair.org/ and https://confmaster.net/. An example of the interface is shown in Figure 1.
2.1. The Basic Platform
In all experiments, the participant is presented with a table containing all papers. For each paper, the table specifies the title and keywords, and the user may click a paper to expand and read the abstract. The user can bid on each paper using a radio button whose states are No/Maybe/Yes, where No is the default option. As is common in bidding platforms, we implemented basic search and filtering capabilities. The user may type a string in order to see only the papers containing this exact string anywhere in the title, keywords, or abstract. At the top, the user also sees how many papers have been marked as Yes and as Maybe so far, and may alter their selection of the papers at any time. Subjects could sort papers according to any column and the initial order depends on the experiment condition.
In some conditions additional information or options were provided in the interface including the inverse price of papers (called ‘iPrice’ in Meir et al. (2021) and ‘bidding points’ on the platform) or the total bidding requirement, marked in Fig. 1(Right). We discuss each of these design modifications in their respective section. Following Shapley and Shubik (1977); Meir et al. (2021), we define the inverse Price (iPrice) of a paper as , where is the current number of bids on the paper, and is the number of copies of the paper that need to be assigned (throughout this paper ). Thus a high iPrice indicates current low demand.
2.2. Field Experiment
For our field experiment, we used the bidding phase of the COMSOC-2021 international workshop.444https://comsoc2021.net.technion.ac.il/ We partitioned the set of 42 reviewers randomly into a field treatment (FT) group consisting of 28 people that saw papers’ iPrices during bidding, and a smaller field control (FC) group of 14 people that saw no iPrices. There were 93 submissions in total.
Bidding Process
Both groups used our platform for bidding, where all 93 submissions were available along with the search and bidding interface shown in Figure 1. Using this interface, reviewers could also use the platform to report a conflict of interest on papers, but this was scarcely used. The control group had no extra information on demand and were asked to bid positively on at least 12 papers, of which 5-7 will be assigned as in Fig. 1(Left). The treatment group saw the iPrices as in Fig. 1(Right), and had a budget of 800 bidding points. These bidding minimums for both groups were purely instructive and were not actively enforced in any way: reviewers could bid on any number of papers. The iPrices were set as explained above and updated on every new login, hence iPrices were static during a session but may change between sessions if an individual reviewer logged back in. We implemented the two caveats recommended in Meir et al. (2021): (a) the current bidder is always counted as a positive bid on all papers, to prevent price change during the bid; and (b) demands were initialized as uniform rather than empty to prevent a cold start. In practice only three reviewers logged in more than once to update their bids. Papers in both groups were initially presented according to their order of submission.555In hindsight it would have been better to present them in random order, as in the controlled experiments.
Assignment
The workshop used the bids entered by the committee members as input to a standard utilitarian maximal assignment algorithm with demands and conflict of interests Lian et al. (2018); Garg et al. (2010). The implementation was the same as that of Lian et al. (2018) which uses Gurobi to solve the assignment ILP and allows for a range of paper and reviewer capacities, each reviewer was assigned 6 or 7 papers. The utility of the overall assignment used, using the bids as a proxy for reviewer utility, was 520.0. While there may be multiple assignments with the same utility we took the first one that Gurobi provided.
2.3. Controlled Experiments
In the controlled experiment we had a Base (B) group (same interface as the control group in the field experiment), and several different treatment groups. The main treatments we used were: revealing papers’ iPrices to subjects in the Price (P) group; and visually highlighting low-demand papers in the Highlight (H) group. Additional conditions designed to study specific questions will be explained below. All treatments are between subjects. All subjects faced the same set of 550 papers from AAAI’15, which are publicly available.666http://www.aaai.org/Library/AAAI/aaai15contents.php. Subjects in group B were requested to bid on 30 or on 40 papers, of which 8 will be assigned.
Setting Paper Demand
As subjects are participating independently of one another, we needed to generate the demand (i.e. the iPrices) for each paper. Rather than generating artificial demand, we sampled the iPrice directly from a uniform distribution on , and truncated to the range . This is to guarantee we cover the entire range and also have a substantial number of papers with extreme iPrices. Although in reality no paper could have an iPrice of 0 (as it indicates infinite demand), we still wanted to see how this will affect behavior.
Assignment
While the assignment in the controlled experiment plays no role in our analysis, we describe it in Appendix A for completeness. Participants were not aware of the exact allocation algorithm, but were told that papers with positive bids were more likely to be assigned, and that the chance also depends on the demand for the paper (to which they may or may not be exposed according to the condition they are in). The final assignment was displayed to the participant immediately after they submitted their bid, together with the breakdown of the reward.
Incentives
In our controlled experiment, participants were not actually reviewing any paper and thus a-priori had no incentive to prefer one paper over another. To mimic the situation of a reviewer trying to select ‘relevant’ papers, we assigned to each participant a set of six ‘personal keywords’ that supposedly reflect her interests. Subjects earned ‘coins’ for each of the 8 papers that were eventually assigned, and how many of these personal keywords they contained (either in the title or in the paper keywords or in the abstract). Each coin increased the bonus by 0.25, thereby creating an incentive to bid on relevant papers as common in MTurk Experiments Mason and Suri ([2012](#bib.bib31)). An important remark is that in real conferences reviewers’ interests are often positively correlated. Using common keywords leads to a similar situation in our controlled experiment with a correlation of 0.7\pm 0.16$ in paper relevance among participants.
The personal keywords were selected at random for each participant from the pool of all papers’ keywords, with constraints to make sure all participants had a similar amount of relevant papers. These personal keywords were displayed in a separate box on the screen.
Instructions and Demo
To make sure that the (rather complex) instructions of our experiment are understood we: detailed instructions; an online quiz (see appendix); and a demo game. We also informed participants up front that failure to reach minimal required reward may result in rejection of the job—standard for conducting online behavioral research Mason and Suri (2012). The instructions and quiz focused on explaining that the payment depends only on the assigned papers (8 in total) and not directly on the bid. The demo was very similar to the game except it only contained 50 papers and 3 personal keywords. Participants that did not reach the minimal required reward in the demo could not continue to the game but could try the demo up to 3 times. All participants expressed informed consent. The study was approved by the IRB of the authors’ institution.
2.4. Measuring Behavior
Since bidding behavior can be complex and depends on many variables, we develop simple measures that we can compare across subjects and groups of subjects.
For a set of presented papers , we denote by the subset of papers that were selected by subjects. Note that each paper is presented to multiple subject, and counted as a separate ‘presented paper’ for each subject. Also Note that we treat any positive bid (‘Maybe’/‘Yes’) as a selection. In particular, is the set of papers selected by subject .
We denote by the set of papers from that were not selected, similarly, are the papers not selected by subject .
We denote by the iPrice of paper . In the field experiment, the iPrice was derived from the actual demand as explained above, and was updated with every new login; whereas in the controlled experiment it was generated once per subject and remained fixed.
Measuring Individual Behavior
For each of the features we used (for (R)eward or Relevance, (O)rder, and (D)emand, respectively) and each paper , we denote by the relevant feature of the displayed paper.
In the example in Fig. 2 paper #3 has , , and , as the maximal reward in this example is 2.777the reward scheme we actually used was a bit different, see instructions in appendix. In particular, the reward for papers with 0 personal keywords, which are most papers, was negative, so there is a strong incentive to avoid them.
For a subset of samples , we used the average: . E.g. for in our example, we have .
For every subject and feature , we defined the ‘sensitivity-to-X’ as the difference between the average value of the feature in selected and unselected papers. Formally:
[TABLE]
is always in , and its expected value is 0 if is completely insensitive to feature (e.g. selects papers at random). For the subject in our example, where the selected papers are and , we have
- •
, indicating a moderate sensitivity;
- •
, meaning the subject tends to select earlier papers; and
- •
, meaning sensitivity towards paper with low demand (=high iPrice).
Note that StR cannot be evaluated in the field experiment since we have no direct access to the reviewers’ real preferences and expertise.
Measuring Group Behavior
One way to measure the group behavior is considering the average StX values of group members (denoted ). When we want to condition on other attributes, we measure the probability of selecting a paper as a function of the relevant feature (e.g. initial position in the table), while controlling for relevance. Formally, given a set of samples (say, ‘all papers in the second quantile of positions that are highly relevant to their respective subject’), the probability of selection is . We can then test if the behavior in two conditions is different by comparing to or to , checking if the different is significant using an unpaired t-test.
3. Results from the Field Experiment
3.1. Distribution of Bids
The empirical distribution of bids is shown in Figure 3 (Left). In the control group there were a total of 267 bids, 19.1 bids per user, while for the treatment group there were 547 bids, which are 19.5 bids per user.
To see if the induced bids in both conditions are drawn from different distributions, we used a two sample Kolmogorov-Smirnov test with the null hypothesis that the treatment distribution was less than the control distribution Hodges (1958). This resulted in a test statistic of 0.001429 and a -value of 0.67, so we cannot reject the null hypothesis that average bid amounts are the same.
However what we really want to know is whether bidders where affected by the other factors, in particular order and demand.
Sensitivity to Order
According to Table 1, only the control group (FC) demonstrated sensitivity to order, however the effect is barely statistically significant (presumably due to the small number of reviewers on that group).
Sensitivity to Demand
The first observation from Table 1 regarding demand sensitivity is that it is negative in both groups, on average. This may seem surprising but actually makes intuitive sense as in a real conference there is positive correlation in bids, i.e., you are more likely to bid on a popular paper. Hence, only the difference between the groups matters.
The average of StD is slightly higher in the treatment group, but this is not statistically significant. It is more instructive to look at the distribution of StD values (Fig. 4(Left)): we can see clearly that in the treatment group there are several subjects that are highly sensitive to high iPrices (i.e. to low demand), whereas the distribution of the others is similar to the control group.
3.2. Skewed Bids
We compared the number of papers that were under-demanded in each group, that is, received fewer than the 3 bids necessary to find a good assignment. In the control group there were 47 papers that received fewer than three bids, with 6 of these being papers that received no bids at all. For the treatment group there were only 6 papers that were under-demanded and only a single paper that received no bids. However, this must be partially due to the difference in the size of the two groups.
To address this we looked both at the number of orphan papers and at the number of missing bids (minimal additional bids required so that every paper has at least 3 bids) that would appear under a bootstrap sampling paradigm Bruce et al. (2020). To do this we took the set of bids and sampled a “small committee” from each group with 14 reviewers in it 1000 times. As we can see in Fig. 4(Right), although the average number of bids remains unchanged, the number of missing bids and orphans drops significantly as we replace FC bidders with FT bidders, indicating that even the small number of demand-sensitive bidders have a substantial effect on the bid skew.
3.3. Discussion of the Field Experiment
The initial results from our field experiment suggest that: (1) there seems to be a weak order effect; (2a) there is some fraction of reviewers that are highly sensitive to the demand when given via bidding points and budgets; (2b) this increased sensitivity to demand reduces the number of missing bids and orphan papers; (3) subjects who had budgets were more compliant, possibly due to differences in the UI.
However the small number of reviewers makes it difficult to make any strong conclusion. In addition some parameters cannot be controlled (such as inherent demand for papers); or were not controlled in our design (such as paper order or displaying the bidding requirement). We therefore turn to controlled experiments to better understand these effects.
4. Controlled Experiments
Conditions
Our base group (B) was similar to the control group at the field experiment, except that papers where displayed at a random order, and we added the bidding requirement to the UI in order to rule out this as a potential source of differences between groups. See Fig. 1.
In addition to the base group, we had the following treatments.
**iPrices (P): **
In this condition (similarly to the FT group in the field experiment) subjects had an additional column titled ’Bidding points’ showing papers’ iPrices as integers in the range . The bidding requirement was set as a ’budget’ of 1000 points.
**Highlight (H): **
In this condition we did not show the iPrice, but instead highlighted low-demand papers in green (when iPrice is 100) or yellow (when iPrice in [70,99]).
**iPrices + Sort (PS): **
Similar to Condition P, except papers were initially sorted by increasing demand (decreasing iPrice).
**iPrices + Highlight + Sort (PHS): **
Similar to PS, with also highlighting low-demand papers as in Condition H.
**Implicit Request (IR): **
This condition was identical to the base condition, except that the bidding requirement did not appear on the screen during bidding.
Data Collection
We collected data from 338 participants on Amazon Mechanical Turk. Subjects were allowed to play up to three times. Participants were randomly assigned to the base group or to one of the treatment groups. The total number of participants of each group appears in the second column in Table 1. The threshold for rejection was set at 12 coins.
Spammers and Sensitivity to Relevance
There was a distinctive group of subjects who did not respond to paper relevance (‘spammers’) and were not included in the rest of the analysis. We explain this in detail in Appendix B.
To better understand the isolated effect of each factor, we start by analysing the Base condition and conditions (H)ighlight and i(P)rices. For Example, the StR column in Table 1 shows that in all groups the mean sensitivity (of non-spammers) is about 0.2, and is significantly higher than 0.
4.1. Paper Order
We can see that in all three conditions, there is similar average sensitivity to order, of about -0.13, i.e. there is a statistically significant bias to papers that appear earlier. However reward still plays a more important role in selection.888For many spammer subjects, the StO was even more negative, which is not surprising or interesting. But are all subjects slightly biased or is it a small number of highly biased subjects? For this, we look at the distribution of individual StO values in our controlled experiment (Fig. 5, top left).
From the figure, it seems that most subjects are prone to some bias (sensitivity is most often negative but not below ); yet there is a non-negligible number of subjects with a very strong sensitivity, which essentially marked papers at the very top. Some subjects had high positive StO values, meaning they deliberately marked papers at the bottom of the list.
Another question we can ask is whether all papers are equally likely to be promoted when appearing earlier. As we can see in Fig. 5 (top right), primacy affects irrelevant and relevant papers alike, where selection probability drops sharply for papers that are not at the top, and then continues to decrease moderately.
Our findings regarding order effect are largely consistent with those of Cabanac and Preuss (2013) from real conferences, and thus support our Hypothesis 1. The added value of our controlled experiments is two-fold: how order effects are distributed across the population; the dependence of order sensitivity (or lack of thereof) on the relevance of the paper.
Consistency
A third question we may ask is whether the bias towards early papers is consistent. We analyzed the behavior of subjects who played two or three times (all from Condition H), comparing their StO measure each time.
The between-subject variance of StO is —slightly higher than the average within-subject variance of . This indicates that participants maintain some consistency in their sensitivity to order.
4.2. Paper Demand
We considered two ways to communicate papers’ demand to subjects. The first was adopting the market scheme of Meir et al. (2021) where low-demand papers have high iPrices (condition P). In condition H we simply highlighted the low-demand papers visually.
Sensitivity to Demand
The right column in Table 1 shows that the Base group is completely insensitive to the demand (as expected, since they have no information about it); the iPrice scheme is moderately effective; and highlighting alone has a small effect (barely statistically significant). Looking at the distribution of sensitivity to demand in Fig. 5 (bottom left), we can see that in contrast to the primacy effect, most subjects in conditions P and H are not sensitive to the demand. The effect we see is due to a relatively small number of highly sensitive subjects. This corroborates our initial finding from the field experiment, and supports our Hypothesis 2.
Price Scheme More Effective than Highlighting
We can see in Table 1 that the effect of highlighting papers by itself is borderline significant (only 3 of the 21 subjects demonstrated significant bias towards highlighted papers in Fig. 5). In contrast, about third of the subjects who were exposed to iPrices were significantly affected, and the overall bias doubled.
Which Papers are Affected?
We can see that the effect of high iPrices is mainly on papers that are already relevant (Fig. 5, bottom right). This is another difference from the effect of paper order. It is also another evidence of rational decision making (in the economic sense), as the iPrice indicates the probability of getting the paper.
Consistency
Similarly to order effects, subjects who played 2 or 3 games exhibit some consistency in their sensitivity to demand, with a between-subject variance of vs. within-subject.
4.3. Using All Treatments?
Since paper order, iPrice and highlighting all have some positive effect, it might make sense to combine them together in order to influence people to spread their bids even more. In particular, we can adopt the suggestion of Cabanac and Preuss (2013) to actively sort papers by increasing demand and combine this with Meir et al. (2021) iPrice scheme. Further, we can highlight the most underdemanded papers in the table. We therefore ran another experiment with two more groups: In group P+Sort we displayed iPrices and budget as in condition P and sorted the papers initially by decreasing iPrice (so underdemanded papers are on top); In group P+H+Sort we did the same and highlighted the underdemanded (high-iPrice) papers as in condition H. Note that in these two conditions StO is the same as (negative) StD.
By looking back at the StD results in Table 1, we can see that in group P+Sort there is more sensitivity to demand, as expected. It is also higher than the mean StO of subjects in the previous conditions, indicating some synergy between the price scheme and the initial order. However highlighting papers in addition did not help at all (see condition P+H+Sort), and in fact only added noise. We analyse possible reasons for this phenomenon in Appendix C, showing it might be an artifact of a single (or few) AMT participant connecting from multiple accounts.
4.4. Compliance with Bidding Instructions
A very visible difference between the groups at the field experiment (Fig. 1(Left)) is that while the average amount of bids per reviewer is similar, the treatment group seems more concentrated around the mean.
To more accurately measure this difference we define a new measure called Compliance Ratio. The full details are in Appendix D but intuitively a compliance ratio of 1 means exactly exhaust the budget (in conditions with iPrices), or bidding exactly on the required number of papers (when there are no iPrices).
A number below 1 indicates underbidding and a greater number indicates overbidding.
Indeed the compliance ratio was concentrated around 1 only in condition FT and not FC, but we suspected that the difference was only because in condition FT the budget was displayed throughout (in FC the requirement only appeared in the instructions, as in most conferences).
Our controlled experiment confirmed this hypothesis: the IR condition is identical to the Base condition, except for the fact that the bidding requirement was not displayed (as in FC). Indeed, the result was that while in both cases the average compliance was close to 1, values were much more concentrated in Condition B, whereas many participants over- and under-bid in Condition IR.
5. Discussion
Our combined experiments in bidding behavior show that:
- (1)
Bidding likelihood increases uniformly for papers appearing higher in the list (corroborating previous empirical findings); 2. (2)
Presenting papers’ demand in the form of iPrices positively influences a small but non-negligible subset of people to shift their selection to low-demand papers; 3. (3)
Presenting the bidding requirement during bidding (rather than just include it in the instructions beforehand) results in much higher compliance.
Our field experiment further showed that shifting the demand of even few bidders towards low-demand papers, reduces the skew in bids and makes sure more papers get the minimal required amount of bids.
Critique on experimental results
There are two main concerns about the validity of our results. First, there is an internal validity issue: One can ask whether the behavior we see is consistent or sporadic. This is important as consistency also means predictability. Our preliminary analysis shows that subjects exhibit at least some level of consistency but this should be studied more in-depth over longer time periods and with diverse input.
Another concern is external validity: will the behavior of researchers bidding on real papers be similar to that of AMT workers who play a game for recreation and/or money?999The analysis in Appendix C raises one such concern.
We argue that the answer is yes. While it is clear that the preferences of actual reviewers over real paper assignment are very different from those of AMT participants in our controlled experiment, it is much more likely that both groups demonstrate the same behavioral biases and tendencies in trying to obtain their preferred outcome.101010Note that we restricted our AMT participants to similar demographics by requiring a university degree.
In that respect, our use of AMT is similar to its use in consumer behavior research, where controlled experiments with simulated (rather than actual) purchases are used to complement field studies and deepen understanding Ghose et al. (2014).
More generally, results from AMT experiments are considered reliable despite some differences in personality traits Goodman et al. (2013), especially if subjects are filtered based on their comprehension of the task (as we do).
Critique on paper bidding with iPrices
There are several concerns raised by the suggested bidding scheme in Meir et al. (2021). Mostly regarding fair treatment of papers and strategic considerations of bidders (e.g. is it better to bid earlier or later). Meir et al. (2021) directly address most of these concerns in the original paper, where their main point is that bidders are free to ignore instructions and behave as they would without demand information, but any bidder that does take this information into account improves the outcome both for herself and for the others.
We can also add that we did not encounter any adverse effects in our field experiment. However we should keep in mind it was in a small scale.
Another possible objection is that automated matching enabled by systems like TPMS makes bidding redundant altogether, or at least less important. That may be true in the future but as shown in Fiez et al. (2020b) (see our Introduction), current automated fit-scores are also highly skewed, and may therefore exacerbate the problem rather than solve it.
Practical Recommendations
We believe that adopting the simple market scheme of Meir et al. (2021) can have a positive influence on distribution of bids during bidding phase. This influence can be increased by combining other UI factors such as highlighting and/or use the current demand as a factor in sorting presented papers Cabanac and Preuss (2013); Fiez et al. (2020a). Regardless of the bidding scheme, we recommend that the bidding requirement (in terms of number of positive bids or budget) will be displayed during bidding. These changes can be easily implemented in existing platforms such as EasyChair and ConfMaster, and be offered to conference organizers as optional features.
We recommend doing these changes carefully:
- •
Consult UX/UI experts regarding the best way to highlight papers so as to avoid confusion, choosing the best terms to describe iPrices and budgets, etc.;
- •
Explain reviewers/committee members that they can bid as they wish (even ignore all additional information), but will be more likely to get their desired papers by following the bidding instructions;
- •
As for paper order, we should keep in mind that most platforms offer the user flexibility in how to sort the papers, so users should have to option to choose whether demand should be a factor in this order;
- •
Test suggested changes on a subset of conference participants and/or in smaller workshops before full adoption.
We hope these suggestions will contribute to improving the review process for all.
{acks}
Nicholas Mattei was supported by NSF Awards IIS-RI-2007955, IIS-III-2107505, and IIS-RI-2134857, as well as an IBM Faculty Award and a Google Research Scholar Award.
Appendix A Interface and Assignment Algorithm in the Controlled Experiment
The interface is shown in Fig. 6, where we can see the personal keywords on the left.
Recall that in the controlled experiment, there is only one player at a time, but every paper as a fixed iPrice (see Section 2.4) that reflects its demand.
After selection, we determined the assignment based on subject’s bids using a simple randomized allocation: The iPrice of each paper is used as an approximate assignment probability (see Meir et al. (2021)).
The algorithm first considers all ’Yes’ bids in random order and assigns each paper with probability , until papers are assigned or until all of ‘Yes’ bids have been considered.
It then repeats the process for ‘Maybe’ bids, and then for all other papers.
Appendix B Spammers and Sensitivity to relevance
The first thing we had to check was whether subjects are responsive to incentives at all, as otherwise there is not much value in measuring their behavior.
We had an internal criterion used to filter out participants that had consistently failed to select relevant papers. For this, we calculated for each subject her sensitivity to relevance (StR).
Note that a completely random selection should result with StR that is close to 0, whereas even a very weak tendency to select relevant papers (e.g. avoiding papers that have no personal keywords) should result in a strictly positive StR.
The empirical distribution of StR is visible in Fig. 7. We can see that there is a bimodal distribution, indicating two types of subjects: a large minority of subjects whose StR is distributed around 0 (left of the dashed line in the figure), and the majority of subjects, whose StR is well separated from 0.
We treat the first type as ’spammers’, and do not include them in our behavioral analyses.111111A participant may be a ‘spammer’ either if she plays without effort for a quick gain, or if she failed to understand and follow the reward structure. Some evidence suggest that most spammers are of the first type, as they had spent much less time on bidding, and provided unreasonable demographic information. The right column in Table 1 shows the number of usable subjects from each group. Note that our criterion for spammers is conservative, and that there may still be some spammers within the low-StR subjects we analyze.
Payment Rejection
We used a more conservative criterion do reject AMT subjects.
We only rejected subjects who failed the announced criterion (at least 12 coins) and our internal ‘spammer’ criterion explained above. Thus spammers who (by chance or due to false classification) reached 12 coins were paid, but their data was not included in the analysis.
Appendix C Strange Behavior in the P+H+Sort Condition
Recall that the mean sensitivity to demand in Condition PHS was substantially lower than in PS alone, even though the conditions are very similar, and Highlighting should only increase the sensitivity.
We can see a possible reason for this in the distribution of personal StD values in Fig. 9(Right): while almost all subject in group P+H+Sort were influenced, it seems that some of them deliberately selected high-demand papers! Moreover, for those subjects who played more than one game, this behavior was consistent.
While it is possible that our instructions in this condition only were not clear and led to confusion, there is also an alternative explanation.
An ‘exact bidder’ is a subject who bids exactly the budget. While this behavior was rare in general (occurred in about 10% of the games in the other conditions), in the PHS condition more than half the games were exact. Moreover, there were 11 (out of 28) users who were exact in all of their games.
While exact bidding is not a problem on its own, such an excessive rate of two rare behaviors (exact bidding and negative StD) in a single condition may indicate that there were few persons (or perhaps even one) behind many of the accounts.
While we cannot verify this conjecture, it is a reminder about how careful we should be when designing and analyzing online experiments, and of the need for independent corroborations.
When ignoring games with suspicious exact bidding, we get mean StD rate of and in the PS and PHS conditions, respectively. This indicates that the PHS condition is indeed the most effective in making bidders demand-sensitive are reduce bid skew.
Appendix D Compliance with Bidding Instructions
A very visible difference between the groups at the field experiment is that in the treatment group the total bid amounts of each reviewer are concentrated (with a few outliers), whereas in the control group the distribution is scattered. This is even though the control group were instructed to bid on a given number of papers (12), and the instructions to the treatment group was in terms of bidding points. To more accurately measure this difference we define a new measure.
The compliance ratio of a subject with instructions to bid on papers is computed as . For subjects that have prices and budgets we compute the compliance as where is the budget.
In either case, a number below 1 indicates underbidding and a greater number indicates overbidding.
Indeed, for condition FC (control) the average compliance ratio was 1.59 (median 1.55) with a standard deviation of 0.75; whereas in condition FT most subjects were highly compliant, with average, median, and standard deviation of 1.15, 0.97, and 0.52 respectively.
However we suspected that the difference was not due to the use of bidding points: Only the subjects in the treatment group can easily track their progress, through the budget that is shown throughout bidding (see Fig. 1(bottom)).
We therefore formulated a third hypothesis:
**Hypothesis 3: **
: Displaying the bidding requirement (regardless of other information) makes reviewers more compliant.
Note that by ‘more compliant’ we mean the compliance ratio of more subjects is closer to 1, not necessarily higher.
Controlled experiment
To test this hypothesis, we contrasted the Base condition with the ‘Implicit Request’ (IR) condition. In the IR condition (as in most conferences and in our FC group) the bidding requirement was not part of the interface, but still appeared on the instructions that were accessible throughout the experiment.
We can see in Table 1 that groups B and IR behaved similarly in terms of our three sensitivity measures.
Results
Comparing groups IR and B allows us to directly test Hypothesis 3. Indeed, we can see in the two leftmost columns of Fig. 9 that merely presenting the bidding requirement on the screen substantially boosts compliance: while the average compliance in both is close to 1 (1.14 and 1.07, respectively), the variance of IR is much higher (-value is 0.001 in an F-test). This finding supports the hypothesis. Recall that subjects could view the instructions at any time , so the strength of the effect is rather surprising.
There was no substantial difference between groups P and B in terms of concentration, although there was less underbidding and more overbidding in group P. This is in contrast to what we saw at the field experiment, but recall that groups FC was more similar to IR than to B in this respect.
Appendix E Experiment Instructions and Quiz
The reader is encouraged to try the interface at the following link (this link may not be stable, will provide a stable link at the final version):
https://paperbid.herokuapp.com/registerA/ See pages - of images/instructions_A.pdf
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Assadi et al . (2015) Sepehr Assadi, Justin Hsu, and Shahin Jabbari. 2015. Online assignment of heterogeneous tasks in crowdsourcing markets. In Third AAAI Conference on Human Computation and Crowdsourcing .
- 3Aziz et al . (2015) Haris Aziz, Serge Gaspers, Simon Mackenzie, and Toby Walsh. 2015. Fair assignment of indivisible objects under ordinal preferences. Artificial Intelligence 227 (2015), 71–92.
- 4Aziz et al . (2019) Haris Aziz, Xin Huang, Nicholas Mattei, and Erel Segal-Halevi. 2019. The Constrained Round Robin Algorithm for Fair and Efficient Allocation. ar Xiv preprint ar Xiv:1908.00161 (2019).
- 5Benabbou et al . (2021) Nawal Benabbou, Mithun Chakraborty, Ayumi Igarashi, and Yair Zick. 2021. Finding fair and efficient allocations for matroid rank valuations. ACM Transactions on Economics and Computation 9, 4 (2021), 1–41.
- 6Blais et al . (2000) André Blais, Robert Young, and Miriam Lapp. 2000. The calculus of voting: An empirical test. European Journal of Political Research 37, 2 (2000), 181–201.
- 7Bohannon (2013) J Bohannon. 2013. Who’s Afraid of Peer Review? Science 342, 6154 (2013), 60–65. https://doi.org/10.1126/science.342.6154.60 · doi ↗
- 8Bouveret et al . (2016) S. Bouveret, Y. Chevaleyre, and J. Lang. 2016. Fair Allocation of Indivisible Goods. In Handbook of Computational Social Choice , F. Brandt, V. Conitzer, U. Endriss, J. Lang, and A. D. Procaccia (Eds.). Cambridge University Press, Chapter 12, 284–311.
