Learning in teams: peer evaluation for fair assessment of individual contributions
Fedor Duzhin

TL;DR
This paper introduces a new peer evaluation mechanism for collaborative learning that accurately assesses individual contributions by integrating instructor credibility and eliminating reliance on self-evaluation.
Contribution
The paper proposes a novel peer evaluation method that combines instructor judgment with student evaluations, ensuring fairness and accuracy without depending on self-assessment.
Findings
Mechanism becomes objective if students are truthful
Integrates instructor credibility to weight evaluations
Achieves fair assessment of individual contributions
Abstract
The "free rider" problem has long plagued pedagogies based on collaborative learning. The most common solution to the free rider problem is peer evaluation. As well other existing methods of peer evaluation include self-evaluation --- and hence are prone to grade inflation or, as we show here, are inaccurate in that they do not fairly reward the most hard working student. Another common concern with existing methods of peer evaluation is that students often do not have the necessary skills to evaluate the work of their peers objectively. In this paper, we introduce a new mechanism for peer evaluation that does not rely on self-evaluation, and yet remains accurate, i.e., if all students are completely truthful in their evaluations, then the output of our mechanism becomes an objective truth. At the same time, our mechanism integrates the instructor's judgment with respect to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAccess Control and Trust
Learning in teams: peer evaluation for fair assessment of individual contributions.
Fedor Duzhin, Nanyang Technological University
Abstract
The ”free rider” problem has long plagued pedagogies based on collaborative learning. The most common solution to the free rider problem is peer evaluation. As well other existing methods of peer evaluation include self-evaluation — and hence are prone to grade inflation or, as we show here, are inaccurate in that they do not fairly reward the most hard working student. Another common concern with existing methods of peer evaluation is that students often do not have the necessary skills to evaluate the work of their peers objectively.
In this paper, we introduce a new mechanism for peer evaluation that does not rely on self-evaluation, and yet remains accurate, i.e., if all students are completely truthful in their evaluations, then the output of our mechanism becomes an objective truth. At the same time, our mechanism integrates the instructor’s judgment with respect to the credibility of students’ evaluations. For example, the instructor gives scores to students for writing credible reviews and, in turn, subsequent students’ evaluations are weighted according to these instructor scores.
Keywords:
peer evaluation, collaborative learning, cooperative learning
MSC 2010
Primary: 97D60, Secondary 91A80
1 Introduction
A vast body of literature exists on methods of assessment in tertiary education — see, for example, [10]. In practice, however, written final exams prevail, even though most students will never take an exam in their life after graduation and therefore exam grades are hardly able to capture the true potential of a student to thrive in a complex work environment.
Even though most people never take formal exams after leaving school, working in teams and writing reports are typical job tasks of modern homo sapiens. Team work and report writing are taught at universities, but grading every individual student fairly based on a team’s report is a challenge. For instance, if all the team members get the same grade, then a free-rider problem may occur (see [7], [5], [1]).
The most obvious solution of the free-rider problem is peer evaluation (see [3]). The simplest and yet popular approach to peer evaluation is letting each student grade contribution of each of the team members in absolute terms, i.e., out of 10, a 100, as A, B, C or in a similar way (see, for example, [4], [8], or [9]). According to our experience, peer evaluation in absolute terms results to almost all students giving maximal scores to each other just not to jeopardize their friends’ final grade. Thavikulwat and Chang criticize the whole idea of peer evaluation in [12] and propose to replace it with a different procedure based on students choosing their preferred group size.
Note that while the free rider problem may plague a variety of collaborative learning activities, such as team-based learning (see [11]), the main scenario that we have in mind is a team of students collaborating on a well-defined list of tasks. The group size then should equal, roughly, the number of individual tasks within the project and therefore varying the group size is not an option.
There exist sophisticated peer evaluation systems. The system described in [1] is based on each student allocating a certain number of points between their teammates. Kauffman et al. introduce in [6] a mixed system where students give each other ratings from a list of nine terms such as ”excellent”, ”very good” etc. but these ratings are then converted into a numeric value by dividing everyone rating by the team’s average.
A rigorous mathematical theory of peer evaluation is outlined in [2]. However, the theoretical work [2] is very broad and aimed at mathematicians — experts in game theory. In this paper, we are going to narrow the scope of the theory and to simplify it so it becomes accessible by educators and education researchers.
A system of peer evaluation is a procedure of calculating the ”true” (at least, as it is perceived by team members) contribution of each of the team members into the common task based on mutual evaluations reported by team members. A system of peer evaluation may or may not have certain desired qualities. One such quality is resistance to grade inflation. A system of peer evaluation that includes self-evaluation as part of the process is prone to grade inflation because students have incentive to overestimate their own contribution. The second desired quality is accuracy. A system of peer evaluation is accurate if it outputs true contributions of each of the team members whenever they all report the truth. As we show below, there exist systems of peer evaluation that are accurate but prone to grade inflation and systems that do not rely on self-evaluation but are not accurate. The third quality that may also be desired of a peer evaluation system is integration of the instructor’s judgment into the system. The reason for integrating the instructor’s judgment into the system is that students either may not have the expertise to evaluate their peers or may not have incentive to do it fairly. The instructor will then serve as moderator.
In this paper, we develop a system of peer evaluation that is accurate, does not rely on self-evaluation, and integrates the instructor’s judgment.
2 Mathematical Theory
We assume that students collaborate on a common goal of completing a set of well-defined tasks and there exist the objective truth — the share of total work that each student has accomplished. If the true contribution/share of the th student is , then the objective truth is the vector
[TABLE]
The vector is known to students but can’t be observed by the course instructor directly and the system of peer evaluation should motivate students to reveal the truth to the instructor. What students report to the instructor is a matrix of evaluations of each student by each student. Let entries of this matrix be denoted – evaluation of student by student . We assume that (even though systems of peer evaluation with negative scores exist, one can convert such a system to a system with non-negative scores simply by taking the exponential function of each score).
For each , the vector is the vector of evaluations received by student and, for each , the vector is the vector of evaluations reported by student .
2.1 Mechanism
A mechanism (term adopted from [2]) is an algorithm of calculating the vector of perceived students’ contributions / shares of workload
[TABLE]
from the matrix . Note that the output of the mechanism, i.e., the vector of perceived contributions may or may not be equal to the vector of true contributions.
A mechanism may or may not rely on self-evaluation. If students are not required to report a self-evaluation, we let for all . A mechanism that relies on self-evaluation will be prone to grade inflation — a student will be tempted to give an unfairly high evaluation to himself.
A mechanism is accurate if outputs true contributions of each team member whenever all students are truthful. In other words, if, for each , the vector reported by student is proportional to the vector of true contributions , then we must get .
If the course instructor’s purpose is to fairly evaluate each student’s individual contribution to the team effort, the mechanism should be accurate and should not rely on self-evaluation. However, as we show below, mechanisms that are widely used in practice, usually have one of these two qualities, but not both. Currently, mechanisms used in practice are mostly variations of one of the following two.
Pie-to-all mechanism
The pie-to-all mechanism works as follows. Each student gets one pie and then distributes her pie among all team members, including herself, in proportion to their contribution to the team effort. The final perceived contribution of a team member is the average piece of pie that he received from all the team members. It means that we have
[TABLE]
It is easy to show that pie-to-all is an accurate mechanism. However, it relies on self-evaluation and hence is prone to grade-inflation. Clearly, the best strategy to student is to report and for .
Still, the pie-to-all mechanism has been used in practice — see, for example, [6].
Pie-to-others mechanism
The pie-to-others mechanism works as follows. Each student gets one pie and then distributes her pie among all her teammates, not including herself, in proportion to their contribution to the team effort. The final perceived contribution of a team member is the average piece of pie that he received from all the team members. It means that we have
[TABLE]
The only difference with the pie-to-all mechanism (1) is the absence of self-evaluation, which is expressed by .
The pie-to-others is a very popular mechanism, probably the most popular one. Its clear advantage is that it does not rely on self-evaluation. However, a huge issue with the pie-to-others mechanism is that it is not accurate as the following example shows.
Example 1
Suppose that we have a team of three students and the vector of true contributions is
[TABLE]
The matrix of peer-evaluations and the the vector of perceived contributions are
[TABLE]
□
In our experience, real students, at least those whose major is mathematics, understand very well that the pie-to-others mechanism is not accurate. Students will not be happy if a large portion of their grade comes from this mechanism.
However, the inaccuracy of pie-to-others is not the only problem of this mechanism. Well, pie-to-others still gives the highest score to the most hard-working student in a team, even though that score may not accurately reflect the true contribution. Let’s demonstrate the other issue by an example.
Example 2
Consider a hypothetical team of three students whose vector of true contributions is
[TABLE]
The matrix of peer evaluations and the the vector of perceived contributions are
[TABLE]
which means that now all the students are fairly rewarded. □
Thus it is profitable for strongest students to just do all the work by themselves without letting their teammates do anything. It’s then (and only then) that they will be fairly rewarded for their hard work. This behavior is very common in practice and a serious weakness of the pie-to-others mechanism is that it incentivizes such behavior.
3 Results
In this section, we outline a new mechanism that is accurate but does not rely on self-evaluation.
3.1 Auxiliary matrix
Let be raw evaluations of student by student for . We are going to construct an auxiliary matrix whose entries show relative contributions of students and , the ratio of ’s contribution to ’s contribution according to other team members. If is any student other than and , then
[TABLE]
is the share of ’s contribution in the combined ’s and ’s contribution according to . Therefore,
[TABLE]
is the average share of ’s contribution in the combined ’s and ’s contribution according to their teammates. Thus,
[TABLE]
is the average ratio of ’s contribution to ’s contribution according to their teammates.
Note that (3) does not change if we multiply all evaluations reported by a student by a constant. It means that it is not necessary to normalize the matrix of peer-evaluations to compute the auxiliary matrix .
Example 3
Consider a team of 3 students with true contributions . The matrix of peer-evaluations and the auxiliary matrix are
[TABLE]
Note that each column of the matrix is proportional to . □
3.2 Instructor’s judgment
It is not realistic to expect that actual students would be completely truthful and absolutely precise in their evaluations. In our experience, students who contribute least of all usually tend not to put too much thought into evaluations that they submit and often simply give equal scores to everyone. It is therefore important to have some procedure of discrediting evaluations that are not trustworthy.
In our mechanism, students not only give numeric evaluations to each other, but also provide justifications, i.e., write short reports on each of their teammates. The instructor will then read these reports and give students grades for writing the good trust-worthy reports. Let be the grade given to student by the instructor for writing reports.
We will now modify the construction of the auxiliary matrix in order to integrate trustworthiness of students’ reports into it. Specifically, (4) is replaced with
[TABLE]
i.e., the average is replaced with weighted average.
Example 4
Consider four students and true contributions
[TABLE]
Suppose also that the vector of grades for writing reports is . Probably, student 1 wrote nice trustworthy reports, student 2 did not write anything, student 3 wrote something short and not very meaningful, and student 4 wrote meaningful reports but missed some points. At the same time, students 1 and 4 are truthful in their evaluations, student 2 gave same scores to everyone and student 3 was not completely truthful so that the matrix of raw evaluations is
[TABLE]
Then the auxiliary matrix is
[TABLE]
For instance,
[TABLE]
□
3.3 Missing values
In practice, some students fail to submit evaluations of their teammates, i.e., a real matrix may have missing values. In this case, the instructor will give a zero score for reports to students who did not write any reports and impute their missing numeric evaluations with equal scores. Such non-existing evaluations will be then automatically discarded.
3.4 Main mechanism
To calculate ”true” individual contributions of each team member from a matrix of peer evaluations, we let the instructor grade reports written by each student on their teammates. Let be the grade given to student by the instructor, we calculate the auxiliary matrix by (5). Then the entry of the auxiliary matrix is the relative contribution of student to student according to their teammates. We also have for all and .
Note that if everyone were completely truthful, then each column of the matrix would be proportional to the true contributions of the team members. In practice, however, columns of the matrix may not be proportional to each other due to difference in students’ opinions. Another practical issue is that there may be students who don’t contribute at all. If is such a student, i.e., , then we would have
[TABLE]
The final scores are computed from the matrix by
[TABLE]
where the average is taken over all columns that do not have infinite or missing values. Note that we normalize each column dividing it by the sum of its entries, to ensure that sum of entries of the vector is .
Theorem 1
Suppose that at least two students made positive contribution to the team’s work and that there are at least three students who evaluated each of their teammates and got positive grades from the instructor for reports that they wrote. Then the mechanism given by (6) is accurate. □
Proof
The assumptions are needed to ensure that the matrix is well-defined and has at least two columns with finite entries. The rest is straightforward — to show that the mechanism is accurate, we need to prove that its output is the objective truth assuming that all evaluations are truthful. If all the students report the objective truth, then we have
[TABLE]
From (5), we get
[TABLE]
which completes the proof. ■
3.5 Evaluating truthfulness
Although our mechanism is accurate and does not rely on self-evaluation, there is still a practical consideration that it does not address. Sometimes, two friends may cheat the system by giving unfairly high evaluations to each other. If they write convincing reports, the course instructor will overlook the deceit.
The following method was proposed in [2] to discourage such behavior. We actually require students to report self-evaluations, i.e., we have . Even though self-evaluations are discarded when the mechanism output is computed according to (6), all evaluations reported by a student, including the self-evaluation, are used to measure to which extend evaluations reported by student deviate from the final scores .
Let us introduce the notation
[TABLE]
for contribution of student according to student . Then
[TABLE]
is the relative error of evaluation of student by student and
[TABLE]
is the average relative error of student ’s evaluations.
3.6 Summary of our mechanism
In practice, the main score computed by the mechanism according to (6) constitute 90% of the final score, with 5% given for writing reports and 5% given for consistency of evaluations submitted by a student with scores calculated by the main mechanism. Thus the final score given to student is
[TABLE]
where is the output of the main mechanism computed according to (6), is the score given by the instructor to student for reports that wrote, and is the average relative error of th evaluations found according to (7).
Note that the average of scores given by (8) is usually smaller than and equals if and only if all evaluations submitted by students are completely truthful. Another remark is that the weights , , and for the output of the main mechanism, reports, and consistency of evaluations with output of the main mechanism are not justified by data or theory and finding ”right” weights may be a subject of future research.
4 Discussion
Our mechanism is accurate. However, even though it does not rely on self-evaluation, it may be possible to manipulate it.
Example 5
Suppose that we have a team of three students and the vector of true contributions is
[TABLE]
Suppose that the third student decided to manipulate the mechanism and reported incorrect evaluations by claiming that the first and the second student contributed equally. Assume also that the course instructor did not see it and graded all reports as being equally trust-worthy. The matrix of peer-evaluations and the auxiliary matrix are
[TABLE]
The vector of percieved contributions is then
[TABLE]
Then the perceived contribution of the third student is . It means that by changing his evaluations of other students, he changed his own score. □
Note that although manipulating our mechanism is not as straightforward as manipulating the pie-to-all mechanism, it is still possible.
A mechanism is called incentive compatible if lying does not improve one’s own score given that others tell the truth. In other words, a mechanism is incentive compatible if the collective truth-telling is a Nash equilibrium. A mechanism is not manipulable if changing evaluations of other students one reports does not change one’s own score. Note that a mechanism that is not manipulable is automatically incentive compatible, but not vice versa.
Our mechanism is manipulable, but we believe that it will become incentive compatible if a small portion of the final score is given for consistency of scores reported by a student with the team’s perceived contributions. However, determining that small portion (in practice, we use , but we don’t have justification for that number) is still an open problem. We don’t know if there exists an accurate mechanism that is not manipulable. We believe that it does not, but proving it is another open problem.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Charles M Brooks and Janice L Ammons. Free riding in group projects and the effects of timing, frequency, and specificity of criteria in peer assessments. Journal of Education for Business , 78(5):268–272, 2003.
- 2[2] Arthur Carvalho and Kate Larson. Sharing rewards among strangers based on peer evaluations. Decision Analysis , 9(3):253–273, 2012.
- 3[3] Robert Conway, David Kember, Atara Sivan, and May Wu. Peer assessment of an individual’s contribution to a group project. Assessment & Evaluation in Higher Education , 18(1):45–56, 1993.
- 4[4] Norma C Holter. Team assignments can be effective cooperative learning techniques. Journal of education for Business , 70(2):73–76, 1994.
- 5[5] William B Joyce. On the free-rider problem in cooperative learning. Journal of Education for Business , 74(5):271–274, 1999.
- 6[6] Deborah B Kaufman, Richard M Felder, and Hugh Fuller. Accounting for individual effort in cooperative learning teams. Journal of Engineering Education , 89(2):133–140, 2000.
- 7[7] Jane H Leuthold. A free rider experiment for the large class. The Journal of Economic Education , 24(4):353–363, 1993.
- 8[8] S Dolly Malik and Daniel R Strang. An exploration of the emergence of process prototypes in a management course utilizing a total enterprise simulation. Developments in Business Simulation and Experiential Learning , 25, 1998.
