There must be an error here! Experimental evidence on coding errors' biases

Bruno Ferman; Lucas Finamor

arXiv:2508.20069·econ.GN·September 26, 2025

There must be an error here! Experimental evidence on coding errors' biases

Bruno Ferman, Lucas Finamor

PDF

TL;DR

This study provides experimental evidence that researchers are more likely to detect coding errors when these errors produce unexpected results, indicating potential biases in scientific findings due to error detection asymmetries.

Contribution

It demonstrates experimentally that coding errors are more likely to be identified when they lead to unexpected outcomes, highlighting a bias in error detection that can affect research validity.

Findings

01

Individuals are 20% more likely to detect errors with unexpected results.

02

Coding errors can bias scientific research outcomes.

03

Error detection is asymmetrical based on result expectations.

Abstract

Quantitative research relies heavily on coding, and coding errors are relatively common even in published research. In this paper, we examine whether individuals are more or less likely to check their code depending on the results they obtain. We test this hypothesis in a randomized experiment embedded in the recruitment process for research positions at a large international economic organization. In a coding task designed to assess candidates' programming abilities, we randomize whether participants obtain an expected or unexpected result if they commit a simple coding error. We find that individuals are almost 20% more likely to detect coding errors when they lead to unexpected results. This asymmetry in error detection depending on the results they generate suggests that coding errors may lead to biased findings in scientific research.

Tables9

Table 1. Table 1 : Experimental Design

Panel A. Keeping missing values (99)
	Treated	Control
Question 3	$- {0.1602}^{, *}$	${0.1488}^{, *}$
Question 4	${0.1488}^{, *}$	$- {0.1602}^{, *}$
Panel B. Removing missing values (99)
Question 3	-0.0042	0.0068
Question 4	0.0068	-0.0042

Table 2. Table 2 : Descriptive statistics and balance

Variable	Overall Mean	Control Mean	Treated Mean	Diff	P-value Diff	N Obs
Female	0.377	0.352	0.402	0.051	0.095	1025
	(0.485)	(0.478)	(0.491)	[0.030]
Master	0.937	0.927	0.947	0.019	0.203	1031
	(0.243)	(0.260)	(0.225)	[0.015]
Econometrics	0.864	0.859	0.868	0.009	0.663	1034
	(0.343)	(0.348)	(0.338)	[0.021]
Stata	0.508	0.524	0.491	-0.033	0.295	1036
	(0.500)	(0.500)	(0.500)	[0.031]
R	0.295	0.287	0.305	0.018	0.526	1036
	(0.456)	(0.453)	(0.461)	[0.028]
Python	0.159	0.150	0.169	0.019	0.403	1036
	(0.366)	(0.357)	(0.375)	[0.023]
Score Q1-Q2	4.446	4.438	4.454	0.016	0.891	991
	(1.828)	(1.850)	(1.807)	[0.116]
Coding Score	0.055	0.000	0.110	0.110	0.109	886
	(1.025)	(0.998)	(1.049)	[0.069]
Impact Evaluation Score	0.032	0.000	0.064	0.064	0.328	949
	(1.009)	(1.000)	(1.018)	[0.066]
Prior effect	0.126	0.127	0.125	-0.002	0.466	947
	(0.046)	(0.046)	(0.047)	[0.003]
P-value joint test					0.463

Table 3. Table 3 : Main results

Estimator	${\hat{β}}_{1}$	${\hat{β}}_{2}$	${\hat{β}}_{combined}$	${\hat{β}}_{GMM}$
	(1)	(2)	(3)	(4)
Panel A - Controls for wave
Treat	$0.0103$	$0.0146$	$0.0141$	$0.0141$
(s.e.)	( $0.0199$ )	( $0.0077$ )	( $0.0073$ )	( $0.0073$ )
[p-value]	[ $0.6053$ ]	[ $0.0596$ ]	[ $0.0514$ ]	[ $0.0546$ ]
{p-value2}	{ $0.3026$ }	{ $0.0298$ }	{ $0.0278$ }	{ $0.0273$ }

Panel B - Adding demographics controls
Treat	$0.0100$	$0.0157$	$0.0150$	$0.0149$
(s.e.)	( $0.0197$ )	( $0.0080$ )	( $0.0075$ )	( $0.0077$ )
[p-value]	[ $0.6126$ ]	[ $0.0508$ ]	[ $0.0400$ ]	[ $0.0525$ ]
{p-value2}	{ $0.3063$ }	{ $0.0254$ }	{ $0.0213$ }	{ $0.0262$ }

Panel C - Adding data test variables
Treat	$0.0111$	$0.0145$	$0.0141$	$0.0141$
(s.e.)	( $0.0198$ )	( $0.0079$ )	( $0.0075$ )	( $0.0077$ )
[p-value]	[ $0.5738$ ]	[ $0.0674$ ]	[ $0.0564$ ]	[ $0.0664$ ]
{p-value2}	{ $0.2869$ }	{ $0.0337$ }	{ $0.0308$ }	{ $0.0332$ }

Panel D - Adding screening test variables
Treat	$0.0079$	$0.0130$	$0.0124$	$0.0124$
(s.e.)	( $0.0198$ )	( $0.0076$ )	( $0.0072$ )	( $0.0074$ )
[p-value]	[ $0.6878$ ]	[ $0.0864$ ]	[ $0.0792$ ]	[ $0.0949$ ]
{p-value2}	{ $0.3439$ }	{ $0.0432$ }	{ $0.0380$ }	{ $0.0475$ }

N Obs	$788$	$788$	$788$	$788$
Intercept	$0.078$	$0.005$	-	-

Table 4. Table 4 : Alternative Samples

Sample	Qualified	All	Non-negative Prior	Correct Prior
	(1)	(2)	(3)	(4)
Effect	$0.0141$	$0.0097$	$0.0141$	$0.0154$
(s.e.)	( $0.0073$ )	( $0.0065$ )	( $0.0073$ )	( $0.0082$ )
[p-value]	[ $0.0514$ ]	[ $0.1237$ ]	[ $0.0498$ ]	[ $0.0541$ ]
{p-value2}	{ $0.0278$ }	{ $0.0615$ }	{ $0.0289$ }	{ $0.0299$ }
N Obs	$788$	$944$	$785$	$697$
Spotted First	$0.078$	$0.068$	$0.078$	$0.085$
Spotted Second	$0.005$	$0.007$	$0.005$	$0.006$

Table 5. Table 5 : Heterogeneities

	Effect	Std Error	P-value	N Obs	Spotted First	Diff P-value
Benchmarks
aEntire Sample	0.0097	(0.0065)	[0.1237]	944	0.068	-
aQualified Sample	0.0141	(0.0073)	[0.0514]	788	0.078	-

1. Clustered SE?
aNo	-0.0079	(0.0109)	[0.3510]	155	0.014	0.0981
aYes	0.0138	(0.0073)	[0.0552]	789	0.077

2. Score in Q1–Q2
aBelow Median	0.0060	(0.0088)	[0.4852]	452	0.039	0.5384
aAbove Median	0.0137	(0.0089)	[0.0987]	489	0.096

3. Coding Score
aBelow Median	-0.0021	(0.0100)	[0.8312]	405	0.062	0.3205
aAbove Median	0.0112	(0.0089)	[0.2089]	432	0.097

4. Impact Evaluation Score
aBelow Median	0.0043	(0.0107)	[0.6709]	423	0.087	0.6365
aAbove Median	0.0107	(0.0083)	[0.1982]	454	0.063

5. Master’s degree
aYes	0.0109	(0.0069)	[0.1094]	888	0.069	-

6. Econometrics course
aYes	0.0111	(0.0074)	[0.1290]	830	0.076	-

7. Gender
aMen	0.0083	(0.0071)	[0.1267]	578	0.074	0.6713
aWomen	0.0144	(0.0125)	[0.2453]	356	0.058

Table 6. Table A.1 : Descriptive statistics and balance — Qualified sample

Variable	Overall Mean	Control Mean	Treated Mean	Diff	p-value Diff	N Obs
Female	0.389	0.366	0.413	0.047	0.176	800
	(0.488)	(0.482)	(0.493)	[0.035]
Master	0.953	0.933	0.974	0.041	0.005	806
	(0.212)	(0.250)	(0.159)	[0.015]
Econometrics	0.907	0.897	0.918	0.020	0.318	806
	(0.291)	(0.304)	(0.275)	[0.020]
Stata	0.537	0.542	0.531	-0.011	0.758	807
	(0.499)	(0.499)	(0.500)	[0.035]
R	0.307	0.310	0.304	-0.006	0.850	807
	(0.462)	(0.463)	(0.461)	[0.033]
Python	0.143	0.131	0.155	0.023	0.344	807
	(0.350)	(0.338)	(0.362)	[0.025]
Score Q1-Q2	5.133	5.121	5.145	0.024	0.744	807
	(1.034)	(1.019)	(1.050)	[0.073]
Coding Score	0.150	0.100	0.201	0.101	0.179	733
	(1.019)	(0.979)	(1.057)	[0.075]
Impact Evaluation Score	0.158	0.140	0.176	0.036	0.610	751
	(0.963)	(0.936)	(0.992)	[0.070]
Prior effect	0.130	0.131	0.128	-0.003	0.229	767
	(0.034)	(0.030)	(0.037)	[0.002]
P-value joint test					0.606

Table 7. Table A.2 : Non-response rates

Variable	Control Mean	Treated Mean	Diff	P-value Diff	N Obs
Female	0.008	0.014	0.006	0.336	1036
			[0.006]
Master	0.006	0.004	-0.002	0.682	1036
			[0.004]
Econometrics	0.004	0.000	-0.004	0.157	1036
			[0.003]
Coding Score	0.161	0.128	-0.034	0.124	1036
			[0.022]
Impact Evaluation Score	0.095	0.073	-0.022	0.197	1036
			[0.017]
Prior effect	0.085	0.086	0.001	0.952	1036
			[0.017]

Table 8. Table A.3 : Results separately by wave

Sample	All	Wave 2024	Wave 2025
	(1)	(2)
Effect	$0.0141$	$0.0172$	$0.0093$
(s.e.)	( $0.0073$ )	( $0.0077$ )	( $0.0135$ )
[p-value]	[ $0.0514$ ]	[ $0.0273$ ]	[ $0.4652$ ]
{p-value2}	{ $0.0278$ }	{ $0.0224$ }	{ $0.2285$ }
N Obs	$788$	$450$	$338$

Table 9. Table A.4 : Heterogeneities — qualified sample

	Effect	Std Error	P-value	N Obs	Spotted First	Diff P-value
Benchmarks
aEntire Sample	0.0097	(0.0065)	[0.1237]	944	0.068	-
aQualified Sample	0.0141	(0.0073)	[0.0514]	788	0.078	-

1. Clustered SE?
aYes	0.0145	(0.0077)	[0.052]	747	0.081

2. Score in Q1–Q2
aBelow Median	0.0131	(0.0093)	[0.1763]	390	0.043	0.9441
aAbove Median	0.0141	(0.0108)	[0.1461]	398	0.113

3. Coding Score
aBelow Median	0.0035	(0.0100)	[0.7003]	362	0.074	0.4690
aAbove Median	0.014	(0.0105)	[0.2005]	362	0.102

4. Impact Evaluation Score
aBelow Median	0.0154	(0.0121)	[0.1637]	365	0.087	1.3871
aAbove Median	0.0078	(0.0089)	[0.4067]	371	0.063

5. Master’s degree
aYes	0.0154	(0.0077)	[0.0454]	750	0.078	-

6. Econometrics course
aYes	0.0156	(0.0081)	[0.0476]	714	0.084	-

7. Gender
aMen	0.0101	(0.0085)	[0.2128]	477	0.089	0.4519
aWomen	0.0221	(0.0135)	[0.1055]	304	0.061

Equations12

Y_{i}^{Q 3}

Y_{i}^{Q 3}

= α_{1} + β_{1} T_{i} + γ Z + ε_{i},

\tilde{Y}_{i}^{Q 4}

\tilde{Y}_{i}^{Q 4}

\hat{β}_{combined} = ω \hat{β_{1}} + (1 - ω) \hat{β_{2}} .

\hat{β}_{combined} = ω \hat{β_{1}} + (1 - ω) \hat{β_{2}} .

ω = \frac{Var ( β _{1} ^ ) - Cov ( β _{1} ^ , β _{2} ^ )}{Var ( β _{1} ^ ) + Var ( β _{1} ^ ) - 2 Cov ( β _{1} ^ , β _{2} ^ )} .

ω = \frac{Var ( β _{1} ^ ) - Cov ( β _{1} ^ , β _{2} ^ )}{Var ( β _{1} ^ ) + Var ( β _{1} ^ ) - 2 Cov ( β _{1} ^ , β _{2} ^ )} .

Δ_{2}

Δ_{2}

= CI + (CII-B - CII-C) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

There must be an error here! Experimental evidence on coding errors’ biases111We would like to thank Maria Ruth Jones and staff from the World Bank for valuable insights and cooperation with the project. We are also in debt with Dalila Figueiredo and Eduardo Ferraz for their insightful comments.

The pre-analysis plan for this paper was pre-registered and it is available at the AEA Registry — AEARCTR-0008312.

Bruno Ferman 222Sao Paulo School of Economics - FGV

Lucas Finamor333Sao Paulo School of Economics - FGV

(First draft: August 2025

This draft: September 2025)

Abstract

Quantitative research relies heavily on coding, and coding errors are relatively common even in published research. In this paper, we examine whether individuals are more or less likely to check their code depending on the results they obtain. We test this hypothesis in a randomized experiment embedded in the recruitment process for research positions at a large international economic organization. In a coding task designed to assess candidates’ programming abilities, we randomize whether participants obtain an expected or unexpected result if they commit a simple coding error. We find that individuals are almost 20% more likely to detect coding errors when they lead to unexpected results. This asymmetry in error detection depending on the results they generate suggests that coding errors may lead to biased findings in scientific research.

JEL Codes: C81, C80, C93

1 Introduction

There is growing concern about the lack of reproducibility in scientific research, both in economics and other fields (Ankel-Peters et al. (2023); Brodeur et al. (2024e, c, 2025); Camerer et al. (2016, 2019); Campbell et al. (2024); Chang and Li (2022); Christensen and Miguel (2018); Dewald et al. (1986); Drazen et al. (2021); Gertler et al. (2018); Hamermesh (2007); Lang (2025); McCullough et al. (2006); Open Science Collaboration (2015); Pérignon et al. (2024); Vilhuber (2019); Wood et al. (2018)). Several factors contribute to this problem, including the unavailability or poor documentation of data and code (Anderson et al. (2008); Chang and Li (2022); Dewald et al. (1986); Gertler et al. (2018); McCullough (2007); Vilhuber (2019)), p-hacking and publication bias (Adda et al. (2020); Andrews and Kasy (2019); Ashenfelter et al. (1999); Ashenfelter and Greenstone (2004); Begg and Mazumdar (1994); Blanco-Perez and Brodeur (2020); Brodeur et al. (2016, 2020, 2022, 2023, 2024d, 2024a, 2024b); Bruns et al. (2019); Camerer et al. (2016); Campbell et al. (2024); Card and Krueger (1995); Chopra et al. (2024); DellaVigna and Linos (2022); De Long and Lang (1992); Doucouliagos and Stanley (2013); Dreber et al. (2024); Elliott et al. (2022); Franco et al. (2014); Gerber and Malhotra (2008); Gerber et al. (2008, 2010); Havránek (2015); Henry (2009); Ioannidis (2005); Ioannidis et al. (2017); Kepes et al. (2022); Leamer and Leonard (1983); McCloskey (1985); O’Boyle Jr et al. (2017); Olsen et al. (2019); Rosenthal (1979); Stanley (2005, 2008); Vivalt (2019)), excessive researcher degrees of freedom in data construction and analysis (Huntington-Klein et al. (2021, 2025); Landy et al. (2020); Menkveld et al. (2024); Silberzahn et al. (2018); Simmons et al. (2011)), and coding errors that can alter empirical conclusions (Anderson et al. (2008); Brodeur et al. (2024e); McCullough et al. (2006); McCullough (2007)). In some cases, coding errors have led to widely cited but incorrect findings—for example, the paper by Reinhart and Rogoff (2010) on debt and growth, which omitted data due to a spreadsheet error, or replications of published articles that uncovered computational mistakes affecting key results (Herndon et al. (2014)). Beyond these anecdotal examples, in a mass replication study, Brodeur et al. (2024e) find that one quarter of studies published after 2022 in nine leading economics journals and three leading political science journals had some coding errors, showing that coding errors are highly prevalent.

If coding errors are independent of the results they generate, they would not lead to systematic bias, although they would still contribute to the excess dispersion of estimates observed in empirical research. In this case, it would just be part of what some have called “nonstandard errors” (Huntington-Klein et al. (2021, 2025); Menkveld et al. (2024); Silberzahn et al. (2018)). This variation goes beyond what is typically captured by standard model- or design-based measures of uncertainty and may arise from researcher practices and flexibility in analytical choices, where the possibility of coding errors would be one of the reasons why different groups of researchers may end up with different results. In contrast, if the likelihood of detecting coding errors depends on the results those errors produce, then, in addition to increasing dispersion, even well-intentioned researchers may unknowingly introduce systematic bias into their estimates due to coding errors.

In this paper, we test the hypothesis that the probability of detecting a coding error depends on the outcome the error generates. We do so through a randomized experiment embedded in the recruitment process for research positions at a large international economic organization. As part of a coding task designed to assess candidates’ programming abilities, we randomize whether a simple coding mistake leads to an expected or unexpected result. The coding mistake is whether individuals take into account that the value 99 in the outcome variable codes missing values. Failing to take into account the missing values leads to wrong results that can be expected or unexpected, depending on the group candidates were randomly allocated. This design allows us to estimate whether individuals are less likely to detect and correct coding errors when the resulting outcome aligns with their expectations. We find that individuals are almost 20% more likely to detect coding errors when they generate unexpected results. This asymmetry in error detection suggests that coding mistakes may lead not only to an increase in dispersion that is not captured by usual standard errors but also to bias in empirical research.

Our findings contribute to several strands of the literature. First, we add to the growing body of work on the reproducibility crisis in economics by highlighting a novel behavioral mechanism—selective error detection—that can undermine the reliability of empirical findings even in the absence of intentional misconduct. Second, we speak to the literature on nonstandard errors showing that undetected coding errors may be an underappreciated source of excess variation in empirical estimates, showing it can affect not only dispersion but also introduce bias on estimators. Finally, our results echo themes from the literature on confirmation bias (Kunda (1990); Nickerson (1998)), suggesting that researchers may be more likely to overlook errors that produce results aligned with prior expectations.

2 Experimental Design

2.1 Setting and sample

The study takes place in a recruitment process for research assistants and for a research-oriented fellowship program within the Development Economics (DEC) Vice Presidency of the World Bank in two separate waves in 2024 and 2025. As part of the recruitment process, candidates are asked to perform a simple data task to evaluate their coding abilities. The experiment takes place within this data task.

The data task was the last component of the first screening performed by the recruiters. Completion of the task was encouraged, but it was not a requirement.444Some positions did not require coding, therefore the coding task was not mandatory. At the start of the test, individuals could decide whether to share their data from the test for research purposes. The decision had no impact on how the data task was used for the recruitment process. Pooling the two waves, we have results for 1,036 task takers who started the data task and agreed to share their data for research purposes.555In total, 1,171 task takers started the data task and answered the initial questions. Among them, 135 (11.5%) opted to not share their data.

2.2 Data task

The data task presented candidates with a scenario in which they had to analyze data from a hypothetical RCT intervention that tailored educational content to students’ appropriate level—an intervention inspired by studies such as Banerjee et al. (2007); Cabezas et al. (2011); Duflo et al. (2011) and Banerjee et al. (2016). The main objective was to manipulate whether a coding error would lead to an expected or unexpected result, allowing us to evaluate whether participants are more likely to debug their code when they observe an unexpected outcome. To do so, we create datasets in which missing values for the outcome variable (test scores) were coded as 99—an information disclosed in the data dictionary. We then experimentally varied whether including students with missing outcomes in the regressions would produce expected or unexpected results. We describe the data task in detail below, highlighting the features relevant to our research design. In Appendix C, we reproduce the data task.

The data task began with a set of initial demographic questions, including gender, education level, whether the candidate had taken an econometrics course, and the language in which they intended to complete the task. These variables were used as individual-level controls and to conduct heterogeneity analysis.

Next, candidates were presented with results from six randomized controlled trials that evaluated the impact of programs tailoring educational content to students’ appropriate level.666The results were drawn from the following papers: Banerjee et al. (2007); Cabezas et al. (2011); Duflo et al. (2011), and Banerjee et al. (2016). All candidates observe the results from all articles. The estimated effects ranged from 0.08 to 0.16 standard deviations—each positive and statistically significant at the 5% level. These results served to anchor participants’ beliefs, reinforcing the expectation of positive effects from interventions of this type.

We then presented a hypothetical RCT of an intervention that, inspired by the literature, tailored educational content to students’ appropriate level. We explained that 5th-grade teachers in treated schools received materials and training to implement the tailored program, while instruction in control schools remained unchanged. At this stage, participants were asked to report their best guess of the approximate effect of such an RCT on language proficiency, measured 12 months after the start of the intervention. Using a slider, they could select values between –0.30 and 0.30 standard deviations. Consistent with the anchoring provided by existing evidence, more than 87% of task-takers reported priors between 0.08 and 0.16 standard deviations.

After collecting participants’ priors, each candidate received a dataset corresponding to the hypothetical experiment. Each task taker was randomly assigned a version of the dataset, which included three files: student-level data from 480 schools participating in the RCT across two distinct states (one file per state), and a data dictionary. Candidates were informed that they would answer four questions based on these datasets. Each question appeared on a separate page, and once they proceeded to the next question, they could not return to change previous answers. However, they were told in advance that they would have the opportunity to review and revise their responses after completing all questions. While only the final answers were used for the screening process, we rely on the initial responses for the purposes of our experiment.777As we explain in detail in Section 2.4, this design ensures the experiment is fair to participants both ex-ante and ex-post. The questions required either a numerical answer (e.g., a point estimate with three decimal places) or a written response involving interpretation or reasoning.

The first question (Q1) requires task takers to manipulate the data and answer questions related to counts, means, and conditional means based on the provided dataset. The second question (Q2) asks participants to run an OLS regression to assess the balance of the hypothetical RCT. Importantly, the variables used in these two questions do not contain missing values, as the outcome variable (test scores), which contains the missing values coded as 99, is not used. We refer to the scores obtained in these initial questions (Q1–Q2) as a measure of initial coding ability, as they evaluate whether individuals can perform basic data manipulations and run a standard OLS regression. As specified in our pre-analysis plan, our main analyses are restricted to individuals who demonstrate basic coding proficiency.888The main reason for this restriction is that we depend on individuals knowing how to run an OLS regression in order to correctly identify whether they have spotted the coding error.

The third question (Q3) asks participants to estimate the effect of the hypothetical intervention on language scores in one of the two states (e.g., State 1). Task takers are instructed to run a specific OLS regression, using the standardized language score as the outcome and regressing it on a constant and a treatment indicator. We record their submitted point estimate, standard error, and p-value for the treatment effect, as well as their interpretation of the result. The fourth question (Q4) mirrors this task for the other state (e.g., State 2). That is, if a participant answered Q3 using data from State 1, Q4 asks about State 2, and vice versa. In these two questions, participants are exposed to a common coding error: failing to account for the fact that missing test scores are recorded as 99. The data dictionary explicitly informed them that a value of 99 corresponds to missing outcomes.

The key experimental variation lies in the construction of the datasets provided to candidates: they differ in the results participants would obtain if they do not account for the coding of missing values. In the treatment group, failing to drop the 99s leads to a significant negative estimated effect of the program in Q3 (an unexpected result given their priors), followed by a significant positive effect in Q4 (an expected result). In contrast, the control group encounters the reverse: a significant positive result in Q3 and a significant negative result in Q4, if the 99 values are mistakenly included in the regressions. If the missing values are appropriately taken into account, then the estimated effects are approximately zero in both groups. Table 1 shows the average point estimate candidates would obtain in both questions if they include the 99 values or not.

On the final page of the data task, candidates are shown all their previous answers and are given the opportunity to revise them before submission. For instance, a candidate who notices the issue with the 99s only in Q4 could still revise their answer to Q3 before submitting. Importantly, for the purposes of our experiment, we analyze the sequential responses prior to these final adjustments. In contrast, the screening process considers only the final submitted answers.

2.3 Identification

The main goal of the experiment is to identify the proportion of individuals who only spot the coding error when this leads to an unexpected result. Our experiment design, with variation on whether the unexpected result (in case the coding error is not spotted) appears in Q3 or Q4 allows us to identify this proportion in two different ways.

We classify the individuals according to four latent types:

Always-spot (AS): those who always spot the error, irrespectively of the result; 2. 2.

Never-spot (NS): those who never spot the error, irrespectively of the result; 3. 3.

Complier I (CI): those who spot the coding error if it leads to unexpected results; 4. 4.

Complier II (CII): those who spot the coding error if they find conflicting results between the two answers (that is, Q3 had a positive effect while Q4 had a negative one, or vice-versa).

Figure 1 presents the observed outcomes for questions Q3 and Q4 for each of these latent types, depending on whether they are in the treated or in the control group.999This classification would not allow individuals to detect the problem in Q3, but not detect it in Q4. Reassuringly, we find that only 2 out of 1,036 subjects presented this pattern. Given that, we have two ways of identifying the proportion of CI types.

The first approach uses only data from Q3. In the control group, only the AS type would spot the error at this stage, while in the treated both the AS and the CI types would spot the error. Therefore, we can identify the proportion of CI types by comparing the proportion of participants who answered correctly Q3 between the treated and the control groups.

Another alternative is to consider participants who did not spot the error in Q3, but did spot the error in Q4. As presented in Figure 1, in the treated group, only type CII would present this observed pattern, while in the control group this pattern would be observed for both CI and CII types. Therefore, a comparison of the proportion of participants that exhibit this observed pattern between the control and the treated groups would provide another way of identifying the proportion of individuals who debug when they observe an unexpected result.101010Here we assume that the order individuals obtain the wrong results, including the 99s, does not matter. In Appendix B we expand the analysis for the cases where this might not hold. While we did not pre-specify the use of this variation for identifying the proportion of CI types, we observed afterwards that this provides a clean and more precise identification of the parameter of interest.

2.4 Fairness and ethical concerns

In addition to helping with the identification, the use of questions Q3 and Q4 plays a crucial role in guaranteeing that the experiment is not only ex-ante fair for all candidates—which is accomplished given that all candidates had the same probability of being assigned to the treatment or the control group), but also that the experiment is ex-post fair—that is, we wanted to place candidates in analogous situations, irrespectively of the results of the randomization. Suppose we had only one question. The main concern was that, if CI types are prevalent in the pool of candidates, then those assigned to the treatment group would have an advantage, as they would be more likely to answer this question correctly.

With questions Q3 and Q4, we still have the issue that CI types assigned to the treatment group would realize the 99 issue in Q3—so they would be able to answer correctly Q3 and Q4, whereas those assigned to the control group would only answer correctly Q4. However, candidates are able to edit their answers before submitting them, which eliminates this problem. More specifically, CI types assigned to the control group would only realize that 99 values represent missing values when answering Q4. And then they would be able to edit their answer to Q3 before submitting it for the screening process. The fact that only final answers were used for the screening process, while we rely on initial responses for the experiment, allows the experiment to be ex-post fair while still providing the relevant information for identification of the proportion of participants who only spot the error when it leads to an unexpected result.

We also piloted the experiment in four different recruitment processes with a different partner and a similar design. Results did not show any statistical difference in final recruitment scores between treated and control individuals. The pre-analysis plan presents the results for these pilots. In the experiment we also obtain the same result that there is no statistically significant difference in the final score across the two groups. It is also worth mentioning that the recruitment process already measured coding proficiency with randomized questions. The only manipulation for this RCT was on the structure of the dataset and questions.

2.5 Descriptive Statistics

Our total sample comprises 1,036 task takers. Among them, 633 (61.1%) completed the data task in the 2024 wave, and 406 (39.9%) in the 2025 wave. Every time a candidate started the task, they would be randomized into the control or treatment group. A total of 527 (50.9%) task takers were randomized into the control group, and 509 (49.1%) were randomized into treatment. In order to correctly identify whether individuals saw the 99 as coding missing values or not, we depend on individuals knowing how to run a regression. Therefore, we define a subset of our sample, the qualified sample, which correctly estimated an OLS regression in question 2, the question preceding the two questions we use in the experiment. A total of 807 (77.9%) candidates are in the qualified sample. Our pre-analysis plan specified at least 800 observations in the qualified sample.

Table 2 shows descriptive statistics of the sample. The first three columns show the overall mean value, mean for the control group, and mean for the treated group for each variable. The next two columns show the estimated difference and p-value for the regression-adjusted balance test of equality of means. The last column shows the number of observations with non-missing information. We can see that 37.7% of the sample are female, 93.7% have a master’s degree or are enrolled in a master’s program, and 86.4% have already taken an Econometrics course. In terms of computational language, 50.8% use Stata, 29.5% R, and 15.9% Python, and the remaining 3.8% use other software. The average score in the initial two coding questions is 4.44 out of 6 possible points. In the recruitment process of the partner institute, they also recorded coding proficiency and scored individuals for their knowledge on impact evaluation. Coding score reflects answers to multiple choice questions on the language of their choice (options were Stata, R, or Python). We assigned each individual the corresponding score from the language they choose to complete our data task. We standardized this variable separately for each software language to have mean zero and unitary standard deviation in the control group. Impact evaluation questions measured individuals familiarity and knowledge on impact evaluation and econometric questions. We also standardized this variable to have mean zero and unitary standard deviation in the control group. In terms of beliefs about the effects of the hypothetical RCT, the average value is 0.126 standard deviations, in the middle of the interval of the presented papers. Figure A.1 shows the histogram for these priors. Consistent with the randomization protocol, we do not find significant differences between treated and control groups in terms of these covariates. The p-value of a joint test that means of all these variables are the same between these two groups is 0.463. Table A.1 reproduces these results for the qualified sample.111111The p-value of a joint test that means of all covariates are the same between treated and control groups is 0.606 for the qualified sample. We do find, however, statistically significant differences in the proportions of individuals with master’s degree or higher. As pre-specified in the PAP, we include this (and the other covariates) as controls in our main specifications. Reassuringly, all results remain similar upon the inclusion of covariates. Additionally, Tables 5 and A.4 show results conditioning on these variables.

3 Empirical Strategy

Our empirical strategy explores directly the random assignment of the ordering of the negative–positive results on questions 3 and 4. Let $Y^{Q3}_{i}$ be the indicator for whether the candidate $i$ spotted the error in the first question to estimate the causal effects (question 3). Candidate $i$ is treated, $T_{i}=1$ , if she receives the negative estimate in the first question. We estimate our treatment effects in a specification that interacts the treatment indicator with the demeaned control variables and wave indicators, as suggested by Lin (20fmenk13),

[TABLE]

where $\tilde{X}_{i}$ are all demeaned covariates: gender, whether the candidate took an econometrics course, whether the candidate has a master’s degree or above, and the initial score in the screening questions. In addition to the above covariates, we include a control for (demeaned) wave indicator ( $\tilde{W}_{i}$ ) and the interaction of all covariates and the wave indicator, to account for the potential different sets of candidates in each wave. The second line uses the variable $Z$ for a short notation of all these variables. $\alpha_{1}$ measures the proportion of individuals in the control group who spot the error. $\beta_{1}$ is our coefficient of interest, measuring the differential probability of detecting the coding error for individuals observing the negative effect in the first question. We will estimate Equation 1 using OLS with robust standard errors. This is the equation and estimation method pre-specified in our pre-analysis plan. In the pre-analysis plan, we also specified that we would conduct inference using a unilateral hypothesis test. For completeness, we present p-values for both unilateral and bilateral tests.

As discussed in Section 2.3, it is also possible to identify the proportion of individuals who only spot the coding error when this leads to an unexpected result by comparing the proportion of individuals who only spot the error in question 4 in the control and treated groups (importantly, in this case it is the proportion of controls minus the proportion of treated). In practice, we can implement this identification strategy using the following regression:

[TABLE]

where we define $\tilde{Y}^{Q4}_{i}$ as one if individual $i$ spotted the error for the first time in Q4 (thus, not in Q3). The coefficient $\alpha_{2}$ measures the proportion of individuals in the treated sample who spot the error because they saw the flipped results, that is CII, while $\beta_{2}$ is still the same coefficient of interest in measuring the proportion of individuals who spot the error only because they see the negative result that is the CI type. We also estimate Equation 2 with OLS with robust standard errors.

As both $\hat{\beta_{1}}$ and $\hat{\beta_{2}}$ are different estimators of the same parameter of interest, we can combine their estimations to achieve a more efficient estimator. We do it in two ways. The first estimator combines both estimates, choosing the weights that minimize the variance. That is,

[TABLE]

Where $\omega$ is chosen to minimize the variance of $\hat{\beta}_{\text{combined}}$ . That is

[TABLE]

To take into account that $\omega$ uses estimated variances and covariances, we use bootstrap at the individual level to conduct inference for $\hat{\beta}_{\text{combined}}$ . At every bootstrap iteration, we compute $\hat{\beta_{1}}$ and $\hat{\beta_{2}}$ , their variance-covariance matrix, and then the combined estimator. The resulting bootstrap p-values are very close to the analytical ones obtained ignoring how $\omega$ uses estimated variances and covariances.

Another approach to obtain a more efficient estimator is to jointly estimate Equations 1 and 2, imposing the same coefficient $\beta$ . We do it using a GMM estimator that stacks the moments of the two equations together. As in the OLS estimator, the moments are that the covariances of residuals and all variables are zero. Estimators $\hat{\beta_{2}}$ , $\hat{\beta}_{\text{combined}}$ and $\hat{\beta}_{\text{GMM}}$ were not in our pre-analysis plan as we have not planned to use the data from Q4. After the implementation, we saw how it provides as useful and clean variation as the first question, with a very minimal additional assumption.

4 Results

4.1 Main Result

Table 3 presents the main results. In the first column, we present the estimates using the first estimator (presented in Equation 1). In the first panel, we only add the wave fixed effects. We can see that only 7.8% of the control group identifies the 99 in Q3 (the intercept from Equation 1). For the treatment group, the proportion spotting the error is 1.03pp higher, increasing it to 8.8%, although this estimate is not statistically significant. The next three panels add, sequentially, all the control variables: all demographic controls (Panel B), data test measures (computational language used, and initial proficiency scores) in Panel C, and screening variables (coding and impact evaluation scores) in Panel D. The results are all very similar; the point estimate ranges between 0.79pp and 1.11pp, with relatively large standard errors.

In the second column, we implement the OLS estimator using the data from question 4. The intercept from this regression reveals that only 0.5% of individuals in the treated group spot the error only in this question (that is, those who detect the error when they see the flipped result). In the control group, the proportion that only spots the error in Q4 is higher, yielding a $\hat{\beta}_{2}$ equal to 1.46pp. It is remarkable how close this estimate is to the estimate using only the first question (1.03pp). Moreover, this estimator is considerably more precise than the previous one. This happens because the proportion of treated individuals who only spot in the second question is very close to zero. Even without additional controls, this estimate is marginally significant with a p-value of 0.0596 for a bilateral test (and statistically significant at 5% when we consider a unilateral test, with a p-value of 0.0298). When we add controls in the next panels, the estimate is very stable, ranging from 1.30pp to 1.57pp, with p-values for the bilateral test in the interval 5.1%–8.6%.

Results in the third column combine optimally the two estimators, given their variances and covariances. The estimate without additional controls is 1.41pp. This is an 18.1% increase over the baseline detection probability. Including all controls does not change the result. The estimates are between 1.24pp and 1.50pp with p-values ranging between 4.0% and 7.9% for the bilateral test and between 2.1%–3.8% for the unilateral ones. The results using the GMM estimator, which also combines both sources of identification, are very similar (column 4).

The results so far used the qualified sample, that is, the sample of individuals who know how to run a regression. Table 4 shows our main estimates for two alternative samples. Column 2 shows the results for the sample of all individuals. As expected, the proportion of control individuals spotting the error in the first question is smaller (6.8% compared to 7.8% in the qualified sample). The point estimate is also smaller (0.97pp versus 1.41pp). That is also expected; we cannot identify whether these new individuals correctly spot the 99 or not, because they likely do not know how to run OLS regressions, as they failed or did not answer question number 2. Indeed, we find that only 27.6% of the unqualified sample provided a correct OLS point estimator (whether taking the 99 values into account or not), compared to 86.8% for the qualified sample. In the third column, we drop from the qualified sample individuals who report negative values for the prior effect, as pre-specified in the PAP. The results look very similar, as only 3 individuals in the sample reported negative values. Lastly, in the fourth column, we restrict the qualified sample to the individuals who had priors close to the presented studies. Here, we consider those who reported priors in the interval 0.08–0.16 standard deviations. The point estimate is slightly larger, 1.54pp versus 1.41pp in the baseline estimation. We would expect this number to be larger as these individuals have priors aligned with the literature, and were, therefore, expecting positive results with the same magnitude that we present them.

4.2 Heterogeneity

In this section, we investigate whether we have evidence of heterogeneous effects along some dimensions we observe in the data. For this exercise we use the entire sample for two main reasons. First to have larger groups when we split the analysis in sub-samples. Second, some exercises aim to compare individuals with lower or higher coding ability and skills, and therefore it would be inconsistent to already select on those that scored some screening questions correctly, as we do in the qualified sample. Nevertheless, Table A.4 show the same results for the qualified sample. The first two rows of Table 5 show the benchmark results for the entire and qualified samples. As the subsequent panels use the entire sample, the first row should be used as a comparison.

The first three panels split the individuals according to different ways of assessing their coding abilities. Whether they have correctly clustered the standard errors in question 2 (panel 1), their score in the two initial screening questions (panel 2), and their coding score in the partner institute assessment (panel 3). Across the three analysis, the results are very similar. Groups we expect to have better performance (who know how to cluster standard errors and with higher initial scores) have higher probabilities of correctly taking the 99 missing values regardless of the results the coding error would generate. However, these advantages do not make them less vulnerable to the bias. On the contrary, the point estimates are larger for the more trained subgroups although the differences are only significant at 10% for the comparison between those who did or did not correctly clustered the standard errors. The next three panels analyze the results by training and knowledge of econometrics and impact evaluation. We see similar results, those with more training and knowledge (higher impact evaluation scores, with master’s degree, and who have already taken an econometrics course) have larger point estimates, although the differences are not statistically different. In panels 5 and 6, we do not estimate the results for those without master’s degree or who did not took an econometrics course because the sample is so small that we dot enough variation to estimate the combined estimator.

The last panel show heterogeneity by gender, where we observe that women have slightly lower baseline detection probability and larger point estimates for the bias. However, we cannot reject that the results are the same for both samples. In addition to these results, Appendix Table A.3 shows results separately for each wave.

5 Discussion

Our main result shows that individuals are significantly more likely to detect coding errors when those errors lead to unexpected results. This suggests that error detection is not a neutral process: it depends on whether the output aligns with prior expectations. In our setting, where the same coding error leads to either an expected or unexpected result depending on random assignment, we find that the unexpected result prompts greater debugging effort.

While our experimental design focuses on whether coding errors lead to expected or unexpected results, a natural conjecture is that this mechanism may also extend to favorable results—that is, results that researchers view as more likely to be published or that support their hypotheses. If researchers are less inclined to scrutinize favorable outcomes, then coding errors that generate such results may be less likely to be detected, potentially introducing systematic bias into the published literature.

Our findings are particularly relevant for placebo tests, where researchers typically expect to find no significant effects—and where expected results are often also seen as favorable. In such cases, if a coding error leads to a statistically insignificant placebo result, researchers may interpret this as confirmation that the test “worked” and may forgo further scrutiny. As a consequence, coding errors may lead to an excess of false-negative findings, masking potential violations of identifying assumptions even when researchers act in good faith.

More broadly, our results indicate that debugging is a costly activity, and thus the amount of effort researchers invest in it may depend on both their expectations and their incentives. Institutional practices—such as requiring code disclosure or pre-publication code review—may therefore significantly influence researchers’ debugging efforts by altering their incentives. By increasing anticipated scrutiny, such practices could encourage more thorough error detection, potentially reducing biases arising from undetected mistakes.

6 Conclusion

In recent years, the Economics profession has seen increased concerns about the reproducibility and replicability of research findings. Many journals set up policies requiring data and code availability to increase research transparency. In this paper, we experimentally test whether individuals are more likely to find coding errors when they lead to non-expected results. We find that the probability of spotting a simple and common coding error increases by almost 20% when the error leads to an unexpected result. This indicates that coding errors may not only increase the dispersion of results observed in empirical research but may bias the scientific inquiry. The results reinforce the necessity of policies that increase transparency in empirical science.

Online Appendices

Appendix A Additional figures and tables

Appendix B Latent types expanded

In the main text we present our identification strategy, classifying individuals into four latent types. In this appendix we expand this classification in order to relax the hypothesis the Complier II type (CII) detects the error irrespectively of the order of results. For this we sub-divide this type into three exhaustive cases. In this case, we consider that there are six latent types:

Always-spot (AS): those who always spot the error, irrespectively of the result; 2. 2.

Never-spot (NS): those who never spot the error, irrespectively of the result; 3. 3.

Complier I (CI): those who spot the coding error if it leads to unexpected results; 4. 4.

Complier II (CII): those who spot the coding error if they find conflicting results between the two answers

•

CII-A: spot the error if they find conflicting results between the two answers irrespectively of their signs

•

CII-B: spot the error if they find conflicting results between the two answers, if the first is the positive and the second is the negative one. But not the other way around.

•

CII-C: spot the error if they find conflicting results between the two answers, if the first is the negative and the second is the positive one. But not the other way around.

Figure A.2 shows all the latent types by their respective results in Q3 and Q4 if they are in the treatment or control group. First, note that our first source of identification — which compares the proportion of treated and controls who spotted the error in Q3 — is not affected by the presence of the sub-categories of CII types.

For the second identification approach — which contrasts those spotting only in Q4 between treatment and control, we would identify the following quantity $\Delta_{2}$ :

[TABLE]

That is, we would identify the proportion of CI types, plus the difference between CII-B and CII-C types in the population. Therefore, this approach recovers the proportion of CI if these two types have the same proportion (or do not exist). If $\text{CII-B}<\text{CII-C}$ , that is, it is more likely to spot the error after seeing negative-positive, than positive-negative results, then we would underestimate our target parameter. If $\text{CII-B}>\text{CII-C}$ , we would overestimate the proportion of CI types. Note however, that this difference is also manifested by individuals with differential debugging probabilities based on the results they face, exactly what we want to test with this RCT. Therefore, we do not see that as necessarily a bias, but as an evidence that the debugging probabilities depend on the observed outcome when there is a coding error. Additionally, it is worth mentioning that the proportion of the sum of CII-A and CII-C types is identified by the proportion of individuals spotting only in Q4 in the treatment group. Since this proportion is very small (0.5%), this implies that the proportion of CII-C is also very small.

Appendix C Data Task

Figures A.3—A.14 below reproduce the six parts of the data task as seen by the task takers.

Bibliography74

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Adda et al. (2020) Jérôme Adda, Christian Decker, and Marco Ottaviani. P-hacking in clinical trials and how incentives shape the distribution of results across phases. Proceedings of the National Academy of Sciences , 117(24):13386–13392, 2020.
2Anderson et al. (2008) Richard Anderson, William Greene, B. D. Mc Cullough, and H. D. Vinod. The role of data/code archives in the future of economic research. Journal of Economic Methodology , 15(1):99–119, None 2008. doi: 10.1080/13501780801915574 .
3Andrews and Kasy (2019) Isaiah Andrews and Maximilian Kasy. Identification of and correction for publication bias. American Economic Review , 109(8):2766–2794, 2019.
4Ankel-Peters et al. (2023) Jörg Ankel-Peters, Nathan Fiala, and Florian Neubauer. Do economists replicate? Journal of Economic Behavior & Organization , 212:219–232, 2023.
5Ashenfelter and Greenstone (2004) Orley Ashenfelter and Michael Greenstone. Estimating the value of a statistical life: The importance of omitted variables and publication bias. American Economic Review , 94(2):454–460, 2004. doi: 10.1257/0002828041301955 .
6Ashenfelter et al. (1999) Orley Ashenfelter, Colm Harmon, and Hessel Oosterbeek. A review of estimates of the schooling/earnings relationship, with tests for publication bias. Labour economics , 6(4):453–470, 1999.
7Banerjee et al. (2016) Abhijit Banerjee, Rukmini Banerji, James Berry, Esther Duflo, Harini Kannan, Shobhini Mukherji, Marc Shotland, and Michael Walton. Mainstreaming an effective intervention: Evidence from randomized evaluations of “teaching at the right level” in india. Technical report, National Bureau of Economic Research, 2016.
8Banerjee et al. (2007) Abhijit V Banerjee, Shawn Cole, Esther Duflo, and Leigh Linden. Remedying education: Evidence from two randomized experiments in india. The Quarterly Journal of Economics , 122(3):1235–1264, 2007.