TBar: Revisiting Template-based Automated Program Repair

Kui Liu; Anil Koyuncu; Dongsun Kim; Tegawend\'e F. Bissyand\'e

arXiv:1903.08409·cs.SE·June 7, 2019

TBar: Revisiting Template-based Automated Program Repair

Kui Liu, Anil Koyuncu, Dongsun Kim, Tegawend\'e F. Bissyand\'e

PDF

TL;DR

This paper evaluates the effectiveness of template-based automated program repair (APR) by building TBar, a tool that systematically applies fix patterns, demonstrating unprecedented success on the Defects4J benchmark.

Contribution

The paper introduces TBar, a simple yet effective APR tool that leverages a comprehensive set of fix patterns, highlighting the importance of fix pattern diversity and the role of fault localization.

Findings

01

TBar correctly fixes 74/101 bugs with perfect fault localization.

02

TBar fixes 43 bugs on Defects4J, surpassing previous approaches.

03

Fix pattern diversity significantly impacts APR effectiveness.

Abstract

We revisit the performance of template-based APR to build comprehensive knowledge about the effectiveness of fix patterns, and to highlight the importance of complementary steps such as fault localization or donor code retrieval. To that end, we first investigate the literature to collect, summarize and label recurrently-used fix patterns. Based on the investigation, we build TBar, a straightforward APR tool that systematically attempts to apply these fix patterns to program bugs. We thoroughly evaluate TBar on the Defects4J benchmark. In particular, we assess the actual qualitative and quantitative diversity of fix patterns, as well as their effectiveness in yielding plausible or correct patches. Eventually, we find that, assuming a perfect fault localization, TBar correctly/plausibly fixes 74/101 bugs. Replicating a standard and practical pipeline of APR assessment, we demonstrate…

Figures6

Click any figure to enlarge with its caption.

Tables8

Table 1. Table 1. Literature review on fix patterns for Java programs.

Authors	APR tool name	# of fix patterns	Publication Venue	Publication Year
Pan et al. (Pan et al., 2009)	-	27	EMSE	2009
Kim et al. (Kim et al., 2013)	PAR	10 (16^∗)	ICSE	2013
Martinez et al. (Martinez and Monperrus, 2016)	jMutRepair	2	ISSTA	2016
Durieux et al. (Durieux et al., 2017)	NPEfix	9	SANER	2017
Long et al. (Long et al., 2017)	Genesis	3 (108^∗)	FSE	2017
D. Le et al. (Le et al., 2017)	S3	4	FSE	2017
Saha et al. (Saha et al., 2017)	ELIXIR	8 (11^∗)	ASE	2017
Hua et al. (Hua et al., 2018)	SketchFix	6	ICSE	2018
Liu and Zhong (Liu and Zhong, 2018)	SOFix	12	SANER	2018
Koyuncu et al. (Koyuncu et al., 2018)	FixMiner	28	UL Tech Report	2018
Liu et al. (Liu et al., 2018b)	-	174	TSE	2018
Rolim et al. (Rolim et al., 2018)	REVISAR	9	UFERSA Tech Report	2018
Liu et al. (Liu et al., 2019c)	AVATAR	13	SANER	2019
D. Le et al. (Le et al., 2016b)	HDRepair^†	11	SANER	2016
Xin and Reiss (Xin and Reiss, 2017)	ssFix^†	34	ASE	2017
Wen et al. (Wen et al., 2018)	CapGen^†	30	ICSE	2018
Jiang et al. (Jiang et al., 2018)	SimFix^†	16	ISSTA	2018

Table 2. Table 2 . Change properties of fix patterns.

Fix Pattern	Change Action	Change Graunlarity	Bug Context	Change Spread
FP1	Insert	statement	cast expression	single
FP2.1	Insert	statement	a variable or an expression returning non- primitive-type data	single
FP2.(2,3,4,5)	Insert	statement		dual
FP3	Insert	statement	element access of array or collection variable	single
FP4.(1,2,3,4)	Insert	statement	any statement	single
FP5	Update	expression	class instance creation expression and clone method	single
FP6.1	Update	expression	conditional expression	single
FP6.2	Delete
FP6.3	Insert
FP7.1	Update	expression	variable declaration expression	single
FP7.2	Update	expression	cast expression	single
FP8.(1,2,3)	Update	expression	integral division expression	single
FP9.(1,2)	Update	expression	literal expression	single
FP10.1	Update	expression, or statement	method invocation, class instance creation, constructor, or super constructor	single
FP10.2	Update
FP10.3	Delete
FP10.4	Insert
FP11.1	Update	expression	assignment or infix-expression	single
FP11.2	Update	expression	arithmetic infix-expression	single
FP11.3	Update	expression	instance of expression	single
FP12	Update	expression	return statement	single
FP13.(1, 2)	Update	expression	variable expression	single
FP14	Move	statement	any statement	single or multiple
FP15.1	Delete	statement	any statement	single or multiple
FP15.2	Delete	method	any statement	multiple

Table 3. Table 3 . Diversity of fix patterns w.r.t change properties.

Action Type	# fix patterns	Granularity	# fix patterns	Spread	# fix patterns
Update	17	Expression	21	Single-	30
Delete	4	Statement	17	Statement	30
Insert	13	Method	1	Multiple-	7
Move	1			Statements	7

Table 4. Table 4 . Defects4J dataset information.

Project	Chart (C)	Closure (Cl)	Lang (L)	Math (M)	Mockito (Mc)	Time (T)	Total
# bugs	26	133	65	106	38	27	395
# test cases	2,205	7,927	2,245	3,602	1,457	4,130	21,566
# fixed bugs by all APR tools (cf. (Liu et al., 2019b; Liu et al., 2019c))	13	16	28	37	3	4	101

Table 5. Table 5 . Number of bugs fixed by fix patterns with TBar p subscript TBar 𝑝 \texttt{TBar}_{p} .

Fixed Bugs	C	Cl	L	M	Mc	T	Total
# of Fully Fixed Bugs	12/13	20/26	13/18	23/35	3/3	3/6	74/101
# of Partially Fixed Bugs	2/4	3/6	1/4	0/4	0/0	1/1	7/20

Table 6. Table 6 . Defects4j bugs fixed by fix patterns.

Bug ID

FP1

FP2

FP3

FP4

FP5

FP6

FP7

FP8

FP9

FP10

FP11

FP12

FP13

FP14

FP15

1

2

3

4

5

1

2

3

4

1

2

3

1

2

1

2

3

1

2

1

2

3

4

1

2

3

1

2

1

2

C-1

❍

⚫

❍

1/5

C-4

⚫

❍

⚫

2/3

C-7

❍

◐

1/2

C-8

⚫

1/1

C-9

⚫

❍

1/2

C-11

⚫

1/1

C-12

⚫

1/1

C-14

⚫

❍

2/3

C-18

❍

⚫

1/5

C-19

⚫

1/1

C-20

⚫

1/1

C-24

⚫

1/1

C-25

⚫

❍

1/3

C-26

⚫

❍

⚫

2/3

Cl-2

⚫

❍

1/3

Cl-4

⚫

1/1

Cl-6

❍

⚫

1/6

Cl-10

⚫

1/1

Cl-11

❍

⚫

1/5

Cl-13

⚫

1/1

Cl-18

⚫

❍

1/2

Cl-21

❍

⚫

1/5

Cl-22

❍

⚫

1/5

Cl-31

⚫

❍

1/2

Cl-38

❍

⚫

1/3

Cl-40

⚫

1/1

Cl-46

⚫

1/1

Cl-62

❍

◐

❍

1/5

Cl-63

❍

◐

❍

1/5

Cl-70

⚫

1/1

Cl-73

⚫

1/1

Cl-85

⚫

1/1

Cl-86

⚫

1/1

Cl-102

⚫

2/2

Cl-106

❍

⚫

1/2

Cl-115

❍

⚫

1/5

Cl-126

❍

⚫

1/6

L-6

⚫

1/1

L-7

❍

⚫

1/4

L-10

◐

⚫

2/2

L-15

❍

⚫

❍

1/5

L-22

❍

◐

❍

1/5

L-24

⚫

1/1

L-26

⚫

1/1

L-33

⚫

1/1

L-39

⚫

❍

1/3

L-47

⚫

1/1

L-51

⚫

1/1

L-57

❍

⚫

2/5

L-59

⚫

1/1

L-63

❍

⚫

1/7

M-4

⚫

1/1

M-5

⚫

❍

1/2

M-11

⚫

4/4

M-15

◐

1/1

M-22

⚫

❍

1/2

M-30

◐

1/1

M-33

❍

⚫

1/3

M-34

⚫

1/1

M-35

⚫

1/1

M-50

❍

⚫

1/9

M-57

⚫

1/1

M-58

⚫

1/1

M-59

⚫

❍

1/2

M-65

⚫

1/1

M-70

⚫

1/1

M-75

⚫

1/1

M-77

❍

⚫

❍

⚫

2/4

M-79

◐

1/1

M-80

❍

⚫

❍

1/4

M-82

❍

⚫

❍

1/5

M-85

❍

◐

❍

⚫

❍

3/8

M-89

⚫

1/1

M-98

⚫

1/1

Mc-26

⚫

1/1

Mc-29

⚫

2/2

Mc-38

⚫

2/2

T-3

◐

1/1

T-7

⚫

❍

1/2

T-19

❍

⚫

1/2

T-26

⚫

1/1

# 1

1

6

5

4

1

0

3

1

0

1

0

1

3

5

3

0

1

6

0

3

1

3

11

1

0

12

2

13

2

# 2

1

7

10

6

1

0

4

1

0

14

0

15

12

32

3

0

1

6

7

4

2

3

24

2

0

1

43

19

6

25

4

Table 7. Table 7 . Comparing TBar against the state-of-the-art APR tools.

Project

jGenProg

jKali

jMutRepair

HDRepair

Nopol

ACS

ELIXIR

JAID

ssFix

CapGen

SketchFix

FixMiner

LSRepair

SimFix

kPAR

AVATAR

TBar

Fully fixed

Partially fixed

Chart

0/7

0/6

1/4

0/2

1/6

2/2

4/7

2/4

3/7

4/4

6/8

5/8

3/8

4/8

3/10

5/12

9/14

0/4

Closure

0/0

0/7

0/0

5/11

2/11

0/0

3/5

5/5

0/0

6/8

5/9

8/12

1/5

Lang

0/0

0/1

2/6

3/7

3/4

8/12

1/8

5/12

5/5

3/4

2/3

8/14

9/13

1/8

5/11

5/14

0/3

Math

5/18

1/14

2/11

4/7

1/21

12/16

12/19

1/8

10/26

12/16

7/8

12/14

7/14

14/26

7/18

6/13

19/36

0/4

Mockito

0/0

1/1

0/0

1/2

2/2

1/2

0/0

Time

0/2

0/1

1/1

2/3

0/0

0/4

0/0

0/1

1/1

0/0

1/1

1/2

1/3

1/2

Total

5/27

1/22

3/17

6/23

5/35

18/23

26/41

9/31

20/60

21/25

19/26

25/31

19/37

34/56

18/49

27/53

43/81

2/18

P(%)

18.5

4.5

17.6

26.1

14.3

78.3

63.4

29.0

33.3

84.0

73.1

80.6

51.4

60.7

36.7

50.9

53.1

11.1

Table 8. Table 8 . Per-pattern repair performance.

	FP1	FP2					FP3	FP4				FP5	FP6			FP7		FP8			FP9		FP10				FP11			FP12	FP13		FP14	FP15
	FP1	1	2	3	4	5	FP3	1	2	3	4	FP5	1	2	3	1	2	1	2	3	1	2	1	2	3	4	1	2	3	FP12	1	2	FP14	1	2
Correct	1	4	2	1	0	1	0	1	0	0	0	0	0	0	3	3	0	0	0	1	2	0	1	1	1	1	7	1	0	0	9	1	0	2	2
Avg position*	(1)	(16)	(1)	(5)	-	(5)	-	(5)	-	-	-	-	-	-	(23)	(16)	-	-	-	(9)	(1)	-	(2)	(62)	(6)	(1)	(12)	(18)	-	-	(5)	(1)	-	(2)	(1)
Plausible (all)	1	7	4	1	0	1	0	3	0	0	0	0	1	0	11	4	0	0	0	1	4	0	2	2	1	1	12	1	0	0	25	4	1	7	5
Avg position*	(1)	(12)^†	(191)	(5)	-	(5)	-	(20)	-	-	-	-	(8)	-	(27)^†	(15)	-	-	-	(9)	(18)	-	(4)	(49)	(6)	(1)	(15)^†	(18)	-	-	(8)^†	(20)	(15)	(26)	(16)

Equations2

\scriptsize\begin{array}[]{l}{\tt DEFAULT\_{VALUE}}=\begin{cases}\text{false},&\text{if }RT=boolean;\\ 0,&\text{if }RT={primitive}\text{ }{type};\\ \text{new String()},&\text{if }RT=String;\\ \text{``return;''},&\text{if }RT=void;\\ \text{null},&otherwise.\end{cases}\end{array}

\scriptsize\begin{array}[]{l}{\tt DEFAULT\_{VALUE}}=\begin{cases}\text{false},&\text{if }RT=boolean;\\ 0,&\text{if }RT={primitive}\text{ }{type};\\ \text{new String()},&\text{if }RT=String;\\ \text{``return;''},&\text{if }RT=void;\\ \text{null},&otherwise.\end{cases}\end{array}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

TBar: Revisiting Template-Based Automated Program Repair

Kui Liu, Anil Koyuncu, Dongsun Kim, Tegawendé F. Bissyandé

University of Luxembourg, Luxembourg

kui.liu, anil.koyuncu, dongsun.kim, [email protected]

(2019)

Abstract.

We revisit the performance of template-based APR to build comprehensive knowledge about the effectiveness of fix patterns, and to highlight the importance of complementary steps such as fault localization or donor code retrieval. To that end, we first investigate the literature to collect, summarize and label recurrently-used fix patterns. Based on the investigation, we build TBar, a straightforward APR tool that systematically attempts to apply these fix patterns to program bugs. We thoroughly evaluate TBar on the Defects4J benchmark. In particular, we assess the actual qualitative and quantitative diversity of fix patterns, as well as their effectiveness in yielding plausible or correct patches. Eventually, we find that, assuming a perfect fault localization, TBar correctly/plausibly fixes 74/101 bugs. Replicating a standard and practical pipeline of APR assessment, we demonstrate that TBar correctly fixes 43 bugs from Defects4J, an unprecedented performance in the literature (including all approaches, i.e., template-based, stochastic mutation-based or synthesis-based APR).

Automated program repair, fix pattern, empirical assessment.

††copyright: acmcopyright††price: 15.00††doi: 10.1145/3293882.3330577††journalyear: 2019††isbn: 978-1-4503-6224-5/19/07††conference: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis; July 15–19, 2019; Beijing, China††booktitle: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’19), July 15–19, 2019, Beijing, China††ccs: Software and its engineering Software verification and validation††ccs: Software and its engineering Software defect analysis††ccs: Software and its engineering Software testing and debugging

1. Introduction

Automated Program Repair (APR) has progressively become an essential research field. APR research is indeed promising to improve modern software development by reducing the time and costs associated with program debugging tasks. In particular, given that faults in software cause substantial financial losses to the software industry (NIST, 2019; Britton et al**., 2013), there is a momentum in minimizing the time-to-fix intervals by APR. Recently, various APR approaches (Nguyen et al., 2013; Weimer et al., 2009; Le Goues et al., 2012b; Kim et al., 2013; Coker and Hafiz, 2013; Ke et al., 2015; Mechtaev et al., 2015; Long and Rinard, 2015; Le et al., 2016a, b; Long and Rinard, 2016b; Chen et al., 2017; Le et al., 2017; Long et al., 2017; Xuan et al., 2017; Xiong et al., 2017; Jiang et al., 2018; Wen et al., 2018; Hua et al., 2018; Liu et al., 2019c; Liu et al.**, 2019b, a) have been proposed, aiming at reducing manual debugging efforts through automatically generating patches.

An early strategy of APR is to generate concrete patches based on fix patterns (Kim et al**., 2013) (also referred to as fix templates (Liu and Zhong, 2018) or program transformation schemas (Hua et al., 2018)). This strategy is now common in the literature and has been implemented in several APR systems (Kim et al., 2013; Saha et al., 2017; Durieux et al., 2017; Liu and Zhong, 2018; Hua et al., 2018; Koyuncu et al., 2018; Martinez and Monperrus, 2018; Liu et al., 2019c; Liu et al., 2019b). Kim et al. (Kim et al., 2013) showed the usefulness of fix patterns with PAR. Saha et al. (Saha et al., 2017) later proposed ELIXIR by adding three new patterns on top of PAR (Kim et al., 2013). Durieux et al. (Durieux et al., 2017) proposed NPEfix to repair null pointer exception bugs, using nine pre-defined fix patterns. Long et al. designed Genesis (Long et al., 2017) to infer fix patterns for specific three classes of defects. Liu and Zhong (Liu and Zhong, 2018) explored posts from Stack Overflow to mine fix patterns for APR. Hua et al. proposed SketchFix (Hua et al., 2018), a runtime on-demand APR tool with six pre-defined fix patterns. Recently, Liu et al. (Liu et al., 2019c) used the fix patterns of FindBugs static violations (Liu et al.**, 2018b) to fix semantic bugs. Concurrently, Ghanbari and Zhang (Ghanbari and Zhang, 2018) showed that straightforward application of fix patterns (i.e., mutators) on Java bytecode is effective for repair. They do not, however, provide a comprehensive assessment of the repair performance yielded by each implemented mutator.

Although the literature has reported promising results with fix patterns-based APR, to the best of our knowledge, no extensive assessment on the effectiveness of various patterns is performed. A few most recent approaches (Liu and Zhong, 2018; Hua et al**., 2018; Liu et al.**, 2019c) reported which benchmark bugs are fixed by each of their patterns. Nevertheless, many relevant questions on the effectiveness of fix patterns remain unanswered.

This paper. Our work thoroughly investigates to what extent fix patterns are effective for program repair. In particular, emphasizing on the recurrence of some patterns in APR, we dissect their actual contribution to repair performance. Eventually, we explore three aspects of fix patterns:

•

Diversity: How diverse are the fix patterns used by the state-of-the-art? We survey the literature to identify and summarize the available patterns with a clear taxonomy.

•

Repair performance: How effective are the different patterns? In particular, we investigate the variety of real-world bugs that can be fixed, the dissection of repair results, and their tendency to yield plausible or correct patches.

•

Sensitivity to fault localization noise: Are all fix patterns similarly sensitive to the false positives yielded by fault localization tools? We investigate sensitivity by assessing plausible patches as well as the suspiciousness rank of correctly-fixed bug locations.

Towards realizing this study, we implement an automated patch generation system, TBar (Template-Based automated program repair), with a super-set of fix patterns that are collected, summarized, curated and labeled from the literature data. We evaluate TBar on the Defects4J (Just et al**.**, 2014) benchmark, and provide the replication package in a public repository: https://github.com/SerVal-DTF/TBar.

Overall, our investigations have yielded the following findings:

(1)

Record performance: TBar creates a new higher baseline of repair performance: 74/101 bugs are correctly/plausibly fixed with perfect fault localization information and 43/81 bugs are fixed with realistic fault localization output, respectively. 2. (2)

Fix pattern selection: Most bugs are correctly fixed only by a single fix pattern while other patterns generate plausible patches. This implies that appropriate pattern prioritization can prevent from plausible/incorrect patches. Otherwise, APR tools might be overfitted in plausible but incorrect patches. 3. (3)

Fix ingredient retrieval: It is challenging for template-based APR to select appropriate donor code, which is an ingredient of patch generation when using fix patterns. Inappropriate donor code may cause plausible but incorrect patch generation. This motivates a new research direction: donor code prioritization. 4. (4)

Fault localization noise: It turns out that fault localization accuracy has a large impact on repair performance when using fix patterns in APR (e.g., applying a fix pattern to incorrect location yields plausible/incorrect patches).

2. Fix Patterns

For this study, we systematically review111For conferences and journals, we consider ICSE, FSE, ASE, ISSTA, ICSME, SANER, TSE, TOSEM, and EMSE. The search keywords are ‘program’+‘repair’, ‘bug’ +‘fix’. the APR literature to identify approaches that leverage fix patterns. Concretely, we consider the program repair website (pro, 2019), a bibliography survey of APR (Monperrus, 2018), proceedings of software engineering conference venues and journals as the source of relevant literature. We focus on approaches dealing with Java program bugs, and manually collect, from the paper descriptions as well as the associated artifacts, all pattern instances that are explicitly mentioned. Table 1 summarizes the identified relevant literature and the quantity of identified fix patterns targeting Java programs. Note that the techniques described in the last four papers (i.e., HDRepair, ssFix, CapGen, and SimFix papers) do not directly use fix patterns: they leverage code change operators or rules, which we consider similar to using fix patterns.

2.1. Fix Patterns Inference

Fix patterns have been explored with the following four ways:

(1)

Manual Summarization: Pan et al. (Pan et al**., 2009) identified 27 fix patterns from patches of five Java projects to characterize the fix ingredients of patches. They do not however apply the identified patterns to fix actual bugs. Motivated by this work, Kim et al. (Kim et al.**, 2013) summarized 10 fix patterns manually extracted from 62,656 human-written patches collected from Eclipse JDT. 2. (2)

Mining: Long et al. (Long et al**., 2017) proposed Genesis, to infer fix patterns for three kinds of defects from existing patches. Liu and Zhong (Liu and Zhong, 2018) explored fix patterns from Q&A posts in Stack Overflow. Koyuncu et al. (Koyuncu et al., 2018) mined fix patterns at the AST level from patches by using code change differentiating tool (Falleri et al., 2014). Liu et al. (Liu et al., 2018b) and Rolim et al. (Rolim et al., 2018) proposed to mine fix patterns for static analysis violations. In general, mining approaches yield a large number of fix patterns, which are not always about addressing deviations in program behavior. For example, many patterns are about code style (Liu et al., 2019c). Recently, with AVATAR (Liu et al.**, 2019c), we proposed an APR tool that considers static analysis violation fix patterns to fix semantic bugs. 3. (3)

Pre-definition: Durieux et al. (Durieux et al**., 2017) pre-defined 9 repair actions for null pointer exceptions by unifying the related fix patterns proposed in previous studies (Dobolyi and Weimer, 2008; Kent, 2008; Long et al., 2014). On the top of PAR (Kim et al., 2013), Saha et al. (Saha et al., 2017) further defined 3 new fix patterns to improve the repair performance. Hua et al. (Hua et al.**, 2018) proposed an APR tool with six pre-defined so-called code transformation schemas. We also consider operator mutations (Martinez and Monperrus, 2016) as pre-defined fix patterns, as the number of operators and mutation possibilities is limited and pre-set. Xin and Reiss (Xin and Reiss, 2017) proposed an approach to fixing bugs with 34 predefined code change rules at the AST level. Ten of the rules are not for transforming the buggy code but for the simple replacement of multi-statement code fragments. We discard these rules from our study to limit bias. 4. (4)

Statistics: Besides formatted fix patterns, researchers (Wen et al**., 2018; Jiang et al., 2018) also explored to automate program repair with code change instructions (at the abstract syntax tree level) that are statistically recurrent in existing patches (Zhong and Su, 2015; Martinez and Monperrus, 2015; Liu et al., 2018c; Wen et al., 2017; Jiang et al.**, 2018). The strategy is then to select the top-n most frequent code change instructions as fix ingredients to synthesize patches.

2.2. Fix Patterns Taxonomy

After manually assessing all fix patterns presented in the literature (cf. Table 1), we identified 15 categories of patterns labeled based on the code context (e.g., a cast expression), the code change actions (e.g., insert an “if” statement with “instanceof” check) as well as the targets (e.g., ensure the program will no throw a ClassCastException.). A given category may include one or several specialized sub-categories. Below, we present the labeled categories and provide the associated 35 Code Change Patterns described in simplified GNU diff pattern for easy understanding.

FP1. Insert Cast Checker.** Inserting an instanceof check before one buggy statement if this statement contains at least one unchecked cast expression. Implemented in: PAR, Genesis, AVATAR, SOFix†, HDRepair†, SketchFix†, CapGen†, and SimFix†.

**

⬇ x + if (exp instanceof T) {

x var = (T) exp; ......

x + }

where exp is an expression (e.g., a variable expression) and T is the casting type, while “ $\ldots\ldots$ ” means the subsequent statements dependent on the variable var. Note that, “ $\dagger$ ” denotes that the fix pattern is not specifically illustrated in the corresponding APR tools since the tools have some abstract fix patterns that can cover the fix pattern. The same notation applies to the following descriptions.

FP2. Insert Null Pointer Checker.** Inserting a null check before a buggy statement if, in this statement, a field or an expression (of non-primitive data type) is accessed without a null pointer check. Implemented in: PAR, ELIXIR, NPEfix, Genesis, FixMiner, AVATAR, HDRepair**†, SOFix†, SketchFix†, CapGen†, and SimFix†.

⬇ xFP2.1: + if (exp != null) {

x ...exp...; ......

x + }

xFP2.2: + if (exp == null) return DEFAULT_VALUE;

x ...exp...;

xFP2.3: + if (exp == null) exp = exp1;

x ...exp...;

xFP2.4: + if (exp == null) continue;

x ...exp...;

xFP2.5: + if (exp == null)

x + throw new IllegalArgumentException(...);

x ...exp...;

where DEFAULT_VALUE is set based on the return type (RT) of the encompassing method as below:

[TABLE]

exp1** is a compatible expression in the buggy program (i.e., that has the same data type as exp). FP2.4 is specific to the case of a buggy statement within a loop (i.e., for or while).**

FP3. Insert Range Checker.** Inserting a range checker for the access of an array or collection if it is unchecked. Implemented in: PAR, ELIXIR, Genesis, SketchFix, AVATAR, SOFix**†** and SimFix**†.

⬇ x + if (index < exp.length) {

x ...exp[index]...; ......

x + }

xxOR

x + if (index < exp.size()) {

x ...exp.get(index)...; ......

x + }

where exp is an expression representing an array or collection.

FP4. Insert Missed Statement.** Inserting a missing statement before, or after, or surround a buggy statement. The statement is either an expression statement with a method invocation, or a return/try-catch/if statement. Implemented in: ELIXIR, HDRepair, SOFix, SketchFix, CapGen, FixMiner, and SimFix.**

⬇ xFP4.1: + method(exp);

xFP4.2: + return DEFAULT_VALUE;

xFP4.3: + try {

x statement; ......

x + } catch (Exception e) { ... }

xFP4.4: + if (conditional_exp) {

x statement; ......

x + }

where exp is an expression from a buggy statement. It may be empty if the method does not take any argument. FP4.4 excludes three fix patterns (FP1, FP2, and FP3) that are used with specific contexts.

FP5. Mutate Class Instance Creation.** Replacing a class instance creation expression with a cast super.clone() method invocation if the class instance creation is in an overridden clone method. Implemented in: AVATAR.**

⬇ x public Object clone() {

x - ... new T();

x + ... (T) super.clone();

x }

where T is the class name of the current class containing the buggy statement.

FP6. Mutate Conditional Expression.** Mutating a conditional expression that returns a boolean value (i.e., true or false) by either updating it, or removing a sub conditional expression, or inserting a new conditional expression into it. Implemented in: PAR, ssFix, S3, HDRepair, ELIXIR, SketchFix, CapGen, SimFix, and AVATAR.**

⬇ xFP6.1: - ...condExp1...

x + ...condExp2...

xFP6.2: - ...condExp1 Op condExp2...

x + ...condExp1...

xFP6.3: - ...condExp1...

x + ...condExp1 Op condExp2...

where condExp1 and condExp2 are conditional expressions. Op is the logical operator ‘——’ or ‘&&’. The mutation of operators in conditional expressions is not summarized in this fix pattern but in FP11.

FP7. Mutate Data Type.** Replacing the data type in a variable declaration or a cast expression with another data type. Implemented in: PAR, ELIXIR, FixMiner, SOFix, CapGen, SimFix, AVATAR, and HDRepair**†.

⬇ xFP7.1: - T1 var ...;

x + T2 var ...;

xFP7.2: - ...(T1) exp...;

x + ...(T2) exp...;

where both T1 and T2 denote two different data types. exp means the being casted expression (including variable).

FP8. Mutate Integer Division Operation.** Mutating the integer division expressions to return a float value, by mutating its divisor or divider to make them be of type float. Released by Liu et al. (Liu et al., 2018b)****, it is not implemented in any APR tool yet. **

⬇ xFP8.1: - ...dividend / divisor...

x + ...dividend / (double or float) divisor...

xFP8.2: - ...dividend / divisor...

x + ...(double or float) dividend / divisor...

xFP8.3: - ...dividend / divisor...

x + ...(1.0 / divisor) * dividend...

where dividend and divisor are integer number literals or integer-returned expressions (including variables).

FP9. Mutate Literal Expression.** Mutating boolean, number, or String literals in a buggy statement with other relevant literals, or correspondingly-typed expressions. Implemented in: HDRepair, S3, FixMiner, SketchFix, CapGen, SimFix and ssFix**†.

⬇ xFP9.1: - ...literal1...

x + ...literal2...

xFP9.2: - ...literal1...

x + ...exp...

where literal1 and literal2 are of the same type literals, but having different values (e.g., literal1 is true, literal2 is false). exp denotes any expression value of the same type as literal1.

FP10. Mutate Method Invocation Expression.** Mutating the bu-ggy method invocation expression by adapting its method name or arguments. This pattern consists of four sub fix patterns:**

(1)

Replacing the method name with another one which has a compatible return type and same parameter type(s) as the buggy method that was invoked. 2. (2)

Replacing at least one argument with another expression which has a compatible data type. Replacing a literal or variable is not included in this fix pattern, but rather in FP9 and FP13 respectively. 3. (3)

Removing argument(s) if the method invocation has the suitable overridden methods. 4. (4)

Inserting argument(s) if the method invocation has the suitable overridden methods.

Implemented in: PAR, HDRepair, ssFix, ELIXIR, FixMiner, SOFix, SketchFix, CapGen, and SimFix.

⬇ xFP10.1: - ...method1(args)...

x + ...method2(args)...

xFP10.2: - ...method1(arg1, arg2, ...)...

x + ...method1(arg1, arg3, ...)...

xFP10.3: - ...method1(arg1, arg2, ...)...

x + ... method1(arg1, ...)...

xFP10.4: - ...method1(arg1, ...)...

x + ...method1(arg1, arg2, ...)...

where method1 and method2 are the names of invoked methods. args, arg1, arg2 and arg3 denote the argument expressions in the method invocation. Note that, code changes on class instance creation, constructor and super constructor expressions are also included in these four fix patterns.

FP11. Mutate Operators.** Mutating an operation expression by mutating its operator(s). We divide this fix pattern into three sub-fix patterns following the operator types and mutation actions.**

(1)

Replacing one operator with another operator from the same operator class (e.g., relational or arithmetic). 2. (2)

Changing the priority of arithmetic operators. 3. (3)

Replacing instanceof operator with (in)equality operators.

Implemented in: HDRepair, ssFix, ELIXIR, S3, jMutRepair, SOFix, FixMiner, SketchFix, CapGen, SimFix, AVATAR, and PAR†.

⬇ xFP11.1: - ...exp1 Op1 exp2...

x + ...exp1 Op2 exp2...

xFP11.2: - ...(exp1 Op1 exp2) Op2 exp3...

x + ...exp1 Op1 (exp2 Op2 exp3)...

xFP11.3: - ...exp instanceof T...

x + ...exp != null...

where exp denotes the expressions in the operation and Op is the associated operator.

FP12. Mutate Return Statement.** Replacing the expression (excluding literals, variables, and conditional expressions) in a return statement with a compatible expression. Implemented in: ELIXIR, SketchFix, and HDRepair**†.

⬇ x - return exp1;

x + return exp2;

where exp1 and exp2 represent the returned expressions.

FP13. Mutate Variable.** Replacing a variable in a buggy statement with a compatible expression (including variables and literals). Implemented in: S3, SOFix, FixMiner, SketchFix, CapGen, SimFix, AVATAR, and ssFix**†.

⬇ xFP13.1: - ...var1...

x + ...var2...

xFP13.2: - ...var1...

x + ...exp...

where var1 denotes a variable in the buggy statement. var2 and exp represent respectively a compatible variable and expression of the same type as var1.

FP14. Move Statement.** Moving a buggy statement to a new position. Implemented in: PAR.**

⬇ x - statement;

x ......

x + statement;

where statement represents the buggy statement.

FP15. Remove Buggy Statement.** Deleting entirely the buggy statement from the program. Implemented in: HDRepair, SOFix, FixMiner, CapGen, and AVATAR.**

⬇ xFP15.1: ......

x - statement;

x ......

xFP15.2: - methodDeclaration(Arguments) {

x - ......; statement;......

x - }

where statement denotes any identified buggy statement, and method represents the encompassing method.

2.3. Analysis of Collected Patterns

We provide a study of the collected fix patterns following quantitative (overall set) and qualitative (per fix pattern) aspects. Table 2 assesses the fix patterns in terms of four qualitative dimensions:

(1)

Change Action: what high-level operations are applied on a buggy code entity? On the one hand, Update operations replace the buggy code entity with retreived donor code, while Delete operations just remove the buggy code entity from the program. On the other hand, Insert operations insert an otherwise missing code entity into the program, and Move operations change the position of the buggy code entity to a more suitable location in the program. 2. (2)

Change Granularity: what kinds of code entities are directly impacted by the change actions? This entity can be an entire Method, a whole Statement or specifically targeting an Expression within a statement. 3. (3)

Bug Context: what specific AST nodes of code entities are used to match fix patterns. 4. (4)

Change Spread: the number of statements impacted by each fix pattern.

Quantitatively, as summarized in Table 3, 17 fix patterns are related to Update change actions, 4 fix patterns implement Delete actions, 13 fix patterns Insert extra code, and only 1 fix pattern is associated to Move change action.

In terms of change granularity, 21 and 17 fix patterns are applied respectively at the expression and statement code entity levels 222Among these, four sub-fix patterns (FP10) can be applied to either expressions or statements, given that constructor and super-constructor code entities in Java program are grouped into statement level in terms of abstract syntax tree by Eclipse JDT.. Only 1 fix pattern is suitable at the method level.

Overall, we note that 30 fix patterns are applicable to a single statement, while 7 fix patterns can mutate multiple statements at the same time. Among these patterns, FP14 and FP15.1 can both mutate single and multiple statements.

3. Setup for Repair Experiments

In order to assess the effectiveness of fix patterns in the taxonomy presented in Section 2, we design program repair experiments using the fix patterns as the main ingredients. The produced APR system is then assessed on a widely-used benchmark in the repair community to allow reliable comparison against the state-of-the-art.

3.1. TBar: a Baseline APR System

Based on the investigations of recurrently-used fix patterns, we build TBar, a template-based APR tool which integrates the 35 fix patterns presented in Section 2. We expect the APR community to consider TBar as a baseline APR tool: new approaches must come up with novel techniques for solving auxiliary issues (e.g., repair precision, search space optimization, fault locations re-prioritization, etc.) to boost automated program repair beyond the performance that a straightforward application of common fix patterns can offer. Figure 1 overviews the workflow that we have implemented in TBar. We describe in the following subsections the role and operation of each process as well as all necessary implementation details.

3.1.1. Fault Localization

Fault localization is necessary for template-based APR as it allows to identify a list of suspicious code locations (i.e., buggy statements) on which to apply the fix patterns. TBar leverages the GZoltar (Campos et al., 2012)** framework to automate the execution of test cases for each buggy program. In this framework, we use the Ochiai (Abreu et al., 2007)**** ranking metric to compute the suspiciousness scores of statements that are likely to be the faulty code locations. This ranking metric has been demonstrated in several empirical studies (Steimann et al., 2013; Xie et al., 2013; Xuan and Monperrus, 2014a; Pearson et al., 2017)**** to be effective for localizing faults in object-oriented programs. The GZoltar framework for fault localization is also widely used in the literature of APR (Martinez and Monperrus, 2016; Xiong et al., 2017; Xuan et al., 2017; Xin and Reiss, 2017; Wen et al., 2018; Koyuncu et al., 2018; Liu et al., 2018a; Jiang et al., 2018; Liu et al., 2019b; Liu et al., 2019c), allowing for a fair assessment of TBar’s performance against the state-of-the-art.**

3.1.2. Fix Pattern Selection

In the execution of the repair pipeline, once the fault localization process yields a list of suspicious code locations, TBar sequentially attempts to select the encoded fix patterns from its database of fix patterns for each statement in the locations list. The selection of fix patterns is conducted in a naïve way based on the AST context information of each suspicious statement. Specifically, TBar sequentially traverses each node of the suspicious statement AST from its first child node to its last leaf node and tries to match each node against the context AST of the fix pattern. If a node can match any bug context presented in Table 2, a related fix pattern will be matched to generate patch candidates with the corresponding code change pattern. If the node is not a leaf node, TBar keeps traversing its children nodes. For example, if the first child node of a suspicious statement is a method invocation expression, it will be first matched with FP10. Mutate Method Invocation Expression fix pattern. If the children nodes of the method invocation start from a variable reference, it will be matched with FP13. Mutate Variable fix pattern as well. Other fix patterns follow the same manner. After all expression nodes of a suspicious statement are matched with fix patterns, TBar further matches fix patterns from statement and method levels respectively.

3.1.3. Patch Generation and Validation

When a matching fix pattern is found (i.e., a pattern is selected for a suspicious statement), a patch is generated by mutating the statement, then the patched program is run against the test suite. If the patched program passes all tests successfully, the patch candidate is considered as a plausible patch (Qi et al., 2015). Once such a plausible patch is identified, TBar stops generating other patch candidates for this bug to fix bugs in a standard and practical program repair workflow (Martinez and Monperrus, 2016; Xiong et al., 2017; Xuan et al., 2017; Liu et al., 2019b; Liu et al., 2019c)****, but does not generate all plausible patches for each bug, unlike PraPR (Ghanbari and Zhang, 2018). Otherwise, the pattern selection and patch generation process is resumed until all AST nodes of buggy code are traversed. When several fix pattern contexts match one node, their actions are used for ordering: TBar prioritizes Update over Insert that is over Delete, which is prioritized over Move. In case of multiple donor code options for a given fix pattern, the candidate patches (each generated with a specific donor code) are ordered based on the distances between donor code node and buggy code node in the AST of the buggy code file: priority is given to smaller distances. Due to space limitation, detailed steps, illustrated in an algorithmic pseudo-code, are released in the replication package.

Considering that some buggy programs have several buggy locations, if a patch candidate can make a buggy program pass a sub-set of previously failing test cases without failing any previously passing test cases, this patch is considered as a plausible sub-patch of this buggy program. TBar will further validate other patch candidates, until either a plausible patch is generated, or all patch candidates are validated, or TBar exhausts the time limitation set (i.e., three hours) for repair attempts.

If a plausible patch is generated, we further manually check the equivalence between this patch and the ground-truth patch provided by developers and available in the Defects4J benchmark. If the plausible patch is semantically equivalent to the ground-truth patch, the plausible patch is considered as correct. Otherwise, it is only considered as plausible. We offer a replication package with extensive details on pattern implementation within TBar. Source code is publicly available in the aforementioned GitHub repository.

3.2. Assessment Benchmark

For our empirical assessments, we selected the Defects4J (Just et al., 2014)** dataset as the evaluation benchmark of TBar. This benchmark includes test cases for buggy Java programs with the associated developer fixes. Defects4J is an ideal benchmark for the objective of this study, since it has been widely used by most recent state-of-the-art APR systems targeting Java program bugs. Table 4 provides summary statistics on the bugs and test cases available in the version 1.2.0 (def, 2019) of Defects4J which we use in this study. **

Overall, we note that, to date, 101 Defects4J bugs have been correctly fixed by at least one APR tool published in the literature. Nevertheless, we recall that SimFix (Jiang et al., 2018)** currently holds the record number of bugs fixed by a single tool, which is 34.**

4. Assessment

This section presents and discusses the results of repair experiments with TBar. In particular, we conduct two experiments for:

•

Experiment #1: Assessing the effectiveness of the various fix patterns implemented in TBar. To avoid the bias that fault localization can introduce with its false positives (cf. (Liu et al., 2019b)****), we directly provide perfect localization information to TBar.

•

Experiment #2: Evaluating TBar in a normal program repair scenario. We investigate in particular the tendency of fix patterns to produce more or less incorrect patches.

4.1. Repair Suitability of Fix Patterns

Our first experiment focuses on assessing the patch generation performance of fix patterns for real bugs. In particular, we investigate three research questions in Experiment #1.

Research Questions for Experiment #1

RQ1.

How many real bugs from Defects4J can be correctly fixed by fix patterns from our taxonomy?

RQ2.

Can each Defects4J bug be fixed by different fix patterns?

RQ3.

What are the properties of fix patterns that are successfully used to fix bugs?

In a recent study, Liu et al. (Liu et al., 2019b)** reported how fault localization techniques substantially affect the repair performance of APR tools. Given that, in this experiment, the APR tool (namely TBar) is only used as a means to apply the fix patterns in order to assess their effectiveness, we must eliminate the fault localization bias. Therefore, we assume that the bug positions at statement level are known, and we directly provide it to the patch generation step of TBar, without running any fault localization tool (which is part of the normal APR workflow, see Figure 1). To ensure readability across our experiments, we denote this version of the APR system as $\texttt{TBar}_{p}$ (where $p$ stands for perfect localization). Table 5 summarizes the experimental results of $\texttt{TBar}_{p}$ .**

Among 395 bugs in the Defects4J benchmark, $\texttt{TBar}_{p}$ can generate plausible patches for 101 bugs. 74 of these bugs are fixed with correct patches. We also note that $\texttt{TBar}_{p}$ can partially fix333Partial fix: a patch makes the buggy program pass a part of previously failed test cases without causing any new failed test cases (Liu et al**., 2019b). 20 bugs with plausible patches, and 8 of them are correct. In a previous study, the kPAR (Liu et al., 2019b)**** baseline tool (i.e., a Java implementation of the PAR (Kim et al., 2013)**** seminal template-based APR tool) was correctly/plausibly fixing 36/55 Defects4J bugs when assuming perfect localization.**

**While the results of $\texttt{TBar}_{p}$ are promising, ** $\sim$ 79%(=314/395) of bugs cannot be correctly fixed with the available fix patterns. We manually investigated these unfixed bugs and make the following observations as research directions for improving the fix rates:

(1)

Insufficient fix patterns.** Many bugs are not fixed by $\texttt{TBar}_{p}$ simply due to the absence of matching fix patterns. This suggests that the fix patterns collected in the literature are far from being representative for real-world bugs. The community must thus keep contributing with effective techniques for mining fix patterns from existing patches.** 2. (2)

Ineffective search of fix ingredients.** Template-based APR is a kind of search-based APR (Wen et al., 2018): some fix patterns require donor code (i.e., fix ingredients) to generate actual patches. For example, as shown in Figure 2, to apply the relevant fix pattern FP9.2, one needs to identify fix ingredient “ImageMapUtilities.htmlEs- cape” as the necessary in generating the patch. The current implementation of TBar limits its search space for donor code to the “local” file where the bug is localized. It is a limitation to find the correct donor code, but it reduces the risk of search space explosion. In addition, TBar leverages the context of buggy code to prune away irrelevant fix ingredients. Therefore, some bugs cannot be fixed by TBar although its fix pattern can match with code change actions. With more effective search strategies (e.g., larger search space such as fix ingredients from other projects as in (Liu et al., 2018a)), there might be more chances to fix more bugs.**

RQ1:** The collected fix patterns can be used to correctly fix 74 real bugs from the Defects4J dataset. A larger portion of the dataset remains however unfixed by $\texttt{TBar}_{p}$ , notably due to (1) the limitations of the fix patterns set and to (2) the naïve search strategy for finding relevant fix ingredients to build concrete patches from patterns.**

Figure 3 summarizes the statistics on the number of bugs that can be fixed by one or several fix patterns. The Y-axis denotes the number of fix patterns (i.e., $n=$ 1, 2, 3, 4, 5, and ¿5) that can generate plausible patches for a number of bugs (X-axis). The legend indicates that “P” represents the number of plausible patches generated by $\texttt{TBar}_{p}$ (i.e., those that are not found to be correct). “# $k$ ”, where $k\in[1,4]$ , indicates that a bug can be correctly fixed by only $k$ fix patterns (although it may be plausibly fixed by more fix patterns).

Consider for the bottom-most bar in Figure 3: 66 (=28+38) bugs can be plausibly fixed by a single pattern (Y-axis value is 1); it turns out that only 38 of them are correctly fixed. Note that several patterns can generate (plausible) patches for a bug, but not all patches are necessarily correct. For example, in the case of the top-most bar in Figure 3, 5 bugs are each plausibly fixed by over 5 fix patterns. However, only 1 bug is correctly fixed by 3 fix patterns.

In summary, 86% (= $\frac{38+10+5+3+10+4}{74+7}$ ) of correctly fixed bugs (74 fully and 7 partially fixed bugs) are exclusively fixed correctly by single patterns. In other words, generally, several fix patterns can generate patches that can pass all test cases but, in most cases, the bug is correctly fixed by only one pattern. This finding suggests that it is necessary to carefully select an appropriate fix pattern when attempting to fix a bug, in order to avoid plausible patches which may prevent the discovery of correct patches by halting the repair process (given that all tests are passing on the plausible patch).

RQ2:** Some bugs can be plausibly fixed by different fix patterns. However, in most cases, only one fix pattern is adequate for generating a correct patch. This finding suggests a need for new research on fix pattern prioritization.**

Table 6 details which bug is fixed by which fix pattern(s). We note that five fix patterns (i.e., FP3, FP4.3, FP5, FP7.2 and FP11.3) cannot be used to generate a plausible patch for any Defects4J bug. Two fix patterns (i.e., FP9.2 and FP12) lead to plausible patches for some bugs, but none of them is correct. It does not necessarily suggest that the aforementioned fix patterns are useless (or ineffective) in APR. Instead, two reasons can explain their performance:

•

The search for donor code may be inefficient for finding relevant ingredients for applying these patterns

•

The Defects4J dataset does not contain the types of bugs that can be addressed by these fix patterns.

In addition, twenty (20) fix patterns lead to the generation of correct patches for some bugs. Most of these fix patterns are involved in the generation of plausible patches (which turn out to be incorrect). Interestingly, we found the cases of six (6) fix patterns which can generate several444Note that, in this experiment $\texttt{TBar}_{p}$ generates and assesses all possible patch candidates for a given pair ”bug location - fix pattern” with varying ingredients.** patch candidates, some which being correct and others being only plausible, for the same 10 bugs (as indicated in Table 6 with ‘◐’). This observation further highlights the importance of selecting a relevant donor code for synthesizing patches: selecting an inappropriate donor code can lead to the generation of a plausible (but incorrect) patch, which will impede the generation of correct patches in a typical repair pipeline.**

Aside from fix patterns, fix ingredients collected in donor code are essential to be properly selected to avoid patches that are plausible but may yet be incorrect.

We further inspect properties of fix patterns, such as change actions, granularity, and the number of changed statements in patches. The statistics are shown in Figure 4, highlighting the number of plausible (but incorrect) and correct patches for the different property dimensions through which fix patterns can be categorized.

More bugs are fixed by Update change actions than any by any other actions. Similarly, fix patterns targeting expressions fix more bugs correctly than patterns targeting statements and methods. However, fix patterns mutating whole statements have a higher rate of correct patches among their plausible generated patches. Finally, fix patterns changing only single statements can correctly fix more bugs than those touching multiple statements. Fix patterns targeting multi-statements have however a higher rate of correctness.

RQ3:** There are noticeable differences between successful repair among fix patterns depending on their properties related to implemented change actions, change granularity and change spread.**

4.2. Repair Performance Comparison: TBar vs State-of-the-art APR tools

Our second experiment evaluates TBar in a realistic setting for patch generation, allowing for reliable comparison against the state-of-the-art in the literature. Concretely, we investigate two research questions in Experiment #2.

Research Questions for Experiment #2

RQ4.

What performance can be achieved by TBar in a standard and practical repair scenario?

RQ5.

To what extent are the different fix patterns sensitive to noise in fault localization (i.e., spotting buggy code locations)?

In this experiment we implement a realistic scenario, using a normal fault localization (i.e., no assumption of perfect localization as for $\texttt{TBar}_{p}$ ) on Defects4J bugs. To enable a fair comparison with performance results recorded in the literature, TBar leverages a standard configuration in the literature (Liu et al., 2019b)** with GZoltar (Campos et al., 2012)**** and Ochiai (Abreu et al., 2007). Furthermore, TBar does not utilize any additional technique to improve the accuracy of fault localization, such as crashed stack trace (used by ssFix (Xin and Reiss, 2017)), predicate switching (Zhang et al., 2006) (used by ACS (Xiong et al., 2017)), or test case purification (Xuan and Monperrus, 2014b) (used by SimFix (Jiang et al., 2018)).**

With respect to the patch generation step, contrary to the experiment with $\texttt{TBar}_{p}$ where all positions of multi-locations bugs were known (cf. Section 4.1), TBar adapts a “first-generated and first-selected” strategy to progressively apply fix patterns, one at a time, in various suspicious code locations: TBar generates a patch $p_{i}$ , using a fix pattern that matches a given bug. If $p_{i}$ passes a subset of previously-failing test cases without failing any previously-passing test case, TBar selects $p_{i}$ as a plausible patch for the bug. Then, TBar continues to validate another patch $p_{i+1}$ (which can be generated by the same fix pattern on the same code entity with other ingredients, or on another code location). When $p_{i+1}$ passes a subset of test cases as $p_{i}$ , if $p_{i+1}$ is generated for the same buggy code entity as $p_{i}$ , $p_{i+1}$ will be abandoned; otherwise, TBar takes $p_{i+1}$ as another plausible patch as well. Through this process, TBar creates a patch set $P$ = { $p_{i}$ , $p_{i+1}$ , …} of plausible patches. Here, as soon as any patch can pass all the given test cases for a given bug, TBar takes it as a plausible patch for the given bug, which is regarded as a fully-fixed bug, and all $p_{i}\in P$ will be abandoned. Otherwise, our tool yields $P$ , a set of plausible patches that can each partially fix the given bug.

We run the TBar APR system against the buggy programs of the Defects4J dataset. Table 7 presents the performance of TBar in comparison with recent state-of-the-art APR tools from the literature. TBar can fix 81 bugs with plausible patches, 43 of which are correctly fixed. No other APR tool had reached this number of fixed bugs. Nevertheless, its precision (ratio of correct vs. plausible patches) is lower than some recent tools such as CapGen and SimFix which employs sophisticated techniques to select fix ingredients. Nonetheless, it is noteworthy that, despite using fix patterns catalogued in the literature, we can fix three bugs (namely Cl-86,L-47,M-11) which had never been fixed by any APR system: M-11 is fixed by a pattern found by a standalone fix pattern mining tool (Liu et al., 2018b)** but which was not encoded by any APR system yet. Cl-86 and L-47 are fixed by patterns that were not applied to Defects4J.**

RQ4:** TBar outperforms all recent state-of-the-art APR tools that were evaluated on the Defects4J dataset. It correctly fixes 43 bugs, while the runner-up (SimFix) is reported to correctly fix 34 bugs.**

It is noteworthy that TBar performs significantly less than $\texttt{TBar}_{p}$ (43 vs. 74 correctly fixed bugs). This result is in line with a recent study (Liu et al., 2019b)****, which demonstrated that fault localization imprecision is detrimental to APR repair performance. Table 6 summarizes information about the number of bugs each fix pattern contributed to fixing with $\texttt{TBar}_{p}$ . While only 4 fix patterns did not lead to the generation of any plausible patch when assuming perfect localization. With TBar, it is the case for 13 fix patterns (see Table 8). This observation further confirms the impact of fault localization noise.

We propose to examine the locations where TBar applied fix patterns to generate plausible but incorrect patches. As shown in Figure 5, TBar has made changes on incorrect positions (i.e., non-buggy locations) for 24 out of the 38 fully-fixed and 15 out of the 16 partially-fixed bugs.

Even when TBar applies a fix pattern to the precise buggy location, the generated patch may be incorrect. As shown in Figure 5, 14 patches that fully fix Defects4J bugs mutate the correct locations: in 3 cases, the fix patterns were inappropriate; in 2 other cases, TBar failed to locate relevant donor code; for the remaining, TBar does not support the required fix patterns.

Finally, Figure 6 illustrates the impact of fault localization performance: unfixed bugs (but correctly fixed by $\texttt{TBar}_{p}$ ) are generally more poorly localized than correctly fixed bugs. Similarly, we note that many plausible but incorrect patches are generated for bugs which are not well localized (i.e., several false positive buggy locations are mutated leading to plausible but incorrect patches).

Average positions bugs (in fault localization suspicious list) are also provided in Table 8. It appears that some fix patterns (e.g., FP2.1, FP6.3, FP10.2) can correctly fix bugs that are poorly localized, showing less sensitivity to fault localization noise than others.

RQ5:** Fault localization noise has a significant impact on the performance of TBar. Fix patterns are diversely sensitive to the false positive locations that are recommended as buggy positions.**

5. Discussion

Overall, our investigations reveal that a large catalogue of fix patterns can help improve APR performance. However, at the same time, there are other challenges that must be dealt with: more accurate fault localization, effective search of relevant donor code, fix pattern prioritization. While we will work on some of these research directions in future work, we discuss in this section some threats to validity of the study and practical limitations of TBar.

5.1. Threats to Validity

Threats to external validity include the target language of this study, i.e., Java. Fix patterns studied in this paper only cover the fix patterns targeting at Java program bugs released by the state-of-the-art pattern-based APR systems. However, we believe that most fix patterns presented in this study could be applied to other languages since fix patterns are illustrated as abstract syntax tree level. Another threat to external validity could be the fix pattern diversity. Our study may not consider all available fix patterns so far in the literature. To reduce this threat, we systematically reviewed the research on pattern-based program repair in the literature. Nevertheless, we acknowledge that integrating more fix patterns may not necessarily lead to increased number of bugs that are correctly fixed. With too many fix patterns, the search space of fix patterns and patch candidates will explode. Eventually, the APR tool will produce a huge number of plausible patches, many of which might be validated before the correct ones (Wen et al., 2018)****. A future research direction could be on the construction and curation of fix patterns database for APR.

Our strategy of fix pattern selection can be a threat to internal validity: it naïvely matches patterns based on the AST context around buggy locations. More advanced strategies would give a higher probability to select appropriate patterns to fix more bugs. Our approach to searching for donor code also carries some threats to validity: TBar focuses on the local buggy file, while previous works have shown that the adequate donor code, for some bugs, is available in other files (Wen et al., 2018; Jiang et al., 2018). In future work, we will investigate the search of donor code beyond local files, while using heuristics to cope with the potential search space explosion. Finally, the selected benchmark for evaluation constitutes another threat to external validity for assessment. The performance achieved by TBar on Defects4J may not be reached on a bigger, more diverse and more representative dataset. To address this threat, new benchmarks such as Bugs.jar (Saha et al., 2018)** and Bears (Madeiral et al., 2019)**** should be investigated.**

5.2. Limitations

TBar** selects fix patterns in a naïve way, it thus would be necessary to design a sophisticated strategy (such as bug symptom, bug type, or other information from bug reports) for fix pattern selection to reduce the noise from inappropriate fix patterns. Searching donor code for synthesis patches is another limitation of TBar, as the correct donor code for fixing some bugs is located in the code files that do not contain the bug (Wen et al., 2018; Jiang et al., 2018). If TBar extends the donor code searching to other non-buggy code files, it will cause the search space explosion.**

6. Related Work

Fault Localization. In general, most APR pipelines start with fault localization (FL), as shown in Figure 1. Once the buggy position is localized, ARP tools can mutate the buggy code entity to generate patches. To identify defect locations in a program, several automated FL techniques have been proposed (Wong et al., 2016): slice-based (Wong et al., 2010; Mao et al., 2014), spectrum-based (Abreu et al., 2009a; Perez et al., 2017), statistics-based (Liblit et al., 2005; Liu et al., 2006)****, etc.

Spectrum-based FL is widely adopted in APR systems since they identify bug position at the statement level. It relies on the ranking metrics (e.g., Trantula (Jones and Harrold, 2005), Ochiai (Abreu et al., 2009b)) to calculate the suspiciousness of each statement. GZoltar (Campos et al., 2012) and Ochiai have been widely integrated into APR systems since their effectiveness has been demonstrated in several empirical studies (Steimann et al., 2013; Xie et al., 2013; Xuan and Monperrus, 2014a; Pearson et al., 2017). As reported by Liu et al. (Liu et al., 2019b) and studied in this paper, this FL configuration still has a limitation on localizing bug positions. Therefore, researchers tried to enhance FL techniques with new techniques, such as predicate switching (Zhang et al., 2006; Xiong et al., 2017) and test case purification (Xuan and Monperrus, 2014b; Jiang et al., 2018)****.

Patch Generation. Another key process of APR pipelines is searching for another shape of a program (i.e., a patch) in the space of all possible programs (Le Goues et al., 2012a; Long and Rinard, 2016a). If the search space is too small, it might not include the correct patches. (Wen et al., 2018). To reduce this threat, a straightforward strategy is to expand the search space, however, which could lead to other two problems: (1) at worst, there still is no correct patch in it; and (2) the expanded search space includes more plausible patches that enlarge the possibility of generating plausible patches before correct ones (Wen et al., 2018; Liu et al., 2018a).

To improve repair performance, many APR systems have been explored to address the search space problem. Synthesis-based APR systems (Long and Rinard, 2015; Xuan et al., 2017; Xiong et al., 2017) explored to limit the search space on conditional bug fixes by synthesizing new conditional expressions with variables identified from the buggy code. Pattern-based APR tools (Kim et al., 2013; Le et al., 2016b; Saha et al., 2017; Long et al., 2017; Durieux et al., 2017; Le et al., 2017; Liu and Zhong, 2018; Hua et al., 2018; Jiang et al., 2018; Liu et al., 2019c)** are designed to purify the search space by following fix patterns to mutate buggy code entities with retrieved donor code. Other APR pipelines focus on specific search methods for donor code or patch synthesizing strategies, to address the search space problem, such as contract-based (Wei et al., 2010; Chen et al., 2017), symbolic execution based (Nguyen et al., 2013), learning based (Long and Rinard, 2016b; Gupta et al., 2017; Rolim et al., 2017; Soto and Le Goues, 2018; Bhatia et al., 2018; White et al., 2019), and donor code searching (Mechtaev et al., 2015; Ke et al., 2015)**** APR tools. Various existing APR tools have achieved promising results on fixing real bugs, but there is still an opportunity to improve the performance; for example, mining more fix patterns, improving pattern selection and donor code retrieving strategy, exploring a new strategy for patch generation, and prioritizing bug positions.**

Patch Correctness. The ultimate goal of APR systems is to automatically generate a correct patch that can resolve the program defects. In the beginning, patch correctness is evaluated by passing all test cases (Weimer et al., 2009; Kim et al., 2013; Le et al., 2016b). However, these patches could be overfitting (Qi et al., 2015; Le et al., 2018)** and even worse than the bug (Smith et al., 2015). Since then, APR systems are evaluated with the precision of generating correct patches (Xiong et al., 2017; Wen et al., 2018; Jiang et al., 2018; Liu et al., 2019c). Recently, researchers start to explore automated frameworks that can identify patch correctness for APR systems automatically (Xiong et al., 2018; Le et al., 2019)****.**

7. Conclusion

Fix patterns have been studied in various scenarios to understand bug fixes in the wild. They are further implemented in different APR pipelines to generate patches automatically. Although template-based APR tools have achieved promising results, no extensive investigation on the effectiveness fix patterns was conducted. We fill this gap in this work by revisiting the repair performance of fix patterns via a systematic study assessing the effectiveness of a variety of fix patterns summarized from the literature. In particular, we build a straightforward template-based APR tool, TBar, which we evaluate on the Defects4J benchmark. On the one hand, assuming a perfect fault localization, TBar fixes 74/101 bugs correctly/plausibly. On the other hand, in a normal/practical APR pipeline, TBar correctly fixes 43 bugs despite the noise of fault localization false positives. This constitutes a record performance in the literature on Java program repair. We expect TBar to be established as the new baseline APR system, leading researchers to propose better techniques for substantial improvement of the state-of-the-art.

Acknowledgements.

** This work is supported by the Fonds National de la Recherche (FNR), Luxembourg, through RECOMMEND 15/IS/10449467 and FIXPATTERN C15/IS/9964569. **

Bibliography82

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2def (2019) Last Accessed: May. 2019. Defecst 4J. https://github.com/rjust/defects 4j/releases/tag/v 1.2.0 .
3par (2019) Last Accessed: May. 2019. PAR Fix Templates. https://sites.google.com/site/autofixhkust/home/fix-templates .
4pro (2019) Last Accessed: May. 2019. Program Repair. http://program-repair.org .
5Abreu et al . (2007) Rui Abreu, Arjan JC Van Gemund, and Peter Zoeteweij. 2007. On the accuracy of spectrum-based fault localization. In Testing: Academic and Industrial Conference Practice and Research Techniques - MUTATION . IEEE, 89–98.
6Abreu et al . (2009 b) Rui Abreu, Peter Zoeteweij, Rob Golsteijn, and Arjan JC Van Gemund. 2009 b. A practical evaluation of spectrum-based fault localization. Journal of Systems and Software 82, 11 (2009), 1780–1792.
7Abreu et al . (2009 a) Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. 2009 a. Spectrum-based multiple fault localization. In Proceedings of the 24th International Conference on Automated Software Engineering . IEEE, 88–99.
8Bhatia et al . (2018) Sahil Bhatia, Pushmeet Kohli, and Rishabh Singh. 2018. Neuro-symbolic program corrector for introductory programming assignments. In Proceedings of the 40th International Conference on Software Engineering . ACM, 60–70.