Turaco: Complexity-Guided Data Sampling for Training Neural Surrogates of Programs
Alex Renda, Yi Ding, Michael Carbin

TL;DR
This paper introduces a complexity-guided data sampling method for training neural surrogates of programs, improving accuracy by selecting training data based on path complexity analysis.
Contribution
It proposes a novel sampling methodology that leverages program path complexity to enhance neural surrogate training accuracy.
Findings
Complexity-guided sampling improves surrogate accuracy.
Path complexity analysis informs better data selection.
Method tested on real-world programs with positive results.
Abstract
Programmers and researchers are increasingly developing surrogates of programs, models of a subset of the observable behavior of a given program, to solve a variety of software development challenges. Programmers train surrogates from measurements of the behavior of a program on a dataset of input examples. A key challenge of surrogate construction is determining what training data to use to train a surrogate of a given program. We present a methodology for sampling datasets to train neural-network-based surrogates of programs. We first characterize the proportion of data to sample from each region of a program's input space (corresponding to different execution paths of the program) based on the complexity of learning a surrogate of the corresponding execution path. We next provide a program analysis to determine the complexity of different paths in a program. We evaluate these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
