Mobile-App Analysis and Instrumentation Techniques Reimagined with DECREE
Yixue Zhao

TL;DR
DECREE is an infrastructure that enhances the reproducibility, reusability, and comparability of mobile app analysis and instrumentation techniques, aiming to improve research practices and practical adoption.
Contribution
It introduces DECREE, a modular infrastructure that standardizes evaluation, facilitates discovery, and simplifies replication of mobile app analysis tools.
Findings
DECREE enables reproducible evaluation of techniques.
DECREE supports easy discovery and reuse of solutions.
DECREE facilitates replication studies and comparison.
Abstract
A large number of mobile-app analysis and instrumentation techniques have emerged in the past decade. However, those techniques' components are difficult to extract and reuse outside their original tools, their evaluation results are hard to reproduce, and the tools themselves are hard to compare. This paper introduces DECREE, an infrastructure intended to guide such techniques to be reproducible, practical, reusable, and easy to adopt in practice. DECREE allows researchers and developers to easily discover existing solutions to their needs, enables unbiased and reproducible evaluation, and supports easy construction and execution of replication studies. The paper describes DECREE's three modules and its potential to fundamentally alter how research is conducted in this area.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Software Testing and Debugging Techniques · Software Engineering Research
Mobile-App Analysis and Instrumentation Techniques Reimagined with DECREE
Yixue Zhao
Advisor: Nenad Medvidovic
University of Southern California
Abstract
A large number of mobile-app analysis and instrumentation techniques have emerged in the past decade. However, those techniques’ components are difficult to extract and reuse outside their original tools, their evaluation results are hard to reproduce, and the tools themselves are hard to compare. This paper introduces DECREE, an infrastructure intended to guide such techniques to be reproducible, practical, reusable, and easy to adopt in practice. DECREE allows researchers and developers to easily discover existing solutions to their needs, enables unbiased and reproducible evaluation, and supports easy construction and execution of replication studies. The paper describes DECREE’s three modules and its potential to fundamentally alter how research is conducted in this area.
I introduction
Current mobile computing research has focused extensively on three threads:
1
static analysis techniques that analyze the apps’ implementation artifacts statically to extract information of interest (e.g., security vulnerabilities [1, 2]);
2
instrumentation techniques that improve targeted aspects (e.g., performance [3, 4, 5]) of an app by directly modifying the app’s implementation; and
3
auxiliary techniques that analyze external information associated with mobile apps to learn useful lessons (e.g., our recent work that assessed prefetching and caching opportunities [6]).
However, there is a pronounced gap, specifically between the emergence of static analysis and instrumentation techniques in research and their adoption in practice for four reasons:
1 There is no established communication channel between researchers and app developers, thus the techniques may not meet the exact needs in practice and may violate real-world assumptions.
2 Research techniques often have steep learning curves, making them difficult to adopt.
3
Research techniques are often evaluated in limited settings, rendering any claims insufficiently convincing for app developers to adopt.
4
Existing techniques are usually designed as one-off solutions, making them hard to reproduce, reuse, or customize.
We have faced this gap in our prior research [2, 3, 6, 5]. The research community at large is also beginning to recognize this gap and the wasted opportunities it causes [1, 7, 8, 9]. Inspired in part by, but going beyond these early efforts, our proposed work—DECREE—aims to transform how research in the mobile arena is conducted in order to produce reusable, practical, and reproducible research that is easier to adopt in practice. While the concepts behind DECREE are independent of the specific technology used to develop mobile apps, we focus on Android due to its dominant market share.
DECREE is an infrastructure that provides a comprehensive baseline for Developing, Evaluating, Composing, Reusing, Evolving, and Exploring research techniques in the mobile computing domain, with three research threads:
1
A microservice-based reference architecture for static analysis and instrumentation techniques, intended to be comprehensive in scope but simple enough to adopt and tailor. We will evaluate its reusability support and correctness by migrating existing techniques, and comparing the migrated and original techniques.
2
A corresponding testbed to rigorously evaluate and compare static analysis and instrumentation techniques with standard baselines. We will evaluate its correctness and effectiveness by comparing the obtained measurements of the original and migrated techniques.
3
A cloud-based open repository that contains DECREE-compatible techniques, allowing both researchers and app developers to easily discover what they need, and enabling unbiased comparison and replication studies of DECREE-compatible techniques in an automatic manner. We will evaluate its correctness and performance by reproducing the evaluations conducted in the second research thread and comparing their results.
DECREE makes the following contributions:
1
a reference architecture to guide the design of mobile computing techniques, so that they can be readily reused by other researchers and adopted by app developers;
2
a testbed with standard baselines to allow competing techniques to be evaluated fairly and thoroughly;
3
an open repository to bridge the gap between researchers and developers and allow them to leverage each other’s knowledge;
4
reproduced evaluation results of exiting techniques to benefit future research and enable replication studies.
The rest of the paper is structured as follows. Section II details the three research threads. Section III presents our progress to date and obtained evaluation plan. Section IV overviews related work and Section V concludes the paper.
II proposed approach
This section describes DECREE’s three research threads.
II-A The DECREE Reference Architecture—DECREE-RA
We design DECREE-RA based on the existing static analysis and instrumentation techniques, and our own experience in the mobile computing domain [6, 3, 2, 1, 10, 11, 4]. Our aim is to decompose mobile computing techniques into reusable components at a proper granularity with modular design that can, both, serve as a roadmap for future techniques and improve the reusability of existing techniques.
DECREE-RA’s design is based on the microservice architectural style for three reasons.
1
The microservice style helps to decouple potentially complex functionality into lightweight, “black-box” microservices, which are easy to understand and adopt.
2
Existing analysis and instrumentation techniques tend to comprise clearly separable components, and the microservice style would make it easier to reuse such components across techniques.
3 The microservice style allows components (i.e., microservices) to be implemented in different programming languages with different technologies, which suits the heterogeneity of the mobile computing domain.
As Fig. 1 shows, DECREE-RA’s reference architecture consists of six components. An individual static analysis or instrumentation technique can consist of one or more of the reference components.
Intermediate Representer takes an app or the OS (e.g., the Android framework), as its input and produces an Intermediate Representation (IR) for Static Analyzer to analyze. IR can be used by other Intermediate Representers to build new IRs, e.g., a tool-specific IR is usually built on top of foundational IRs, such as the control-flow graph (CFG) of an app.
Static Analyzer analyzes the IR to extract information that can be used in other components, such as an app’s or OS’s program point to be instrumented. For instance, PerfChecker [4] has a Static Analyzer to detect performance bugs.
App Instrumenter transforms the original app, usually based on the information extracted from the Static Analyzer. The App Instrumenter can be categorized into Automatic App Instrumenter or Manual App Instrumenter, e.g., via APIs that leverage annotations, and it usually needs to be configured so that the instrumented app can interact with other specific components at runtime, such as Backend Service.
OS Instrumenter is similar to App Instrumenter, but it instruments the OS (e.g., Android). OS Instrumenters (OSI) can also be categorized as Automatic and Manual OSIs. For instance, our prior work SEALANT [2]’s Interceptor is a Manual OSI that extends the Android framework to block malicious intents at runtime.
Device Monitor observes the device-level conditions at app runtime, typically to balance the quality-of-service (QoS) trade-offs. Similarly to the App Instrumenter, it also needs to be configured in order to interact with other components at runtime, such as the Backend Service.
Backend Service contains the ancillary functionalities that are triggered at app runtime. It will interact with the instrumented app and the Device Monitor via a lightweight protocol, such as REST. The ancillary functionalities are usually triggered by specific information sent from the instrumented app or the Device Monitor, such as prefetching HTTP requests aggressively when battery power is sufficient.
II-B The DECREE Testbed—DECREE-TB
DECREE-TB is a testbed for evaluating both static analysis techniques and instrumentation techniques in a reproducible and unbiased manner. It is intended to support the testing of techniques that follow DECREE-RA’s design, and of apps produced by instrumentation techniques.
II-B1 Testing of DECREE-Compatible Techniques
A technique can be evaluated at the level of a microservice API with unit test cases provided by the technique’s original developers. Each test case is executed in DECREE-TB’s controlled environment with a built-in monitoring system to record the relevant non-functional properties (NFPs).
DECREE-TB will store the raw test results of each unit test. These results will be useful for researchers to calculate coarser-grained evaluation metrics and compare different techniques. For instance, the accuracy of a given technique can be calculated by the number of relevant pass tests. Additionally, DECREE-TB’s built-in controlled testing environment and the NFP monitoring system make it possible to compare different techniques fairly, under identical conditions.
II-B2 Automated Differential Testing
DECREE-TB also supports testing an instrumented app, to verify that the instrumented app’s functional behavior is identical to the original app’s without unwanted side-effects with desired NFPs (e.g., performance overhead). This is critical but often neglected in the evaluation of existing instrumentation techniques. Differential testing has three automated phases:
1
In the differential test generation phase, the challenge is to efficiently achieve high coverage of the different parts of the apps with confidence. To address the challenge, we propose a novel path-sensitive automatic test generation technique at the granularity of callbacks. Callbacks are the essential representation of user interactions [10]. For example, the onClick callback represents a user’s click on a GUI widget. Our insight is that the instrumented app should have the same functionalities visible to the users (in addition to the desired NFPs), compared to the original app after each user interaction (i.e., callback). Thus, our test cases aim to cover every possible execution path in the callbacks that contain the difference.
2
In the comparative testing phase, DECREE-TB will automatically identify the “checkpoints” for each test case generated in the previous phase and will run each test case on the original and instrumented apps to get the results at the checkpoint. To render the scope of this research feasible, we will specifically focus on one type of functional and one type of non-functional checkpoints. The functional checkpoint will be inserted at each UI update point, and will be used to verify the instrumented app’s correctness. The non-functional checkpoint will be inserted before and after each modified callback, and will be used to verify performance.
3
The pair comparison phase takes pair-wise results generated by comparative testing as input, and compares the results of the original app and the instrumented app for researchers to see if their instrumentation technique works as expected.
II-C The DECREE Repository—DECREE-RP
The goal of DECREE-RP is to improve the availability, reusability, and reproducibility of analysis and instrumentation techniques by providing a cloud-based open repository readily accessible to both app developers and researchers, and a built-in testing engine integrated from DECREE-TB, to enable unbiased, reproducible evaluation among DECREE-compatible techniques in an automatic and customizable manner.
Fig. 2 shows the overview of DECREE-RP.
Its open repository consists of four databases.
1
Microservice pool contains the microservice-based DECREE-compatible techniques that are uploaded by researchers, along with their corresponding API documentation and test results if being evaluated by DECREE-TB.
2
Service request pool contains the requests from app developers on specific capabilities that are needed. Developers can submit test scripts with their service requests to describe their expected results, which can serve as the “ground truth” in the evaluation of research techniques. The test scripts are stored in the
3
test script pool, where they can be reused by researchers to determine if their techniques meet the developers’ needs, and to compare their techniques with competing techniques under the same baselines. The test script can specify benchmark apps to be tested, which are stored in the
4
benchmark pool, where developers or researchers can upload new such apps to benefit others.
DECREE-RP’s testing engine integrates DECREE-TB and defines a test scripting language to configure the evaluation of DECREE-compatible techniques, as well as differential testing of instrumented apps in DECREE-TB’s controlled testing environment (Section II-B).
The test scripting language specifies
1
the test cases to be executed;
2
subject apps to be evaluated from the benchmark pool;
3
the testing environment (e.g., configurations of the desktop environment for running the DECREE-compatible techniques or versions of the Android device/OS for the apps); and
4
standard NFP metrics whose monitoring should be enabled (e.g., execution time). With the standard evaluation protocol and a controlled cloud-based testing environment, DECREE-RP’s testing engine has the potential to alter how the evaluation of research techniques is conducted currently, to enable reproducible and unbiased evaluation and to bypass the often unavoidable time-consuming engineering work (downloading subject apps, setting up controlled testing environments, executing tests and recording their results under varying conditions, etc.).
III preliminary results and evaluation plan
To date, we have developed two mobile computing techniques [3, 6] and designed the DECREE infrastructure. We are in the process of migrating our two techniques to DECREE-RA, developing DECREE-TB and DECREE-RP. The rest of the section describes our progress and evaluation plan in detail.
Our prior work PALOMA [3] is an instrumentation technique, with underlying static analyses, that reduces app latency by prefetching HTTP requests via four major components: (1) String Analyzer identifies suitable HTTP requests for prefetching by interpreting their URL values; (2) Callback Analyzer detects the program points to issue prefetching requests; (3) Instrumenter uses the above information to produce a prefetching-enabled app; (4) at app runtime, the instrumented app triggers PALOMA’s Proxy to issue prefetching requests and cache prefetched responses. Following DECREE-RA, we are implementing PALOMA’s String Analyzer and Callback Analyzer as two Static Analyzer microservices and one reusable Jimple [11] Intermediate Representer. PALOMA’s Instrumenter is being implemented as an App Instrumenter, while its Proxy is being implemented as a Backend Service that interacts with the instrumented app at runtime.
Another recent study [6] resulted an auxiliary technique with underlying app instrumentation. It focused on the prefetching and caching opportunities in mobile apps in order to reduce app latency. It has an Instrumenter that instruments the original app to gather needed information regarding HTTP requests and responses at app runtime, which is used to calculate different statistics for answering research questions, such as “Are Expires headers trustworthy?”. With DECREE-RA, the Instrumenter is being implemented as one App Instrumenter microservice, and will reuse the Jimple Intermediate Representer microservice developed in PALOMA [3].
We will evaluate DECREE-RA’s support for reusability by measuring the common portions in the reimplemented DECREE-compatible techniques compared to the original techniques. We will evaluate the correctness by verifying that the functionalities of the original techniques remain unchanged in their DECREE-compatible counterparts.
We will reuse our prior work [3, 6] to develop DECREE-TB, with the focus on performance (i.e., execution time). Adding other metrics to evaluate (e.g., energy consumption) will be straightforward following the same protocol as performance. DECREE-TB’s differential testing will leverage our prior experience on program analysis [3] and will be first evaluated on the benchmark apps used originally in our work. We will evaluate its effectiveness by comparing the generated test cases with the “ground truth” obtained from the apps’ implementations manually. We will assess the accuracy of the applied test cases using the tests reported in our original technique as the baseline. Note that, while DECREE-TB’s differential testing is motivated by and targeted at instrumented apps, this technique is applicable to any app. Thus, we will further evaluate DECREE-TB’ effectiveness and accuracy on a broad cross-section of real Android apps.
To evaluate DECREE-RP, we will add the re-implemented DECREE-compatible techniques, benchmark apps, and test scripts to DECREE-RP’s open repository. Then we will reproduce the same tests conducted in DECREE-TB, but this time with the help of test scripts supported by DECREE-RP’s testing engine in order to evaluate its correctness and performance. All results will be recorded and made public throughout, to benefit future reproducibility studies.
IV related work
To the best of our knowledge, we are the first to propose a comprehensive infrastructure that provides baselines for mobile computing techniques across their development lifecycle. ReproDroid [1] proposes a framework for comparing taint analysis tools and reports a reproducibility study on six existing tools. However, ReproDroid is limited to taint analysis techniques (one particular type of static analysis techniques) and does not attempt to provide a way of redesigning and reimplementing those techniques for their future improved adaptation and reuse. The remainder of this section focuses on software testing, as it is related to DECREE-TB (Section II-B).
GUI Testing is widely adopted in testing mobile apps, such as model-based testing [12], random testing [13]. These techniques target testing the functionalities of an app with high coverage to identify bugs, while DECREE-TB targets apps that are instrumented, often optimized from the original apps. Thus, DECREE-TB’s goal is complementary to existing work: instead of achieving high test coverage, it focuses on whether the instrumentation performs as expected with desired NFPs.
Regression Testing is a rich research area that focuses on changes to a program to ensure the changes do not break previous functionalities. It usually assumes that the previous test cases are known, and aims to prioritize or select from the previous test space [14], and to adapt or augment the previous test cases to the new changes [15]. DECREE-TB’s proposed differential testing technique can generate test cases automatically, without requiring any previous test cases, which would be challenging to obtain for researchers who are not the developers of the apps to be tested.
V Expected Contributions
DECREE takes the first step toward open science in the mobile computing domain, with infrastructure support and comprehensive baselines. It has the potential to fundamentally change how app analysis and instrumentation techniques are developed and to yield reusable, reproducible, practical techniques that benefit both future research and their adoption. An added advantage is that its microservices will be deployed on the cloud and will not introduce significant overhead on the apps deployed on resource-constrained mobile devices. Researchers will also be able to dynamically update their microservices without requiring modifications to the app code. In addition, DECREE’s test cases have the potential to serve as baselines for comparing different techniques in the same domain (e.g., app optimization for energy efficiency). Once the microservices are adopted by developers, the underlying research techniques will be “organically” evaluated in the real world with real users, providing further insights and incentives for researchers to improve their techniques.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] F. Pauck, E. Bodden, and H. Wehrheim, “Do android taint analysis tools keep their promises?” in Proceedings of the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) , November 2018.
- 2[2] Y. K. Lee, J. Y. Bang, G. Safi, A. Shahbazian, Y. Zhao, and N. Medvidovic, “A sealant for inter-app security holes in android,” in 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE) . IEEE, 2017, pp. 312–323.
- 3[3] Y. Zhao, M. S. Laser, Y. Lyu, and N. Medvidovic, “Leveraging program analysis to reduce user-perceived latency in mobile applications,” in Proceedings of the International Conference on Software Engineering (ICSE) , May 2018.
- 4[4] Y. Liu, C. Xu, and S.-C. Cheung, “Characterizing and detecting performance bugs for smartphone applications,” in Proceedings of the 36th International Conference on Software Engineering . ACM, 2014, pp. 1013–1024.
- 5[5] Y. Zhao, “Toward client-centric approaches for latency minimization in mobile applications,” in Mobile Software Engineering and Systems (MOBILE Soft), 2017 IEEE/ACM 4th International Conference on . IEEE, 2017, pp. 203–204.
- 6[6] Y. Zhao, P. Wat, M. S. Laser, and N. Medvidović, “Empirically assessing opportunities for prefetching and caching in mobile apps,” in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering . ACM, 2018, pp. 554–564.
- 7[7] M. Harman and P. O’Hearn, “From start-ups to scale-ups: Opportunities and open problems for static and dynamic program analysis.”
- 8[8] T. Z. Robert Feldt, Tim Menzies, “Rose festival, FSE 2018. recognizing and rewarding open science in software engineering.” [Online]. Available: https://github.com/researchart/rose-fse 18
