BenchPress: Analyzing Android App Vulnerability Benchmark Suites
Joydeep Mitra, Venkatesh-Prasad Ranganath, Aditya Narkar

TL;DR
This paper empirically evaluates four Android vulnerability benchmark suites by analyzing API usage in real-world apps and on Stack Overflow, providing insights to improve benchmark selection and development.
Contribution
It offers a systematic comparison of benchmark suites based on API coverage and identifies gaps for extending these benchmarks.
Findings
Coverage analysis of benchmark APIs in real apps
Pairwise comparison of benchmark suites
Identification of security APIs not covered by benchmarks
Abstract
In recent years, various benchmark suites have been developed to evaluate the efficacy of Android security analysis tools. The choice of such benchmark suites used in tool evaluations is often based on the availability and popularity of suites and not on their characteristics and relevance. One of the reasons for such choices is the lack of information about the characteristics and relevance of benchmarks suites. In this context, we empirically evaluated four Android specific benchmark suites: DroidBench, Ghera, IccBench, and UBCBench. For each benchmark suite, we identified the APIs used by the suite that were discussed on Stack Overflow in the context of Android app development and measured the usage of these APIs in a sample of 227K real world apps (coverage). We also compared each pair of benchmark suites to identify the differences between them in terms of API usage. Finally, we…
| DroidBench | Ghera |
| ICCBench | UBCBench |
| Relevant APIs in decreasing order of use percentage for API Levels 23-27 | |
| —– % Real-World apps using a relevant API | |
| —– % Android related Stack Overflow Posts discussing a relevant API | |
| DroidBench | Ghera |
| ICCBench | UBCBench |
| Relevant APIs in decreasing order of use percentage for API Levels 23-27 | |
| —– % Real-World apps using a relevant API | |
| —– % Android related Stack Overflow Posts discussing a relevant API | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
BenchPress: Analyzing Android App Vulnerability Benchmark Suites
Joydeep Mitra Venkatesh-Prasad Ranganath Aditya Narkar
Kansas State University, USA
{joydeep,rvprasad,avnarkar}@ksu.edu
(Created: January 29, 2019. Revised: September 19, 2019.)
Abstract
In recent years, various benchmark suites have been developed to evaluate the efficacy of Android security analysis tools. The choice of such benchmark suites used in tool evaluations is often based on the availability and popularity of suites and not on their characteristics and relevance. One of the reasons for such choices is the lack of information about the characteristics and relevance of benchmarks suites.
In this context, we empirically evaluated four Android specific benchmark suites: DroidBench, Ghera, ICCBench, and UBCBench. For each benchmark suite, we identified the APIs used by the suite that were discussed on Stack Overflow in the context of Android app development and measured the usage of these APIs in a sample of 227K real world apps (coverage). We also compared each pair of benchmark suites to identify the differences between them in terms of API usage. Finally, we identified security-related APIs used in real-world apps but not in any of the above benchmark suites to assess the opportunities to extend benchmark suites (gaps).
The findings in this paper can help 1) Android security analysis tool developers choose benchmark suites that are best suited to evaluate their tools (informed by coverage and pairwise comparison) and 2) Android app vulnerability benchmark creators develop and extend benchmark suites (informed by gaps).
1 Introduction
1.1 Motivation
Effectiveness of Android security analysis tools is evaluated with benchmarks and real-world apps. The effectiveness of static taint analysis tools like AmanDroid [20], FlowDroid [2], HornDroid [4], and IccTA [9] has been evaluated by applying them to benchmarks from DroidBench, ICCBench, and UBCBench [14] benchmark suites and comparing tool verdicts with benchmark labels that indicate the presence/absence of specific vulnerability or malicious behavior.
Such tool evaluations have used benchmarks without evaluating the authenticity and the representativeness of the benchmarks. Authenticity is the truthfulness of the claim about the presence/absence of a vulnerability or malicious behavior in a benchmark (Section 2.2.2 in [11]). Representativeness is the similarity between the manifestation/occurrence of a vulnerability in a benchmark and in real-world apps (Section 3 in [15]). Consequently, the usefulness of findings from these evaluations is diminished in terms of the ability of tools and techniques to detect vulnerabilities or malicious behaviors (due to authenticity) and the general applicability of tools and techniques (due to representativeness).
Recently, there have been two efforts focused on the authenticity of benchmarks. Mitra and Ranganath [11] created Ghera, a suite of demonstrably authentic Android app vulnerability benchmarks, to address the issue of authenticity. They also established the representativeness of Ghera benchmarks (in terms of API usage) [15]. Pauck et al. [13] developed ReproDroid, a tool to help verify the authenticity of Android app vulnerability benchmarks. They found that not all claims about the presence/absence of vulnerabilities in benchmarks in DIALDroid, DroidBench, and ICCBench benchmark suites were true.
It is common in other communities to study and characterize benchmarks. In the program analysis community, Blackburn et al. [3] developed and used metrics based on static and dynamic properties of programs to characterize and compare the DaCapo benchmarks with SPEC Java benchmarks [17]. Isen et al. [8] measured several properties of embedded Java benchmarks and how well they represent real-world mobile apps. In the systems community, Pallister et al. [12] characterized benchmarks based on the energy consumption properties of embedded platforms. In the database community, such assessments have been around since 1990s [6]. However, such close scrutiny of benchmark suites have not occurred in the Android security community.
Motivated by the aforementioned efforts to analyze and characterize benchmarks, we undertook an effort to assess the representativeness of multiple Android app vulnerability benchmark suites.i.e., how well does a benchmark suite represent real-world apps?
1.2 Research Questions
The objective of our effort is to answer the following research questions:
- •
RQ1 In general, do Android app vulnerability benchmark suites use APIs that are used by real-world apps and discussed by Android app developers? The question is aimed at understanding the representativeness of benchmark suites and the relevance of the APIs used to capture vulnerabilities in benchmark suites. The answer to this question can help associate an element of confidence to benchmarks and, consequently, to tool evaluations that use such benchmarks.
- •
RQ2 In the context of security, do Android app vulnerability benchmark suites use APIs that are used by real-world apps and discussed by Android app developers? Similar to RQ1, this question is intended to understand the representativeness and relevance of the security-related APIs used by benchmark suites but in the specific context of security.
- •
RQ3 How do the considered benchmark suites differ in terms of API usage? The purpose of this question is to identify the common and unique APIs between benchmark suite pairs. The answer to this question can help tool developers choose appropriate benchmark suites to test/evaluate their tools.
- •
RQ4 Do real-world apps use security-related APIs not used by any benchmark suite? The purpose of this question is to identify gaps between existing benchmark suites and the real-world apps in terms of security-related APIs. The answer to this question can steer security analysis efforts towards unexplored Android APIs to possibly uncover new vulnerabilities and enhance existing benchmark suites.
1.3 Contributions
In this paper, we make the following contributions:
- •
Provide empirical evidence about the representativeness of four Android app vulnerability benchmark suites.
- •
Identify gaps between the evaluated benchmark suites and real world apps in terms of APIs that are used in real world apps but not in the benchmark suites.
- •
Extend and improve the framework for empirical evaluation of Android app vulnerability benchmarks introduced by Ranganath & Mitra [15] because we believe this framework can be used by other researchers to conduct similar studies in other domains as well.
In addition to these contributions, we hope this effort will spark the interest of the empirical software engineering community to study Android app vulnerability benchmarks and help improve Android app security.
The remainder of the paper is structured as follows. Section 2 outlines the metric of representativeness along with the benchmark suites and the real world app sample used in the study. Section 3 describes the experiment to measure representativeness. Sections 4-LABEL:sec:rq4 discuss the answers to posed research questions. Section 7 describes the threats to the validity of the experiment. Section 8 describes related work. Section 9 provides information about the artefacts used in this effort. Section 11 summarizes the findings from this effort.
2 Concepts and Subjects
2.1 API usage as a measure of representativeness
Representative vulnerability benchmarks should have two aspects. First, they should capture vulnerabilities that occur in the real world. Second, the manifestation of vulnerabilities in representative benchmarks should be similar (if not identical) to that in real-world apps.
Ranganath and Mitra [15] observed this challenge while establishing the representativeness of Ghera benchmarks. So, they introduced the notion of using API usage as a weak but general measure of representativeness of benchmarks. They reasoned “the likelihood of a vulnerability occurring in real-world apps is directly proportional to the number of real-world apps using the Android APIs involved in the vulnerability”. Consequently, to measure the representativeness of benchmarks, they measured how often APIs used in benchmarks were used in real-world apps.
In this evaluation, we use the above notion and a similar approach to measure the representativeness of benchmarks.
2.2 Benchmarks
For this study, we considered 4 benchmark suites related to Android app vulnerabilities: DroidBench, Ghera, IccBench, and UBCBench.
DroidBench [13] contains 211 benchmarks. Each benchmark is an Android app that captures zero or more information leak vulnerabilities. The vulnerabilities captured in DroidBench primarily stem from Inter-Component Communication (ICC) feature of Android and general features of Java.
Ghera [11] contains 60 benchmarks that capture mostly known Android app vulnerabilities along with few unknown Android app vulnerabilities. Each benchmark includes 3 Android apps: 1) a benign app that contains vulnerability x, 2) a malicious app that exploits vulnerability x in the benign app, and 3) a secure app that does not contain vulnerability x and thus cannot be exploited by the malicious app. In this evaluation, we considered only the benign apps from each benchmark and we will refer to them as Ghera benchmarks in the rest of this paper. Unlike in DroidBench, the vulnerabilities in Ghera stem from different Android features including ICC.
IccBench [13] contains 24 benchmarks. Each benchmark is an Android app that captures zero or more information leak vulnerabilities. IccBench focuses on capturing vulnerabilities that stem from communcation between apps via ICC.
UBCBench [14] contains 16 benchmarks. Each benchmark is an Android app that captures at most one information leak vulnerability. UBCBench captures information flow vulnerabilities primarily stemming from ICC and SharedPreferences111A SharedPreference is a file that stores key-value pairs and can be private to an app or shared features of Android and general features of Java.
2.3 Real World Apps
We collected 700K apps from AndroZoo [1] in March 2019. From this set of 700K apps, we curated a set of 473K apps that target API levels 19 thru 27. An API level uniquely identifies the framework API revision offered by a version of the Android platform. In an Android app, the minimum API level is the least framework API version required by the app and target API level is the framework API version targeted by the app. For this evaluation, we initially picked target API level 19 thru 27 because most benchmarks targeted these API levels. However, we later discovered that Android currently does not support API levels 19 thru 22. Therefore, to make the evaluation current, from the set of 473K apps, we retained only apps that target API levels 23 thru 27. Hence, we ended up with a sample of 226K real-world Android apps. \Freftab:real_world_app_sample provides the distribution of this sample across considered target API Levels.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Kevin Allix, Tegawéndé F. Bissyande, Jacques Klein, and Yves Le Traon. Androzoo: Collecting millions of android apps for the research community. In Proceedings of the 13th International Conference on Mining Software Repositories , pages 468–471. ACM, 2016.
- 2[2] Steven Arzt, Siegfried Rasthofer, Christian Fritz, Eric Bodden, Alexandre Bartel, Jacques Klein, Yves Le Traon, Damien Octeau, and Patrick Mc Daniel. Flowdroid: Precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for android apps. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation , pages 259–269. ACM, 2014. https://github.com/secure-software-engineering/Flow Droid , Accessed: 21-Nov-2017.
- 3[3] Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. Mc Kinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanović, Thomas Van Drunen, Daniel von Dincklage, and Ben Wiedermann. The dacapo benchmarks: Java benchmarking development and analysis. SIGPLAN Not. , pages 169–190, 2006.
- 4[4] Stefano Calzavara, Ilya Grishchenko, and Matteo Maffei. Horndroid: Practical and sound static analysis of android applications by SMT solving. In 2016 IEEE European Symposium on Security and Privacy , pages 47–62, 2016. https://github.com/ylya/horndroid , Accessed: 05-May-2018.
- 5[5] Alessandra Gorla, Ilaria Tavecchia, Florian Gross, and Andreas Zeller. Checking app behavior against app descriptions. In Proceedings of the 36th International Conference on Software Engineering , ICSE 2014, pages 1025–1035. ACM, 2014.
- 6[6] Jim Gray, editor. The Benchmark Handbook for Database and Transaction Systems (2nd Edition) . Morgan Kaufmann, 1993.
- 7[7] Google Inc. Shrink code and resources with Pro Guard. https://developer.android.com/studio/build/shrink-code , 2017. Accessed: 17-Jan-2019.
- 8[8] C. Isen, L. John, Jung Pil Choi, and Hyo Jung Song. On the representativeness of embedded java benchmarks. In 2008 IEEE International Symposium on Workload Characterization , pages 153–162, 2008.
