Beyond the TESSERACT:Trustworthy Dataset Curation for Sound Evaluations of Android Malware Classifiers

Theo Chow; Mario D'Onghia; Lorenz Linhardt; Zeliang Kan; Daniel Arp; Lorenzo Cavallaro; and Fabio Pierazzi

arXiv:2506.23814·cs.CR·March 24, 2026

Beyond the TESSERACT:Trustworthy Dataset Curation for Sound Evaluations of Android Malware Classifiers

Theo Chow, Mario D'Onghia, Lorenz Linhardt, Zeliang Kan, Daniel Arp, Lorenzo Cavallaro, and Fabio Pierazzi

PDF

Open Access

TL;DR

This paper emphasizes the importance of trustworthy dataset curation for evaluating Android malware classifiers, identifying overlooked factors affecting evaluation reliability and proposing a methodology to improve dataset quality.

Contribution

It introduces five novel factors influencing evaluation discrepancies and offers a methodology for curating trustworthy datasets for Android malware detection.

Findings

01

Performance discrepancies persist despite realistic data sources.

02

Five overlooked factors significantly impact evaluation outcomes.

03

Recommendations improve the reliability of malware classifier assessments.

Abstract

The reliability of machine learning critically depends on dataset quality. While machine learning applied to computer vision and natural language processing benefits from high-quality benchmark datasets, cyber security often falls behind, as quality ties to the ability of accessing hard-to-obtain realistic data that may evolve over time. Android is, however, positioned uniquely in this ecosystem due to AndroZoo and other sources, which provide large-scale, continuously updated, and timestamped repositories of benign and malicious apps. Since their release, such data sources provided access to populations of Android apps that researchers can sample from to evaluate learning-based methods in realistic settings, i.e., over temporal frames to account for app evolution (natural distribution shift) and test datasets that reflect in-the-wild class ratios. Surprisingly, we observe that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Data Stream Mining Techniques · Imbalanced Data Classification Techniques