DexBERT: Effective, Task-Agnostic and Fine-grained Representation Learning of Android Bytecode
Tiezhu Sun (1), Kevin Allix (1), Kisub Kim (2), Xin Zhou (2), Dongsun, Kim (3), David Lo (2), Tegawend\'e F. Bissyand\'e (1), Jacques Klein (1), ((1) University of Luxembourg, (2) Singapore Management University, (3), Kyungpook National University)

TL;DR
DexBERT is a novel BERT-like model designed for Android bytecode, providing effective, task-agnostic, and fine-grained representations to improve various software engineering tasks.
Contribution
The paper introduces DexBERT, a universal, task-agnostic language model for Android bytecode that captures fine-grained information at the class level.
Findings
DexBERT effectively models DEX bytecode.
It improves performance on class-level tasks.
Strategies for handling diverse app sizes are demonstrated.
Abstract
The automation of a large number of software engineering tasks is becoming possible thanks to Machine Learning (ML). Central to applying ML to software artifacts (like source or executable code) is converting them into forms suitable for learning. Traditionally, researchers have relied on manually selected features, based on expert knowledge which is sometimes imprecise and generally incomplete. Representation learning has allowed ML to automatically choose suitable representations and relevant features. Yet, for Android-related tasks, existing models like apk2vec focus on whole-app levels, or target specific tasks like smali2vec, which limits their applicability. Our work is part of a new line of research that investigates effective, task-agnostic, and fine-grained universal representations of bytecode to mitigate both of these two limitations. Such representations aim to capture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software System Performance and Reliability
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Softmax · WordPiece · Linear Warmup With Linear Decay · Layer Normalization
