Bridging the PLC Binary Analysis Gap: A Cross-Compiler Dataset and Neural Framework for Industrial Control Systems
Yonatan Gizachew Achamyeleh, Shih-Yuan Yu, Gustavo Quir\'os Araya,, Mohammad Abdullah Al Faruque

TL;DR
This paper introduces PLC-BEAD, a large dataset of PLC binaries with source code and labels, and presents PLCEmbed, a transformer-based model achieving high accuracy in binary analysis tasks for industrial control systems.
Contribution
The paper provides the first comprehensive dataset pairing PLC binaries with source code and labels, along with a transformer-based framework for binary analysis in industrial control systems.
Findings
PLCEmbed achieves 93% accuracy in compiler provenance identification.
PLCEmbed achieves 42% accuracy in functionality classification.
The dataset enables reproducible research in PLC security and reverse engineering.
Abstract
Industrial Control Systems (ICS) rely heavily on Programmable Logic Controllers (PLCs) to manage critical infrastructure, yet analyzing PLC executables remains challenging due to diverse proprietary compilers and limited access to source code. To bridge this gap, we introduce PLC-BEAD, a comprehensive dataset containing 2431 compiled binaries from 700+ PLC programs across four major industrial compilers (CoDeSys, GEB, OpenPLC-V2, OpenPLC-V3). This novel dataset uniquely pairs each binary with its original Structured Text source code and standardized functionality labels, enabling both binary-level and source-level analysis. We demonstrate the dataset's utility through PLCEmbed, a transformer-based framework for binary code analysis that achieves 93\% accuracy in compiler provenance identification and 42\% accuracy in fine-grained functionality classification across 22 industrial control…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Software Testing and Debugging Techniques · Scientific Computing and Data Management
