Applying Action Masking and Curriculum Learning Techniques to Improve   Data Efficiency and Overall Performance in Operational Technology Cyber   Security using Reinforcement Learning

Alec Wilson; William Holmes; Ryan Menzies; Kez Smithson Whitehead

arXiv:2409.10563·cs.CR·September 18, 2024

Applying Action Masking and Curriculum Learning Techniques to Improve Data Efficiency and Overall Performance in Operational Technology Cyber Security using Reinforcement Learning

Alec Wilson, William Holmes, Ryan Menzies, Kez Smithson Whitehead

PDF

Open Access

TL;DR

This paper demonstrates that combining curriculum learning and action masking significantly improves data efficiency and performance of reinforcement learning agents in operational technology cyber security scenarios, with faster training and higher rewards.

Contribution

It extends the IPMSRL environment to include more realistic dynamics and shows that curriculum learning and action masking enhance RL performance and data efficiency in cyber security tasks.

Findings

01

Curriculum learning increased episode rewards from -2.791 to -0.569.

02

Action masking increased episode rewards from -2.791 to -0.743.

03

Combined methods achieved a reward of 0.137 in less than 1 million steps.

Abstract

In previous work, the IPMSRL environment (Integrated Platform Management System Reinforcement Learning environment) was developed with the aim of training defensive RL agents in a simulator representing a subset of an IPMS on a maritime vessel under a cyber-attack. This paper extends the use of IPMSRL to enhance realism including the additional dynamics of false positive alerts and alert delay. Applying curriculum learning, in the most difficult environment tested, resulted in an episode reward mean increasing from a baseline result of -2.791 to -0.569. Applying action masking, in the most difficult environment tested, resulted in an episode reward mean increasing from a baseline result of -2.791 to -0.743. Importantly, this level of performance was reached in less than 1 million timesteps, which was far more data efficient than vanilla PPO which reached a lower level of performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation and Cyber Security

MethodsEntropy Regularization · Proximal Policy Optimization