# Toward a Hybrid Intrusion Detection Framework for IIoT Using a Large Language Model

**Authors:** Musaad Algarni, Mohamed Y. Dahab, Abdulaziz A. Alsulami, Badraddin Alturki, Raed Alsini

PMC · DOI: 10.3390/s26041231 · Sensors (Basel, Switzerland) · 2026-02-13

## TL;DR

This paper introduces a hybrid intrusion detection framework for IIoT that uses a large language model and numerical features to detect cyber threats effectively.

## Contribution

A novel leakage-safe hybrid intrusion detection framework combining text-based and numerical features with PCA and SMOTE for IIoT cybersecurity.

## Key findings

- The framework achieves 98.19% accuracy on the Edge-IIoTset dataset.
- It reaches 99.15% accuracy on the ToN_IoT dataset, showing strong performance.
- Combining BERT embeddings with PCA and SMOTE improves class separation and handles class imbalance.

## Abstract

The widespread connectivity of the Industrial Internet of Things (IIoT) improves the efficiency and functionality of connected devices. However, it also raises serious concerns about cybersecurity threats. Implementing an effective intrusion detection system (IDS) for IIoT is challenging due to heterogeneous data, high feature dimensionality, class imbalance, and the risk of data leakage during evaluation. This paper presents a leakage-safe hybrid intrusion detection framework that combines text-based and numerical network flow features in an IIoT environment. Each network flow is converted into a short text description and encoded using a frozen Large Language Model (LLM) called the Bidirectional Encoder Representations from Transformers (BERT) model to obtain fixed semantic embeddings, while numerical traffic features are standardized in parallel. To improve class separation, class prototypes are computed in Principal Component Analysis (PCA) space, and cosine similarity scores for these prototypes are added to the feature set. Class imbalance is handled only in the training data using the Synthetic Minority Over-sampling Technique (SMOTE). A Random Forest (RF) is used to select the top features, followed by a Histogram-based Gradient Boosting (HGB) classifier for final prediction. The proposed framework is evaluated on the Edge-IIoTset and ToN_IoT datasets and achieves promising results. Empirically, the framework attains 98.19% accuracy on Edge-IIoTset and 99.15% accuracy on ToN_IoT, indicating robust, leakage-safe performance.

## Full-text entities

- **Diseases:** MITM (MESH:D010033), GAN-AE (MESH:D056768), IDS (MESH:C537310), DL (MESH:D007859), FS (MESH:D052159), DoS (MESH:D019575), CLS (MESH:D008310), injury to (MESH:D014947)
- **Chemicals:** ToN_IoT (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12944543/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12944543/full.md

## References

29 references — full list in the complete paper: https://tomesphere.com/paper/PMC12944543/full.md

---
Source: https://tomesphere.com/paper/PMC12944543