Where Do Flow Semantics Reside? A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification
Sizhe Huang, Zitong Li, Shujie Yang

TL;DR
This paper introduces FlowSem-MAE, a protocol-native tabular pretraining paradigm for encrypted traffic classification that preserves protocol semantics and significantly improves accuracy with limited labeled data.
Contribution
It proposes a novel paradigm that treats protocol-defined fields as architectural priors, reformulating traffic classification as a tabular learning task rather than sequence modeling.
Findings
FlowSem-MAE outperforms state-of-the-art methods across datasets.
With half labeled data, it surpasses most existing methods trained on full data.
Addressing inductive bias mismatch improves classification accuracy.
Abstract
Self-supervised masked modeling shows promise for encrypted traffic classification by masking and reconstructing raw bytes. Yet recent work reveals these methods fail to reduce reliance on labeled data despite costly pretraining: under frozen encoder evaluation, accuracy drops from greater than 0.9 to less than 0.47. We argue the root cause is inductive bias mismatch: flattening traffic into byte sequences destroys protocol-defined semantics. We identify three specific issues: 1) field unpredictability, random fields like ip.id are unlearnable yet treated as reconstruction targets; 2) embedding confusion, semantically distinct fields collapse into a unified embedding space; 3) metadata loss, capture-time metadata essential for temporal analysis is discarded. To address this, we propose a protocol-native paradigm that treats protocol-defined field semantics as architectural priors,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
