Label Unification for Cross-Dataset Generalization in Cybersecurity NER

Maciej Jalocha; Johan Hausted Schmidt; William Michelseen

arXiv:2507.13870·cs.CL·September 3, 2025

Label Unification for Cross-Dataset Generalization in Cybersecurity NER

Maciej Jalocha, Johan Hausted Schmidt, William Michelseen

PDF

Open Access

TL;DR

This paper explores label unification in cybersecurity NER datasets to improve cross-dataset generalization, proposing models and analyzing limitations, but finds limited success in enhancing transfer performance.

Contribution

It introduces a label unification approach and novel architectures like multihead and graph-based models to address dataset differences in cybersecurity NER.

Findings

01

Unified datasets do not significantly improve cross-dataset generalization.

02

Multihead model offers marginal gains over unified training.

03

Graph-based transfer model shows no significant performance improvement.

Abstract

The field of cybersecurity NER lacks standardized labels, making it challenging to combine datasets. We investigate label unification across four cybersecurity datasets to increase data resource usability. We perform a coarse-grained label unification and conduct pairwise cross-dataset evaluations using BiLSTM models. Qualitative analysis of predictions reveals errors, limitations, and dataset differences. To address unification limitations, we propose alternative architectures including a multihead model and a graph-based transfer model. Results show that models trained on unified datasets generalize poorly across datasets. The multihead model with weight sharing provides only marginal improvements over unified training, while our graph-based transfer model built on BERT-base-NER shows no significant performance gains compared BERT-base-NER.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Adversarial Robustness in Machine Learning · Imbalanced Data Classification Techniques