A Grammatical Inference Approach to Language-Based Anomaly Detection in XML
Harald Lampesberger

TL;DR
This paper presents a grammatical inference method to learn automata from XML documents, enabling syntax-based anomaly detection without needing explicit schemas, thus improving intrusion detection in XML-based systems.
Contribution
It introduces an XML schema-compatible lexical datatype system and an algorithm to learn visibly pushdown automata directly from examples, facilitating scalable stream validation.
Findings
Automaton can detect structural and datatype anomalies in XML documents.
The approach does not require tree representation, enabling processing of large documents or streams.
The learned automaton improves anomaly detection accuracy in XML security applications.
Abstract
False-positives are a problem in anomaly-based intrusion detection systems. To counter this issue, we discuss anomaly detection for the eXtensible Markup Language (XML) in a language-theoretic view. We argue that many XML-based attacks target the syntactic level, i.e. the tree structure or element content, and syntax validation of XML documents reduces the attack surface. XML offers so-called schemas for validation, but in real world, schemas are often unavailable, ignored or too general. In this work-in-progress paper we describe a grammatical inference approach to learn an automaton from example XML documents for detecting documents with anomalous syntax. We discuss properties and expressiveness of XML to understand limits of learnability. Our contributions are an XML Schema compatible lexical datatype system to abstract content in XML and an algorithm to learn visibly pushdown…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
