Statically Detecting Vulnerabilities by Processing Programming Languages   as Natural Languages

Ib\'eria Medeiros (1); Nuno Neves (1); Miguel Correia (2) ((1) LASIGE,; Faculdade de Ci\^encias; Universidade de Lisboa; Portugal; (2) INESC-ID,; Instituto Superior T\'ecnico; Universidade de Lisboa; Portugal)

arXiv:1910.06826·cs.CR·January 19, 2022

Statically Detecting Vulnerabilities by Processing Programming Languages as Natural Languages

Ib\'eria Medeiros (1), Nuno Neves (1), Miguel Correia (2) ((1) LASIGE,, Faculdade de Ci\^encias, Universidade de Lisboa, Portugal, (2) INESC-ID,, Instituto Superior T\'ecnico, Universidade de Lisboa, Portugal)

PDF

TL;DR

This paper introduces a novel AI-based static analysis approach that uses natural language processing techniques to automatically detect vulnerabilities in web application source code, demonstrated through the DEKANT tool on PHP and WordPress plugins.

Contribution

It presents a new method employing NLP sequence models for vulnerability detection, reducing the need for manual programming of detection rules.

Findings

01

Detected several hundred vulnerabilities, including 62 zero-day flaws.

02

Successfully applied to a large set of PHP applications and WordPress plugins.

03

Identified 12 classes of input validation vulnerabilities.

Abstract

Web applications continue to be a favorite target for hackers due to a combination of wide adoption and rapid deployment cycles, which often lead to the introduction of high impact vulnerabilities. Static analysis tools are important to search for bugs automatically in the program source code, supporting developers on their removal. However, building these tools requires programming the knowledge on how to discover the vulnerabilities. This paper presents an alternative approach in which tools learn to detect flaws automatically by resorting to artificial intelligence concepts, more concretely to natural language processing. The approach employs a sequence model to learn to characterize vulnerabilities based on an annotated corpus. Afterwards, the model is utilized to discover and identify vulnerabilities in the source code. It was implemented in the DEKANT tool and evaluated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.