An Empirical Study of Information Flows in Real-World JavaScript
Cristian-Alexandru Staicu, Daniel Schoepe, Musard Balliu, Michael, Pradel, Andrei Sabelfeld

TL;DR
This empirical study evaluates the prevalence and importance of different information flow types in real-world JavaScript programs, highlighting that lightweight taint analysis suffices for most security issues, while implicit flows are costly and often unnecessary.
Contribution
The paper provides an empirical analysis of information flows in JavaScript, demonstrating that implicit flows are costly and generally not critical for detecting security problems, guiding analysis design choices.
Findings
Implicit flows are expensive to track and often unnecessary.
Lightweight taint analysis suffices for most security issues.
Tracking hidden implicit flows does not reveal additional security problems.
Abstract
Information flow analysis prevents secret or untrusted data from flowing into public or trusted sinks. Existing mechanisms cover a wide array of options, ranging from lightweight taint analysis to heavyweight information flow control that also considers implicit flows. Dynamic analysis, which is particularly popular for languages such as JavaScript, faces the question whether to invest in analyzing flows caused by not executing a particular branch, so-called hidden implicit flows. This paper addresses the questions how common different kinds of flows are in real-world programs, how important these flows are to enforce security policies, and how costly it is to consider these flows. We address these questions in an empirical study that analyzes 56 real-world JavaScript programs that suffer from various security problems, such as code injection vulnerabilities, denial of service…
| ID | Library | Policy | LoC | SBC | Upgs |
| 1 | fish | module eval and exec | 69 | 1 | 0 |
| 2 | growl | module eval and exec | 270 | 1 | 0 |
| 3 | gm | module eval and exec | 1,614 | 1 | 0 |
| 4 | libnotify | module eval and exec | 54 | 1 | 0 |
| 5 | mixin-pro | module eval and exec | 168 | 1 | 0 |
| 6 | modulify | module eval and exec | 2,410 | 1 | 0 |
| 7 | mol-proto | module eval and exec | 1,696 | 1 | 0 |
| 8 | mongoosify | module eval and exec | 160 | 0 | 1 |
| 9 | m-log | module eval and exec | 243 | 1 | 0 |
| 10 | mobile-icon-resizer | file system API eval and exec | 410 | 1 | 0 |
| 11 | mongo-parse | module eval and exec | 506 | 1 | 0 |
| 12 | mongoosemask | module eval and exec | 12,750 | 0.78 | 28 |
| 13 | mongui | HTTP API eval and exec | 1,539 | 0.44 | 0 |
| 14 | mongo-edit | HTTP API eval and exec | 577 | 0 | 0 |
| 15 | mock2easy | HTTP API eval and exec | 1,217 | 0.07 | 3 |
| 16 | chook-growl-reporter | module eval and exec | 243 | 1 | 0 |
| 17 | git2json | module eval and exec | 434 | 1 | 0 |
| 18 | kerb_request | module eval and exec | 67 | 1 | 0 |
| 19 | printer | module eval and exec | 139 | 1 | 0 |
| 20 | debug | module regex matching | 360 | 1 | 0 |
| 21 | mime | module regex matching | 108 | 1 | 0 |
| 22 | tough-cookie | module regex matching | 1,145 | 1 | 0 |
| 23 | fresh | module regex matching | 59 | 0.5 | 0 |
| 24 | forwarded | module regex matching | 30 | 0 | 0 |
| 25 | underscore.string | module regex matching | 1,779 | 1 | 0 |
| 26 | ua-parser-js | module regex matching | 584 | 0.50 | 6 |
| 27 | parsejson | module regex matching | 46 | 1 | 0 |
| 28 | useragent | module regex matching | 6,827 | 1 | 0 |
| 29 | no-case | module regex matching | 33 | 1 | 0 |
| 30 | content-type-parser | module regex matching | 221 | 1 | 0 |
| 31 | timespan | module regex matching | 577 | 0.20 | 4 |
| 32 | string | module regex matching | 2,001 | 1 | 0 |
| 33 | content | module regex matching | 125 | 0.42 | 0 |
| 34 | slug | module regex matching | 375 | 0.5 | 2 |
| 35 | htmlparser | module regex matching | 2,155 | 0.65 | 5 |
| 36 | charset | module regex matching | 49 | 0.5 | 0 |
| 37 | mobile-detect | module regex matching | 612 | 1 | 0 |
| 38 | ismobilejs | module regex matching | 935 | 0.33 | 1 |
| 39 | dns-sync | module regex matching | 76 | 1 | 0 |
| 40 | ip | buffer reading module | 325 | 0.76 | 0 |
| 41 | concat-stream | buffer reading module | 132 | 1 | 0 |
| 42 | bl | buffer reading module | 206 | 0.72 | 4 |
| 43 | request | buffer reading HTTP | 2,217 | 0.52 | 0 |
| 44 | ws | buffer reading HTTP API | 2,449 | 0.07 | 1 |
| 45 | floody | buffer reading HTTP API | 94 | 0.8 | 0 |
| 46 | tunnel-agent | buffer reading HTTP API | 225 | 1 | 0 |
| 47 | History sniffing (Jang et al., 2010a) | HTMLElement.color img.src | 42 | 0 | 3 |
| 48 | Font fingerpr. (Acar et al., 2013) | HTMLElement.offsetWidth img.src | 145 | 0.5 | 1 |
| 49 | Font fingerpr.333https://www.privacytool.org/AnonymityChecker/ | HTMLElement.offsetWidth img.src | 44 | 0.02 | 3 |
| 50 | Font fingerpr.444http://www.lalit.org/lab/javascript-css-font-detect/ | HTMLElement.offsetWidth img.src | 134 | 1 | 0 |
| 51 | Browser ext. fingerpr. (Sjösten et al., 2017) | HTMLElement.offsetWidth request.open | 1,451 | 1 | 1 |
| 52 | DoNotTrack leakage555https://browserleaks.com/js/donottrack.js | navigator_doNotTrack HTMLElement.html | 20 | 0 | 1 |
| 53 | Login state leakage666https://robinlinus.github.io/socialmedia-leak/ | onload event document.innerHTML | 191 | 1 | 0 |
| 54 | Engine fingerpr.777https://www.privacytool.org/AnonymityChecker/ | HTMLElement.type console.log | 129 | 0 | 1 |
| 55 | Browser ext. fingerpr.888https://popmyads.com/ | onload event HTMLElement.innerHTML | 37 | 0 | 0 |
| 56 | Resource fingerpr.999https://browserleaks.com/firefox#more | onload event console.log | 43 | 0 | 0 |
| Strategy | Sec. condition | Tracked flows | Permissiveness | ||
|---|---|---|---|---|---|
| Expl. | Obs. | Hid. | |||
| Taint tracking | Explicit secrecy | ✓ | Stop when -labeled value reaches sink. | ||
| Observable tracking | Observable secrecy | ✓ | ✓ | Stop when -labeled value reaches sink. | |
| No Sensitive Upgrade | Non-interference | ✓ | ✓ | ✓ | Stop when -labeled variable is written in sensitive context. |
| Permissive Upgrade | Non-interference | ✓ | ✓ | ✓ | Stop when partially leaked value is used. |
| Explicit Secrecy | Observable Secrecy | Non Interference | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Min | Avg | Max | Min | Avg | Max | Min | Avg | Max | |
| Command injection | 10 | 59,339 | 1,118,862 | 10 | 59,383 | 1,118,910 | 10 | 59,540 | 1,118,941 |
| ReDoS vuln. | 3 | 210 | 2,064 | 3 | 540 | 6,152 | 3 | 633 | 7,073 |
| Buffer vuln. | 98 | 5,740 | 24,690 | 98 | 6,007 | 24,748 | 98 | 6,084 | 24,843 |
| Client-side progr. | 4 | 5,919 | 40,364 | 14 | 19,555 | 134,765 | 16 | 20,890 | 136,502 |
| Work | Analysis | Explicit | Obs. | Hidden |
|---|---|---|---|---|
| Vogt et al. (Vogt et al., 2007) | dynamic | ✓ | ✓ | - |
| Jang et al. (Jang et al., 2010a) | hybrid | ✓ | ✓ | - |
| Chugh et al. (Chugh et al., 2009) | hybrid | ✓ | ✓ | ✓ |
| Tripp et al. (Tripp et al., 2014) | hybrid | ✓ | - | - |
| Chudnov & Naumann (Chudnov and Naumann, 2015) | dynamic | ✓ | ✓ | NSU |
| Hedin et al. (Hedin et al., 2014) | dynamic | ✓ | ✓ | NSU |
| Bichhawat et al. (Bichhawat et al., 2017) | dynamic | ✓ | ✓ | PU |
| Kerschbaumer et al. (Kerschbaumer et al., 2013) | dynamic | ✓ | ✓ | - |
| Bauer et al. (Bauer et al., 2015) | dynamic | ✓ | - | - |
| De Groef et al. (De Groef et al., 2012) | dynamic | MOD | MOD | MOD |
| Austin & Flanagan (Austin and Flanagan, 2012) | dynamic | MOD | MOD | MOD |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\mathlig
¡⟨ \mathlig¿⟩ \mathlig—/⇓ \mathlig—-¿↦
An Empirical Study of Information Flows in Real-World JavaScript
Cristian-Alexandru Staicu
TU Darmstadt
,
Daniel Schoepe
Chalmers University of Technology
,
Musard Balliu
KTH Royal Institute of Technology
,
Michael Pradel
TU Darmstadt
and
Andrei Sabelfeld
Chalmers University of Technology
Abstract.
Information flow analysis prevents secret or untrusted data from flowing into public or trusted sinks. Existing mechanisms cover a wide array of options, ranging from lightweight taint analysis to heavyweight information flow control that also considers implicit flows. Dynamic analysis, which is particularly popular for languages such as JavaScript, faces the question whether to invest in analyzing flows caused by not executing a particular branch, so-called hidden implicit flows. This paper addresses the questions how common different kinds of flows are in real-world programs, how important these flows are to enforce security policies, and how costly it is to consider these flows. We address these questions in an empirical study that analyzes 56 real-world JavaScript programs that suffer from various security problems, such as code injection vulnerabilities, denial of service vulnerabilities, memory leaks, and privacy leaks. The study is based on a state-of-the-art dynamic information flow analysis and a formalization of its core. We find that implicit flows are expensive to track in terms of permissiveness, label creep, and runtime overhead. We find a lightweight taint analysis to be sufficient for most of the studied security problems, while for some privacy-related code, observable tracking is sometimes required. In contrast, we do not find any evidence that tracking hidden implicit flows reveals otherwise missed security problems. Our results help security analysts and analysis designers to understand the cost-benefit tradeoffs of information flow analysis and provide empirical evidence that analyzing implicit flows in a cost-effective way is a relevant problem.
††copyright: none
1. Introduction
JavaScript is at the heart of the modern web, empowering rich client-side applications and, more recently, also server-side applications. While some language features, such as dynamism and flexibility, explain this popularity, the lack of other features, such as language-level protection and isolation mechanisms, open up a wide range of integrity, availability, and confidentiality vulnerabilities (Johns, 2008). As a result, securing JavaScript applications has become a key challenge for web application security. Unfortunately, existing browser-level mechanisms, such as the same-origin policy or the content security policy, are coarse-grained, falling short to distinguish between secure and insecure manipulation of data by scripts. Furthermore, server-side applications lack such isolation mechanisms completely, allowing an attacker, e.g., to inject and execute arbitrary code that interacts with the operating system through powerful APIs (Staicu et al., 2018).
An appealing approach to securing JavaScript applications is information flow analysis. This approach tracks the flow of information from sources to sinks in order to enforce application-level security policies. It can ensure both integrity, by preventing information from untrusted sources to reach trusted sinks, and confidentiality, by preventing information from secret sources to reach public sinks. For example, information flow analysis can check that no attacker-controlled data is evaluated as executable code or that secret user data is not sent to the network. Because the dynamic nature of JavaScript hinders precise static analysis, dynamic information flow analysis has received significant attention by researchers (Vogt et al., 2007; Jang et al., 2010a; Chudnov and Naumann, 2015; Hedin et al., 2014; Bichhawat et al., 2017; Bauer et al., 2015; De Groef et al., 2012; Austin and Flanagan, 2012). The basic idea of dynamic information flow analysis is to attach security labels, e.g., secret (untrusted) and public (trusted), to runtime values and to propagate these labels during program execution. To simplify the presentation, we assume to have two security labels, and we say that a value is sensitive if its label is secret or untrusted; otherwise, we say that a value is insensitive.
At the language level, a program may propagate information via two kinds of information flows:111There are other kinds of flows, such as timing and cache side-channels, which we ignore here. Explicit flows (Denning and Denning, 1977) occur whenever sensitive information is passed by an assignment statement or into a sink. Implicit flows (Denning and Denning, 1977) arise via control-flow structures of programs, e.g., conditionals and loops, when the flow of control depends on a sensitive value. For a dynamic information flow analysis, implicit flows can be further classified into flows that happen because a particular branch is executed, so-called observable implicit flows (Balliu et al., 2017), and flows that happen because a particular branch is not executed, so called hidden implicit flows (Balliu et al., 2017).
Figure 1 illustrates the different kinds of flows with a simple JavaScript-like program that leaks sensitive information. The program has a variable passwd, which is marked initially as a sensitive source at line LABEL:line:markSource. Using this variable in an operation that creates a new value, e.g., in line LABEL:line:explicit, is an explicit flow. Consider the case where the password is “topSecret”, i.e., the conditional at line LABEL:line:cond evaluates to true, and line LABEL:line:write1 sets gotIt to true. At line LABEL:line:sink, the gotIt variable is sent to the network through the function sink(), which is considered to be an insensitive sink. The flow from the password to gotIt is an observable implicit flow because a sensitive value determines that gotIt gets written. Now, consider the case where passwd is “abc”. The branch at line LABEL:line:cond is not taken and the gotIt variable remains false. Sending this information to the network reveals that the password is different from “topSecret”. This flow is a hidden implicit flow because a sensitive value determines that gotIt does not get written.
Ideally, an information flow analysis should consider all three kinds of flows. In fact, there exists a large body of work on static, dynamic, hybrid, and multi-execution techniques to prevent explicit and implicit flows. However, so far these tools have seen little use in practice, despite the strong security guarantees that they provide. In contrast, a lightweight form of information flow analysis called taint analysis is widely used in computer security (Schwartz et al., 2010). Taint analysis is a pure data dependency analysis that only tracks explicit flows, ignoring any control flow dependencies.
The question which kinds of flows to consider is a tradeoff between costs and benefits. On the cost side, considering more flows increases false positives (King et al., 2008). A false positive here means that a secure execution is conservatively blocked by an overly restrictive enforcement mechanism. A common reason is that a value gets labeled as sensitive even though it does not actually contain information that is security-relevant in practice. This problem, sometimes referred to as label creep (Denning, 1982; Sabelfeld and Myers, 2003), reduces the permissiveness of information flow monitoring, because the monitor will prematurely stop a program to prevent a value with an overly sensitive label from reaching a sink. Another cost of considering more kinds of flows is an increase in runtime overhead. On the benefit side, considering more flows increases the ability to find security vulnerabilities and data leakages, i.e., the level of trust one obtains from the analysis. For example, an analysis that considers only explicit flows will miss any leakage of sensitive data that involves an implicit flow. Unfortunately, despite the large volume of research on information flow analysis, there is very little empirical evidence on the importance of the different kinds of flows in real applications. Because of this lack of knowledge, potential users of information flow analyses cannot make an informed decision about what kind of analysis to use.
To better understand the tradeoff between costs and benefits of using a dynamic information flow analysis, this paper presents an empirical study of information flows in real-world JavaScript code. Our overall goal is to better understand the costs and benefits of dynamically analyzing explicit, observable implicit, and hidden implicit flows. Specifically, we are interested in how prevalent different kinds of flows are, what kinds of security problems can(not) be detected when considering subsets of flows, and what costs considering all flows imposes. To address these questions, we study 56 real-world JavaScript programs in various application domains with a diverse set of security policies. The study considers integrity problems, specifically code injection vulnerabilities and denial of service vulnerabilities caused by an algorithmic complexity problem, and confidentiality problems, specifically leakages of uninitialized memory, browser fingerprinting and history sniffing. Each studied program has at least one real-world security problem that information flow analysis can detect.
Our study is enabled by a novel methodology that combines state-of-the-art dynamic information flow analysis (Hedin and Sabelfeld, 2012; Hedin et al., 2014; Austin and Flanagan, 2010) and program rewriting (Birgisson et al., 2012) with a set of novel security metrics. We implement the methodology in a dynamic information flow analysis built on top of Jalangi (Sen et al., 2013). The implementation draws on a sound analysis for a simple core of JavaScript. The formalization relates the security metrics to semantic security conditions for taint tracking (Schoepe et al., 2016), observable tracking (Balliu et al., 2017) and information flow monitoring (Goguen and Meseguer, 1982).
The findings of our study include:
- (1)
All three kinds of flows occur locally in real-life applications, i.e., an analysis that ignores some of them risks to miss violations of the information flow policy. Explicit flows are by far the most prevalent, and only five benchmarks contain hidden implicit flows (Section 4.1). 2. (2)
An analysis that considers explicit and observable implicit flows, but ignores hidden implicit flows, detects all vulnerabilities in our benchmarks. For most applications it is even sufficient to track explicit flows only, while for some client-side, privacy-related applications one must also consider observable implicit flows (Section 4.2). 3. (3)
Tracking hidden implicit flows causes an analysis to prematurely terminate various executions. Furthermore, we find that different monitoring strategies proposed in the literature vary significantly in their permissiveness. (Section 4.3). 4. (4)
The amount of data labeled as sensitive steadily increases during the execution of most benchmarks, confirming the label creep problem. An analysis that considers implicit flows increases the label creep by over 40% compared to an analysis that considers only explicit flows (Section 4.4). 5. (5)
The analysis overhead caused by considering implicit flows is significant: Ignoring implicit flows saves the effort of tracking runtime operations by a factor of 2.5 times (Section 4.5).
Prior work (discussed in Section 5) studies false positives caused by static analysis of implicit flows (King et al., 2008; Russo et al., 2009) and the semantic strength of flows (Masri and Podgurski, 2009). Jang et al. (Jang et al., 2010b) conduct a large-scale empirical study showing that several popular web sites use information flows to exfiltrate data about users’ behavior. Kang et al. (Kang et al., 2011) combine dynamic taint analysis with targeted implicit flow analysis, demonstrating the importance of tracking implicit flows for trusted programs. However, to the best of our knowledge, no existing work addresses the above questions.
In summary, this paper contributes the following:
- •
We are the first to empirically study the prevalence of explicit, observable implicit, and hidden implicit flows in real-world applications against integrity, availability, and confidentiality policies.
- •
We present a methodology and its implementation, which enables the study, and we provide a formal basis for empirically studying information flows (Section 3).
- •
We show the soundness of the analysis for a core of JavaScript with respect to semantic security conditions (Appendix).
- •
Through realistic case studies and security policies, we provide empirical evidence that sheds light on the cost-benefit tradeoff of information analysis and that outlines directions for future work (Section 4).
We share our implementation, as well as all benchmarks and policies used for the study, to support future evaluations of information flow tools for JavaScript.222https://new-iflow.herokuapp.com/download-iflow.html
2. Benchmarks and Security Policies
Our study is based on 56 client-side and server-side JavaScript applications, which suffer from four classes of vulnerabilities. These applications are subject to attacks that have been independently discovered by existing work, including integrity, availability, and confidentiality attacks. For every application, we define realistic security policies expressed as information flow policies. Table 1 shows the applications, along with their security policies, and size measured in lines of code. The benchmarks vary in size from tens of lines of code to tens of thousands. We further explain the policies below. For each application we either create or reuse a set of inputs that trigger the attack and other inputs to increase the coverage of different behaviors.
Our goal is an in-depth study of the different kinds of information flows for a range of security policies; we do not claim to study a representative sample of JavaScript applications. Existing in-breadth empirical studies, which analyze hundreds of thousands of web pages against fixed policies, provide clear evidence for security and privacy risks in JavaScript code (Jang et al., 2010a; Lekies et al., 2013; Melicher et al., 2018). In contrast to these large-scale studies, our effort consists in identifying vulnerable scripts from different domains and analyzing the flows therein.
Injection vulnerabilities on Node.js
The Node.js ecosystem has enabled a proliferation of server and desktop applications written in JavaScript. Injection vulnerabilities are programming errors that enable an attacker to inject and execute malicious code. Recent work (Staicu et al., 2018) has demonstrated the devastating impact of injection vulnerabilities on server-side programs, e.g., when an attacker-controlled string reaches powerful APIs such as exec or eval. Such attacks can severely compromise integrity, e.g., deleting all files in a directory or completely controlling the attacked machine. We study 19 Node.js modules that contain injection vulnerabilities (IDs 1 to 19 in Table 1). As security policies, we consider the interface of a module as an untrusted source and the APIs that interpret strings as code, such as exec or eval, as trusted sinks.
ReDoS vulnerabilities
Regular expression Denial of Service, or ReDoS, is a form of algorithmic complexity attack that exploits the possibly long time of matching a regular expression against an attacker-crafted input. The single-threaded execution model of JavaScript makes JavaScript-based web servers particularly susceptible to ReDoS attacks (Staicu and Pradel, 2018). We analyze 19 web server applications that are subject to ReDoS attacks (IDs 20 to 39 in Table 1). As a security policy, we consider data received via HTTP requests as untrusted sources and regular expressions known to be vulnerable as trusted sinks.
Buffer vulnerabilities
Buffer vulnerabilities expose memory content filled with previously used data, e.g., cryptographic keys, source code, or system information. In Node.js, such vulnerabilities occur when using the Buffer constructor without explicit initialization. Buffer vulnerabilities are similar to the infamous Heartbleed flaw in OpenSSL (Durumeric et al., 2014), as both allow an attacker to read more memory than intended. We analyze 7 applications subject to buffer vulnerabilities (IDs 40 to 46 in Table 1). The security policy requires that no information flows from the buffer allocation constructor to HTTP requests without initialization.
Device fingerprinting and history sniffing
Web-based fingerprinting collects device-specific information, e.g., installed fonts or browser extensions, to identify users (Acar et al., 2014). History sniffing attacks use the fact that browsers display links differently depending on whether the target has been visited (Jang et al., 2010a; Weinberg et al., 2011). We analyze 10 client-side JavaScript applications that are subject to various forms of fingerprinting and history sniffing attacks (IDs 47 to 56 in Table 1). The security policies label as secret the sources that provide sensitive information, e.g., the font height and width, and as public sinks the APIs that enable external communication, e.g., image tags. We adapt these programs to our Node.js-based infrastructure by introducing minimal changes that emulate DOM interactions. We carefully cross-checked this adaptations in a pair-programming fashion, ensuring that all flows in the original program are preserved. The policies are application-specific and mark certain nodes in the emulated DOM as sources and sinks. In contrast to the other benchmarks, these programs can potentially be malicious (Nikiforakis et al., 2013; Jang et al., 2010a). That is, the assumption that the analyzed code is trusted does no longer hold.
3. Methodology
To address the research questions from Section 1, we present a methodology that combines a set of novel metrics with a dynamic information flow analysis (Hedin and Sabelfeld, 2012; Hedin et al., 2014), a monitoring strategy (Austin and Flanagan, 2010), and an automated mechanism to insert upgrade statements (Birgisson et al., 2012). The metrics summarize the flows observed during the program execution. This section provides the necessary background on information flow analysis, an informal description of our methodology, and definitions of the metrics. It also presents a formalization of the core of our methodology.
3.1. Setting: Information Flow Analysis
Security labels
An information flow analysis associates each value with a security label that indicates how sensitive the value is. Labels are typically arranged in a lattice (Denning, 1976). To ease the presentation, we focus on two labels: (for high or sensitive) and (for low or insensitive), where is more sensitive than . Given a label , we write to denote that a value has security label . If a value does not have a label, we assume it is implicitly labeled as .
Information flow policy
The analysis checks whether data from a sensitive source influences data that arrives at an insensitive sink. The sources and sinks for a program are specified in an information flow policy, or short, policy. For integrity, the policy specifies that no information from untrusted sources () reaches trusted sinks (). For confidentiality, the policy stipulates that no information from secret sources () reaches public sinks (). We model sources by variables and object fields, and their security label corresponds to the label of the value that they contain initially. We denote sinks by a function sink(), which is implicitly labeled as .
Monitoring strategies
Different monitoring strategies for dynamic information flow analysis address the problem of checking whether an execution violates a policy. In this work, we focus on flow-sensitive dynamic monitors, where variables can be assigned different security labels during the execution. Table 2 gives an overview of the monitoring strategies studied in this paper. Taint analysis tracks only explicit flows and stops the program only if an -labeled value reaches a sink.
In contrast to taint tracking, the other two strategies also track implicit flows. The monitors identify implicit flows by maintaining a security stack that contains all sensitive labels of expressions in conditionals that influence the control flow. When the stack is non-empty, the program executes in a sensitive context. Observable Tracking (Balliu et al., 2017) tracks only explicit and observable implicit flows, but ignores hidden implicit flows. Whenever an -labeled variable is updated in a sensitive context, observable tracking updates the label as sensitive and continues with the execution. For example, consider the following program, which is trivially secure because there is no call to sink():
1var location; var y; var z;
2if (10 < location < 20)/#\label{line:cond1}#/ {
3 y = "Home";/#\label{line:write3}#/ }
4//upgrade(y);/#\label{line:write5}#/
5z = "You are at " + y;/#\label{line:write4}#/
Consider now an execution where the location is . Observable tracking updates the labels of y and z as sensitive and does not stop the execution.
The strictest monitoring strategies try to prevent also hidden implicit flows. We consider two variants of such a strategy. They both terminate the execution of the program whenever an observable implicit flow may lead to a hidden implicit flow in another execution. The No Sensitive Upgrade strategy (NSU) (Zdancewic, 2002; Austin and Flanagan, 2009) disallows updating the security labels of a variable in a sensitive context. In particular, it terminates the execution whenever such an update happens. For example, consider the execution of the above program with location=15H. The NSU strategy terminates the program at line LABEL:line:write3 due to the update of the -labeled variable y in a sensitive context.
Permissive Upgrade (PU) (Austin and Flanagan, 2010) is a refinement of the NSU strategy. It labels a value as partially leaked if an -labeled variable is updated in a sensitive context, and terminates the program if the updated variable is further used outside the sensitive context. Consider again the same execution of the above program. The PU strategy labels y as partially leaked at line LABEL:line:write3 because the program writes to the -labeled variable in a sensitive context, and then terminates the program at line LABEL:line:write4 because the value is used. In our work, we use the PU strategy to study the prevalence of different kinds of flows.
Upgrade statements
Naively applying the PU strategy to real-world programs can be very restrictive and risks to increase the number of false positives, i.e., terminate many secure executions. To address this problem, Austin and Flanagan propose the upgrade statement (Austin and Flanagan, 2009) and the privatization statement (Austin and Flanagan, 2010). These statements change the label of a variable to explicitly, to signal a potential hidden implicit flow to the monitor. For example, we can insert an upgrade statement before line LABEL:line:write4 in the above example to mark y as sensitive even if the branch is not taken. As a result, the program does not terminate immediately when the value is read. If the program would later call sink(y), then the monitor would terminate the program and report a policy violation.
Permissiveness
The above example illustrates the permissiveness issues of different monitoring strategies, i.e., that they terminate the program unnecessarily even though no policy violation occurs. Taint tracking and observable tracking both do not terminate the program. In contrast, both NSU and PU terminate the program unnecessarily. This overapproximation of policy violations is necessary to avoid potential hidden implicit flows. Adding upgrade statements avoids such premature termination of the program by assigning an -label to y, independently of what branch of the conditional statement is executed. If we uncomment line LABEL:line:write5, the execution proceeds without terminating the program unnecessarily. That is, upgrade statements may increase the permissiveness, but impose the cost of adding upgrade statements.
3.2. Security Metrics
Our approach uses program testing to measure the prevalence of different kinds of information flows. The basic idea is to test a program with an information flow monitor that implements the PU strategy, while incrementing counters that represent the number of explicit, observable implicit, and hidden implicit flows. These counters then allow us to reason about the prevalence of the different kinds of flows and about the policy violations that different monitoring strategies would detect. In contrast to the PU monitor that terminates the program when it encounters a policy violation, our monitor continues the execution to measure flows in the remainder of the execution. We refer to Section 3.3 for the formal definition of the monitor.
We consider information flows at two levels of granularity. On the one hand, we consider flows induced by a single operation in the program (Section 3.2.1). We call such flows micro flows or simply flows. Studying flows at the micro flow level is worthwhile because it provides a detailed understanding of the operations that contribute to higher-level flows. In particular, flows provide a quantitative answer to the permissiveness challenges faced by state-of-the-art dynamic monitors that implement the NSU or the PU strategy. On the other hand, we consider transitive flows of information between a source and a sink, called source-to-sink flows (Section 3.2.3). Studying flows at this coarse-grained level is worthwhile because source-to-sink flows are what security analysts are interested in when using an information flow analysis.
The metrics presented in this section measure the prevalence of flows quantitatively, and do not attempt to judge the importance of flows. To ensure that our flows represent relevant problems, our study uses real-world security problems and policies that capture these issues.
3.2.1. Micro Flows
To measure how many explicit, observable implicit, and hidden implicit flows exist, our monitor increments the counters for these micro flows as follows.
Explicit flows
The approach counts an explicit flow for every assignment event where the written value is sensitive but the value that gets overwritten (if any) is not sensitive. The rationale is to capture program behavior where sensitive information flows to a memory location that stores insensitive information. In contrast, overwriting a sensitive value with another (in)sensitive value does not leak any new information, and therefore does not count as an explicit flow.
For example, consider this code:
1var x = 3H; var y = 5H; var z;
2x = y; // no explicit flow
3z = x; // explicit flow
Observable implicit flows
The approach counts an observable implicit flow for every assignment event that happens in a sensitive context and that overwrites an insensitive value. Similar to explicit flows, the rationale is to capture program behavior that writes sensitive information to a memory location that stores insensitive information. The main difference is that the assignment happens because of a control flow decision made based on a sensitive context. Note that it is irrelevant whether the written value is sensitive because the fact that a write happens leaks sensitive information.
For example, consider this code:
1var x = trueH; var y = 3; var z;
2if (x)
3 y = 5; // observable implicit flow
4z = 7; // no flow
Hidden implicit flows
The approach counts a hidden implicit flow for every execution of an upgrade statement of a variable containing insensitive information. The rationale is to capture assignment events that did not happen, but that could have happened during the execution if a control flow decision that depends on a sensitive value would have been different.
For example, consider this code:
1var x = falseH; var y; var z;
2if (x)
3 y = 5; // not executed, no flow
4upgrade(y); // hidden implicit flow
5z = y; // hidden implicit flow
3.2.2. Label Creep
As mentioned earlier, a common reason for false positives is label creep. Since measuring false positives would be subject to a given source-to-sink policy, we focus on measuring the prevalence of the more general phenomenon of label creep in micro flows. Recall that this concept refers to the fact that information flow analysis may quickly label a large portion of all values handled in a program as sensitive. In most of the cases, this leads to an explosion in false positives that in turn reduces the usefulness of the analysis. We propose a novel metric called Label Creep Ratio (LCR) to assess how many variables and object fields in memory are labeled as sensitive.
[TABLE]
For a given monitoring strategy, the Label Creep Ratio is the ratio between the number of assignments of -labeled values and the total number of assignments. Intuitively, measuring the LCR throughout an execution estimates the speed at which the memory locations get assigned sensitive labels.
3.2.3. Source-to-sink Flows
To what degree do different kinds of flows contribute to policy violations? To address this question, we consider transitive flows from a source of sensitive information to a sink of insensitive information. For instance, none of the flows in the examples above correspond to a source-to-sink flow, since no sink statement is present.
Now, consider the code:
1var x = falseH; var y; var z;
2if (x)
3 y = 5;
4upgrade(y); // hidden micro flow
5z = x; // explicit micro flow
6sink(y); // source-to-sink flow
The program contains two micro flows and one source-to-sink flow. However, if the execution is analyzed with taint tracking or observable tracking, the source-to-sink flow is missed, because it occurs only due to the upgrade statement.
As another example, consider the following code:
1var x = trueH; var y; var z;
2if (x)
3 y = 5; // observable flow
4z = x; // explicit flow
5sink(y+z); // source-to-sink flow
The source-to-sink flow will be detected by all three kinds of monitoring strategies, because the variable z gets labeled via an explicit micro flow and then gets passed to the sink.
As illustrated by these two examples, we measure how many source-to-sink flows different monitoring strategies detect by tracking what micro flows contribute to a source-to-sink flow. Furthermore, to count the number of unique source-to-sink flows that a monitor detects, we compute the set of source code locations involved in each source-to-sink flow. If the code locations of two source-to-sink flows are the same, we count them as only one unique flow. This corresponds to the way a human security analyst would inspect warnings produced by an analysis.
3.2.4. Inference of Upgrade Statements
The approach described so far requires a program that indicates hidden implicit flows through upgrade statements. To obtain such a program, we adapt a testing-based technique for automatically inserting upgrade statements (Birgisson et al., 2012). The basic idea is to repeatedly execute the program with a particular policy, to monitor the execution for potentially missed hidden implicit flows (using the PU strategy (Austin and Flanagan, 2010), see Section 3.1), and to insert upgrade statements that signal them to the monitor when counting micro flows. Whenever the monitor terminates the program because it detects an access to a value that is marked as partially leaked, the approach modifies the program by inserting an upgrade statement at the code location where is next used; this upgrade statement in the modified program will then be executed whenever is used again, regardless of whether the same branch that leads to the insertion of the upgrade statement is taken. The process continues until it reaches a fixed point, i.e., until the program has enough upgrade statements for the given tests.
The ability of our analysis to observe hidden implicit flows depends on the completeness of the inferred upgrade statements, since missing upgrade statements may result in false negatives for hidden implicit flows. How often this occurs depends on how well the analyzed executions cover the branches of the programs. One way to assess this ability would be to measure tradition branch coverage, i.e., the percentage of all branches that are covered by the given test inputs. However, traditional branch coverage is only of limited use because inserting upgrade statements does not rely on covering all branches in the code, but only on a subset. Specifically, the ability to insert upgrade statements depends on the branch coverage for conditionals that depend on sensitive values. We present a metric called Sensitive Branch Coverage (SBC) that captures this idea:
[TABLE]
where is the set of conditionals that depend on a sensitive value. For example, consider executing the following program with x=falseH:
1var x; var y
2if (x)/#\label{line:uncovCond}#/
3 y = 5;
The set consists of the conditional at line LABEL:line:uncovCond, but since the execution covers only the false branch, .
3.3. Formalization of Flows and Conditions
We define the syntax and semantics of NanoJS, a simplified core of JavaScript to illustrate the flow counting performed by our implementation.
**Notation: ** We denote empty sequences by . Concatenating two sequences and is denoted by . Slightly abusing notation, we also use the same notation to prepend a single element to a sequence by writing . Similarly, we write to denote that occurs in sequence .
NanoJS syntax: NanoJS statements:
[TABLE]
A terminated execution is denoted by . All function calls to sinks with expression are modeled by ; other function calls are not considered in NanoJS.
**Semantics: ** Operationally, the constructs in NanoJS behave as in standard imperative languages. To count micro flows, we associate each primitive value with a tuple of flow counts, where . A tuple denotes explicit flows, observable flows, and hidden flows. A value is either a primitive value annotated with a flow count, or an address on the heap. We assume that there is a set of primitive base types, such as boolean, numbers, and strings. A heap object maps a finite set of names to values. We write tt for boolean value true and ff for boolean value false.
We use flow counts to track how information is propagated by a program, analogous to labels in other information flow monitors. We define a join-semilattice structure for flow counts as follows. Intuitively, a non-zero flow count indicates a sensitive value, whereas if all flow counts are zero, the value is insensitive: The join of two flow counts is defined as , where denotes the pointwise addition of the two flow counts. Two flow counts satisfy if whenever then .
A configuration consists of a statement , an environment mapping variable names to values, a heap , a stack of security levels , and a sink counter counting flows reaching sink statements; we denote the set of configurations by . An execution of a NanoJS program yields a trace indicating outputs produced by the execution.
We now define the small-step semantics of NanoJS. A step denotes a single evaluation step producing trace . We write for a terminated execution. Slightly abusing notation, we define as the join of all labels occurring in value with heap . For simplicity, we assume that there are no cyclical references on the heap.
The function denotes the pair , where the hidden flow count of all components of the value of a variable is incremented by 1. To update flow counts, we use an auxiliary function . Intuitively, increments the explicit and observable flow counters for assigning a value with flow count to a location with label while the security stack is . We define where and
To define observations based on references passed to sinks, we use a helper function that, given a value, returns all references to heap objects reachable from the value. We denote evaluating an expression in environment and heap by . The rules propagate flow counts into the result values; for example, adding two values with one explicit flow each will result in two explicit flows in the result. We assume, contrary to real-world JavaScript, that expressions do not have side effects.
Finally, Figure 2 gives the rules of small-step operational semantics for NanoJS with flow counting. The way the rules modify the environment and heap is standard. Some standard rules are omitted and provided in the appendix. In addition to the standard execution of a program, the semantics also track flow counts for each value. For example, an assignment statement propagates the flow counts of the assigned expression and additionally increments the explicit flow count if has non-zero flows and the observable flow count if the control-flow path is determined by sensitive data. A sink statement increments global counts representing source-to-sink flows. Since all sink statements model writes to insensitive sinks, any write of an expression with non-zero flow counts will result in incrementing the global counters.
**Security conditions: ** We also adapt existing security conditions for tracking only explicit or observable flows to NanoJS (Balliu et al., 2017). To capture only explicit flows, we use the notion of explicit secrecy; intuitively, a run of a program satisfies explicit secrecy if and only if the program obtained by sequentially composing all non-control-flow commands executed during that run does not leak information. For example, the program would produce the extracted programs or depending on the value of in a given run. In both cases, the extracted program contains prohibited information flows, since the source program only leaks information through an implicit flow.
To track only explicit and observable implicit flows, we keep branching constructs in the extracted program, but replace not taken branches by skip. If the extracted program does not leak sensitive information, then the run satisfies observable secrecy. For example, in the program , observable secrecy would extract either or . This matches the intuition that an observable flow only occurs in the run where is tt, where the assignment is executed, but not in a run where is ff, since this run only leaks information through a hidden implicit flow; i.e. the extracted program when leaks information, but the extracted program for does not. Appendix A gives formal definitions of the two notions.
**Soundness: ** To establish soundness of our counting scheme, we show that if all explicit flow counts for all sinks for a given run are [math], then that run satisfies explicit secrecy. Similarly, we show that if all explicit and observable flow counts are [math], the run satisfies observable secrecy. The formal theorem statements and proofs can be found in Appendices B and C.
3.4. Implementation
To implement our methodology, we develop a tool for dynamic information flow analysis following Hedin at al. (Hedin and Sabelfeld, 2012; Hedin et al., 2014). The implementation builds on Jalangi (Sen et al., 2013), a dynamic analysis framework for JavaScript that uses source-to-source transformation. Since Jalangi supports ECMAScript 5 only, we down-compile programs written in newer versions of the language with Babel (bab, 2 08). Building on top of Jalangi allows us to focus on the important parts of the analysis and let the framework handle otherwise challenging aspects of implementing a dynamic information flow analysis, e.g., on the fly instrumentation of code produced by eval, exceptional termination of functions, boxing and unboxing of primitive values (Chudnov and Naumann, 2015). We handle higher-order functions and track dynamic modification of object properties as described by Hedin et al. (Hedin and Sabelfeld, 2012). Our policy language is expressive, allowing the security analyst to mark both functions and arguments of callbacks as sources.
To approximate the effects of native calls, we model them by transferring the labels from all parameters to the return value. Moreover, if one of the parameter is an object, we propagate labels from all its properties to the return value. For a set of frequently used native functions, such as Array.push, Array.forEach, Object.call, and Object.defineProperty, we create richer models that propagate labels more precisely. To increase the confidence in our implementation, we created more than 100 validation tests that assert the correctness of label propagation in typical usage scenarios. When inserting upgrades, the implementation does not modify the actual source code but it stores the source code locations of upgrades, and then performs the upgrades at runtime.
4. Empirical Study
This section presents the results of our empirical study that assesses the costs and benefits of tracking different kinds of flows.
The last two columns of Table 1 show the sensitive branch coverage (SBC) and the number of upgrades inserted while executing the benchmarks. Overall, the tests used for the study reach a high SBC, for 54% of the programs even 100%, enabling the analysis to insert upgrade statements. For each of the considered benchmarks, our tool can detect source-to-sink flows. This is hardly surprising, since we already know that the programs contain such flows, but it shows that our tool can handle complex, real-life JavaScript code.
4.1. Prevalence of Micro Flows
At first, we address the question of how prevalent explicit, observable implicit, and hidden implicit micro flows are among all operations that induce an information flow. Figure 3 shows the distribution of micro flows for our benchmarks. The majority of benchmarks contain both implicit and explicit micro flows. Benchmark 39 is a special case where reaching the sink is the first operation performed on the untrusted data, and hence the data flows directly from source to sink without producing any micro flow. The explicit flows are by far the most prevalent, appearing in all but one benchmarks. Five benchmarks also contain hidden implicit flows, but we can safely conclude that these cases are rare.
4.2. Source-to-sink Flows
We now evaluate source-to-sink flows, which are the ultimate measure of success for an information flow analysis. Source-to-sink flows are what a security analysts ultimately cares about: how does information from a sensitive source reaches an insensitive sink. Information flow analysis has no way to show that such a flow is security-relevant, but it is the analyst’s job to further inspect the flows and decide. In this section, however, we have a different goal and setup: we start with a set of known security problems that produce a source-to-sink flow and proceed by showing what type of analysis is needed to detect these problems.
Our tool can enforce different security conditions (cf. Section 3.3). For example, if we are interested only in explicit and observable implicit flows, we can run the tool in observable tracking mode and enforce observable secrecy. Figure 4 presents the number of source-to-sink flows detected by different monitoring strategies. All the integrity vulnerabilities can be detected by taint tracking only, and all the security violations in our data set can be detected through observable tracking. Moreover, all the Node.js vulnerabilities can be detected by the taint tracking only, independently of whether they are confidentiality or integrity vulnerabilities. We argue that this is because our Node.js programs are expected to be trusted. That is, a security issue may arise from a programming error, but not by malicious intention. This assumption does not hold, however, for the fingerprinting and history sniffing benchmarks, where only observable implicit flows contribute to the source-to-sink flows. A second explanation for why the implicit flows are prevalent in the browser environment is that there are already a set of security mechanisms in the browser that prevent certain type of dangerous behavior. For example when fingerprinting the login state using images, an attacker cannot directly read the bytes of the image due to same origin policy, and hence it relies on measuring its width.
We analyzed in detail the additional source-to-sink flows detected by observable tracking for benchmarks 12, 26, 34, 43, and 44, and by PU for benchmark 34. In all these cases the reported flows are false positives, since they do not allow an attacker to exploit the respective vulnerability. In Section 4.4, we discuss in detail why these false positives occur when data is propagated through implicit flows.
Our results indicate that observable tracking is enough to tackle all the real-life security problems we consider and that taint tracking suffices for all the trusted code. We do not claim that there are no real-life security problems beyond observable secrecy, we just do not see any in our data set. Moreover, we believe that when strong controls are in place, attackers will be motivated to use more sophisticated attacks, possibly though the use of hidden implicit flows. However, tracking these flows is expensive as we will see in the remainder of this section.
4.3. Permissiveness
A potential problem for adopting information flow analysis in practice is its limited permissiveness, i.e., the fact that a monitor may terminate the program even though no data flows from a source to a sink. Our metrics allow us to quantify this effect both for the NSU and the PU monitoring strategies. Specifically, we measure how many code locations a user would have to inspect because a monitor terminates the program. The NSU monitor terminates the program when an update of an insensitive variable is performed in a sensitive context. This condition corresponds to observable implicit micro flows and we count the number of code locations where such a flow occurs. The PU monitor terminates the program when an insensitive variable that was updated in a sensitive context is read. This termination condition corresponds to the locations where our tool inserts an upgrade statement. Figure 5 shows the number of code locations affected by the lack of permissiveness for NSU and PU. We exclude benchmarks for which neither of the monitoring strategies raises an alarm. On average, NSU throws 5.46 times more alarms than PU, that is, PU is much more practical than NSU. However, when comparing the PU violations to the number of source-to-sink flows that require PU (Figure 4), we observe that most of the PU alarms do not translate to actual source-to-sink flows and should be considered false positives.
4.4. Label Creep Ratio
As a second metric for the cost of different kinds of flows, we use the Label Creep Ratio (LCR) defined in Section 3.2.1. For each benchmark and monitoring strategy, we measure how the LCR changes during the execution time. Figure 6(a) shows the ratio for PU monitoring. The metric is not monotonically increasing because the analysis is flow-sensitive, i.e., the security label of a variable may change over time. Nevertheless, the LCR steadily increases for most benchmarks, which confirms the label creep problem. Because our policies are targeted at detecting known security problems in the benchmarks, the maximum LCR reached is relatively low (20%, on average).
A comparison of different monitoring strategies shows that stricter monitoring causes more label creep. On average, observable tracking has a 0.3% smaller LCR than PU; a taint tracking analysis has a 45.4% smaller LCR than observable tracking. Figure 6(b) illustrates this effect with a representative benchmark (number 11). The graph shows how label creep increases for observable tracking compared to taint tracking.
We illustrate with the same benchmark 11 how label creep may translate to false positives. By revisiting Figure 4, we observe that the implicit flows do not contribute additional source-to-sink violations compared to a taint analysis. Figure 7 shows an excerpt of the source code of the benchmark. The code is vulnerable to code injection, where query is the source and eval is the sink. The source-to-sink flow is trivial since the sensitive data is directly passed to the sink at line 3, which a taint tracker easily detects. In addition, observable tracking pushes the query on the security stack at line 2, which causes implicit flows at lines 10 and 11 where two constants are written to memory. For detecting code injections, these flows are irrelevant. For example, suppose we have a statement eval(this.operator) at line 12, for which observable tracking would report a source-to-sink flow. This source-to-sink flow would be a false positive because the attacker can only control whether the call to eval happens, not what value flows into it.
4.5. Runtime Overhead
The last cost metric we use is a proxy measure for the runtime overhead imposed by different monitors. For each benchmark we count the number of operations that propagate a label or that modify the security stack. Table 3 shows how the number of events depends on the kind of monitor, aggregated by the different types of vulnerabilities we consider. As expected, raising the security bar translates into larger runtime overhead. Interestingly, this increase is not uniform across the different types of benchmarks. For injections, the cost increase is relatively small, while for ReDoS and client-side programs the increase between explicit and observable secrecy is more than 2.5-fold. We hypothesize that this is due to the structure of the programs: when comparing these findings with the micro flows in Figure 3, we see that implicit flows are more common in ReDoS and client-side programs than in injections. The price paid to track implicit flows in the client-side benchmarks translates to detected source-to-sink flows, as we have seen in Section 4.2, while this is not the case for ReDoS vulnerabilities.
4.6. Threats to Validity
The validity of the conclusions drawn from our study is subject to several threats. First, our dynamic information flow analysis uses a simple model for native functions (Section 3.4), which may not accurately capture all effects of these functions. To minimize the influence of this limitation, we focus the study on subject programs that have relatively few native calls. We also wrote a set of precise models for some of the array and string native functions. Second, our results are limited to the programs we consider and may not generalize to other programs or classes of programs. In particular, we mostly consider non-malicious programs, where difficult-to-analyze flows may be less prevalent than in malicious code. Our methodology is generic enough to be easily applied to other programs. Finally, the hidden implicit flows that our methodology can observe are bounded by the upgrade statements inserted into the programs, which in turn depend on the tests we use to exercise the programs. To mitigate this threat we constructed tests in a way that increases the sensitive branch coverage. However, multiple paths cannot be covered due to a variety of reasons, e.g., error cases that cannot be easily triggered or unfeasible execution paths. Despite these limitations, our study produces interesting insights about the kinds of flows that appear in real-world JavaScript programs and the cost-benefit tradeoff of information flow analysis.
5. Related Work
Denning and Denning pioneered the development and formal description of static information flow analyses (Denning and Denning, 1977; Denning, 1976). Fenton studies purely dynamic information flow monitors (Fenton, 1974). A huge body of work has been created during the years to refine Dennings’ and Fenton’s ideas and to adapt them to various programming languages. Table 4 presents some of the more recent tools and shows what kinds of flows they consider. Many analyses consider only explicit flows (Schwartz et al., 2010). Among the analyses that consider implicit flows, the majority stop or modify the program as soon as a hidden flow occurs.
Information Flow Analysis for JavaScript
Chugh et al. propose a static-dynamic analysis that reports flows from code given to eval() to sensitive locations, such as the location bar of a site (Chugh et al., 2009). Austin and Flanagan address the problem of hidden implicit flows (Austin and Flanagan, 2009, 2010), as discussed in detail in Section 2. Hedin et al. propose a dynamic analysis that implements the NSU strategy for a subset of JavaScript (Hedin and Sabelfeld, 2012). They develop JSFlow, which supports the full JavaScript language, but it requires inserting upgrade statements manually (Hedin et al., 2014). Birgisson et al. propose to automatically insert upgrade statements (Birgisson et al., 2012) by iteratively executing tests under the NSU monitor. Their approach is implemented for a JavaScript-like language, whereas we support the full JavaScript language. Our monitor implements the PU strategy to insert upgrade statements, which reduces the number of upgrade statements and increases permissiveness. Bichhawat et al. propose a variant of PU, where the program is terminated whenever a partially leaked value may flow into the heap (Bichhawat et al., 2014). A WebKit-based browser by Kerschbaumer et al. (Kerschbaumer et al., 2013) balances performance and permissiveness by probabilistically switching between taint tracking and observable tracking and deploys crowdsourcing techniques to discover information flow violations by Alexa Top 500 pages.
Other Work on Information Flow Analysis
Balliu et al. study a family of information flow trackers for different kinds of flows and propose security conditions to evaluate their soundness (Balliu et al., 2017). We borrow their conditions to prove the soundness of our monitor for NanoJS. Bao et al. show that considering implicit flows can cause a significant amount of false positives and propose a criterion to determine a subset of all conditionals to consider (Bao et al., 2010). Chandra et al. propose a VM-based analysis for Java that combines a conservative static analysis with a dynamic analysis to track all three kinds of flows considered in this paper (Chandra and Franz, 2007). Dytan is a dynamic information flow analysis for binaries that supports both explicit and observable implicit flows (Clause et al., 2007). Myers and Liskov introduce Jif, a language for specifying and statically enforcing security policies for Java programs (Myers and Liskov, 2000). A survey by Sabelfeld and Myers provides an overview of further static approaches (Sabelfeld and Myers, 2003).
Applications of Information Flow Analysis
Information flow analysis is widely used to discover potential vulnerabilities. All approaches we are aware of consider only a subset of the three kinds of flows. Flax uses taint analysis to find incomplete or missing input validation and generates attacks that try to exploit the potential vulnerabilities (Saxena et al., 2010). Lekies et al. (Lekies et al., 2013) and Melicher et al. (Melicher et al., 2018) propose a similar approach to detect DOM-based XSS vulnerabilities. Jang et al. analyze various web sites with information flow policies targeted at common privacy leaks and attack vectors, such as cookie stealing and history sniffing (Jang et al., 2010a). Their analysis considers observable implicit flows but not hidden implicit flows. Sabre analyzes flows inside browser extensions to discover malicious extensions (Dhawan and Ganapathy, 2009). Their analysis considers only explicit flows.
Studies of Information Flow
King et al. (King et al., 2008) share our goal of understanding practical trade-offs between explicit and implicit flows. They empirically study implicit flows detected by a static analysis in six Java-based implementations of authentication and cryptographic functions. They report that most of the reported policy violations are false positives, mostly due to conservative handling of exceptions. Our work focuses on dynamic analysis for JavaScript-based implementations, which gives rise to a class of observable secrecy monitors that is not relevant in a static setting. Another empirical study of information flows is by Masri et al. (Masri and Podgurski, 2009). Their work studies how the length of flows (measured as the length of the static dependence chain), the strength of flows (measured based on entropy and correlations), and different kinds of information flows (explicit and observable implicit) relate to each other. Similar to our methodology, Masri et al. target dynamic analysis. Our work differs by addressing different research questions, a different language, and by considering hidden implicit flows.
6. Conclusions
This paper presents an empirical study of information flows in real-world programs. Based on novel metrics to capture the prevalence of explicit, observable implicit, and hidden implicit flows, as well as the costs they involve, we study 56 JavaScript programs that suffer from real-world security problems. Our results show that implicit flows are expensive to track in terms of permissiveness, label creep, and runtime overhead. We find taint tracking to be sufficient for most of the studied security problems, while for some privacy scenarios observable tracking is needed. Our work helps security analysts and analysis developers to better understand the cost-benefits tradeoffs of information flow analysis. Furthermore, our findings highlight the need for future research on cost-effective ways to analyze hidden implicit information flows.
Acknowledgments
Parts of this work was supported by the German Federal Ministry of Education and Research and by the Hessian Ministry of Science and the Arts with the National Research Center for Applied Cybersecurity, and by the German Research Foundation within the ConcSys and Perf4JS projects.
Appendix A Security Definitions
The previous section has formally defined the flow counting that is at the heart of our empirical study. We now related the flow counting to three previously described (Balliu et al., 2017) security conditions: Explicit secrecy, which requires the absence of explicit flows, observable secrecy, which requires the absence of both explicit flows and observable hidden flows, and non-interference, which requires the absence of all three kinds of flows, i.e., explicit flows, observable implicit flows, and hidden implicit flows. To describe these security conditions in our formalization, we define an instrumented version of the semantics that, along with counting flows, extracts another program. Intuitively, the extracted program preserves the semantics of the original program execution but exposes only a subset of all flows.
To formalize non-interference, we first refer to low-equivalence on environments and heaps. Two environments and heaps are low-equivalent if they are equal on all insensitive values. For example, when considering integrity, the two states are equal on all non-attacker-controlled variables. Dually, for confidentiality, this means that the attacker cannot observe any difference between the two states. Non-interference is defined in terms of low-equivalence of initial environments and heaps. Intuitively, an execution satisfies non-interference iff the same trace can be produced for any indistinguishable starting environment and heap.
Definition 1 (Non-interference).
A program satisfies non-interference for environment and heap , iff whenever , then for all , , where , it holds that for some .
Explicit secrecy and observable secrecy are both defined by extracting a simpler program during the execution of a program which eliminates information flows not considered by the security condition: Programs extracted by explicit secrecy contain no control-flow information, whereas programs extracted by observable secrecy discard statements in untaken branches, thus removing leaks through not executed statements in other branches in the program.
The following describes the program extraction formally. We extend each configuration with an extracted statement , written , where and with referring to the set of extracted statements for a given security condition. For each security condition, we below define an extraction extraction function and then use in the execution steps of the instrumented semantics: .
**Explicit secrecy: ** For explicit secrecy, we disregard control-flow-related statements by defining an extraction function . Intuitively, the extracted program discards all control-flow decisions that influenced the current execution and extracts only the straight-line portion of the current execution. As a result, the extracted program no longer contains any implicit flows. The extraction function for explicit secrecy is defined as follows:
[TABLE]
Based on this extraction function, we can define explicit secrecy:
Definition 2.
A program satisfies explicit secrecy for and iff whenever , then is non-interfering for environment and heap .
**Observable secrecy: ** To define observable secrecy we first define evaluation contexts to keep track of where in a partially extracted program the next statement should be placed. The set of evaluation contexts is defined by the following grammar:
[TABLE]
Note that with the exception of , symbolizing a hole in the context, evaluation contexts are a subset of statements. We denote replacing by a statement or context in a context by . Note that if , then .
We then define an extraction function to define observable secrecy:
[TABLE]
where denotes shifting the hole in the context outside of the branch of the surrounding if. Note that in programs not initially containing pop statements, any pop encountered during execution delimits a control-flow construct.
Appendix B Soundness
In this section, we show that if a particular execution results in zero explicit flows, this execution satisfies explicit secrecy. Similarly, if both observable and explicit flow counts are zero, the run satisfies observable secrecy.
Theorem B.1.
If and , then satisfies explicit secrecy for and .
Proofs for the two theorems are provided in Appendix C.
Theorem B.2.
If and , then satisfies observable secrecy for and .
We omit a similar soundness statement for non-interference, as the monitor follows the same approach as a traditional permissive-upgrade-based information flow monitor, under the additional assumption that all required upgrade statements were inserted during the testing phase.
Appendix C Proofs and Additional Definitions
We formally define values and objects as follows. The sets are defined by (mutual-)inductively:
[TABLE]
The formal definition of joining the labels of values is given by if and if . The helper function is defined as follows: if , and
The remaining rules for the operational semantics are the following:
E-Skip
\displaystyle\displaystyle{\hbox{\hskip 2.25pt\vbox{\hbox{\hskip-2.25pt\hbox{\hbox{\displaystyle\displaystyle\ }}}\vbox{}}}\over\hbox{\hskip 54.30025pt\vbox{\vbox{}\hbox{\hskip-54.30023pt\hbox{\hbox{\displaystyle\displaystyle\langle\textbf{skip},\rho,h,t,\kappa\rangle\xrightarrow{}\langle\varepsilon,\rho,h,t,\kappa\rangle}}}}}}
*
E-AssignField
\displaystyle\displaystyle{\hbox{\hskip 171.13522pt\vbox{\hbox{\hskip-156.62619pt\hbox{\hbox{\displaystyle\displaystyle\llbracket x\rrbracket(\rho,h)=a_{x}}\hskip 18.00003pt\hbox{\hbox{\displaystyle\displaystyle a_{x}\in\mathit{Addr}}\hskip 18.00003pt\hbox{\hbox{\displaystyle\displaystyle o=h(a_{x})}\hskip 18.00003pt\hbox{\hbox{\displaystyle\displaystyle y\in\text{dom}(o)}\hskip 18.00003pt\hbox{\hbox{\displaystyle\displaystyle\llbracket x.y\rrbracket(\rho,h)=v_{xy}}}}}}}}\vbox{\hbox{\hskip-171.13522pt\hbox{\hbox{\displaystyle\displaystyle\kappa_{xy}=\bigsqcup(h,v_{xy})}\hskip 18.00003pt\hbox{\hbox{\displaystyle\displaystyle\llbracket e\rrbracket(\rho,h)=v^{\kappa}}\hskip 18.00003pt\hbox{\hbox{\displaystyle\displaystyle\kappa^{\prime}=\kappa+\Delta(\kappa_{xy},\kappa,t)}\hskip 18.00003pt\hbox{\hbox{\displaystyle\displaystyle o^{\prime}=o[y\mapsto v^{\kappa^{\prime}}]}\hskip 18.00003pt\hbox{\hbox{\displaystyle\displaystyle h^{\prime}=h[a_{x}\mapsto o^{\prime}]}}}}}}}\vbox{}}}}\over\hbox{\hskip 62.12814pt\vbox{\vbox{}\hbox{\hskip-62.12813pt\hbox{\hbox{\displaystyle\displaystyle\langle x.y=e,\rho,h,t,\kappa\rangle\xrightarrow{}\langle\varepsilon,\rho,h^{\prime},t,\kappa\rangle}}}}}}
E-While
\displaystyle\displaystyle{\hbox{\hskip 2.25pt\vbox{\hbox{\hskip-2.25pt\hbox{\hbox{\displaystyle\displaystyle\ }}}\vbox{}}}\over\hbox{\hskip 153.4285pt\vbox{\vbox{}\hbox{\hskip-153.42848pt\hbox{\hbox{\hbox{\displaystyle\displaystyle\langle\textbf{while}\ e\ \textbf{do}\ c,\rho,h,t,\kappa\rangle\xrightarrow{}}}\hskip 18.00003pt\hbox{\displaystyle\displaystyle\langle\textbf{if}\ (e)\ {\ c\mathrel{;}\textbf{while}\ e\ \textbf{do}\ c\ }\ \textbf{else}\ {\ \varepsilon\ },\rho,h,t,\kappa\rangle}}}}}}
E-UpgradeH
\displaystyle\displaystyle{\hbox{\hskip 48.00352pt\vbox{\hbox{\hskip-48.00352pt\hbox{\hbox{\displaystyle\displaystyle\llbracket x\rrbracket(\rho,h)=v^{\kappa}}\hskip 18.00003pt\hbox{\hbox{\displaystyle\displaystyle\kappa\neq\mathbf{0}}}}}\vbox{}}}\over\hbox{\hskip 68.23465pt\vbox{\vbox{}\hbox{\hskip-68.23463pt\hbox{\hbox{\displaystyle\displaystyle\langle\textbf{upgrade}(x),\rho,h,t,\kappa\rangle\xrightarrow{}\langle\varepsilon,\rho,h,t,\kappa\rangle}}}}}}
E-Seq
\displaystyle\displaystyle{\hbox{\hskip 55.45578pt\vbox{\hbox{\hskip-55.45576pt\hbox{\hbox{\displaystyle\displaystyle\langle c_{1},\rho,h,t,\kappa\rangle\xrightarrow{\tau}\langle c_{1}^{\prime},\rho^{\prime},h^{\prime},t^{\prime},\kappa^{\prime}\rangle}}}\vbox{}}}\over\hbox{\hskip 64.37059pt\vbox{\vbox{}\hbox{\hskip-64.37057pt\hbox{\hbox{\displaystyle\displaystyle\langle c_{1}\mathrel{;}c_{2},\rho,h,t,\kappa\rangle\xrightarrow{\tau}\langle c_{1}^{\prime}\mathrel{;}c_{2},\rho^{\prime},h^{\prime},t^{\prime},\kappa^{\prime}\rangle}}}}}}
E-SeqEmpty
\displaystyle\displaystyle{\hbox{\hskip 2.25pt\vbox{\hbox{\hskip-2.25pt\hbox{\hbox{\displaystyle\displaystyle\ }}}\vbox{}}}\over\hbox{\hskip 51.54503pt\vbox{\vbox{}\hbox{\hskip-51.54503pt\hbox{\hbox{\displaystyle\displaystyle\langle\varepsilon\mathrel{;}c,\rho,h,t,\kappa\rangle\xrightarrow{}\langle c,\rho,h,t,\kappa\rangle}}}}}}
Two environments and heaps and are low-equivalent, written iff
, , .
In the interest of brevity, we elide lemmas about standard properties of the evaluation relation in the following proofs.
Proof of Theorem B.1.
We define . We define an safety property on extracted programs as follows: . Moreover, we note that only straight-line programs are extracted for explicit secrecy and such programs trivially preserve low equivalence of environments and heaps.
Additionally, we define the predicate where iff whenever , , and , then , where iff the counter of each value in and is related by to the corresponding counter in and . Formally, iff , , and .
For the induction to succeed we show the stronger statement that whenever , then and .
We prove this by induction on . The reflexive case is trivial. For the transitive case, assume . Per the induction hypothesis we have that and . We show that and by induction on . The main interesting cases are E-Assign, E-AssignField, and E-Sink.
Case E-Assign: We have that . We have that follows from the fact that any sink in is also reachable in , hence the claim follows from the induction hypothesis.
To show , we note that if then and , where is the result of the assigned expression with incremented counters. We show that the label of is still bounded by the corresponding label of in . From the induction hypothesis we have that . If the new label of in is not , then the claim follows trivially. If it is low, it follows from the previous fact that also receives in .
Case E-AssignField is analogous.
Case E-Sink: In this case follows easily from the induction hypothesis as the sink statement does not change the environment and heap. follows trivially for sink statements already reachable in . Since , we need to show that this also holds when reaching . Since the explicit flow count is still [math] in , we have that this the label of in is . Since , we have that therefore the label of in is also , as desired.
Clearly whenever , then satisfies per-run non-interference wrt. and since low equivalence between memories is preserved and only low expression reach sinks without increasing the counter. ∎
In the following proof sketch, we define and overload to be any -tuple of [math]s. The relation is generalized similarly to arbitrary tuples.
Proof of Theorem B.2.
For the induction to go through, we show the stronger property that whenever , , and , then:
- •
, and () and
- •
There exist , , , and such that .
- •
Moreover, , and .
where holds iff all values labeled low in both environments are equal in value and lists of levels and satisfy if there exists a suffix of length of such that all levels are pairwise related by . We write for copies of composed with sequential composition.
We show this by induction on . The reflexive case is trivial. For the transitive case, assume . From part of the induction hypothesis we have that ; from part we obtain for which the corresponding disjunction also holds ; from we have , , and . We show by induction on ; the interesting cases are E-Assign, E-AssignField, E-IfTrue/False, E-Sink, and E-Pop. We refer to the proof obligations as , , and .
Case E-Assign: follows trivially from the semantics. We have , , where is the result of evaluating in and incrementing the counts appropriately.
We proceed by case distinction on . If , then we have that where , where is the result of evaluating in . follows from this case in and the fact that assignments do not modify the label stack.
In the case where , then we have that . Therefore, we have that the counter of is not , therefore and follows trivially.
The other statements follow from the induction hypothesis: If is not reached, then matches the evaluation of for any .
Case E-AssignField: Analogous to E-Assign.
Case E-If: Without loss of generality, we only discuss the case where the then branch is taken. In this case, we first note that , and where ; in particular note that . For note that from it follows that as desired.
We proceed by case distinction on . If and , then we have that . Note that for and we have that .
To show we proceed by case distinction on . If , we have that where . Since , we have that ; we also have that trivially.
In the case where , we have that where is a prefix of and hence, .
Note that since and , we have that as required for this case.
where is the result of evaluating in . follows from this case in and the fact that assignments do not modify the label stack.
In the case where and , we have that then since is not reached and hence replacing it with does not affect the execution. Since , we trivially have that then also . In both cases, the rest of follows trivially.
Case E-Sink: We have that , where . Environments and the heap are unchanged. Moreover, we have that , since the flow counts are assumed to be [math]. follows easily.
For , note the second alternative of the disjunction of leads to a contradiction, since then and this would imply that , violating the assumption that .
We can therefore assume that and . Hence we also have that . Since , we also have that the label of is ; from we have that is also labeled . With this yields that , concluding . follows easily since heaps, environments, and the label stack are not modified by executing a sink statement.
Case E-Pop: By this case we have . follows easily. For we again proceed by case distinction on the disjunction in . In the first case, the conclusion follows easily, since we reach the same state as the execution in . Assume now that and . We proceed by case distinction on this execution reaching the branch surrounding in . We denote this branch by ; WLOG assume that and . Then, we show : Since terminated without reaching , we have that for any , and, since , we reach trivially, concluding the induction. This stronger property then trivially implies non-interference of . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2bab (2 08) Accessed: 2018-02-08. Babel Java Script compiler. https://babeljs.io .
- 3Acar et al . (2014) Gunes Acar, Christian Eubank, Steven Englehardt, Marc Juárez, Arvind Narayanan, and Claudia Díaz. 2014. The Web Never Forgets: Persistent Tracking Mechanisms in the Wild. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, Scottsdale, AZ, USA, November 3-7, 2014 . 674–689.
- 4Acar et al . (2013) Gunes Acar, Marc Juarez, Nick Nikiforakis, Claudia Diaz, Seda Gürses, Frank Piessens, and Bart Preneel. 2013. FP Detective: dusting the web for fingerprinters. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security . ACM, 1129–1140.
- 5Austin and Flanagan (2009) Thomas H. Austin and Cormac Flanagan. 2009. Efficient purely-dynamic information flow analysis. In Workshop on Programming Languages and Analysis for Security (PLAS 2009) . ACM, 113–124.
- 6Austin and Flanagan (2010) Thomas H. Austin and Cormac Flanagan. 2010. Permissive dynamic information flow analysis. In Workshop on Programming Languages and Analysis for Security (PLAS 2010) . ACM, 3.
- 7Austin and Flanagan (2012) Thomas H. Austin and Cormac Flanagan. 2012. Multiple Facets for Dynamic Information Flow. In Proceedings of the 39th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’12) . 165–178.
- 8Balliu et al . (2017) Musard Balliu, Daniel Schoepe, and Andrei Sabelfeld. 2017. We Are Family: Relating Information-Flow Trackers. In Computer Security - ESORICS 2017 - 22nd European Symposium on Research in Computer Security, Oslo, Norway, September 11-15, 2017, Proceedings, Part I . 124–145.
