A Review on Flight Delay Prediction
Alice Sternberg, Jorge Soares, Diego Carvalho, Eduardo Ogasawara

TL;DR
This paper provides a comprehensive review of flight delay prediction models, emphasizing the evolution of methods, especially machine learning, and offering a taxonomy and timeline of key research developments in the field.
Contribution
It introduces a taxonomy for flight delay prediction approaches and presents a timeline highlighting research trends and the increasing use of machine learning techniques.
Findings
Machine learning methods are increasingly used in flight delay prediction.
The review categorizes approaches based on scope, data, and computational methods.
A timeline of significant research illustrates evolving trends in the field.
Abstract
Flight delays hurt airlines, airports, and passengers. Their prediction is crucial during the decision-making process for all players of commercial aviation. Moreover, the development of accurate prediction models for flight delays became cumbersome due to the complexity of air transportation system, the number of methods for prediction, and the deluge of flight data. In this context, this paper presents a thorough literature review of approaches used to build flight delay prediction models from the Data Science perspective. We propose a taxonomy and summarize the initiatives used to address the flight delay prediction problem, according to scope, data, and computational methods, giving particular attention to an increased usage of machine learning methods. Besides, we also present a timeline of significant works that depicts relationships between flight delay prediction problems and…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A Review on Flight Delay Prediction
Alice Sternberg
CEFET/RJ
\AndJorge Soares
CEFET/RJ
\ANDDiego Carvalho
CEFET/RJ
\AndEduardo Ogasawara
CEFET/RJ
Abstract
Flight delays hurt airlines, airports, and passengers. Their prediction is crucial during the decision-making process for all players of commercial aviation. Moreover, the development of accurate prediction models for flight delays became cumbersome due to the complexity of air transportation system, the number of methods for prediction, and the deluge of flight data. In this context, this paper presents a thorough literature review of approaches used to build flight delay prediction models from the Data Science perspective. We propose a taxonomy and summarize the initiatives used to address the flight delay prediction problem, according to scope, data, and computational methods, giving particular attention to an increased usage of machine learning methods. Besides, we also present a timeline of significant works that depicts relationships between flight delay prediction problems and research trends to address them.
The published version of this paper is made available at https://doi.org/10.1080/01441647.2020.1861123.
Please cite as:
L. Carvalho, A. Sternberg, L. Maia Gonçalves, A. Beatriz Cruz, J.A. Soares, D. Brandão, D. Carvalho, e E. Ogasawara, 2020, On the relevance of data science for flight delay research: a systematic review, Transport Reviews
K****eywords Flight delays Commercial aviation Brazilian system
1 Introduction
Delay is one of the most remembered performance indicators of any transportation system. Notably, commercial aviation players understand delay as the period by which a flight is late or postponed. Thus, a delay may be represented by the difference between scheduled and real times of departure or arrival of a plane [117]. Country regulator authorities have a multitude of indicators related to tolerance thresholds for flight delays. Indeed, flight delay is an essential subject in the context of air transportation systems. In 2013, 36% of flights delayed by more than five minutes in Europe, 31.1% of flights delayed by more than 15 minutes in the United States, and 16.3% of flights were canceled or suffered delays greater than 30 minutes in Brazil [45, 5]. This indicates how relevant this indicator is and how it affects no matter the scale of airline meshes.
Flight delays have negative impacts, mainly economic, for passengers, airlines, and airports. Given the uncertainty of their occurrence, passengers usually plan to travel many hours earlier for their appointments, increasing their trip costs, to ensure their arrival on time [11, 55]. On the other hand, airlines suffer penalties, fines and additional operation costs, such as crew and aircrafts retentions in airports [25, 112, 51, 62]. Furthermore, from the sustainability point of view, delays may also cause environmental damage by increasing fuel consumption and gas emissions [95, 105, 102, 75, 8, 125].
Delays also jeopardize airlines marketing strategies, since carriers rely on customers’ loyalty to support their frequent-flyer programs and the consumer’s choice is also affected by reliable performance. There is a identified relationship between levels of delays and fares, aircraft sizes, flight frequency and complaints about airline service [39, 83, 21, 93, 133]. The estimation of flight delays can improve the tactical and operational decisions of airports and airlines managers and warn passengers so that they can rearrange their plans [40].
To better understand the entire flight ecosystems, vast volumes of data from commercial aviation are collected every moment and stored in databases. Submerged in this massive amount of data produced by sensors and IoT [86, 29, 90], analysts and data scientists are intensifying their computational and data management skills to extract useful information from each datum. In this context, the procedure of comprehending the domain, managing data and applying a model is known as Data Science, a trend in solving challenging problems related to Big Data.
Under this data deluge scenario, this paper contributes by presenting an analysis of the available literature on flight delay prediction from Data Science perspective. It seeks to summarize the most researched trends in this field, describing how this problem is addressed and comparing methods that have been used to build prediction models. This becomes more relevant as we observe an increasing presence of machine learning methods to model flight delays predictions. This analysis is conducted by establishing a flight delay research taxonomy, which organizes approaches according to the type of problem, scope, data issues, and computational methods. The paper also contributes by presenting a timeline of major works grouped by the kind of flight delay prediction problem.
Besides this introduction, the rest of this paper is structured as follows. Section 2 introduces the flight delay scenario, describing a typical operation of a commercial flight, kinds of delays and their impacts. It also structures three different ways for treating the prediction problem. In Section 3, a taxonomic analysis of the prediction is presented, showing the most researched topics, the scope of application, data and methods that authors are using to predict flight delays. Section 4 discusses the main results based on a timeline of publications grouped by the types of problems and their intersections. Finally, Section 5 concludes our analysis by presenting major highlights and trends about delay prediction problem.
2 The flight delay scenario
Commercial aviation is a complex distributed transportation system. It deals with valuable resources, demand fluctuations, and a sophisticated origin-destination matrix that need orchestration to provide smooth and safety operations. Furthermore, individual passenger follows her itineraries while airlines plan various schedules for aircrafts, pilots and flight attendants. Figure 1 illustrates a typical operation of a commercial flight. Stages can take place at terminal boundaries, airports, runways, and airspace, being susceptible to different kinds of delays. Some examples include mechanical problems, weather conditions, ground delays, air traffic control, runway queues and capacity constraints [103, 63, 3].
This scheme is repeated several times throughout the day for each flight in the system. Pilots, flight attendants and aircrafts may have different schedules due to legal rests, duties, and maintenance plans for airplanes. So, any disruption in the system can impact the subsequent flights of the same airline [2]. Moreover, disturbances may cause congestion at airspace or other airports, creating queues and delaying some flights from other carriers [106, 123]. In this way, the prediction of flight delays is an essential subject for airlines, airports, Air Navigation Service Providers (ANSP), and network managers, like FAA [52] and Eurocontrol [46].
The flight delay prediction problem can be treated by different points of view: (i) delay propagation, (ii) root delay and cancellation. In delay propagation, one study how delay propagates through the network of the transportation system. On the other hand, considering that new problems may happen eventually, it is also important to predict further delays and understand their causes. Such occurrences, in this paper, are named as a root delay problem. Finally, under specific situations, delays can lead to cancellations, forcing airlines and passengers to reschedule their itineraries. So, researchers focused on cancellation analysis try to figure out which conditions lead to cancellations. Moreover, it explores the airlines’ decision-making process for choosing the flights to be canceled.
3 Taxonomy
The main problems related to flight delay prediction are identified and organized in a taxonomy. It includes scopes, models, and ways of handling flight delay prediction problem. It considers flight domain features, such as problem and scope, and Data Science perspectives, such as data and methods. Figure 2 depicts the entire taxonomy while next subsections describe each component of the taxonomy and related work.
Regarding the available literature on flight delay prediction, we have conducted a systematic mapping study. The search expression string * (“airport delay” “flight delay”) (“predict” “forecast” “propagate”)* was used to query Scopus on October 2017. Query result brought 310 references. Additionally, 29 works were added using snowballing search.
We have selected 134 to build this review due to their relevance and direct link with the flight delay prediction problem. The main criteria to be included is to have the word “delay” in the abstract, and the paper should have at least the one citation at Google Scholar per year before 2016. It means that to include a paper of 2015, it must have at least one citation, and so one.
From this study, we were able to present a taxonomy that drives the organization of the following sections.
3.1 Problem
Problem is the core feature in domain taxonomy. As seen in Section 2, there are three major concerns regarding the flight delay prediction problem: delay propagation, root delay and cancellation. Depending on the emphasis of the research, authors select one of these lines to develop their models.
3.1.1 Root delay and cancellation
Considering that new delay (root delay) may happen eventually, these root delays impair the performance of transportation network. Researchers create prediction models to tackle root delay, predicting when and where a delay will occur and what are its reasons and sources. This includes models that efficiently seek to estimate the number of minutes, probability or level of delay for a specific flight, airline or airport.
A relevant number of works focused on predicting and estimating delay duration [102]. Some approaches considered probabilistic models and innovation distribution [90, 112], whereas others find conditions for the occurrence of a root delay, such as passenger demand, fares, flight frequency, aircraft size, and taxi-out time [11, 131].
Particular circumstances, such as weather conditions, acts of God, aircraft problems, may lead airlines to cancel flights. Besides, airlines may directly cancel a flight, when factors like seat occupancy or cost savings are taking into consideration [80, 122].
3.1.2 Delay propagation
In delay propagation, the primary objective is to understand how delay propagates through airlines and airports based on the assumption that an initial delay has already occurred in the transportation system. A particular scenario happens when delays are spread to other flights of the same airline as chain reactions [24, 16, 2, 118]. Under this situations, it is important to measure how stable and reliable carriers can be to recover from delay propagation [119, 41]. Also, a delay may continue to propagate due to the scheduling of critical resources or retentions in other airports [59].
When scheduled time for take-off or landing is not fulfilled, flights need new slots that may be unavailable. In this scenario, it is important to understand the effects that a root delay in flight may produce to both departure and arrival airports [123, 100, 61]. Such phenomenon may increase the number of flights at some period, generating capacity problems and queues.
3.2 Scope
Delays can be induced by different sources and affect airports, airlines, en route airspace or an ensemble of them. For analysis purposes, one may assume a simplified system where only one of these actors or any combination of them is considered. It should be noted that any scope of application can be combined with any problem mentioned in Section 3.1.
Some work focused on airports to predict delays for all departs considered all airlines and en route airspace indifferently [106, 102]. Airports are also the focus when the objective is to investigate their efficiency based on delays of all carriers [94, 72, 100, 71]. On the other hand, only airlines are considered when comparing the performance of two airlines under the same conditions [3].
An ensemble of airport and en route airspace were studied to understand the relationship between congestion and delays [63, 88]. Others considered airports and airlines as well to evaluate capacity problems and airlines decisions [112]. There are many possibilities to ensemble scopes. This becomes important when studying the dynamics of air transportation systems, mainly when targeting root delay.
3.3 Data
Three fundamental questions about data are: Where to find flight data? Which attributes should be considered? Is it possible to handle each datum to obtain better results? To answer these questions, the data problem is divided into three classes: (i) data sources, (ii) dimensions, and (iii) data management.
3.3.1 Data Sources
The type of datasets from the air transportation system are mainly related to airlines, airports or ensemble. Since airlines and airports commonly do not share their databases with the entire community, they are often used by collaborators of those institutions. Ensemble datasets may include both carriers, airports, and additional information provided by governmental agencies, regulatory authorities, and service providers. Table 1 displays the type of datasets by regions. It presents the number of publications and the top three most cited papers in each category. Governmental agencies usually provide public access to their databases with different granularity. It is noticed that data from The United States Department of Transportation [44], primarily through The Federal Aviation Administration [52] and The Bureau of Transportation Statistics databases [26] are widely used to obtain information about flights. The Eurocontrol [46] database is provided by an intergovernmental organization in Europe. This dataset is also used intensively in flight delay studies [103].
Other related datasets, such as weather, may be obtained from governmental databases or service providers. This includes, for example, The National Oceanic and Atmospheric Administration of the United States [92]. In fact, authors may use more than one source to develop their models. Datasets from United States Department of Transportation [44], National Oceanic and Atmospheric Administration [92], and Weather Company [113] are commonly used to build delay prediction models.
Additionally, some researchers [130, 131] create synthetic datasets to evaluate their models instead of using real data. For example, Zou et al. [131] developed a market scenario, considering airport capacity, links, frequency, and characteristics of flights and passenger demand.
3.3.2 Dimensions
Considering the main public datasets and the papers analyzed, we have organized them main commonly attributes used into seven classes depicted in the data model of Figure 3. They abstract the main input attributes for delay prediction models. Beyond scheduled and actual times of departure and arrival, several characteristics may be considered depending on the focus of research.
Spatial dimension is related to the positions taken by the aircraft, such as departure and arrival airports, their cities, regions, and countries [61, 102]. The temporal dimension is often used to capture seasonality or periodic patterns of data. These elements contain both date (season, month, and day of the week) and time (the day or time of the day) characteristics [90, 1, 112]. Weather dimension expresses external and environmental conditions in a particular moment [50]. It may represent specific features, such as ceiling and visibility [103] that defines, for example, if take-off or landing is going to happen under visual or instrumental conditions. Additionally, en route airspace weather situation (known as convective weather) and airport weather situation (known as surface weather) contain several momentaneous parameters [63].
Planning describes what airlines, airports, and air traffic controllers intend to do with critical resources involved in their operations. This dimension includes (i) airline schedules, (ii) airport schedules and (iii) flight plans. Arline schedules define all origin and destination points, their frequency and sequence, and aircrafts and crew allocations for each flight [24, 16, 119, 3, 41]. Airport schedules indicate the time each flight takes-off and lands, while flight plans indicate all en route parameters, such as distance, route, speed, and high [59].
Features represent characteristics of airlines, airports or aircrafts. Airlines status may indicate if a carrier is a major or an affiliate one or if it is a traditional hub-and-spoke or a low-cost point-to-point. Aircrafts characteristics show their size, their number of seats and occupancy, which may be a constraint to some operations because they affect market decisions. Finally, airport infrastructure may represent the number of runways, gates and service providers in an airport facility [94, 109, 122].
The state of the system indicates in which conditions airlines, airports or en route airspace are operating at a specific moment. Some examples correspond to prior levels of delay or airports closures [130]. The information about the state of the system is used to predict its behavior. Finally, operations are related to capacity and demand of airports and en route airspace. When demand exceeds capacity, a congestion scenario is formed, which enables occurrence of delays [88].
3.3.3 Data Management
Since the use of databases to store a massive amount of data have been increasing over the last years, data management techniques are becoming more and more crucial to provide a convenient and efficient query processing. Data management tasks contemplate design of database structure to enable data integration from different sources, elimination of inconsistencies, and data transformation. The development of a data warehouse supported by online analytical processing (OLAP) and data management techniques may be useful for this purpose. As mentioned in Section 3.3.1, multiple sources of data may be used. Thus, the usage of data warehouses combined with Extract, Transform and Load (ETL) procedures are commonly used to link the datasets of different sources [126].
There are many data management preprocessing procedures that can be applied to flight delay prediction datasets. They include data cleaning, feature selection, data transformation, and clustering. One of the main tasks of data cleaning is outlier removal. Extreme conditions may result in outliers that are not interesting if one is concerned about regular operations [112]. Feature selection is the process of identifying attributes that are less correlated. Correlated and irrelevant attributes may provide model over-fitting or decrease prediction performance [118]. These preprocessing procedures are essential since the better the preprocessing is conducted on input data, the better the prediction models may be developed from it.
Data transformation is also an important activity to empower prediction models. Some examples of transformations include normalization and discretization. Normalization reduces the range of possible values to a particular interval, such as -1 to 1 or 0 to 1. It gives equal strength for different variables and let machine learning methods identify which are the most relevant ones. Discretization consists of replacing numerical values by representative labels. It includes the transform of time periods into bins of a fixed time [11, 72], binning of values to cope with limitations in computational packages [24, 123] or to better train prediction models [16], especially when using machine learning models.
Clustering means grouping elements of the dataset in a way that similar observations stay together in the same group and dissimilar items stay in different groups. Many works compute clustering techniques, such as k-means or agglomerative hierarchical clustering, to support preliminary steps for further prediction models [102].
3.4 Method
The flight delay prediction problem may be modeled in many ways, depending on the objectives of the research. Methods were divided into five groups, according to Figure 4. The numbers next to each category represent the number of related papers.
3.4.1 Statistical Analysis
Statistical analysis usually encompasses the use of regression models, correlation analysis, econometric models, parametric tests, non-parametric tests, and multivariate analysis (MVA). When it comes to regression models, both delay multiplier and recursive models can help airlines to understand delay propagation effects through the network and to estimate the costs of delays [16, 115, 84, 124, 127].
Many econometric models are also build to evaluate the efficiency flight systems, such as the analysis of the investments done by a governmental agency [88] or to evaluate the equilibrium point considering the relationship between delays and passenger demand, fares, frequency and size of the aircrafts [131]. Xiong et al. [122] built an econometric model based on pre-existing delays, potential delay savings, distance, characteristics of the destination airport and airline, frequency, aircraft size, occupancy rate and fare to understand which reasons lead airlines to cancel their flights. Qin et al. [101] studied the periodicity of flight delay rate, whereas Mofokeng et al. [87] studied the impact of aircraft turnaround time during maintenance check. Finally, Hao et al. [61] built a model to quantify how delays originated at New York are propagated to other airports.
Some works focus on statistical inference. Pathomsiri et al. [94] used a non-parametric function to evaluate the efficiency of airports of the United States regarding delays. Reynolds et al. [103] computed the correlation between levels of delays and capacities of the European airports. They also suggested different approaches to deal with the congestion problem, describing their advantages and disadvantages. Finally, Abdel-Aty et al. [1] calculated daily average of delays to detect correlations to understand the principal causes of delays at Orlando International Airport.
3.4.2 Probabilistic Models
Probabilistic Models encompass analysis tools that estimate the probability of an event based on historical data. Tu et al. [112] developed a probabilistic model based on expectation-maximization combined with genetic algorithms to predict the distribution of departure delay at Denver International Airport.
Boswell et al. [24] expressed delay classes by a probabilistic mass function and used a transition matrix to verify delay propagation to subsequent flights. They made a cancellation analysis computing the conditional probability to cancel a flight given that its previous flight was delayed. Mueller et al. [90] modeled departure, en route and arrival delays using density functions. The authors verified that Normal distribution fitted better to departure delays, while en route and arrival delays were better described by Poisson distribution. Concerned about the total duration of a root delay, Wong et al. [118] studied delay propagation through a survival model.
Evans et al. [49] built a theoretical routing networks that integrated flight routing and scheduling model. Kotegawa et al. [74] built a series of algorithms that forecast restructuring of the US commercial airline network to reduce both flight delay and total delay. Pfeil et al. [98] a probabilistic forecasts of whether or not a terminal area route will be blocked based on raw convective weather forecasts. Finally, Zhong et al. [129] build a Monte Carlo simulations to estimate airports’ runway capacity.
3.4.3 Network Representation
Network representation encompasses the study of flight systems according to a graph theory. Abdelghany et al. [2] built direct acyclic graphs to model the schedule of an airline (including flight times and resources availability) to detect disruptions and their impacts on the rest of the network. They used the classical shortest path algorithm to evaluate propagation effects.
Ahmadbeygi et al. [3] built propagation trees to compare two different airlines, one operating in a conventional hub-and-spoke scheme and the other in a low-cost point-to-point system. Xu et al. [123] and Wu et al. [120] built a Bayesian network to model delay propagation. Baspinar [14] built a network-epidemic process using historical flight-track data of Europe to create a novel delay propagation model.
3.4.4 Operational Research
Operational Research includes advanced analytical methods (such as optimization, simulations, and queue theory) to help key-players make better decisions. Simulations may analyze airport capacity data, considering departure and arrival delays under different weather conditions [106, 63]. They may also evaluate the cost of, each delayed flight of an airline schedule [109]. Moreover, simulations through queuing models were applied by Wieland [117] to predict root delay, by Kim and Hansen [72] to study the effects of capacity and demand on delay levels at the airports of New York area, and by Pyrgiotis et al. [100] to study delay propagation between some airports.
Other simulations were done to analyze delay propagation concerning schedule stability [41] and reliability [119]. Through simulations, different scenarios were commonly explored, such as reliability or flexibility of airports under external conditions. Hansen et al. [59] considered the congestion problem and designed a simple deterministic queuing model to analyze propagation effects for subsequent flights of an airline and at Los Angeles International Airport.
3.4.5 Machine Learning
Machine learning is the research that explores the development of algorithms that can learn from data and provide predictions based on it. Works that study flight systems are increasing the usage of machine learning methods. The methods commonly used include k-Nearest Neighbor, neural networks, SVM, fuzzy logic, and random forests. They were mainly used for classification and prediction.
Rebollo et al. [102] applied random forests to predict root delay. They compared their approach with regression models to predict root delay in airports of the United States considering time horizons of 2, 4, 6 and 24 hours. Their test errors grew as the forecast horizon increased.
Khanmohammadi et al. [69] created an adaptive network based on fuzzy inference system to predict root delay. The predictions were used as an input for a fuzzy decision-making method to sequence arrivals at JFK International Airport in New York.
Balakrishna et al. [10, 11] used a reinforcement learning algorithm to predict taxi-out delays. The problem was modeled through a Markov decision process and solved by a machine learning algorithm. When running their model 15 minutes before the scheduled time of departure, authors achieved good performances at JFK International Airport in New York and Tampa Bay International Airport.
Lu et al. [130] built a recommendation system to forecast delays at some airports due to propagation effects. The prediction was based on the k-Nearest Neighbor algorithm and used historical data to recognize similar situations in the past. The authors noticed fast response time and easy, logical comprehension as the main advantages of their method.
4 Results and discussion
Since flight delays cause economic consequences to passengers and airlines, recognizing them through prediction may improve marketing decisions. Due to that, several forecast models have been built over the last twenty years. These models have sought to understand how delays propagate through the network of flights or airports, to predict root delay in the system or to comprehend the cancellation process. Beyond these three points of view for treating the flight delay prediction problem, models could also differ by their scope of application, data issues and methods.
The number of papers has increased in the late 2000s since 87.5% of the works had been published between 2007 and 2017. Regarding only the documents considered in this analysis, Figure 5.a displays the number of publications grouped by methods. It can be observed a significant growth in machine learning [6] and data mining [17, 77] in the last decade. Also, Figure 6 depicts the complete timeline of papers, showing most cited authors per period and categories of methods. Pondering the way for tackling the delay problem, it was seen a balance between the number of papers that consider delay propagation and root delay, while few works deemed sole the cancellation analysis. Also, Figure 5.b indicates the foremost journals in which flight delay material was published.
From Figures 6 and 7, it is possible to observe the leading authors in the field. Figure 7 displays the main collaboration graph from authors in our systematic review that had three or more publications. The radius of each vertex indicates the number of papers published by each author, whereas the strength of the edge indicates the degree of collaboration among the pair of authors. Some authors do not contain connected edges, meaning that none of their collaborators achieved three publications in our review.
According to data perspective, we divided our analysis into three parts: data sources, dimensions and data management. From our review analysis, the adoption of data sources depends mostly on the country or region where the study has been taken place. For example, in China, most works were based on airport data, while in the United States the primary source was The United States Department of Transportation [44].
Dimensions were not directly related to the type of problem, but to the scope of application. This characteristic is notable in this case. Attributes such as weather, capacity, and demand were characteristics of airport or en route airspace scopes. On the other hand, airlines schedules indicated scopes that considered airlines elements. It was also observed several ensembles of different dimensions, showing that prediction models may be improved through the selection of different attributes.
Data management was not specific to any problem or scope of application, and its use is steadily growing. In fact, it is present in most of the machine learning models adopted, primarily through data transformation. Most of the probabilistic models also considered outlier removal and data transformations techniques. A small percentage of the statistical analysis, network representation, and operational research methods applied general data management techniques as well.
Regarding the methods used to develop the prediction models, statistical analysis, and operational research were the most applied in the past. These approaches were well spread between the three ways of treating the prediction problem. This same balance was also verified for probabilistic models. On the other hand, network representation was mostly employed for delay propagation.
It is worth mentioning that machine learning approaches experienced a notable growth in the late 2000s, especially in root delay. In fact, both machine learning and data management are positively correlated. The more machine learning is used, the more data management is required. Especially, due to a trend in which extensive data is collected from sensors and IoT devices [68, 42, 97, 122, 128]. In fact, this can be confirmed in Figure 8 that presents the cloud word from papers published between 2015 and 2017 related to flight delays and machine learning. Terms such as algorithm [12], big data [38, 33], data model [37], learn [57], train-test [64] are becoming more frequent. Such terminology is day-by-day becoming a trend for the next years.
5 Conclusion
Flight delays are an important subject in the literature due to their economic and environmental impacts. They may increase costs to customers and operational costs to airlines. Apart from outcomes directly related to passengers, delay prediction is crucial during the decision-making process for every player in the air transportation system.
In this context, researchers created flight delay models for delay prediction over the last years, and this work contributes with an analysis of these models from a Data Science perspective. We developed a taxonomy scheme and classified models in respect of detailed components.
Mainly, the taxonomy includes domain and Data Science branches. The former branch categorizes the problem (flight delay prediction) and the scope. The last branch groups methods and data handling. It was observed that the flight delay prediction is classified into two main categories, such as delay propagation and root delay and cancellation. Besides, the scope determines one of the three specific extents: airline, airport, en-route airspace or an ensemble of them.
Additionally, considering Data Science branch, we aimed at the datum, by categorizing data sources, dimensions that can be used in the models, and data management techniques to preprocess data and improve prediction models efficiency. We also studied and divided the main methods into five categories: statistical analysis, probabilistic models, network representation, operations research, and machine learning. Those categories have been grouped as their use on specific forecast models for flight delays.
Besides the taxonomic scheme, we also presented a timeline with all articles to spot trends and relationships involving the main elements in the taxonomy. In the light of the domain-problem classification, this timeline showed a dominance of delay propagation and root delay over cancellation analysis. Researchers used to focus on statistical analysis and operational research approaches in the past. However, as the data volume grows, we noticed the use of machine learning and data management is increasing significantly. This clearly characterizes a Data Science trend.
Researchers from airlines, airports, and academia will require a combination of skills of both domain specialists and data scientists to enable knowledge discovery from flight Big Data.
Acknowledgments
The authors thank CNPq, CAPES (finance code 001), FAPERJ, and CEFET/RJ for partially funding this research.
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abdel-Aty et al. [2007] M. Abdel-Aty, C. Lee, Y. Bai, X. Li, and M. Michalak. Detecting periodic patterns of arrival delay. Journal of Air Transport Management , 13(6):355–361, Nov. 2007. ISSN 0969-6997.
- 2Abdelghany et al. [2004] K. F. Abdelghany, S. S. Shah, S. Raina, and A. F. Abdelghany. A model for projecting flight delays during irregular operation conditions. Journal of Air Transport Management , 10(6):385–394, Nov. 2004. ISSN 0969-6997.
- 3Ahmad Beygi et al. [2008] S. Ahmad Beygi, A. Cohn, Y. Guan, and P. Belobaba. Analysis of the potential for delay propagation in passenger airline networks. Journal of Air Transport Management , 14(5):221–236, Sept. 2008. ISSN 0969-6997.
- 4Ahmadbeygi et al. [2010] S. Ahmadbeygi, A. Cohn, and M. Lapp. Decreasing airline delay propagation by re-allocating scheduled slack. IIE Transactions (Institute of Industrial Engineers) , 42(7):478–489, 2010.
- 5ANAC [2017] ANAC. Agência Nacional de Aviação Civil. Technical report, http://www.anac.gov.br/, 2017.
- 6Ariyawansa and Aponso [2016] C. Ariyawansa and A. Aponso. Review on state of art data mining and machine learning techniques for intelligent Airport systems. In Proceedings of 2016 International Conference on Information Management, ICIM 2016 , pages 134–138, 2016.
- 7Azadian et al. [2012] F. Azadian, A. E. Murat, and R. B. Chinnam. Dynamic routing of time-sensitive air cargo using real-time information. Transportation Research Part E: Logistics and Transportation Review , 48(1):355–372, Jan. 2012. ISSN 1366-5545.
- 8Balaban et al. [2017] E. Balaban, I. Roychoudhury, L. Spirkovska, S. Sankararaman, C. Kulkarni, and T. Arnon. Dynamic routing of aircraft in the presence of adverse weather using a POMDP framework. In 17th AIAA Aviation Technology, Integration, and Operations Conference, 2017 , 2017.
