Semantic-Fused Multi-Granularity Cross-City Traffic Prediction
Kehua Chen, Yuxuan Liang, Jindong Han, Siyuan Feng, Meixin Zhu, Hai, Yang

TL;DR
This paper introduces a novel transfer learning model that fuses semantics at multiple granularities to improve cross-city traffic prediction, especially in data-scarce regions, by leveraging domain-invariant features and hierarchical graph structures.
Contribution
It proposes a Semantic-Fused Multi-Granularity Transfer Learning (SFMGTL) model that jointly addresses semantic fusion and multi-granularity in transfer learning for traffic prediction.
Findings
Outperforms state-of-the-art baselines on six real-world datasets.
Requires fewer parameters than baseline models.
Enhances demand prediction accuracy during peak hours.
Abstract
Accurate traffic prediction is essential for effective urban management and the improvement of transportation efficiency. Recently, data-driven traffic prediction methods have been widely adopted, with better performance than traditional approaches. However, they often require large amounts of data for effective training, which becomes challenging given the prevalence of data scarcity in regions with inadequate sensing infrastructures. To address this issue, we propose a Semantic-Fused Multi-Granularity Transfer Learning (SFMGTL) model to achieve knowledge transfer across cities with fused semantics at different granularities. In detail, we design a semantic fusion module to fuse various semantics while conserving static spatial dependencies via reconstruction losses. Then, a fused graph is constructed based on node features through graph structure learning. Afterwards, we implement…
| Dataset | NY-Taxi | CHI-Taxi | DC-Taxi | HZMetro | SHMetro | |
| # Nodes | 460 | 476 | 420 | 80 | 288 | |
| # Physical Edges | 3,886 | 4,018 | 3,538 | 248 | 958 | |
| Interval | 60 min | 60 min | 60 min | 15 min | 15 min | |
| Time Span | 1/1/2016 - 12/31/2016 | 1/1/2019 - 1/25/2019 | 7/1/2016 - 9/30/2016 | |||
| Mean | Pick:32.913, Drop:32.791 | Pick:5.861, Drop:5.950 | Pick:2.732, Drop:2.731 | In:213.930, Out:216.277 | In:223.690, Out:229.271 | |
| Std | Pick:131.080, Drop:119.994 | Pick:39.043, Drop:34.236 | Pick:16.317, Drop:14.051 | In:358.880, Out:340.822 | In:358.880, Out:340.821 |
| Baseline | NY-Taxi DC-Taxi | CHI-TaxiDC-Taxi | ||||||||||
| RMSE | MAE | RMSE | MAE | |||||||||
| 1 hour | 3 hour | 5 hour | 1 hour | 3 hour | 5 hour | 1 hour | 3 hour | 5 hour | 1 hour | 3 hour | 5 hour | |
| ARIMA | 6.464 | 12.670 | 15.590 | 1.968 | 3.490 | 4.277 | 6.464 | 12.670 | 15.590 | 1.968 | 3.490 | 4.277 |
| GRU | 5.604 | 9.461 | 11.223 | 1.872 | 2.869 | 3.411 | 5.614 | 9.461 | 11.231 | 1.844 | 2.858 | 3.466 |
| TGCN | 4.960 | 6.993 | 7.860 | 1.841 | 2.624 | 2.938 | 4.862 | 7.093 | 7.696 | 1.841 | 2.517 | 2.723 |
| Fine-tuned | 4.986 | 7.073 | 7.851 | 1.737 | 2.491 | 2.713 | 4.680 | 6.762 | 7.682 | 1.665 | 2.316 | 2.675 |
| MAML | 4.990 | 7.071 | 7.807 | 1.721 | 2.544 | 2.548 | 4.759 | 6.849 | 7.640 | 1.795 | 2.353 | 2.667 |
| ST-GFSL | 4.858 | 7.063 | 7.847 | 1.613 | 2.284 | 2.613 | 4.720 | 6.725 | 7.505 | 1.677 | 2.276 | 2.466 |
| CrossTReS | 4.820 | 6.954 | 7.733 | 1.678 | 2.258 | 2.518 | 4.670 | 6.703 | 7.370 | 1.614 | 2.245 | 2.364 |
| SF-HGTL | 4.966 | 6.618 | 7.271 | 1.626 | 2.041 | 2.390 | 4.707 | 6.667 | 7.302 | 1.658 | 2.068 | 2.266 |
| Baseline | HZMetro SHMetro | SHMetroHZMetro | ||||||||||
| RMSE | MAE | RMSE | MAE | |||||||||
| 15min | 30min | 60min | 15min | 30min | 60min | 15min | 30min | 60min | 15min | 30min | 60min | |
| ARIMA | 117.034 | 133.054 | 125.461 | 52.351 | 59.130 | 61.755 | 101.400 | 113.407 | 109.088 | 52.416 | 57.517 | 57.835 |
| GRU | 69.921 | 88.188 | 143.522 | 32.146 | 38.309 | 54.856 | 62.723 | 69.639 | 100.848 | 33.177 | 35.978 | 48.669 |
| TGCN | 65.438 | 65.723 | 80.925 | 33.467 | 33.826 | 40.022 | 60.093 | 63.850 | 77.728 | 32.535 | 34.510 | 41.912 |
| Fine-tuned | 64.040 | 64.276 | 77.753 | 32.568 | 32.642 | 38.281 | 60.163 | 62.481 | 73.111 | 31.183 | 32.641 | 38.260 |
| MAML | 63.668 | 64.380 | 77.734 | 32.505 | 32.763 | 38.418 | 58.617 | 61.199 | 72.674 | 31.776 | 33.212 | 39.080 |
| ST-GFSL | 60.512 | 60.971 | 69.347 | 32.200 | 32.302 | 36.690 | 57.580 | 60.326 | 70.837 | 32.879 | 33.614 | 38.179 |
| CrossTReS | 60.242 | 60.793 | 68.608 | 32.188 | 32.309 | 36.401 | 55.983 | 60.807 | 71.886 | 31.677 | 32.264 | 38.587 |
| SF-HGTL | 60.440 | 61.197 | 69.210 | 30.017 | 31.033 | 35.400 | 54.390 | 57.567 | 71.243 | 29.317 | 30.647 | 36.823 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTraffic Prediction and Management Techniques · Human Mobility and Location-Based Analysis
MethodsBalanced Selection
Cross-City Traffic Prediction via Semantic-Fused Hierarchical Graph Transfer Learning
Kehua Chen, Jindong Han, Siyuan Feng*, and Hai Yang Kehua Chen and Jindong Han are with Division of Emerging Interdisciplinary Areas (EMIA), Interdisciplinary Programs Office, The Hong Kong University of Science and Technology, Hong Kong, China.Siyuan Feng and Hai Yang are with Civil and Environmental Engineering Department, The Hong Kong University of Science and TechnologyH. Yang is also with the Intelligent Transportation Thrust, the Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China* S. Feng is the corresponding author, E-mail: [email protected]
Abstract
Accurate traffic prediction benefits urban management and improves transportation efficiency. Recently, data-driven methods have been widely applied in traffic prediction and outperformed traditional methods. However, data-driven methods normally require massive data for training, while data scarcity is ubiquitous in low-developmental or newly constructed regions. To tackle this problem, we can extract meta knowledge from data-rich cities to data-scarce cities via transfer learning. Besides, relations among urban regions can be organized into various semantic graphs, e.g. proximity and POI similarity, which is barely considered in previous studies. In this paper, we propose Semantic-Fused Hierarchical Graph Transfer Learning (SF-HGTL) model to achieve knowledge transfer across cities with fused semantics. In detail, we employ hierarchical graph transformation followed by meta-knowledge retrieval to achieve knowledge transfer in various granularity. In addition, we introduce meta semantic nodes to reduce the number of parameters as well as share information across semantics. Afterwards, the parameters of the base model are generated by fused semantic embeddings to predict traffic status in terms of task heterogeneity. We implement experiments on five real-world datasets and verify the effectiveness of our SF-HGTL model by comparing it with other baselines.
Index Terms:
Few-shot learning, Traffic prediction, Graph neural network.
I Introduction
As a vital problem in Intelligent Transportation System (ITS), traffic prediction aims to forecast future transportation status, e.g. traffic flow [1], traffic speed [2], origin-destination demand [3]. Accurate traffic prediction benefits urban management and improves transportation efficiency. Due to the explosion of data volume, various machine learning methods have been used in traffic prediction and outperformed traditional approaches [4].
Nonetheless, deep learning methods normally require massive data to achieve satisfactory performance, and limited data size leads to over-fitting problems. While data scarcity is ubiquitous in cities with low developmental levels or new districts, it is non-trivial to train a powerful model. In addition, traffic patterns share common features even though they are in different cities. For instance, traffic speed in commercial areas decreases during peak time, and crowd flow from residential areas to commercial areas tends to surge in the morning. Therefore, if we can extract the traffic patterns from data-rich cities and adapt the learned patterns to data-scarce cities, we can still achieve good performance with limited data sizes. Such knowledge transfer among cities is called cross-city knowledge transfer and has attracted much attention these years. Generally, there are two paradigms in cross-city knowledge transfer. The first idea is Divide-Match-Transfer principle [5, 6]: both source and target cities are first divided into several regions, and the target regions absorb knowledge from the most similar source regions after matching. Moreover, the other method utilizes meta learning methods to extract meta-knowledge from source cities and adapt the knowledge based on target cities [7, 8]. In traffic prediction, meta-knowledge can be deemed as various traffic patterns.
In terms of recent research, roads or stations in cities can be treated as nodes in graphs, and their distance or POI similarity can be regarded as edges. As such topological structure determines traffic patterns, the knowledge transfer among graphs is more informative than grid-based transfer. However, there are still several challenges in cross-city knowledge transfer. First, most current studies merely achieved knowledge transfer at a solely local level without the consideration of coarse-scale knowledge. Here we define “local” as node level in graphs, i.e. region or grid in previous studies, and define “zone” as a relatively coarse level consisting of more than one node, and encompasses general characteristics (Fig. 1). For instance, several subway stations can all belong to one commercial zone from a physical proximity perspective, and can also belong to one functional category from a Point of Interest (POI) perspective. The local-level transfer losses such zonal information. Second, although various semantics have been employed in traffic predictions since urban data can be normally organized as multiple semantic graphs, most current knowledge-transfer models do not consider different semantic information, and we can merely concatenate various semantic representations for prediction based on current studies. Fig. 2 presents an example of the construction of various semantic graphs, although each semantic graph has identical nodes, the nodes could be connected in different ways to encompass different semantic meanings, e.g. physical proximity and POI similarity in the example. Third, compared to Divide-Match-Transfer principle, meta learning methods directly extract meta-knowledge, and hence are more generalizable. However, previous studies implicitly extract spatial-temporal patterns along with the model parameters, which does not explicitly integrate information among patterns, let alone the knowledge fusion of multiple semantics.
To tackle the aforementioned challenges, we propose a novel framework for spatio-temporal transfer learning named Semantic-Fused Hierarchical Graph Transfer Learning (SF-HGTL). SF-HGTL takes graphs as inputs and is able to handle various semantic graphs. Considering the first challenge, we utilize hierarchical graph transformation for each semantic graph to extract multi-level information. To extract meta-knowledge for knowledge transfer and tackle the third challenge, we design several meta-knowledge graphs to extract meta-knowledge from different levels. Graph structure is more informative compared to discrete memory vectors. To solve the second challenge as well as reduce the number of parameters, different semantics share the same meta-knowledge graphs, and their parameters are adjusted by corresponding meta semantic nodes. At last, we employ a modulating function to generate task-specific parameters for the base learner after the fusion of semantics. Note that the base learner can be any deep learning models, e.g. Recurrent Neural Network and Graph Neural Network, and we use the base learner to predict traffic status.
The main contributions of this paper are as follows:
- •
We design a novel cross-city knowledge transfer framework based on semantic-fused hierarchical graphs, the framework uses hierarchical graph transformation to achieve local, zonal and global feature representation, and we introduce meta-knowledge graphs to benefit information retrieval in the meta-testing phase.
- •
We introduce meta semantic nodes that not only adapt meta-knowledge in terms of various semantics, but also achieve information sharing among semantics since they utilize same meta-knowledge graphs. Semantic discriminator is further applied to avoid trivial solutions.
- •
We implement several experiments based on five real-world datasets to verify the effectiveness of the proposed framework. The results demonstrate the superiority of our SF-HGTL model compared to baseline models.
The remainder of the paper is organized as follows. We briefly introduce related works in Section 2, and propose definitions and problems in Section 3. Afterwards, the proposed model is revealed in Section 4. Section 5 thoroughly introduces experiments for evaluation, results and discussions. At last, we summarize the overall paper in Section 6.
II RELATED WORK
This paper is relevant to traffic prediction, transfer learning, and cross-city knowledge transfer. In this section, we briefly review related works for the above topics.
II-A Traffic Prediction
As a classical problem, researchers have focused on traffic prediction since several decades ago. Traditional statistical methods include Historical Average (HA) [9], Autoregressive Integrated Moving Average (ARIMA) [10] and its variants. Nonetheless, these traditional methods merely capture linear relations and cannot perfectly depict data patterns. Hence, recent studies tend to use deep learning methods due to their powerful expressiveness. Researchers adopted various techniques to tackle traffic prediction problems, ranging from Recurrent Neural Network (RNN), Convolutional Neural Network (CNN) to Graph Neural Network (GNN).
Long Short-Term Memory (LSTM) was applied in [11] to predict traffic speed on a single road, and subsequent studies normally use RNN as a model component owing to its powerful capacity of capturing non-linear dynamics. Within a city, traffic status in one region is influenced by proximal regions. To model such relations, [12] formed traffic speed as various matrices, and then transformed matrices into image channels. Then CNN was applied to predict future traffic status. Yet, CNN can only capture Euclidean relations. As an emerging technique, GNN has been widely used in current models for traffic prediction. For example, [13] modeled the traffic flow as a diffusion process on a directed graph and proposed DCRNN to forecast traffic speed. T-GCN [14] combined Graph Convolutional Network (GCN) and Gated Recurrent Unit (GRU) to predict traffic status. In addition, [15] came up with ST-MetaNet to capture traffic spatial and temporal correlations. The authors designed the model based on deep meta-learning and chose RNN and GNN as meta-learners.
Although deep learning methods achieve accurate traffic prediction, they typically require massive data. Hence, transfer learning provides an effective method to alleviate data scarcity.
II-B Transfer Learning
Traditional machine learning approaches have achieved great success in various practical scenarios, such as object detection, and language processing. However, machine learning normally requires massive labeled data for training, and the collection of labeled data is time-consuming and expensive.
Transfer learning achieves knowledge transfer from source domains to improve the learning performance in target domains. Hence, transfer learning is suitable to handle few-shot problems, i.e. rich data in source domains and scarce data in target domains. In data-based interpretation [16], the main objective of transfer learning is to minimize the distribution difference between source and target domains, such as instance weighting strategy [17] and feature transformation strategy [18]. In model-based interpretation [16], the aim is to accurately predict results on the target domain by utilizing source knowledge. As an emerging method, meta-learning [19] has been used to achieve transfer learning as an effective method. In meta-learning, a number of tasks are drawn from task distributions, the model aims to learn a generalized meta-learner at the meta-training phase, and achieves fast adaption at the meta-testing phase when new tasks come. As one of the most famous methods, MAML [20] treats the meta-learner as parameter initialization by bi-level optimization, we use MAML as the basic framework in this paper. Besides, [21] raised that the utilization of task discrepancy benefits the model performance, and various modulating methods have been applied in recent studies, such as automated relational meta-graph [22], multimodal modulation network [23], and hierarchical prototype graph [24].
While most previous transfer learning studies focused on image- and text-related tasks, spatial-temporal transfer learning is at the initial stage and still without much understanding.
II-C Cross-City Knowledge Transfer
To the best of our knowledge, [5] first proposed Divide-Match-Transfer principle. The authors partitioned the cities into equal-size grids, then matched target regions with the most correlated source regions. Afterwards, ConvLSTM was used as the backbone model to learn regional representations for prediction. At fine-tuning stage, the model tried to minimize the squared error between regional representations of target regions and matched source regions.
MetaST [7] designed a memory module to extract long-term patterns from source cities. The regions in source cities were clustered into several categories as memory. Then, the attention mechanism was utilized to gather useful information from the memory during both the meta-training and meta-testing phases. ST-DANN [6] first mapped source and target data to a common embedding space, and tried to minimize Maximum Mean Discrepancy (MMD) between two embeddings. To further capture spatial dependencies, the model introduced a global attention mechanism. The attention mechanism can also be deemed as a matching process. [8] designed the model in terms of graph data structure. The model extracted meta-knowledge through GRU and Graph Attention Network (GAT). To express the structural information, graph construction loss was introduced. Last, the node-level meta-knowledge was fed into a parameter generation module to produce non-shared feature extractor parameters. To alleviate negative transfer, [25] proposed selective cross-city transfer learning to filter harmful source knowledge. The authors employed edge-level and node-level adaption to training the feature network, and designed a weighting network for loss calculation.
As mentioned in the introduction, there are still challenges and limitations in current studies, and this paper aims to solve them via semantic-fused hierarchical graph transfer learning.
III DEFINITION AND PROBLEM FORMULATION
In this section, we introduce several definitions used in this paper and formally illustrate the cross-city knowledge transfer problem.
Definition 1 (Graph): Given a graph . is the node set, and ; is the edge set; is the node feature matrix, and the feature dimension of each node is ; A is the adjacency matrix, each entry indicates the weight of edge .
Definition 2 (Few-Shot Learning): Given a dataset , where is the training set that only has a few samples, and is the corresponding testing set. The aim of few-shot learning is to train a model on the training set that minimizes the prediction error on the test set. Here, we treat traffic prediction problem in the data-scarce cities as a few-shot learning problem.
In this paper, we follow Model-Agnostic Meta-Learning (MAML) [20] framework to handle the few-shot learning problem. As Fig. 3 shows, we use a source dataset to train the meta learner, is organized as various tasks, denoted as . Each task includes a support set and a query set . MAML employs and to achieve domain adaption via bi-level optimization. In detail, assuming the initial parameters of the meta learner as , we first utilize to update model parameters as . Then is employed to calculate loss based on , and the parameters are updated again to acquire , i.e. outer updating. We use for the following training. At the meta-testing stage, the target dataset is also formed as tasks, we employ as the initial parameters, fine-tune parameters based on and test the model performance in .
Problem Definition 1 (Traffic Prediction): Consider there are sensors/recorders in a given city, sensor/recorder measures traffic status on a specific node (e.g. road link, or subway station) at time , denoted as . Given past records, the sequential traffic status can be naturally deemed as node features, then traffic prediction aims to find a function that predicts the next traffic status:
[TABLE]
The above past traffic status of nodes and its corresponding traffic status are deemed as one sample. Various samples consist of one task .
Problem Definition 2 (Cross-City Traffic Prediction): Traffic prediction based on a few training samples can be deemed as a few-shot learning problem, and we solve it with meta learning. Here we assume only one source and one target city. Given a source city denoted as , a target city denoted as , where indicates the number of different semantics and each semantic shares the same node features , i.e. past traffic status. The source city is data-rich while the target city is data-scarce. We aim to find a function that leverages knowledge of the source city to predict traffic status in the target city as accurately as possible.
IV METHODOLOGY
Fig. 4 presents the proposed model architecture. For each sample in task with multiple semantic graphs, we generate a sample embedding via a semantic-fusion cell. In detail, we first feed the node features into a shared GRU, then treat hidden states as the new node features. Then, hierarchical graph transformation is applied to each semantic graph. At each transformation layer, the nodes retrieve meta-knowledge (MK) from several meta-knowledge graphs. To model the relations among various semantics, we further introduce several meta semantic nodes to adapt the parameters of meta-knowledge graphs in terms of semantics. Afterwards, each semantic graph is transformed as a semantic embedding, and we concatenate semantic embeddings as sample embedding. We treat the average of sample representations as task embedding. The task embedding then interacts with a task meta-knowledge graph. By feeding the task embedding into a task modulating function, we acquire the adapted parameters for the meta learner.
IV-A Hierarchical Graph Transformation
Assume there are samples in support set , and each sample has semantics. Although various semantics are available in spatio-temporal prediction problems, the node features are normally fixed. Here we employ a shared GRU to generate node embeddings based on historical traffic status:
[TABLE]
Afterwards, we apply Graph Neural Networks (GNN) to extract non-Euclidean information. General GNN follows message passing mechanism [26], which includes two processes: aggregation and updating. At the -th layer, GNN updates the node embedding as follows:
[TABLE]
where indicates the neighborhood of node ; UPDATE can be neural networks; AGG can be mean or max pooling.
However, GNN performs powerfully when focusing on local information since deep GNN leads to over-smoothing problem [27]. Hence, the hierarchical structure is able to provide a larger view and conserve node differences. [28] designed a hierarchical graph representation model to acquire graph embeddings. Basically, hierarchical graph transformation assigns nodes to clusters at each layer in order to reduce the node numbers. Denote learned cluster assignment matrix at layer as , the node embeddings at layer as , the new node/cluster embedding matrix at layer as , and adjacency matrix at layer as . The main process is first to generate node embeddings through an embedding GNN, and a cluster assignment matrix through a pooling GNN at layer . Afterwards, the nodes at layer are assigned to clusters with new embeddings and a new adjacency matrix at layer :
[TABLE]
In short, we denote -th hierarchical graph transformation as . We utilize two-layer transformations, the nodes after the first transformation demonstrates zonal information, and the second transformation aggregates the overall graphs as semantic information. The utilization of hierarchical graph transformation actually clusters similar nodes together, thus each node at deep levels can be deemed as a coarse category indicating more general information, e.g. urban functional zones.
Besides, we add hierarchical link prediction loss to encode nearby nodes closely, and hierarchical entropy loss to force exclusive assignment:
[TABLE]
where is the Frobenius norm; means the entropy function; and is the -the row of .
IV-B Meta-knowledge Graph Construction
Inspired by [22], we design multiple meta-knowledge graphs to extract knowledge at different levels, i.e. node MK graph, zone MK graph and semantic MK graph. Here we take the node MK graph as an example for demonstration. Specifically, we denote the node MK graph as with nodes. Both node features and adjacency matrix are learned during training:
[TABLE]
where and are node features in the node MK graph; is sigmoid function; and are learnable parameters; is a scaling factor.
Afterwards, the node-level graph of each semantic is supposed to retrieve meta-knowledge from the node MK graph. Here we construct a super-graph connecting sample nodes with nodes in . For the weights among node and meta-node , we calculate them via Euclidean distances:
[TABLE]
As Fig. 5 presents, the node features and adjacency matrix of super-graph are , and respectively, where indicates concatenation operation and means matrix transpose.
Then we feed the super-graph into a GNN and integrate node features with meta-knowledge. In our model, meta-knowledge retrieval is implemented for each layer of hierarchical graph transformation, i.e. . Therefore, we finally acquire a semantic embedding with the combination of various meta-knowledge:
[TABLE]
where MLP indicates a fully-connected neural network; means semantic embeddings before MK retrieval; and indicates semantic embeddings after MK retrieval; Eq. 12, 14 and 17 represent meta-knowledge retrieval process, and is mean pooling.
Subsequently, we acquire graph embeddings for each semantic graph and fusion semantic graph embeddings to generate a sample embedding, e.g. concatenation and attention mechanism. Here we simply concatenate semantic graph embeddings to generate sample ’s embedding, we name hierarchical graph transformation together with semantic fusion process as semantic fusion cell:
[TABLE]
Within a support set , we apply mean pooling to sample embeddings to acquire a task embedding:
[TABLE]
Afterwards, the task embedding further retrieves knowledge from a task MK graph to achieve task adaptation. The retrieval process is similar to the aforementioned super-graph construction method. At last, we apply a modulating function to the initialized parameter to generate task-specific parameters:
[TABLE]
where and are learnable parameters; indicates element-wise multiplication. is then used as model parameters for prediction.
IV-C Semantic Adjustment for Meta-knowledge Graph
As introduced above, the proposed model conducts meta-knowledge retrieval for each semantic, which increases the number of parameters significantly. To alleviate this problem, we design a semantic adjustment module to achieve parameter adaption in terms of various semantics.
In detail, various semantic graphs share the same MK graphs, and we introduce meta semantic nodes to adjust MK graphs according to different semantics. The utilization of meta semantic nodes can not only reduce the number of parameters, but also establish connections among semantics. Similar to Eq. 11, we construct a super-graph based on MK graphs and meta semantic nodes, then implement GNN to update node features of MK graphs in terms of meta semantic nodes. Then, MK graphs interact with hierarchical graphs as mentioned above. Hence, the meaning of Hierarchical in our SF-HGTL model is twofold: (i) we use hierarchical graph transformation to aggregate meta-knowledge from node, zone and semantic levels; (ii) we use meta semantic nodes to hierarchically adjust parameters of MK graphs.
The features of meta semantic nodes are learned during the training phase to represent different semantics. The current framework merely captures the discrepancy from various semantic adjacency matrices, the signal information can vanish after the meta-knowledge retrieval process. Hence, the nodes on MK graphs rather than meta semantic nodes can represent semantic information, which leads to the trivial solution.
To solve this problem, we further introduce a semantic discriminator to guide the learning of meta semantic nodes. Specifically, we treat meta semantic nodes as prototypes, i.e. forcing different semantic graph embeddings to approach corresponding meta semantic nodes and keeping away from other semantics:
[TABLE]
where is the number of semantics; is the temperature hyperparameter. We use an to increase the expressiveness.
IV-D Meta-optimization
As introduced in [20], we split source-city datasets into massive training tasks in meta-training phase. A task consists of a support set and a query set , and . Recall that we use past steps to predict next traffic status for one sample, each support set and query set consist of several samples. Afterwards, batches of tasks are sampled from to train our model by bi-level optimization. For task , the model first acquires adjusted parameters as mentioned above, then leverages to inner-update :
[TABLE]
where is stepsize; is the task-specific loss function, e.g. Mean Square Error (MSE) and Mean Average Error (MAE) for our traffic prediction.
Afterwards, the meta-objective is the combination of task loss based on , hierarchical link prediction loss, hierarchical entropy loss and semantic loss:
[TABLE]
where is the overall parameters; is the stepsize; are regularization factors. Algorithm 1 presents the above training process.
V EXPERIMENT
To evaluate the effectiveness of our framework, we implement diverse experiments with five real-world datasets and compare the model performance with several baselines. The experiments are related to taxi pickup/dropoff and metro passenger flow prediction, and our framework can be easily transferred to other traffic prediction tasks such as traffic speed and flow prediction. Moreover, an ablation study is conducted to indicate the importance of each component of our model. At last, we conduct sensitivity analysis to show the impact of various hyperparameters.
V-A Experiment Setup and Datasets
We choose taxi pickup/dropoff and metro passenger flow prediction tasks in this study. The taxi pickup/dropoff prediction involves three datasets: NY-Taxi, CHI-Taxi and DC-Taxi [25], we use proximity, road connectivity and POI graphs in the experiment similar to [25]. The metro passenger flow prediction involves two datasets: HZMetro and SHMetro [29], we utilize physical, similarity and correlation graphs in the experiment as introduced in [29]. The statistics of datasets are listed in Table I.
NY-Taxi dataset: The dataset is collected based on New York taxis covered between 74.059° to 73.863° in longitude and 40.645° to 40.848° in latitude. There are 133 million records in total. The data ranges from January 1st 2016 to December 31st 2016, and the recording interval is 1 hour.
CHI-Taxi dataset: The dataset is collected based on Chicago taxis covered between 87.740° to 87.576° in longitude and 41.766° to 42.013° in latitude. There are 24.5 million records in total. The data ranges from January 1st 2016 to December 31st 2016, and the recording interval is 1 hour.
DC-Taxi dataset: The dataset is collected based on Washington taxis covered between 77.127° to 76.926° in longitude and 38.798° to 38.969° in latitude. There are 10 million records in total. The data ranges from January 1st 2016 to December 31st 2016, and the recording interval is 1 hour.
HZMetro dataset: The dataset is generated based on transaction records of the Hangzhou metro system, the system has 2.35 million ridership each day. The data ranges from January 1st 2019 to January 25th 2019, and the recording interval is 15 min.
SHMetro dataset: The dataset is generated based on transaction records of the Shanghai metro system, the system has 811.8 million ridership each day. The data ranges from July 1st 2016 to September 30th 2016, and the recording interval is 15 min.
We conduct four experiments to evaluate the model performance, i.e. NY-Taxi to DC-Taxi, CHI-Taxi to DC Taxi, HZMetro to SHMetro, and SHMetro to HZMetro. For taxi prediction, we employ the past 6 time steps to predict the next 5 time steps, including pickup and dropoff volumes. For metro prediction, we use the past 4 time steps to predict the next 4 time steps, including inflow and outflow volumes. Min-Max normalization and Z-score normalization are applied for taxi and metro passenger flow pre-processing respectively.
Evaluation metrics: Since the prediction problem is a regression problem, we utilize two widely applied methods for evaluation: Mean Average Error (MAE) and Root Mean Square Error (RMSE).
[TABLE]
where and are predicted and true values of instance respectively.
Implementation details: The experiments are conducted on a Linux server with 2 NVIDIA Tesla V100 GPUs. We use TGCN [14] as the base learner to capture both temporal and spatial relations. The base learner has three TCGN layers with hidden sizes (16, 32, 32), followed by three MLP layers with hidden sizes (256, 128, 128). In addition, we choose k-GNN [30] to proceed graph data. The hidden size of SF-HGTL is 32, the number of centers for MK graphs is 5, the clustering ratio of hierarchical graph transformation is 0.3, and we let . The learning rates are 1e-3 and 5e-3 for inner and outer updating respectively, and Adam [31] is employed as the optimizer. The code is available on github.com/ckjzsa/SF-HGTL.
V-B Baselines
We first select several traditional statistical models and state-of-the-art deep learning methods for comparison:
- •
ARIMA [32]: Autoregressive Integrated Moving Average is a classic method for time-series prediction and merely considers linear relations among data.
- •
GRU [33]: Gated Recurrent Unit is a popular deep learning method to capture sequential dependency of data.
- •
TGCN [14]: TGCN employs GCN to capture the spatial features, followed by GRU to learn temporal dependency.
Above methods merely utilize target data for model learning and do not involve the knowledge transfer process. Next, we pick several transfer learning models for comparison:
- •
Fine-tuned Model: As a natural idea, we first train the base learner through source cities, and then fine-tune the model based on target cities.
- •
MAML [20]: As mentioned before, MAML uses bi-level optimization for parameter initialization, and the base learner is TGCN.
- •
ST-GFSL [8]: ST-GFSL generates non-shared parameters based on node-level meta-knowledge. Parameter matching is implemented to retrieve similar features from source cities.
- •
CrossTReS [25]: CrossTReS proposed a selective transfer learning framework to pick beneficial knowledge from the source domain by introducing a weighting network. Besides, it is suitable to handle multi-graph spatio-temporal prediction problems.
Note the last two transfer learning methods develop models that focus on spatial-temporal knowledge transfer on graphs. Moreover, since most methods do not consider multiple graphs, we parallelly use multiple models with different semantic graphs followed by concatenation and a prediction head to predict future traffic status.
V-C Experimental Results
Table II and III show the experimental results of taxi pickup/dropoff and metro passenger flow prediction. The best result is bold, and the second-best result is underlined. Generally, deep learning methods outperform traditional methods and transfer-based methods further improve model performance. Although ST-GFSL model considers spatial-temporal patterns, it does not utilize semantic information and hence performs worse than CrossTReS and SF-HGTL models.
For taxi volume prediction, we can find that CrossTReS outperforms SF-HGTL for short-term prediction, but has worse results for long-term prediction. One possible reason is that CrossTReS applies selection mechanism to avoid negative transfer, the re-weighting method works well when the prediction period is short but suffers more uncertain for long-term prediction. As a contrast, our SF-HGTL stores meta-knowledge in the MK graphs, and is more powerful to capture long-term traffic patterns. For metro passenger flow prediction, CrossTReS has relatively lower RMSE for the transfer from HZMetro to SHMetro, while our SF-HGTL model has lower MAE. Note that the size of HZMetro dataset is smaller than that of SHMetro, we can conclude that CrossTReS captures peak traffic patterns better than SF-HGTL, and SF-HGTL is better at fitting general traffic patterns when the data size of the source domain is small. However, when the transfer is from SHMetro to HZMetro, i.e. source domain has sufficient data, SF-HGTL outperforms other methods almost in every metric.
V-D Ablation Study
We further conduct several experiments to prove the efficacy of each component: (i) Remove hierarchical graph transformation, denoted as SF-HGTL*-HGT*; (ii) Remove meta semantic nodes, denoted as SF-HGTL*-MS*; (iii) Only keep task MK graph and remove other MK graphs, denoted as SF-HGTL*-MK*. We implement five paralleled experiments based on metro passenger flow prediction transferring from Hangzhou to Shanghai. Fig. 6 presents the results of five experiments. The removal of hierarchical graph transformation significantly decreases the model performance, which means the information of different granularity is crucial to knowledge transfer. Moreover, the models without meta semantic nodes and MK graphs also perform worse than our SF-HGTL model, as these components store meta-knowledge extracted from various tasks.
V-E Sensitivity Analysis
In this section, we figure out the influence of several vital hyperparameters, i.e. hidden dimension, node number in MK graphs, and clustering ratio. Five experiments are conducted for each hyperparameter based on metro passenger flow prediction, and we use MAE of three periods to measure the performance. Fig. 7 presents the results. It can be found that: (i) Hidden size 32 has the lowest MAE, while hidden size 128 has the highest MAE due to overfitting; (ii) The number of centers determines the capability of MK graphs, the analysis shows center number 5 and 15 have the best MAE performance. Note that increasing centers leads to high computational complexity, so we choose center number 5 in the study; (iii) As a crucial hyperparameter, the hierarchical ratio reflects the clustering granularity, a high ratio means few clusters and vice versa. The results demonstrate that the model performance peaks when the hierarchical ratio is 0.3.
VI Conclusion
We propose a novel framework called SF-HGTL to achieve cross-city traffic prediction. SF-HGTL is based on MAML framework with a modulating function to dynamically adjust parameters of the base learner in terms of task heterogeneity. To generate task embeddings, we employ hierarchical graph transformation on various semantic graphs, and fuse semantic embeddings via concatenation. In addition, we utilize meta semantic nodes and meta-knowledge graphs to implement knowledge transfer in different levels. We conduct experiments on taxi pickup/dropoff volume and metro passenger flow prediction, and compare the model performance with several state-of-the-art methods. The results show that our SF-HGTL outperforms other baseline models. Besides, the ablation study and sensitivity analysis prove the effectiveness of our model. Nonetheless, the current framework requires complete graphs in both source and target cities, while the graphs may change with the introduction of new nodes, e.g. construction of new stations or roads. Hence, how to transfer knowledge from old nodes to new nodes under dynamic graphs is a non-trivial task. We leave this for future work.
Acknowledgment
We would like to acknowledge a grant from RGC Theme-based Research Scheme (TRS) T41-603/20R, and a research grant (project N_HKUST627/18) from the Hong Kong Research Grants Council under the NSFC/RGC Joint Research Scheme.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Y. Lv, Y. Duan, W. Kang, Z. Li, and F.-Y. Wang, “Traffic flow prediction with big data: a deep learning approach,” IEEE Transactions on Intelligent Transportation Systems , vol. 16, no. 2, pp. 865–873, 2014.
- 2[2] M. T. Asif, J. Dauwels, C. Y. Goh, A. Oran, E. Fathi, M. Xu, M. M. Dhanya, N. Mitrovic, and P. Jaillet, “Spatiotemporal patterns in large-scale traffic speed prediction,” IEEE Transactions on Intelligent Transportation Systems , vol. 15, no. 2, pp. 794–804, 2013.
- 3[3] D. Zhang, F. Xiao, M. Shen, and S. Zhong, “Dneat: A novel dynamic node-edge attention network for origin-destination demand prediction,” Transportation Research Part C: Emerging Technologies , vol. 122, p. 102851, 2021.
- 4[4] D. A. Tedjopurnomo, Z. Bao, B. Zheng, F. Choudhury, and A. K. Qin, “A survey on modern deep neural network for traffic prediction: Trends, methods and challenges,” IEEE Transactions on Knowledge and Data Engineering , 2020.
- 5[5] L. Wang, X. Geng, X. Ma, F. Liu, and Q. Yang, “Cross-city transfer learning for deep spatio-temporal prediction,” ar Xiv preprint ar Xiv:1802.00386 , 2018.
- 6[6] S. Wang, H. Miao, J. Li, and J. Cao, “Spatio-temporal knowledge transfer for urban crowd flow prediction via deep attentive adaptation networks,” IEEE Transactions on Intelligent Transportation Systems , vol. 23, no. 5, pp. 4695–4705, 2021.
- 7[7] H. Yao, Y. Liu, Y. Wei, X. Tang, and Z. Li, “Learning from multiple cities: A meta-learning approach for spatial-temporal prediction,” in The World Wide Web Conference , 2019, pp. 2181–2191.
- 8[8] B. Lu, X. Gan, W. Zhang, H. Yao, L. Fu, and X. Wang, “Spatio-temporal graph few-shot learning with cross-city knowledge transfer,” ar Xiv preprint ar Xiv:2205.13947 , 2022.
