Semantic-Fused Multi-Granularity Cross-City Traffic Prediction

Kehua Chen; Yuxuan Liang; Jindong Han; Siyuan Feng; Meixin Zhu; Hai; Yang

arXiv:2302.11774·cs.LG·April 2, 2024

Semantic-Fused Multi-Granularity Cross-City Traffic Prediction

Kehua Chen, Yuxuan Liang, Jindong Han, Siyuan Feng, Meixin Zhu, Hai, Yang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel transfer learning model that fuses semantics at multiple granularities to improve cross-city traffic prediction, especially in data-scarce regions, by leveraging domain-invariant features and hierarchical graph structures.

Contribution

It proposes a Semantic-Fused Multi-Granularity Transfer Learning (SFMGTL) model that jointly addresses semantic fusion and multi-granularity in transfer learning for traffic prediction.

Findings

01

Outperforms state-of-the-art baselines on six real-world datasets.

02

Requires fewer parameters than baseline models.

03

Enhances demand prediction accuracy during peak hours.

Abstract

Accurate traffic prediction is essential for effective urban management and the improvement of transportation efficiency. Recently, data-driven traffic prediction methods have been widely adopted, with better performance than traditional approaches. However, they often require large amounts of data for effective training, which becomes challenging given the prevalence of data scarcity in regions with inadequate sensing infrastructures. To address this issue, we propose a Semantic-Fused Multi-Granularity Transfer Learning (SFMGTL) model to achieve knowledge transfer across cities with fused semantics at different granularities. In detail, we design a semantic fusion module to fuse various semantics while conserving static spatial dependencies via reconstruction losses. Then, a fused graph is constructed based on node features through graph structure learning. Afterwards, we implement…

Tables3

Table 1. TABLE I: The description and statistics of four datasets

Dataset	NY-Taxi	CHI-Taxi	DC-Taxi	HZMetro	SHMetro
# Nodes	460	476	420	80	288
# Physical Edges	3,886	4,018	3,538	248	958
Interval	60 min	60 min	60 min	15 min	15 min
Time Span		1/1/2016 - 12/31/2016		1/1/2019 - 1/25/2019	7/1/2016 - 9/30/2016
Mean	Pick:32.913, Drop:32.791	Pick:5.861, Drop:5.950	Pick:2.732, Drop:2.731	In:213.930, Out:216.277	In:223.690, Out:229.271
Std	Pick:131.080, Drop:119.994	Pick:39.043, Drop:34.236	Pick:16.317, Drop:14.051	In:358.880, Out:340.822	In:358.880, Out:340.821

Table 2. TABLE II: The experimental results for taxi pickup/dropoff prediction

Baseline	NY-Taxi $\overset{}{\to}$ DC-Taxi						CHI-Taxi $\overset{}{\to}$ DC-Taxi
	RMSE			MAE			RMSE			MAE
	1 hour	3 hour	5 hour	1 hour	3 hour	5 hour	1 hour	3 hour	5 hour	1 hour	3 hour	5 hour
ARIMA	6.464	12.670	15.590	1.968	3.490	4.277	6.464	12.670	15.590	1.968	3.490	4.277
GRU	5.604	9.461	11.223	1.872	2.869	3.411	5.614	9.461	11.231	1.844	2.858	3.466
TGCN	4.960	6.993	7.860	1.841	2.624	2.938	4.862	7.093	7.696	1.841	2.517	2.723
Fine-tuned	4.986	7.073	7.851	1.737	2.491	2.713	4.680	6.762	7.682	1.665	2.316	2.675
MAML	4.990	7.071	7.807	1.721	2.544	2.548	4.759	6.849	7.640	1.795	2.353	2.667
ST-GFSL	4.858	7.063	7.847	1.613	2.284	2.613	4.720	6.725	7.505	1.677	2.276	2.466
CrossTReS	4.820	6.954	7.733	1.678	2.258	2.518	4.670	6.703	7.370	1.614	2.245	2.364
SF-HGTL	4.966	6.618	7.271	1.626	2.041	2.390	4.707	6.667	7.302	1.658	2.068	2.266

Table 3. TABLE III: The experimental results for metro passenger flow prediction

Baseline	HZMetro $\overset{}{\to}$ SHMetro						SHMetro $\overset{}{\to}$ HZMetro
	RMSE			MAE			RMSE			MAE
	15min	30min	60min	15min	30min	60min	15min	30min	60min	15min	30min	60min
ARIMA	117.034	133.054	125.461	52.351	59.130	61.755	101.400	113.407	109.088	52.416	57.517	57.835
GRU	69.921	88.188	143.522	32.146	38.309	54.856	62.723	69.639	100.848	33.177	35.978	48.669
TGCN	65.438	65.723	80.925	33.467	33.826	40.022	60.093	63.850	77.728	32.535	34.510	41.912
Fine-tuned	64.040	64.276	77.753	32.568	32.642	38.281	60.163	62.481	73.111	31.183	32.641	38.260
MAML	63.668	64.380	77.734	32.505	32.763	38.418	58.617	61.199	72.674	31.776	33.212	39.080
ST-GFSL	60.512	60.971	69.347	32.200	32.302	36.690	57.580	60.326	70.837	32.879	33.614	38.179
CrossTReS	60.242	60.793	68.608	32.188	32.309	36.401	55.983	60.807	71.886	31.677	32.264	38.587
SF-HGTL	60.440	61.197	69.210	30.017	31.033	35.400	54.390	57.567	71.243	29.317	30.647	36.823

Equations43

X_{1}^{t - T + 1} X_{2}^{t - T + 1} ... X_{N}^{t - T + 1} ... ... ... ... X_{1}^{t} X_{2}^{t} ... X_{N}^{t} ⟶ g (\cdot) X_{1}^{t + 1} X_{2}^{t + 1} ... X_{N}^{t + 1} ... ... ... ... X_{1}^{t + M} X_{2}^{t + M} ... X_{N}^{t + M}

X_{1}^{t - T + 1} X_{2}^{t - T + 1} ... X_{N}^{t - T + 1} ... ... ... ... X_{1}^{t} X_{2}^{t} ... X_{N}^{t} ⟶ g (\cdot) X_{1}^{t + 1} X_{2}^{t + 1} ... X_{N}^{t + 1} ... ... ... ... X_{1}^{t + M} X_{2}^{t + M} ... X_{N}^{t + M}

h_{v_{i}}^{(0)} = GRU ({X_{i}^{t - T + 1}, ..., X_{i}^{t}})

h_{v_{i}}^{(0)} = GRU ({X_{i}^{t - T + 1}, ..., X_{i}^{t}})

h_{v_{i}}^{(l)} = UPDATE (h_{v_{i}}^{(l - 1)}, AGG^{(l)} ({(h_{v_{i}}^{(l - 1)},

h_{v_{i}}^{(l)} = UPDATE (h_{v_{i}}^{(l - 1)}, AGG^{(l)} ({(h_{v_{i}}^{(l - 1)},

h_{v_{j}}^{(l - 1)}, A) ∣ v_{j} \in N (v_{i})}))

Z^{(l)} = GNN_{l, embed} (A^{(l)}, X^{(l)})

Z^{(l)} = GNN_{l, embed} (A^{(l)}, X^{(l)})

S^{(l)} = softmax (GNN_{l, pool} (A^{(l)}, X^{(l)}))

X^{(l + 1)} = S^{(l)^{T}} Z^{(l)}

A^{(l + 1)} = S^{(l)^{T}} A^{(l)} S^{(l)}

L_{l} = ∣∣ A^{(l)}, S^{(l)} S^{(l)^{T}} ∣ ∣_{F}

L_{l} = ∣∣ A^{(l)}, S^{(l)} S^{(l)^{T}} ∣ ∣_{F}

L_{e} = \frac{1}{n _{l}} i = 1 \sum n_{l} H (S_{i})

A_{M} (m_{i}, m_{j}) = σ (W ∣ m_{i} - m_{j} ∣/ γ + b))

A_{M} (m_{i}, m_{j}) = σ (W ∣ m_{i} - m_{j} ∣/ γ + b))

A_{C} (h_{v_{i}}, m_{i}) = \frac{exp ( - ∣∣ ( h _{v_{i}} - m _{i} ) / γ ∣ ∣ _{2}^{2} /2 )}{\sum _{n = 1}^{N} exp ( - ∣∣ ( h _{v_{i}} - m _{n} ) / γ ∣ ∣ _{2}^{2} /2 )}

A_{C} (h_{v_{i}}, m_{i}) = \frac{exp ( - ∣∣ ( h _{v_{i}} - m _{i} ) / γ ∣ ∣ _{2}^{2} /2 )}{\sum _{n = 1}^{N} exp ( - ∣∣ ( h _{v_{i}} - m _{n} ) / γ ∣ ∣ _{2}^{2} /2 )}

\hat{X}^{(0)} = GNN^{(0)} (A_{S}^{(0)}, X_{S}^{(0)})

\hat{X}^{(0)} = GNN^{(0)} (A_{S}^{(0)}, X_{S}^{(0)})

X^{(1)}, A^{(1)} = HGT (A^{(0)}, \hat{X}^{(0)})

\hat{X}^{(1)} = GNN^{(1)} (A_{S}^{(1)}, X_{S}^{(1)})

X^{(2)}, A^{(2)} = HGT (A^{(1)}, \hat{X}^{(1)})

X_{c} = Mean (MLP (X^{(2)}))

g_{c} = GNN^{(2)} (A_{S}^{(2)}, X_{S}^{(2)})

z_{k} = g_{1} \oplus g_{2} \oplus ... \oplus g_{J}

z_{k} = g_{1} \oplus g_{2} \oplus ... \oplus g_{J}

t_{i} = Mean ({z_{k}}_{k = 1}^{K})

t_{i} = Mean ({z_{k}}_{k = 1}^{K})

λ_{i} = σ (W \cdot t_{i} + b)

λ_{i} = σ (W \cdot t_{i} + b)

θ_{i} = λ_{i} \circ θ_{0}

\overset{g}{^}_{c} = MLP (g_{c}), \forall c = 1, ..., J

\overset{g}{^}_{c} = MLP (g_{c}), \forall c = 1, ..., J

L_{d} = - c = 1 \sum J log \frac{exp ( s _{c} \cdot g ^ _{c} / τ )}{\sum _{j = 1}^{J} exp ( s _{c} \cdot g ^ _{j} / τ )}

θ_{i} \leftarrow θ_{i} - α \nabla_{θ} L_{T_{i}} (f_{θ}, S_{i})

θ_{i} \leftarrow θ_{i} - α \nabla_{θ} L_{T_{i}} (f_{θ}, S_{i})

Φ \leftarrow Φ - β \nabla_{Φ} i \sum I (L_{T_{i}} (f_{θ_{i}}, Q_{i}) + μ_{1} L_{l} + μ_{2} L_{e} + μ_{3} L_{d})

Φ \leftarrow Φ - β \nabla_{Φ} i \sum I (L_{T_{i}} (f_{θ_{i}}, Q_{i}) + μ_{1} L_{l} + μ_{2} L_{e} + μ_{3} L_{d})

RMSE = \frac{1}{n} i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2}

RMSE = \frac{1}{n} i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2}

MAE = \frac{1}{n} i = 1 \sum n ∣ y_{i} - \overset{y}{^}_{i} ∣

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zeonchen/sfmgtl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTraffic Prediction and Management Techniques · Human Mobility and Location-Based Analysis

MethodsBalanced Selection

Full text

Cross-City Traffic Prediction via Semantic-Fused Hierarchical Graph Transfer Learning

Kehua Chen, Jindong Han, Siyuan Feng*, and Hai Yang Kehua Chen and Jindong Han are with Division of Emerging Interdisciplinary Areas (EMIA), Interdisciplinary Programs Office, The Hong Kong University of Science and Technology, Hong Kong, China.Siyuan Feng and Hai Yang are with Civil and Environmental Engineering Department, The Hong Kong University of Science and TechnologyH. Yang is also with the Intelligent Transportation Thrust, the Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China* S. Feng is the corresponding author, E-mail: [email protected]

Abstract

Accurate traffic prediction benefits urban management and improves transportation efficiency. Recently, data-driven methods have been widely applied in traffic prediction and outperformed traditional methods. However, data-driven methods normally require massive data for training, while data scarcity is ubiquitous in low-developmental or newly constructed regions. To tackle this problem, we can extract meta knowledge from data-rich cities to data-scarce cities via transfer learning. Besides, relations among urban regions can be organized into various semantic graphs, e.g. proximity and POI similarity, which is barely considered in previous studies. In this paper, we propose Semantic-Fused Hierarchical Graph Transfer Learning (SF-HGTL) model to achieve knowledge transfer across cities with fused semantics. In detail, we employ hierarchical graph transformation followed by meta-knowledge retrieval to achieve knowledge transfer in various granularity. In addition, we introduce meta semantic nodes to reduce the number of parameters as well as share information across semantics. Afterwards, the parameters of the base model are generated by fused semantic embeddings to predict traffic status in terms of task heterogeneity. We implement experiments on five real-world datasets and verify the effectiveness of our SF-HGTL model by comparing it with other baselines.

Index Terms:

Few-shot learning, Traffic prediction, Graph neural network.

I Introduction

As a vital problem in Intelligent Transportation System (ITS), traffic prediction aims to forecast future transportation status, e.g. traffic flow [1], traffic speed [2], origin-destination demand [3]. Accurate traffic prediction benefits urban management and improves transportation efficiency. Due to the explosion of data volume, various machine learning methods have been used in traffic prediction and outperformed traditional approaches [4].

Nonetheless, deep learning methods normally require massive data to achieve satisfactory performance, and limited data size leads to over-fitting problems. While data scarcity is ubiquitous in cities with low developmental levels or new districts, it is non-trivial to train a powerful model. In addition, traffic patterns share common features even though they are in different cities. For instance, traffic speed in commercial areas decreases during peak time, and crowd flow from residential areas to commercial areas tends to surge in the morning. Therefore, if we can extract the traffic patterns from data-rich cities and adapt the learned patterns to data-scarce cities, we can still achieve good performance with limited data sizes. Such knowledge transfer among cities is called cross-city knowledge transfer and has attracted much attention these years. Generally, there are two paradigms in cross-city knowledge transfer. The first idea is Divide-Match-Transfer principle [5, 6]: both source and target cities are first divided into several regions, and the target regions absorb knowledge from the most similar source regions after matching. Moreover, the other method utilizes meta learning methods to extract meta-knowledge from source cities and adapt the knowledge based on target cities [7, 8]. In traffic prediction, meta-knowledge can be deemed as various traffic patterns.

In terms of recent research, roads or stations in cities can be treated as nodes in graphs, and their distance or POI similarity can be regarded as edges. As such topological structure determines traffic patterns, the knowledge transfer among graphs is more informative than grid-based transfer. However, there are still several challenges in cross-city knowledge transfer. First, most current studies merely achieved knowledge transfer at a solely local level without the consideration of coarse-scale knowledge. Here we define “local” as node level in graphs, i.e. region or grid in previous studies, and define “zone” as a relatively coarse level consisting of more than one node, and encompasses general characteristics (Fig. 1). For instance, several subway stations can all belong to one commercial zone from a physical proximity perspective, and can also belong to one functional category from a Point of Interest (POI) perspective. The local-level transfer losses such zonal information. Second, although various semantics have been employed in traffic predictions since urban data can be normally organized as multiple semantic graphs, most current knowledge-transfer models do not consider different semantic information, and we can merely concatenate various semantic representations for prediction based on current studies. Fig. 2 presents an example of the construction of various semantic graphs, although each semantic graph has identical nodes, the nodes could be connected in different ways to encompass different semantic meanings, e.g. physical proximity and POI similarity in the example. Third, compared to Divide-Match-Transfer principle, meta learning methods directly extract meta-knowledge, and hence are more generalizable. However, previous studies implicitly extract spatial-temporal patterns along with the model parameters, which does not explicitly integrate information among patterns, let alone the knowledge fusion of multiple semantics.

To tackle the aforementioned challenges, we propose a novel framework for spatio-temporal transfer learning named Semantic-Fused Hierarchical Graph Transfer Learning (SF-HGTL). SF-HGTL takes graphs as inputs and is able to handle various semantic graphs. Considering the first challenge, we utilize hierarchical graph transformation for each semantic graph to extract multi-level information. To extract meta-knowledge for knowledge transfer and tackle the third challenge, we design several meta-knowledge graphs to extract meta-knowledge from different levels. Graph structure is more informative compared to discrete memory vectors. To solve the second challenge as well as reduce the number of parameters, different semantics share the same meta-knowledge graphs, and their parameters are adjusted by corresponding meta semantic nodes. At last, we employ a modulating function to generate task-specific parameters for the base learner after the fusion of semantics. Note that the base learner can be any deep learning models, e.g. Recurrent Neural Network and Graph Neural Network, and we use the base learner to predict traffic status.

The main contributions of this paper are as follows:

•

We design a novel cross-city knowledge transfer framework based on semantic-fused hierarchical graphs, the framework uses hierarchical graph transformation to achieve local, zonal and global feature representation, and we introduce meta-knowledge graphs to benefit information retrieval in the meta-testing phase.

•

We introduce meta semantic nodes that not only adapt meta-knowledge in terms of various semantics, but also achieve information sharing among semantics since they utilize same meta-knowledge graphs. Semantic discriminator is further applied to avoid trivial solutions.

•

We implement several experiments based on five real-world datasets to verify the effectiveness of the proposed framework. The results demonstrate the superiority of our SF-HGTL model compared to baseline models.

The remainder of the paper is organized as follows. We briefly introduce related works in Section 2, and propose definitions and problems in Section 3. Afterwards, the proposed model is revealed in Section 4. Section 5 thoroughly introduces experiments for evaluation, results and discussions. At last, we summarize the overall paper in Section 6.

II RELATED WORK

This paper is relevant to traffic prediction, transfer learning, and cross-city knowledge transfer. In this section, we briefly review related works for the above topics.

II-A Traffic Prediction

As a classical problem, researchers have focused on traffic prediction since several decades ago. Traditional statistical methods include Historical Average (HA) [9], Autoregressive Integrated Moving Average (ARIMA) [10] and its variants. Nonetheless, these traditional methods merely capture linear relations and cannot perfectly depict data patterns. Hence, recent studies tend to use deep learning methods due to their powerful expressiveness. Researchers adopted various techniques to tackle traffic prediction problems, ranging from Recurrent Neural Network (RNN), Convolutional Neural Network (CNN) to Graph Neural Network (GNN).

Long Short-Term Memory (LSTM) was applied in [11] to predict traffic speed on a single road, and subsequent studies normally use RNN as a model component owing to its powerful capacity of capturing non-linear dynamics. Within a city, traffic status in one region is influenced by proximal regions. To model such relations, [12] formed traffic speed as various matrices, and then transformed matrices into image channels. Then CNN was applied to predict future traffic status. Yet, CNN can only capture Euclidean relations. As an emerging technique, GNN has been widely used in current models for traffic prediction. For example, [13] modeled the traffic flow as a diffusion process on a directed graph and proposed DCRNN to forecast traffic speed. T-GCN [14] combined Graph Convolutional Network (GCN) and Gated Recurrent Unit (GRU) to predict traffic status. In addition, [15] came up with ST-MetaNet to capture traffic spatial and temporal correlations. The authors designed the model based on deep meta-learning and chose RNN and GNN as meta-learners.

Although deep learning methods achieve accurate traffic prediction, they typically require massive data. Hence, transfer learning provides an effective method to alleviate data scarcity.

II-B Transfer Learning

Traditional machine learning approaches have achieved great success in various practical scenarios, such as object detection, and language processing. However, machine learning normally requires massive labeled data for training, and the collection of labeled data is time-consuming and expensive.

Transfer learning achieves knowledge transfer from source domains to improve the learning performance in target domains. Hence, transfer learning is suitable to handle few-shot problems, i.e. rich data in source domains and scarce data in target domains. In data-based interpretation [16], the main objective of transfer learning is to minimize the distribution difference between source and target domains, such as instance weighting strategy [17] and feature transformation strategy [18]. In model-based interpretation [16], the aim is to accurately predict results on the target domain by utilizing source knowledge. As an emerging method, meta-learning [19] has been used to achieve transfer learning as an effective method. In meta-learning, a number of tasks are drawn from task distributions, the model aims to learn a generalized meta-learner at the meta-training phase, and achieves fast adaption at the meta-testing phase when new tasks come. As one of the most famous methods, MAML [20] treats the meta-learner as parameter initialization by bi-level optimization, we use MAML as the basic framework in this paper. Besides, [21] raised that the utilization of task discrepancy benefits the model performance, and various modulating methods have been applied in recent studies, such as automated relational meta-graph [22], multimodal modulation network [23], and hierarchical prototype graph [24].

While most previous transfer learning studies focused on image- and text-related tasks, spatial-temporal transfer learning is at the initial stage and still without much understanding.

II-C Cross-City Knowledge Transfer

To the best of our knowledge, [5] first proposed Divide-Match-Transfer principle. The authors partitioned the cities into equal-size grids, then matched target regions with the most correlated source regions. Afterwards, ConvLSTM was used as the backbone model to learn regional representations for prediction. At fine-tuning stage, the model tried to minimize the squared error between regional representations of target regions and matched source regions.

MetaST [7] designed a memory module to extract long-term patterns from source cities. The regions in source cities were clustered into several categories as memory. Then, the attention mechanism was utilized to gather useful information from the memory during both the meta-training and meta-testing phases. ST-DANN [6] first mapped source and target data to a common embedding space, and tried to minimize Maximum Mean Discrepancy (MMD) between two embeddings. To further capture spatial dependencies, the model introduced a global attention mechanism. The attention mechanism can also be deemed as a matching process. [8] designed the model in terms of graph data structure. The model extracted meta-knowledge through GRU and Graph Attention Network (GAT). To express the structural information, graph construction loss was introduced. Last, the node-level meta-knowledge was fed into a parameter generation module to produce non-shared feature extractor parameters. To alleviate negative transfer, [25] proposed selective cross-city transfer learning to filter harmful source knowledge. The authors employed edge-level and node-level adaption to training the feature network, and designed a weighting network for loss calculation.

As mentioned in the introduction, there are still challenges and limitations in current studies, and this paper aims to solve them via semantic-fused hierarchical graph transfer learning.

III DEFINITION AND PROBLEM FORMULATION

In this section, we introduce several definitions used in this paper and formally illustrate the cross-city knowledge transfer problem.

Definition 1 (Graph): Given a graph $\mathcal{G}=(\mathcal{V},\mathcal{E},\textbf{X},\textbf{A})$ . $\mathcal{V}=\{v_{1},v_{2},....,v_{N}\}$ is the node set, and $N=|\mathcal{V}|$ ; $\mathcal{E}=\{e_{ij}=(v_{i},v_{j})\}$ is the edge set; $\textbf{X}\in\mathds{R}^{N\times D}$ is the node feature matrix, and the feature dimension of each node is $D$ ; A is the adjacency matrix, each entry $a_{ij}$ indicates the weight of edge $e_{ij}$ .

Definition 2 (Few-Shot Learning): Given a dataset $\mathcal{D}$ , where $\mathcal{D}^{tr}=\{X^{tr},Y^{tr}\}$ is the training set that only has a few samples, and $\mathcal{D}^{ts}$ is the corresponding testing set. The aim of few-shot learning is to train a model on the training set that minimizes the prediction error on the test set. Here, we treat traffic prediction problem in the data-scarce cities as a few-shot learning problem.

In this paper, we follow Model-Agnostic Meta-Learning (MAML) [20] framework to handle the few-shot learning problem. As Fig. 3 shows, we use a source dataset $\mathcal{D}_{\mathcal{S}}$ to train the meta learner, $\mathcal{D}_{\mathcal{S}}$ is organized as various tasks, denoted as $\{\mathcal{T}_{i}\}_{i=1}^{I}$ . Each task $\mathcal{T}_{i}$ includes a support set $\mathcal{S}_{i}$ and a query set $\mathcal{Q}_{i}$ . MAML employs $\mathcal{S}_{i}$ and $\mathcal{Q}_{i}$ to achieve domain adaption via bi-level optimization. In detail, assuming the initial parameters of the meta learner as $\theta_{0}$ , we first utilize $\mathcal{S}_{i}$ to update model parameters as $\theta_{i}^{\prime}$ . Then $\theta_{i}^{\prime}$ is employed to calculate loss based on $\mathcal{Q}_{i}$ , and the parameters are updated again to acquire $\theta_{i+1}$ , i.e. outer updating. We use $\theta_{i+1}$ for the following training. At the meta-testing stage, the target dataset $\mathcal{D}_{\mathcal{T}}$ is also formed as tasks, we employ $\theta_{i+1}$ as the initial parameters, fine-tune parameters based on $\mathcal{S}_{k}$ and test the model performance in $\mathcal{Q}_{k}$ .

Problem Definition 1 (Traffic Prediction): Consider there are $N$ sensors/recorders in a given city, sensor/recorder $n$ measures traffic status on a specific node (e.g. road link, or subway station) at time $t$ , denoted as $X_{t}^{n}$ . Given past $T$ records, the sequential traffic status can be naturally deemed as node features, then traffic prediction aims to find a function $g(\cdot)$ that predicts the next $M$ traffic status:

[TABLE]

The above past $T$ traffic status of $N$ nodes and its corresponding $M$ traffic status are deemed as one sample. Various samples consist of one task $\mathcal{T}_{i}$ .

Problem Definition 2 (Cross-City Traffic Prediction): Traffic prediction based on a few training samples can be deemed as a few-shot learning problem, and we solve it with meta learning. Here we assume only one source and one target city. Given a source city denoted as $\mathcal{C}_{\mathcal{S}}=\{\mathcal{G}_{\mathcal{S}}^{c}\}_{c=1}^{J}$ , a target city denoted as $\mathcal{C}_{\mathcal{T}}=\{\mathcal{G}_{\mathcal{T}}^{c}\}_{c=1}^{J}$ , where $J$ indicates the number of different semantics and each semantic shares the same node features $\mathbb{X}$ , i.e. past $T$ traffic status. The source city is data-rich while the target city is data-scarce. We aim to find a function $f(\cdot)$ that leverages knowledge of the source city to predict traffic status in the target city as accurately as possible.

IV METHODOLOGY

Fig. 4 presents the proposed model architecture. For each sample in task $\mathcal{T}_{i}$ with multiple semantic graphs, we generate a sample embedding via a semantic-fusion cell. In detail, we first feed the node features into a shared GRU, then treat hidden states as the new node features. Then, hierarchical graph transformation is applied to each semantic graph. At each transformation layer, the nodes retrieve meta-knowledge (MK) from several meta-knowledge graphs. To model the relations among various semantics, we further introduce several meta semantic nodes to adapt the parameters of meta-knowledge graphs in terms of semantics. Afterwards, each semantic graph is transformed as a semantic embedding, and we concatenate semantic embeddings as sample embedding. We treat the average of sample representations as task embedding. The task embedding then interacts with a task meta-knowledge graph. By feeding the task embedding into a task modulating function, we acquire the adapted parameters for the meta learner.

IV-A Hierarchical Graph Transformation

Assume there are $k=1,...,K$ samples in support set $\mathcal{S}_{i}$ , and each sample has $c=1,...,J$ semantics. Although various semantics are available in spatio-temporal prediction problems, the node features are normally fixed. Here we employ a shared GRU to generate node embeddings based on historical traffic status:

[TABLE]

Afterwards, we apply Graph Neural Networks (GNN) to extract non-Euclidean information. General GNN follows message passing mechanism [26], which includes two processes: aggregation and updating. At the $l$ -th layer, GNN updates the node embedding $\mathbb{h}_{v_{i}}^{(l)}$ as follows:

[TABLE]

where $\mathcal{N}(v_{i})$ indicates the neighborhood of node $v_{i}$ ; UPDATE can be neural networks; AGG can be mean or max pooling.

However, GNN performs powerfully when focusing on local information since deep GNN leads to over-smoothing problem [27]. Hence, the hierarchical structure is able to provide a larger view and conserve node differences. [28] designed a hierarchical graph representation model to acquire graph embeddings. Basically, hierarchical graph transformation assigns nodes to clusters at each layer in order to reduce the node numbers. Denote learned cluster assignment matrix at layer $l$ as $\mathbb{S}^{(l)}\in\mathds{R}^{n_{l}\times n_{l+1}}$ , the node embeddings at layer $l$ as $\mathbb{Z}^{(l)}$ , the new node/cluster embedding matrix at layer $l+1$ as $\mathbb{X}^{l+1}$ , and adjacency matrix at layer $l$ as $\mathbb{A}^{(l)}$ . The main process is first to generate node embeddings $\mathbb{Z}^{(l)}$ through an embedding GNN, and a cluster assignment matrix $\mathbb{S}^{(l)}$ through a pooling GNN at layer $l$ . Afterwards, the nodes at layer $l$ are assigned to clusters with new embeddings $\mathbb{X}^{(l+1)}$ and a new adjacency matrix $\mathbb{A}^{(l+1)}$ at layer $l+1$ :

[TABLE]

In short, we denote $l$ -th hierarchical graph transformation as $\text{HGT}(\mathbb{A}^{(l)},\mathbb{X}^{(l)})$ . We utilize two-layer transformations, the nodes after the first transformation demonstrates zonal information, and the second transformation aggregates the overall graphs as semantic information. The utilization of hierarchical graph transformation actually clusters similar nodes together, thus each node at deep levels can be deemed as a coarse category indicating more general information, e.g. urban functional zones.

Besides, we add hierarchical link prediction loss to encode nearby nodes closely, and hierarchical entropy loss to force exclusive assignment:

[TABLE]

where $||\cdot||_{F}$ is the Frobenius norm; $H(\cdot)$ means the entropy function; and $\mathbb{S}_{i}$ is the $i$ -the row of $\mathbb{S}$ .

IV-B Meta-knowledge Graph Construction

Inspired by [22], we design multiple meta-knowledge graphs to extract knowledge at different levels, i.e. node MK graph, zone MK graph and semantic MK graph. Here we take the node MK graph as an example for demonstration. Specifically, we denote the node MK graph as $\mathcal{G}_{node}^{mk}(\mathbb{X}_{meta},\mathbb{A}_{M})$ with $N$ nodes. Both node features $\mathbb{m}_{i}\in\mathbb{X}_{meta}$ and adjacency matrix $\mathbb{A}_{M}(\mathbb{m}_{i},\mathbb{m}_{j})$ are learned during training:

[TABLE]

where $\mathbb{m}_{i}$ and $\mathbb{m}_{j}$ are node features in the node MK graph; $\sigma$ is sigmoid function; $\mathbb{W}$ and $\mathbb{b}$ are learnable parameters; $\gamma$ is a scaling factor.

Afterwards, the node-level graph of each semantic is supposed to retrieve meta-knowledge from the node MK graph. Here we construct a super-graph $\mathcal{G}^{S}_{node}$ connecting sample nodes with nodes in $\mathcal{G}_{node}^{mk}$ . For the weights among node $\mathbb{h}_{v_{i}}$ and meta-node $\mathbb{m}_{i}$ , we calculate them via Euclidean distances:

[TABLE]

As Fig. 5 presents, the node features and adjacency matrix of super-graph $\mathcal{G}^{S}_{\text{node}}$ are $\mathbb{X}_{S}=\mathbb{X}_{node}\oplus\mathbb{X}_{meta}$ , and $\mathbb{A}_{S}=\mathbb{A}_{node}\oplus\mathbb{A}_{C}\oplus\mathbb{A}_{C}^{T}\oplus\mathbb{A}_{M}$ respectively, where $\oplus$ indicates concatenation operation and $T$ means matrix transpose.

Then we feed the super-graph into a GNN and integrate node features with meta-knowledge. In our model, meta-knowledge retrieval is implemented for each layer of hierarchical graph transformation, i.e. $\hat{\mathbb{X}}^{(l)}=\text{GNN}^{(l)}(\mathbb{A}_{S},\mathbb{X}_{S}^{(l)})$ . Therefore, we finally acquire a semantic embedding with the combination of various meta-knowledge:

[TABLE]

where MLP indicates a fully-connected neural network; $\mathbb{X}_{c}$ means semantic embeddings before MK retrieval; and $\mathbb{g}_{c}$ indicates semantic embeddings after MK retrieval; Eq. 12, 14 and 17 represent meta-knowledge retrieval process, and $\text{Mean}(\cdot)$ is mean pooling.

Subsequently, we acquire graph embeddings $\mathbb{g}_{c}$ for each semantic graph and fusion semantic graph embeddings to generate a sample embedding, e.g. concatenation and attention mechanism. Here we simply concatenate semantic graph embeddings to generate sample $k$ ’s embedding, we name hierarchical graph transformation together with semantic fusion process as semantic fusion cell:

[TABLE]

Within a support set $\mathcal{S}_{i}$ , we apply mean pooling to sample embeddings to acquire a task embedding:

[TABLE]

Afterwards, the task embedding further retrieves knowledge from a task MK graph to achieve task adaptation. The retrieval process is similar to the aforementioned super-graph construction method. At last, we apply a modulating function to the initialized parameter $\theta_{0}$ to generate task-specific parameters:

[TABLE]

where $\mathbb{W}$ and $\mathbb{b}$ are learnable parameters; $\circ$ indicates element-wise multiplication. $\theta_{i}$ is then used as model parameters for prediction.

IV-C Semantic Adjustment for Meta-knowledge Graph

As introduced above, the proposed model conducts meta-knowledge retrieval for each semantic, which increases the number of parameters significantly. To alleviate this problem, we design a semantic adjustment module to achieve parameter adaption in terms of various semantics.

In detail, various semantic graphs share the same MK graphs, and we introduce meta semantic nodes $\mathbb{s}_{c}$ to adjust MK graphs according to different semantics. The utilization of meta semantic nodes can not only reduce the number of parameters, but also establish connections among semantics. Similar to Eq. 11, we construct a super-graph based on MK graphs and meta semantic nodes, then implement GNN to update node features of MK graphs in terms of meta semantic nodes. Then, MK graphs interact with hierarchical graphs as mentioned above. Hence, the meaning of Hierarchical in our SF-HGTL model is twofold: (i) we use hierarchical graph transformation to aggregate meta-knowledge from node, zone and semantic levels; (ii) we use meta semantic nodes to hierarchically adjust parameters of MK graphs.

The features of meta semantic nodes are learned during the training phase to represent different semantics. The current framework merely captures the discrepancy from various semantic adjacency matrices, the signal information can vanish after the meta-knowledge retrieval process. Hence, the nodes on MK graphs rather than meta semantic nodes can represent semantic information, which leads to the trivial solution.

To solve this problem, we further introduce a semantic discriminator to guide the learning of meta semantic nodes. Specifically, we treat meta semantic nodes as prototypes, i.e. forcing different semantic graph embeddings to approach corresponding meta semantic nodes and keeping away from other semantics:

[TABLE]

where $J$ is the number of semantics; $\tau$ is the temperature hyperparameter. We use an $\text{MLP}(\cdot)$ to increase the expressiveness.

IV-D Meta-optimization

As introduced in [20], we split source-city datasets into massive training tasks $\mathcal{T}_{S}$ in meta-training phase. A task $\mathcal{T}_{i}\in\mathcal{T}_{S}$ consists of a support set $\mathcal{S}_{i}$ and a query set $\mathcal{Q}_{i}$ , and $\mathcal{S}_{i}\cap\mathcal{Q}_{i}=\varnothing$ . Recall that we use past $T$ steps to predict next $M$ traffic status for one sample, each support set and query set consist of several samples. Afterwards, batches of tasks are sampled from $\mathcal{T}_{S}$ to train our model by bi-level optimization. For task $\mathcal{T}_{i}$ , the model first acquires adjusted parameters $\theta_{i}$ as mentioned above, then leverages $\mathcal{S}_{i}$ to inner-update $\theta_{i}$ :

[TABLE]

where $\alpha$ is stepsize; $\mathcal{L}_{\mathcal{T}_{i}}$ is the task-specific loss function, e.g. Mean Square Error (MSE) and Mean Average Error (MAE) for our traffic prediction.

Afterwards, the meta-objective is the combination of task loss based on $\mathcal{Q}_{i}$ , hierarchical link prediction loss, hierarchical entropy loss and semantic loss:

[TABLE]

where $\Phi$ is the overall parameters; $\beta$ is the stepsize; $\mu_{1},\mu_{2},\mu_{3}$ are regularization factors. Algorithm 1 presents the above training process.

V EXPERIMENT

To evaluate the effectiveness of our framework, we implement diverse experiments with five real-world datasets and compare the model performance with several baselines. The experiments are related to taxi pickup/dropoff and metro passenger flow prediction, and our framework can be easily transferred to other traffic prediction tasks such as traffic speed and flow prediction. Moreover, an ablation study is conducted to indicate the importance of each component of our model. At last, we conduct sensitivity analysis to show the impact of various hyperparameters.

V-A Experiment Setup and Datasets

We choose taxi pickup/dropoff and metro passenger flow prediction tasks in this study. The taxi pickup/dropoff prediction involves three datasets: NY-Taxi, CHI-Taxi and DC-Taxi [25], we use proximity, road connectivity and POI graphs in the experiment similar to [25]. The metro passenger flow prediction involves two datasets: HZMetro and SHMetro [29], we utilize physical, similarity and correlation graphs in the experiment as introduced in [29]. The statistics of datasets are listed in Table I.

NY-Taxi dataset: The dataset is collected based on New York taxis covered between 74.059° to 73.863° in longitude and 40.645° to 40.848° in latitude. There are 133 million records in total. The data ranges from January 1st 2016 to December 31st 2016, and the recording interval is 1 hour.

CHI-Taxi dataset: The dataset is collected based on Chicago taxis covered between 87.740° to 87.576° in longitude and 41.766° to 42.013° in latitude. There are 24.5 million records in total. The data ranges from January 1st 2016 to December 31st 2016, and the recording interval is 1 hour.

DC-Taxi dataset: The dataset is collected based on Washington taxis covered between 77.127° to 76.926° in longitude and 38.798° to 38.969° in latitude. There are 10 million records in total. The data ranges from January 1st 2016 to December 31st 2016, and the recording interval is 1 hour.

HZMetro dataset: The dataset is generated based on transaction records of the Hangzhou metro system, the system has 2.35 million ridership each day. The data ranges from January 1st 2019 to January 25th 2019, and the recording interval is 15 min.

SHMetro dataset: The dataset is generated based on transaction records of the Shanghai metro system, the system has 811.8 million ridership each day. The data ranges from July 1st 2016 to September 30th 2016, and the recording interval is 15 min.

We conduct four experiments to evaluate the model performance, i.e. NY-Taxi to DC-Taxi, CHI-Taxi to DC Taxi, HZMetro to SHMetro, and SHMetro to HZMetro. For taxi prediction, we employ the past 6 time steps to predict the next 5 time steps, including pickup and dropoff volumes. For metro prediction, we use the past 4 time steps to predict the next 4 time steps, including inflow and outflow volumes. Min-Max normalization and Z-score normalization are applied for taxi and metro passenger flow pre-processing respectively.

Evaluation metrics: Since the prediction problem is a regression problem, we utilize two widely applied methods for evaluation: Mean Average Error (MAE) and Root Mean Square Error (RMSE).

[TABLE]

where $\hat{y}_{i}$ and $y_{i}$ are predicted and true values of instance $i$ respectively.

Implementation details: The experiments are conducted on a Linux server with 2 NVIDIA Tesla V100 GPUs. We use TGCN [14] as the base learner to capture both temporal and spatial relations. The base learner has three TCGN layers with hidden sizes (16, 32, 32), followed by three MLP layers with hidden sizes (256, 128, 128). In addition, we choose k-GNN [30] to proceed graph data. The hidden size of SF-HGTL is 32, the number of centers for MK graphs is 5, the clustering ratio of hierarchical graph transformation is 0.3, and we let $\mu_{1}=\mu_{2}=\mu_{3}=0.1$ . The learning rates are 1e-3 and 5e-3 for inner and outer updating respectively, and Adam [31] is employed as the optimizer. The code is available on github.com/ckjzsa/SF-HGTL.

V-B Baselines

We first select several traditional statistical models and state-of-the-art deep learning methods for comparison:

•

ARIMA [32]: Autoregressive Integrated Moving Average is a classic method for time-series prediction and merely considers linear relations among data.

•

GRU [33]: Gated Recurrent Unit is a popular deep learning method to capture sequential dependency of data.

•

TGCN [14]: TGCN employs GCN to capture the spatial features, followed by GRU to learn temporal dependency.

Above methods merely utilize target data for model learning and do not involve the knowledge transfer process. Next, we pick several transfer learning models for comparison:

•

Fine-tuned Model: As a natural idea, we first train the base learner through source cities, and then fine-tune the model based on target cities.

•

MAML [20]: As mentioned before, MAML uses bi-level optimization for parameter initialization, and the base learner is TGCN.

•

ST-GFSL [8]: ST-GFSL generates non-shared parameters based on node-level meta-knowledge. Parameter matching is implemented to retrieve similar features from source cities.

•

CrossTReS [25]: CrossTReS proposed a selective transfer learning framework to pick beneficial knowledge from the source domain by introducing a weighting network. Besides, it is suitable to handle multi-graph spatio-temporal prediction problems.

Note the last two transfer learning methods develop models that focus on spatial-temporal knowledge transfer on graphs. Moreover, since most methods do not consider multiple graphs, we parallelly use multiple models with different semantic graphs followed by concatenation and a prediction head to predict future traffic status.

V-C Experimental Results

Table II and III show the experimental results of taxi pickup/dropoff and metro passenger flow prediction. The best result is bold, and the second-best result is underlined. Generally, deep learning methods outperform traditional methods and transfer-based methods further improve model performance. Although ST-GFSL model considers spatial-temporal patterns, it does not utilize semantic information and hence performs worse than CrossTReS and SF-HGTL models.

For taxi volume prediction, we can find that CrossTReS outperforms SF-HGTL for short-term prediction, but has worse results for long-term prediction. One possible reason is that CrossTReS applies selection mechanism to avoid negative transfer, the re-weighting method works well when the prediction period is short but suffers more uncertain for long-term prediction. As a contrast, our SF-HGTL stores meta-knowledge in the MK graphs, and is more powerful to capture long-term traffic patterns. For metro passenger flow prediction, CrossTReS has relatively lower RMSE for the transfer from HZMetro to SHMetro, while our SF-HGTL model has lower MAE. Note that the size of HZMetro dataset is smaller than that of SHMetro, we can conclude that CrossTReS captures peak traffic patterns better than SF-HGTL, and SF-HGTL is better at fitting general traffic patterns when the data size of the source domain is small. However, when the transfer is from SHMetro to HZMetro, i.e. source domain has sufficient data, SF-HGTL outperforms other methods almost in every metric.

V-D Ablation Study

We further conduct several experiments to prove the efficacy of each component: (i) Remove hierarchical graph transformation, denoted as SF-HGTL*-HGT*; (ii) Remove meta semantic nodes, denoted as SF-HGTL*-MS*; (iii) Only keep task MK graph and remove other MK graphs, denoted as SF-HGTL*-MK*. We implement five paralleled experiments based on metro passenger flow prediction transferring from Hangzhou to Shanghai. Fig. 6 presents the results of five experiments. The removal of hierarchical graph transformation significantly decreases the model performance, which means the information of different granularity is crucial to knowledge transfer. Moreover, the models without meta semantic nodes and MK graphs also perform worse than our SF-HGTL model, as these components store meta-knowledge extracted from various tasks.

V-E Sensitivity Analysis

In this section, we figure out the influence of several vital hyperparameters, i.e. hidden dimension, node number in MK graphs, and clustering ratio. Five experiments are conducted for each hyperparameter based on metro passenger flow prediction, and we use MAE of three periods to measure the performance. Fig. 7 presents the results. It can be found that: (i) Hidden size 32 has the lowest MAE, while hidden size 128 has the highest MAE due to overfitting; (ii) The number of centers determines the capability of MK graphs, the analysis shows center number 5 and 15 have the best MAE performance. Note that increasing centers leads to high computational complexity, so we choose center number 5 in the study; (iii) As a crucial hyperparameter, the hierarchical ratio reflects the clustering granularity, a high ratio means few clusters and vice versa. The results demonstrate that the model performance peaks when the hierarchical ratio is 0.3.

VI Conclusion

We propose a novel framework called SF-HGTL to achieve cross-city traffic prediction. SF-HGTL is based on MAML framework with a modulating function to dynamically adjust parameters of the base learner in terms of task heterogeneity. To generate task embeddings, we employ hierarchical graph transformation on various semantic graphs, and fuse semantic embeddings via concatenation. In addition, we utilize meta semantic nodes and meta-knowledge graphs to implement knowledge transfer in different levels. We conduct experiments on taxi pickup/dropoff volume and metro passenger flow prediction, and compare the model performance with several state-of-the-art methods. The results show that our SF-HGTL outperforms other baseline models. Besides, the ablation study and sensitivity analysis prove the effectiveness of our model. Nonetheless, the current framework requires complete graphs in both source and target cities, while the graphs may change with the introduction of new nodes, e.g. construction of new stations or roads. Hence, how to transfer knowledge from old nodes to new nodes under dynamic graphs is a non-trivial task. We leave this for future work.

Acknowledgment

We would like to acknowledge a grant from RGC Theme-based Research Scheme (TRS) T41-603/20R, and a research grant (project N_HKUST627/18) from the Hong Kong Research Grants Council under the NSFC/RGC Joint Research Scheme.

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Y. Lv, Y. Duan, W. Kang, Z. Li, and F.-Y. Wang, “Traffic flow prediction with big data: a deep learning approach,” IEEE Transactions on Intelligent Transportation Systems , vol. 16, no. 2, pp. 865–873, 2014.
2[2] M. T. Asif, J. Dauwels, C. Y. Goh, A. Oran, E. Fathi, M. Xu, M. M. Dhanya, N. Mitrovic, and P. Jaillet, “Spatiotemporal patterns in large-scale traffic speed prediction,” IEEE Transactions on Intelligent Transportation Systems , vol. 15, no. 2, pp. 794–804, 2013.
3[3] D. Zhang, F. Xiao, M. Shen, and S. Zhong, “Dneat: A novel dynamic node-edge attention network for origin-destination demand prediction,” Transportation Research Part C: Emerging Technologies , vol. 122, p. 102851, 2021.
4[4] D. A. Tedjopurnomo, Z. Bao, B. Zheng, F. Choudhury, and A. K. Qin, “A survey on modern deep neural network for traffic prediction: Trends, methods and challenges,” IEEE Transactions on Knowledge and Data Engineering , 2020.
5[5] L. Wang, X. Geng, X. Ma, F. Liu, and Q. Yang, “Cross-city transfer learning for deep spatio-temporal prediction,” ar Xiv preprint ar Xiv:1802.00386 , 2018.
6[6] S. Wang, H. Miao, J. Li, and J. Cao, “Spatio-temporal knowledge transfer for urban crowd flow prediction via deep attentive adaptation networks,” IEEE Transactions on Intelligent Transportation Systems , vol. 23, no. 5, pp. 4695–4705, 2021.
7[7] H. Yao, Y. Liu, Y. Wei, X. Tang, and Z. Li, “Learning from multiple cities: A meta-learning approach for spatial-temporal prediction,” in The World Wide Web Conference , 2019, pp. 2181–2191.
8[8] B. Lu, X. Gan, W. Zhang, H. Yao, L. Fu, and X. Wang, “Spatio-temporal graph few-shot learning with cross-city knowledge transfer,” ar Xiv preprint ar Xiv:2205.13947 , 2022.