Large-Scale Traffic Signal Control Using a Novel Multi-Agent   Reinforcement Learning

Xiaoqiang Wang; Liangjun Ke; Zhimin Qiao; and Xinghua Chai

arXiv:1908.03761·cs.LG·September 14, 2021

Large-Scale Traffic Signal Control Using a Novel Multi-Agent Reinforcement Learning

Xiaoqiang Wang, Liangjun Ke, Zhimin Qiao, and Xinghua Chai

PDF

TL;DR

This paper introduces Co-DQL, a novel multi-agent reinforcement learning algorithm for large-scale traffic signal control, improving cooperation, stability, and efficiency in traffic management systems.

Contribution

The paper proposes Co-DQL, a scalable MARL method with enhanced cooperation and stability features, specifically designed for large-scale traffic signal control problems.

Findings

01

Outperforms state-of-the-art MARL algorithms in traffic scenarios

02

Reduces average vehicle waiting time significantly

03

Demonstrates stable and robust learning process

Abstract

Finding the optimal signal timing strategy is a difficult task for the problem of large-scale traffic signal control (TSC). Multi-Agent Reinforcement Learning (MARL) is a promising method to solve this problem. However, there is still room for improvement in extending to large-scale problems and modeling the behaviors of other agents for each individual agent. In this paper, a new MARL, called Cooperative double Q-learning (Co-DQL), is proposed, which has several prominent features. It uses a highly scalable independent double Q-learning method based on double estimators and the UCB policy, which can eliminate the over-estimation problem existing in traditional independent Q-learning while ensuring exploration. It uses mean field approximation to model the interaction among agents, thereby making agents learn a better cooperative strategy. In order to improve the stability and…

Tables5

Table 1. TABLE I: Parameter Settings for Simulator

Parameter Type	Value [unit of measure]
Normal driving time between two nodes	5 [t]
Initial vehicles in simulator	100 [veh]
New vehicles added	5;4;3 [veh/t]
Shortest route length	2 [n]
Longest route length	20 [n]
Signal agent action time interval	4 [t]
Initial random seed number	10

Table 2. TABLE II: Model Performance in Global Random Traffic Flow Scenario

Method	Average Delay Time [t]	Mean Episode Reward
IQL	$148.500 (\pm 8.963)$	$- 11.602 (\pm 0.700)$
IDQL	$131.854 (\pm 7.534)$	$- 10.301 (\pm 0.589)$
DDPG	$111.057 (\pm 0.606)$	$- 8.676 (\pm 0.047)$
MA2C	$71.553 (\pm 0.5812)$	$- 5.590 (\pm 0.045)$
Co-DQL	$36.981 (\pm 0.509)$	$- 2.889 (\pm 0.040)$

Table 3. TABLE III: Model Performance in Double-Ring Traffic Flow Scenario

Method	Average Delay Time [t]	Mean Episode Reward
IQL	$89.838 (\pm 5.645)$	$- 5.615 (\pm 0.353)$
IDQL	$83.921 (\pm 2.273)$	$- 5.245 (\pm 0.142)$
DDPG	$86.581 (\pm 1.182)$	$- 5.411 (\pm 0.074)$
MA2C	$58.857 (\pm 0.779)$	$- 3.679 (\pm 0.049)$
Co-DQL	$26.046 (\pm 0.751)$	$- 1.628 (\pm 0.047)$

Table 4. TABLE IV: Model Performance in Four-Ring Traffic Flow Scenario

Method	Average Delay Time [t]	Mean Episode Reward
IQL	$168.526 (\pm 2.673)$	$- 7.900 (\pm 0.125)$
IDQL	$143.986 (\pm 3.761)$	$- 6.749 (\pm 0.176)$
DDPG	$116.823 (\pm 1.610)$	$- 5.476 (\pm 0.075)$
MA2C	$77.633 (\pm 0.660)$	$- 3.639 (\pm 0.031)$
Co-DQL	$37.174 (\pm 0.937)$	$- 1.743 (\pm 0.044)$

Table 5. TABLE V: Model performance in real road network with asymmetric geometry

Metrics	IQL	IDQL	DDPG	MA2C	Co-DQL
Mean Episode Reward	$- 1160.52 (\pm 190.62)$	$- 1076.34 (\pm 193.53)$	$- 1296.68 (\pm 140.87)$	$- 1108.52 (\pm 83.41)$	$- 930.38 (\pm 87.45)$
Avg. Vehicle Speed [m/s]	$4.33 (\pm 0.49)$	$4.53 (\pm 0.45)$	$3.81 (\pm 0.35)$	$4.65 (\pm 0.23)$	$5.35 (\pm 0.26)$
Avg. Intersection Delay [s/veh]	$28.52 (\pm 5.55)$	$27.17 (\pm 5.42)$	$33.01 (\pm 4.50)$	$27.98 (\pm 2.57)$	$20.31 (\pm 2.55)$
Avg. Queue Length [veh]	$10.03 (\pm 2.05)$	$10.01 (\pm 2.12)$	$12.53 (\pm 1.87)$	$9.80 (\pm 1.21)$	$7.51 (\pm 1.29)$
Trip Delay[s]	$278.38 (\pm 35.35)$	$254.20 (\pm 46.04)$	$311.34 (\pm 30.01)$	$253.23 (\pm 14.01)$	$177.73 (\pm 16.70)$
Trip Arrived Rate	$0.74 (\pm 0.08)$	$0.80 (\pm 0.07)$	$0.57 (\pm 0.05)$	$0.79 (\pm 0.03)$	$0.91 (\pm 0.03)$

Equations71

θ_{t + 1} = θ_{t} + α (Y_{t}^{Q} - Q (s_{t}, a_{t}; θ_{t})) \nabla_{θ_{t}} Q (s_{t}, a_{t}; θ_{t}),

θ_{t + 1} = θ_{t} + α (Y_{t}^{Q} - Q (s_{t}, a_{t}; θ_{t})) \nabla_{θ_{t}} Q (s_{t}, a_{t}; θ_{t}),

Y_{t}^{Q} \equiv r_{t + 1} + γ a max Q (s_{t + 1}, a; θ_{t}),

Y_{t}^{Q} \equiv r_{t + 1} + γ a max Q (s_{t + 1}, a; θ_{t}),

V_{k}^{π} (s) = E^{π} {t = 0 \sum \infty γ^{t} r_{k} (t + 1) ∣ s (0) = s},

V_{k}^{π} (s) = E^{π} {t = 0 \sum \infty γ^{t} r_{k} (t + 1) ∣ s (0) = s},

J_{k}^{π} (s) = T \to \infty lim \frac{1}{T} E^{π} {t = 0 \sum T r_{k} (t + 1) ∣ s (0) = s} .

J_{k}^{π} (s) = T \to \infty lim \frac{1}{T} E^{π} {t = 0 \sum T r_{k} (t + 1) ∣ s (0) = s} .

Q_{k}^{π} (s, a) = r_{k} (s, a) + γ E_{s^{'} \sim p} [V_{k}^{π} (s^{'})],

Q_{k}^{π} (s, a) = r_{k} (s, a) + γ E_{s^{'} \sim p} [V_{k}^{π} (s^{'})],

E {r_{k} ∣ π_{1}, \dots, π_{k}, \dots, π_{N}} \leq E {r_{k} ∣ π_{1}, \dots, π_{k}^{*}, \dots, π_{N}}, \forall π_{k} .

E {r_{k} ∣ π_{1}, \dots, π_{k}, \dots, π_{N}} \leq E {r_{k} ∣ π_{1}, \dots, π_{k}^{*}, \dots, π_{N}}, \forall π_{k} .

Q_{k}^{a} (s, a) \leftarrow Q_{k}^{a} (s, a) + α (r_{k} + γ Q_{k}^{b} (s^{'}, a_{k}^{*}) - Q_{k}^{a} (s, a)),

Q_{k}^{a} (s, a) \leftarrow Q_{k}^{a} (s, a) + α (r_{k} + γ Q_{k}^{b} (s^{'}, a_{k}^{*}) - Q_{k}^{a} (s, a)),

θ^{'} ⟵ τ θ + (1 - τ) θ^{'},

θ^{'} ⟵ τ θ + (1 - τ) θ^{'},

a_{k} = c \in A_{k} argmax {Q_{k} (s_{k}, c) + \frac{ln R _{s_{k}}}{R _{s_{k}, c}}},

a_{k} = c \in A_{k} argmax {Q_{k} (s_{k}, c) + \frac{ln R _{s_{k}}}{R _{s_{k}, c}}},

Q_{k} (s_{k}, a) = E_{l \sim d} [Q_{k} (s_{k}, a_{k}, a_{l})],

Q_{k} (s_{k}, a) = E_{l \sim d} [Q_{k} (s_{k}, a_{k}, a_{l})],

a_{l} = \overline{a}_{k} + δ_{l, k},

a_{l} = \overline{a}_{k} + δ_{l, k},

Q_{k} (s_{k}, a) = E_{l \sim d} [Q_{k} (s_{k}, a_{k}, a_{l})] = E_{l \sim d} [Q_{k} (s_{k}, a_{k}, \overline{a}_{k}) + \nabla Q_{k} (s_{k}, a_{k}, \overline{a}_{k}) \cdot δ_{l, k} + \frac{1}{2} δ_{l, k} \cdot \nabla^{2} Q_{k} (s_{k}, a_{k}, ξ_{l, k}) \cdot δ_{l, k}] = Q_{k} (s_{k}, a_{k}, \overline{a}_{k}) + \nabla Q_{k} (s_{k}, a_{k}, \overline{a}_{k}) \cdot E_{l \sim d} [δ_{l, k}] + \frac{1}{2} E_{l \sim d} [δ_{l, k} \cdot \nabla^{2} Q_{k} (s_{k}, a_{k}, ξ_{l, k}) \cdot δ_{l, k}] = Q_{k} (s_{k}, a_{k}, \overline{a}_{k}) + \frac{1}{2} E_{l \sim d} [R_{k} (a_{l})] \approx Q_{k} (s_{k}, a_{k}, \overline{a}_{k}),

Q_{k} (s_{k}, a) = E_{l \sim d} [Q_{k} (s_{k}, a_{k}, a_{l})] = E_{l \sim d} [Q_{k} (s_{k}, a_{k}, \overline{a}_{k}) + \nabla Q_{k} (s_{k}, a_{k}, \overline{a}_{k}) \cdot δ_{l, k} + \frac{1}{2} δ_{l, k} \cdot \nabla^{2} Q_{k} (s_{k}, a_{k}, ξ_{l, k}) \cdot δ_{l, k}] = Q_{k} (s_{k}, a_{k}, \overline{a}_{k}) + \nabla Q_{k} (s_{k}, a_{k}, \overline{a}_{k}) \cdot E_{l \sim d} [δ_{l, k}] + \frac{1}{2} E_{l \sim d} [δ_{l, k} \cdot \nabla^{2} Q_{k} (s_{k}, a_{k}, ξ_{l, k}) \cdot δ_{l, k}] = Q_{k} (s_{k}, a_{k}, \overline{a}_{k}) + \frac{1}{2} E_{l \sim d} [R_{k} (a_{l})] \approx Q_{k} (s_{k}, a_{k}, \overline{a}_{k}),

\overset{r}{^}_{k} = r_{k} + α \cdot i \in N (k) \sum r_{i},

\overset{r}{^}_{k} = r_{k} + α \cdot i \in N (k) \sum r_{i},

\overset{s}{^}_{k} = ⟨ s_{k}, \frac{1}{N _{k}} i \in N (k) \sum s_{i} ⟩,

\overset{s}{^}_{k} = ⟨ s_{k}, \frac{1}{N _{k}} i \in N (k) \sum s_{i} ⟩,

ℓ (ϕ_{k}) = (Q_{k}^{a} (\overset{s}{^}_{k}, a_{k}, \overline{a}_{k}; ϕ) - Y_{k}^{Co - DQL})^{2},

ℓ (ϕ_{k}) = (Q_{k}^{a} (\overset{s}{^}_{k}, a_{k}, \overline{a}_{k}; ϕ) - Y_{k}^{Co - DQL})^{2},

Y_{k}^{Co - DQL} = \overset{r}{^}_{k} + γ Q_{k}^{b} (\overset{s}{^}_{k}^{'}, argmax_{a_{k}} Q_{k}^{a} (\overset{s}{^}_{k}^{'}, a_{k}, \overline{a}_{k}; ϕ), \overline{a}_{k}^{'}; ϕ_{-}),

Y_{k}^{Co - DQL} = \overset{r}{^}_{k} + γ Q_{k}^{b} (\overset{s}{^}_{k}^{'}, argmax_{a_{k}} Q_{k}^{a} (\overset{s}{^}_{k}^{'}, a_{k}, \overline{a}_{k}; ϕ), \overline{a}_{k}^{'}; ϕ_{-}),

L (ϕ_{k}) = \frac{1}{M} \sum (Q_{k}^{a} (\overset{s}{^}_{k}, a_{k}, \overline{a}_{k}; ϕ) - Y_{k}^{Co - DQL})^{2}

L (ϕ_{k}) = \frac{1}{M} \sum (Q_{k}^{a} (\overset{s}{^}_{k}, a_{k}, \overline{a}_{k}; ϕ) - Y_{k}^{Co - DQL})^{2}

ϕ_{k}^{-} \leftarrow τ ϕ_{k} + (1 - τ) ϕ_{_{,} k}

ϕ_{k}^{-} \leftarrow τ ϕ_{k} + (1 - τ) ϕ_{_{,} k}

Q_{k}^{a} (s, a_{k}, \overline{a}_{k}) \leftarrow (1 - α) Q_{k}^{a} (s, a_{k}, \overline{a}_{k}) + α (r + γ Q_{k}^{b} (s^{'}, a_{k}^{*}, \overline{a}_{k}))

Q_{k}^{a} (s, a_{k}, \overline{a}_{k}) \leftarrow (1 - α) Q_{k}^{a} (s, a_{k}, \overline{a}_{k}) + α (r + γ Q_{k}^{b} (s^{'}, a_{k}^{*}, \overline{a}_{k}))

Q_{k}^{b} (s, a_{k}, \overline{a}_{k}) \leftarrow (1 - α) Q_{k}^{b} (s, a_{k}, \overline{a}_{k}) + α (r + γ Q_{k}^{a} (s^{'}, b_{k}^{*}, \overline{a}_{k})),

Δ_{t + 1} (x) = (1 - α_{t} (x)) Δ_{t} (x) + α_{t} (x) F_{t} (x)

Δ_{t + 1} (x) = (1 - α_{t} (x)) Δ_{t} (x) + α_{t} (x) F_{t} (x)

Δ_{t} (s, a) = Q_{t}^{a} (s, a) - Q_{*} (s, a)

Δ_{t} (s, a) = Q_{t}^{a} (s, a) - Q_{*} (s, a)

F_{t} (s_{t}, a_{t}) = r_{t} + γ Q_{t}^{b} (s_{t + 1}, a^{*}) - Q_{*} (s_{t}, a_{t}),

F_{t} (s_{t}, a_{t}) = F_{t}^{Q} (s_{t}, a_{t}) + γ (Q_{t}^{b} (s_{t + 1}, a^{*}) - Q_{t}^{a} (s_{t + 1}, a^{*})),

F_{t} (s_{t}, a_{t}) = F_{t}^{Q} (s_{t}, a_{t}) + γ (Q_{t}^{b} (s_{t + 1}, a^{*}) - Q_{t}^{a} (s_{t + 1}, a^{*})),

Δ_{t + 1}^{ba} (s_{t}, a_{t}) = Δ_{t}^{ba} (s_{t}, a_{t}) + α_{t} F_{t}^{b} (s_{t}, a_{t}), or

Δ_{t + 1}^{ba} (s_{t}, a_{t}) = Δ_{t}^{ba} (s_{t}, a_{t}) + α_{t} F_{t}^{b} (s_{t}, a_{t}), or

Δ_{t + 1}^{ba} (s_{t}, a_{t}) = Δ_{t}^{ba} (s_{t}, a_{t}) - α_{t} F_{t}^{b} (s_{t}, a_{t}),

E [Δ_{t + 1}^{ba} (s_{t}, a_{t}) ∣ ℑ_{t}] = Δ_{t}^{ba} (s_{t}, a_{t}) + E [α_{t} F_{t}^{b} (s_{t}, a_{t}) - α_{t} F_{t}^{a} (s_{t}, a_{t}) ∣ ℑ_{t}] = Δ_{t}^{ba} (s_{t}, a_{t}) + E [α_{t} γ (Q_{t}^{a} (s_{t + 1}, b^{*}) - Q_{t}^{b} (s_{t + 1}, a^{*})) - α_{t} (Q_{t}^{b} (s_{t}, a_{t}) - Q_{t}^{a} (s_{t}, a_{t})) ∣ ℑ_{t}] = (1 - ξ_{t}^{ba} (s_{t}, a_{t})) Δ_{t}^{ba} (s_{t}, a_{t}) + ξ_{t}^{ba} (s_{t}, a_{t}) E [F_{t}^{ba} (s_{t}, a_{t}) ∣ ℑ_{t}],

E [Δ_{t + 1}^{ba} (s_{t}, a_{t}) ∣ ℑ_{t}] = Δ_{t}^{ba} (s_{t}, a_{t}) + E [α_{t} F_{t}^{b} (s_{t}, a_{t}) - α_{t} F_{t}^{a} (s_{t}, a_{t}) ∣ ℑ_{t}] = Δ_{t}^{ba} (s_{t}, a_{t}) + E [α_{t} γ (Q_{t}^{a} (s_{t + 1}, b^{*}) - Q_{t}^{b} (s_{t + 1}, a^{*})) - α_{t} (Q_{t}^{b} (s_{t}, a_{t}) - Q_{t}^{a} (s_{t}, a_{t})) ∣ ℑ_{t}] = (1 - ξ_{t}^{ba} (s_{t}, a_{t})) Δ_{t}^{ba} (s_{t}, a_{t}) + ξ_{t}^{ba} (s_{t}, a_{t}) E [F_{t}^{ba} (s_{t}, a_{t}) ∣ ℑ_{t}],

∣ E [F_{t}^{ba} (s_{t}, a_{t}) ∣ ℑ_{t}] ∣

∣ E [F_{t}^{ba} (s_{t}, a_{t}) ∣ ℑ_{t}] ∣

\leq γ E [Q_{t}^{a} (s_{t + 1}, a^{*}) - Q_{t}^{b} (s_{t + 1}, a^{*}) ∣ ℑ_{t}]

\leq ∥ Δ_{t}^{ba} ∥.

∣ E [F_{t}^{ba} (s_{t}, a_{t}) ∣ ℑ_{t}] ∣

∣ E [F_{t}^{ba} (s_{t}, a_{t}) ∣ ℑ_{t}] ∣

\leq γ E [Q_{t}^{b} (s_{t + 1}, b^{*}) - Q_{t}^{a} (s_{t + 1}, b^{*}) ∣ ℑ_{t}]

\leq ∥ Δ_{t}^{ba} ∥.

s_{t + 1} = f (s_{t}, a_{t}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDouble Q-learning · Q-Learning

Full text

Large-scale Traffic Signal Control Using a Novel Multi-Agent Reinforcement Learning

Xiaoqiang Wang, Liangjun Ke, Zhimin Qiao, and Xinghua Chai X. Wang, L. Ke and Z. Qiao are with State Key Laboratory for Manufacturing Systems Engineering, School of Automation Science and Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi, 710049, China. (e-mail: [email protected]; [email protected]; [email protected])(Corresponding author: Liangjun Ke.).X. Wang and X. Chai are with CETC Key Laboratory of Aerospace Information Applications, Shijiazhuang, Hebei, China (e-mail: [email protected]).

Abstract

Finding the optimal signal timing strategy is a difficult task for the problem of large-scale traffic signal control (TSC). Multi-Agent Reinforcement Learning (MARL) is a promising method to solve this problem. However, there is still room for improvement in extending to large-scale problems and modeling the behaviors of other agents for each individual agent. In this paper, a new MARL, called Cooperative double Q-learning (Co-DQL), is proposed, which has several prominent features. It uses a highly scalable independent double Q-learning method based on double estimators and the upper confidence bound (UCB) policy, which can eliminate the over-estimation problem existing in traditional independent Q-learning while ensuring exploration. It uses mean field approximation to model the interaction among agents, thereby making agents learn a better cooperative strategy. In order to improve the stability and robustness of the learning process, we introduce a new reward allocation mechanism and a local state sharing method. In addition, we analyze the convergence properties of the proposed algorithm. Co-DQL is applied to TSC and tested on various traffic flow scenarios of TSC simulators. The results show that Co-DQL outperforms the state-of-the-art decentralized MARL algorithms in terms of multiple traffic metrics.

Index Terms:

Traffic signal control, mean field approximation, multi-agent reinforcement learning, double estimators.

I Introduction

Traffic congestion is becoming a great puzzling problem in urban, mainly due to the difficulty of effective utilization of limited road resources (e.g. road width). By regulating traffic flow of road network, the traffic signal control (TSC) at intersections plays an important role in utilizing the road resources and helping to reduce traffic congestion [1].

Many researchers have devoted efforts to TSC, with the aim of minimizing the average waiting time in the whole traffic system and maximizing social welfare [2]. When traffic signals are large-scale, the traditional control methods such as pre-timed [3] and actuated control systems [4] may fail to deal with the dynamic of the traffic conditions or lack the ability to foresee traffic flow. Intelligent computing methods (such as genetic algorithm [5], swarm intelligence [6], neuro-fuzzy networks [7] [8]), however, in many cases, suffer from a slow convergence rate. Reinforcement learning (RL) [9] is a promising adaptive decision-making method in many fields. It has been applied to cope with TSC [10] [11]. It can not only make real-time decisions according to traffic flow, but also predict future traffic flow. Especially in recent years, RL has made tremendous progress which significantly attributes to the success of deep learning [12]. By using deep neural network to approximate the value function or action-value function (such as DQN [13], DDPG [14]), RL can be adapted to the problems with large-scale state space or action space.

As for TSC with multiple signalized intersections, a straightforward idea is centralized, in which TSC is considered as a single-agent learning problem [5] [15]. However, a centralized approaches often need to collect all traffic data in the network as the global state [16], which may lead to high latency and failure rate. In addition, as the number of intersections increases, the joint state space and action space of the agent will increase exponentially to a large extent, which incurs the curse of dimension. Consequently, a centralized method often requires very heavy computational and communication burden.

An alternative way is multi-agent reinforcement learning (MARL) in which each signalized intersection is regarded as an agent. A challenge of a MARL approach is how to response to the dynamic interaction between each signal agent and the environment, which significantly affects the adaptive decision-making of other signals [17]. Moreover, most of the current MARL methods are only studied on very limited-size traffic network problems [18] [19]. However, in urban traffic systems, it is often necessary to consider all the signals in a global coordination manner. In [20] [21], each signal is regarded as an independent agent for training. Although this class of approaches can easily be extended to large-scale scenarios, they directly ignore the actions of other agents in the road network system and implicitly suggest that the environment is static. This makes it difficult for agents to learn favorable strategies with convergence guarantee. In [22], a max-plus method is proposed to deal with large-scale TSC problem, but this approach requires additional computation during execution. Multi-agent A2C [23] is developed from IA2C which is scalable and belongs to a decentralized MARL algorithm, but it may be uneasy to determine the appropriate attenuation factor to weaken the state and reward information from other agents.

In this work, we present a decentralized and scalable MARL method which is named after Cooperative double Q-learning (Co-DQL) and apply it to TSC. The new approach adopts a highly scalable independent double Q-learning method, with the aim of avoiding the problem of over-estimation suffered from traditional independent Q-learning [24]. At the mean time, it can ensure exploration by using the upper confidence bound (UCB) [25] rule. In order to make agents learn a better cooperative strategy for large-scale problems, it employs mean field theory [26], which has been studied in [27]. It approximately treats the interactions within the population of agents as the interaction between a single agent and a virtual agent averaged by other individuals, which potentially transmits the action information among all agents in the environment. Furthermore, we introduce a new reward allocation mechanism and a local state sharing method to make the learning process of agents more stable and robust. To theoretically support the effectiveness of the proposed algorithm, we provide the convergence proof for the proposed algorithm under some mild conditions. Numerical experiment is performed on various traffic flow scenarios of TSC simulators. The empirical results show that the proposed method outperforms several state-of-the-art decentralized MARL algorithms in terms of multiple traffic metrics.

The paper is organised into six sections. Section II describes the background on RL. Section III presents the proposed method and analyzes the convergence properties. Section IV introduces the application of Co-DQL to TSC problem. Section V describes the setup and conditions of the experiments in detail, and makes a comparative analysis and discussion on the experimental results. Section VI summarises this paper.

II Background on Reinforcement Learning

II-A Single-Agent RL

Q-learning is one of the most popular RL methods and it solves sequential decision-making problems by learning estimates for the optimal value of each action. The optimal value can be expressed as $Q^{*}(s,a)=\max_{\pi}Q^{\pi}(s,a)$ . However, it is not easy to learn the values of all the actions in all states when the state space or action space is larger. In this case, we can learn a parameterized action-value function $Q(s,a;\boldsymbol{\theta})$ . When taking action $a_{t}$ in state $s_{t}$ and observing the immediate reward $r_{t+1}$ and resulting state $s_{t+1}$ , the standard Q-learning updates the parameters as follows:

[TABLE]

where $t$ is the time step, $\alpha$ is the learning rate and the target $Y^{Q}_{t}$ is defined as:

[TABLE]

where the constant $\gamma\in[0,1)$ is the discount factor that trades off the importance of immediate and later rewards. After updating gradually, it can converge to optimal action-value function.

Note that Q-learning approximates the value of the next state by maximizing over the estimated action values in the corresponding state, namely, $\max_{a}Q_{t}(s_{t+1},a;\boldsymbol{\theta}_{t})$ and it is an estimate of $E\left\{\max_{a}Q_{t}\left(s_{t+1},a;\boldsymbol{\theta}_{t}\right)\right\}$ , which in turn is used to approximate $\max_{a}E\left\{Q_{t}\left(s_{t+1},a;\boldsymbol{\theta}_{t}\right)\right\}$ . This method of approximating the maximum expected value has a positive deviation [24] [28] [29], which leads to over-estimation of the optimal value and may damage the performance.

II-B Multi-Agent RL

The single-agent RL is based on Markov decision process (MDP) theory, while for MARL, it mainly stems from Markov game [30], which generalizes the MDP and was proposed as the standard framework for MARL [31].

We can use a tuple to formalize Markov game, namely $(N,\boldsymbol{S},\boldsymbol{A}_{1,2,\ldots,N},r_{1,2,\ldots,N},p)$ , where $N$ being the number of agents in the game system, $\boldsymbol{S}=\{\boldsymbol{s}_{1},\ldots,\boldsymbol{s}_{n}\}$ is a finite set of system states, $n$ being the number of states in the system, $\boldsymbol{A}_{k}$ is the action set of agent $k\in\{1,\ldots,N\}$ ; $r_{k}:\boldsymbol{S}\times\boldsymbol{A}_{1}\times\ldots\times\boldsymbol{A}_{N}\times\boldsymbol{S}\rightarrow\mathbb{R}$ is the reward function of agent $k$ , determining the immediate reward, $p:\boldsymbol{S}\times\boldsymbol{A}_{1}\times\ldots\times\boldsymbol{A}_{N}\rightarrow\mu(\boldsymbol{S})$ is the transition function. Each agent has its own strategy and chooses actions according to its strategy. Under the joint strategy $\boldsymbol{\pi}\triangleq(\pi_{1},\ldots,\pi_{N})$ , at each time step, the system state is transferred by taking the joint action $\boldsymbol{a}=(a_{1},\ldots,a_{N})$ selected according to the joint strategy and each agent receives the immediate reward as the consequence of taking the joint action. To measure the performance of a strategy, either the future discounted reward or the average reward over time can be used, depending on the policies of other agents. This results in the following definition for the expected discounted reward for agent $k$ under a joint policy $\boldsymbol{\pi}$ and initial state $\boldsymbol{s}(0)=\boldsymbol{s}\in\boldsymbol{S}$ :

[TABLE]

while the average reward for agent $k$ under this joint policy is defined as:

[TABLE]

On the basis of Eq. (3) (the most used form), the action-value function $Q_{k}^{\boldsymbol{\pi}}:\boldsymbol{S}\times\boldsymbol{A}_{1}\times\ldots\times\boldsymbol{A}_{N}\rightarrow\mathbb{R}$ of agent $k$ under the joint strategy $\boldsymbol{\pi}$ can be written as follows according to Bellman equation:

[TABLE]

where $V_{k}^{\boldsymbol{\pi}}(s)=\mathbb{E}_{\boldsymbol{a}\sim\boldsymbol{\pi}}\left[Q_{k}^{\boldsymbol{\pi}}(s,\boldsymbol{a})\right]$ and $s^{\prime}$ is the system state at the next time step. The commonly used MARL methods are generally based on Q-learning. The general multi-agent Q-learning framework is shown in Algorithm 1.

MARL enables each agent to learn the optimal strategy to maximize its cumulative reward. However, the value function of each agent is related to the joint strategy $\boldsymbol{\pi}$ of all agents, so it is in general impossible for all players in a game to maximize their payoff simultaneously. For MARL, an important solution concept is Nash equilibrium. Given these opponent strategies, the best response of agent $k$ to a vector of opponent strategies is defined as the strategy $\pi_{k}^{*}$ that achieves the maximum expected reward, which is given as follows:

[TABLE]

Then the Nash equilibrium is represented by a joint strategy $\boldsymbol{\pi}^{*}\triangleq\left(\pi_{1}^{*},\ldots,\pi_{N}^{*}\right)$ in which each agent acts with the best response $\pi_{k}^{*}$ to others and all other agents follow the joint policy $\boldsymbol{\pi}_{-k}^{*}$ of all agents except $k$ , where the joint policy $\boldsymbol{\pi}_{-k}^{*}\triangleq\left(\pi_{1}^{*},\ldots,\pi_{k-1}^{*},\pi_{k+1}^{*},\ldots,\pi_{N}^{*}\right)$ . In this case, as long as all other agents keep their policies unchanged, no agent can benefit by changing its policy. Many MARL algorithms reviewed strive to converge to Nash equilibrium. In addition, the Q-function will eventually converge to the Nash Q-value $\boldsymbol{Q}^{*}=(Q_{1}^{*},\ldots,Q_{N}^{*})$ received in a Nash equilibrium of the game.

III Description of The Proposed Method

Co-DQL is developed from a new algorithm, called independent double Q-learning method, which is also firstly proposed in this paper. In the following, we first present the independent double Q-learning method, and then introduce Co-DQL, finally, we analyze its convergence properties.

III-A Independent Double Q-learning Method

Most MARL methods are based on Q-learning. However, as described in Section II-A, traditional RL methods cause the problem of over-estimation, which to some extent harms the performance of RL methods. In [24], a double Q-learning algorithm is proposed, which uses double estimators instead of $\max_{a}Q_{t}\left(s_{t+1},a\right)$ to approximate $\max_{a}E\left\{Q_{t}\left(s_{t+1},a\right)\right\}$ , which is helpful to avoid the problem of over-estimation in standard Q-learning.

Inspired by independent Q-learning [21], we develop an independent double Q-learning method based on the UCB rule. For each agent $k$ , it is associated with two different action-value functions, each of which is updated with a value from the other action-value function for the next state. More specifically, suppose that the two action-value functions are $Q^{\mathfrak{a}}_{k}$ and $Q^{\mathfrak{b}}_{k}$ , and one of them is randomly selected for updating each time. The updating process of the action-value function $Q^{\mathfrak{a}}_{k}$ is as follows. Firstly, the maximal valued action $a^{*}_{k}$ in the next state $s^{\prime}$ is selected according to the action-value function $Q^{\mathfrak{a}}_{k}$ , namely, $a^{*}_{k}=\operatorname{argmax}_{a}Q^{\mathfrak{a}}_{k}\left(s^{\prime},a\right)$ . Then we use the value $Q^{\mathfrak{b}}_{k}\left(s^{\prime},a^{*}_{k}\right)$ to update $Q^{\mathfrak{a}}_{k}$ :

[TABLE]

instead of using the value $Q^{\mathfrak{a}}_{k}\left(s^{\prime},a^{*}_{k}\right)=\max_{a}Q^{\mathfrak{a}}_{k}\left(s^{\prime},a\right)$ to update $Q^{\mathfrak{a}}_{k}$ in independent Q-learning. The updating of $Q^{\mathfrak{b}}_{k}$ is similar to this.

Here two multi-layer neural networks are used to fit the two Q functions, which are expressed as $Q^{\mathfrak{a}}_{k}\left(s,a;\boldsymbol{\theta}_{t}\right)$ and $Q^{\mathfrak{b}}_{k}\left(s,a;\boldsymbol{\theta}_{t}^{\prime}\right)$ respectively. Usually the latter is called target Q-function (or target network). The update mode is similar to the one of deep double Q-learning [28] and the target value $Y_{k,t}\equiv r_{k,t+1}+\gamma Q^{\mathfrak{b}}_{k}(s_{t+1},\operatorname{argmax}_{a}Q^{\mathfrak{a}}_{k}\left(s_{t+1},a;\boldsymbol{\theta}_{t}\right),\boldsymbol{\theta}_{t}^{\prime})$ . In order to make the target network update smoother, we adopt the soft target update [14] instead of copying the network weights directly [13]:

[TABLE]

where $\tau\ll 1$ . The soft update method makes the weights of the target Q-function change slowly, so does the target values. Compared with the direct copy of the weights, the soft update method can enhance the learning stability [13].

To balance exploration and exploitation, the UCB exploration strategy is used to select an action to be performed by the agent $k$ :

[TABLE]

where $R_{s_{k}}$ denotes the number of times state $s_{k}$ has been visited and $R_{s_{k},c}$ denotes the number of times action $c$ has been chosen in this state until now. If action $c$ has been chosen rarely in some states, then the second term will dominate the first term and action $c$ will be explored. As learning progresses, the first term dominates the second term and the UCB strategy ultimately becomes a greedy one. Although $\epsilon$ -greedy strategy is easier to implement for problems with larger state space, we prefer the UCB strategy if possible, since in the preliminary test we observe that the UCB strategy is slightly better than the $\epsilon$ -greedy strategy [32]. From the perspective of exploration mechanism, the exploratory action selection for UCB is based on both the learnt Q-values and the number of times an action has been chosen in the past, hence it tends to be more inclined to explore those actions that are rarely explored.

In this method, agent $k$ just regards other agents as a part of the environment. Therefore, this method ignores the dynamic resulting from the actions of the other agents and the convergence is not guarantee. In order to learn a better cooperative strategies and make learning process more stable and robust, we introduce Co-DQL, which uses mean field approximation, a new reward allocation mechanism and local state sharing method.

III-B Cooperative Double Q-learning Method

With the number of agents increasing, the dimension of joint action $\boldsymbol{a}$ increases exponentially, so when the number of agents is relatively large, it is often not feasible to directly calculate the joint action function $Q_{k}(s,\boldsymbol{a})$ for each agent $k$ . Mean field approximation is first proposed in [27] to deal with the problem. Its core idea is that the interactions within the population of agents are approximated by those between an agent and the average of its neighboring agents 111The neighborhood size is a user-specific parameter. It can take a value from [1, $N$ ] where $N$ is the total number of agents.. Specifically, a very natural approach is to decompose the joint action-value function as follows:

[TABLE]

where $d$ is the uniform distribution on the index set $\mathcal{N}(k)$ which is the set of the neighboring agents of agent $k$ and the size of the index set is $N_{k}=|\mathcal{N}(k)|$ . Suppose that each agent has $C$ discrete actions $\{1,2,\ldots,C\}$ . Then the action $a_{k}$ of agent $k$ can be coded using one-hot, namely, $a_{k}\triangleq\left[a_{k,1},a_{k,2},\ldots,a_{k,C}\right]$ , where each component corresponds to a possible action, and obviously at any time only one component is one and the others are zero. Hence the mean action $\overline{a}_{k}$ can be expressed as: $\overline{a}_{k}\triangleq\left[\overline{a}_{k,1},\overline{a}_{k,2},\ldots,\overline{a}_{k,C}\right]$ , where each component $\overline{a}_{k,i}=\mathrm{E}_{l\sim d}\left[a_{l,i}\right]$ for $i\in\{1,2,\ldots,C\}$ , simply recorded as $\overline{a}_{k}=\mathrm{E}_{l\sim d}\left[a_{l}\right]$ . Intuitively, $\overline{a}_{k}$ can be seen as the empirical distribution of the actions taken by the neighbors of agent $k$ [27]. Naturally, there is the following relationship between the one-hot coding action $a_{l}$ of agent $l$ and the mean action:

[TABLE]

where $\delta_{l,k}$ is a small fluctuation. Under the premise of twice-differentiable, using Taylor expansion theory, the mean field approximation is expressed by the following formulation on the basis of Eq. 10:

[TABLE]

where $\mathrm{E}_{l\sim d}[\delta_{l,k}]=0$ is easily known from Eq. 11, and $R_{k}(a_{l})\triangleq\delta_{l,k}\cdot\nabla^{2}Q_{k}(s_{k},a_{k},\xi_{l,k})\cdot\delta_{l,k}$ denotes the Taylor polynomial’s remainder with $\xi_{l,k}=\overline{a}_{k}+\epsilon_{l,k}\cdot\delta_{l,k}$ and $\epsilon_{l,k}\in[0,1]$ [27]. Under some mild conditions, it can be proved that $R_{k}(a_{l})$ is a random variable close to zero and can be omitted[27]. For large-scale TSC, this way of implicit modeling the behavior of other agents has great advantages, which makes the input dimension of each agent $k$ ’s Q-function drastically reduce, and the joint action dimension decreases from $C^{N_{k}}$ to constant $C^{2}$ . It is worth noting that we only need to pay attention to the actions of the current time step, rather than the historical behavior of the neighbors. This is mainly due to the fact that the traffic state dynamics is Markovian, which will be further discussed in Section IV-A.

For partially observable Markov traffic scenarios, each agent $k$ can get its own reward $r_{k}$ and local observation $s_{k}$ at each time step. The goal of MARL in cooperative situation is to maximize the global benefits or minimize the regrets 222In this paper, regrets refer to the waiting time of vehicles, the length of queues, etc.. However, there may be the so-called credit assignment problem [33] in MARL, so each agent often does not directly regard the global reward as its reward. Instead, we set each agent to maintain its own reward. In addition, if each agent only considers its own immediate reward, then the agent may become selfish, which may be harmful to cooperation. Based on the above considerations, we propose to allocate each agent’s reward according to the following formulation:

[TABLE]

where $\alpha\in[0,1]$ is a discount factor that can be flexibly used to balance selfishness and cooperation. If $\alpha$ is set to 0, then each signal agent only considers the immediate reward of its own intersection, greedily maximizing the throughput of its own intersection, which may damage the global reward of the road network; if $\alpha$ is set to 1, this means that each agent may get the global reward and suffers from credit assignment problem as described earlier. Specifically, we make $0<\alpha<1$ . The idea behind is as follows: For each signal agent $k$ , despite the action selection may be not always beneficial to the neighboring agents, the reallocated reward received after an action depends on its own immediate reward and the immediate reward of the neighboring agents. Once the immediate rewards of the neighboring agents are low, the second term of Eq. 13 will take a small value which means the action taken by signal agent $k$ may be not so great for the neighboring agents. While higher immediate rewards of the neighboring agents will encourage signal agent $k$ and accordingly the second term of Eq. 13 will take a larger value. This reward allocation mechanism in Eq. 13 in turn affects the action selection of agent $k$ , with the aim of maximizing the global reward of the road network. The reward allocation mechanism is similar to the one mentioned in [23], but we do not strictly limit the distance between agent $k$ and the neighboring agents.

The local state sharing method is described below. For agent $k$ , the average of the local state of its neighboring agents is taken as the additional input of agent $k$ ’s action-value function. Hence, the state of agent $k$ can be represented as:

[TABLE]

where $\hat{s}_{k}$ represents agent $k$ ’s joint state. This method implicitly shares state information among agents, and if the dimension of local state is assumed to be $|s|$ , its joint dimension is constant $|s|^{2}$ , which is independent of the number of agents.

Based on the above introduction, Cooperative double Q-learning (Co-DQL) algorithm is proposed. Compared with the centralized control method[16][34], this algorithm reduces the joint input dimension of action-value function from $C^{N_{k}}\cdot|s|^{N_{k}}$ to $C^{2}\cdot|s|^{2}$ at the cost of a small amount of communication and calculation[27], which avoids the curse of dimension in large-scale problems. The pseudo code of Co-DQL is given in Algorithm 2. In this algorithm, multi-layer perceptions parameterized by $\phi$ and $\phi_{-}$ are used to represent the two action-value functions of each agent. Co-DQL works as follows:

Step 0

Initialize: For each $k=1,\ldots,N$ , initialize neural network parameters $\phi_{k}$ , $\phi_{-,k}$ and initialize mean action $\overline{a}_{k}$ for agent $k$ .

Step 1

Check the termination condition: If a problem-specific stopping condition is met, stop and save the training neural network model.

Step 2

Select action: For each $k=1,\ldots,N$ , according to the current observation $\hat{s}_{k}$ of agent $k$ , select action $a_{k}$ under the UCB policy.

Step 3

Execute action: For each $k=1,\ldots,N$ , agent $k$ executes action $a_{k}$ (all agents execute action synchronously), gets immediate reward $r_{k}$ and next state observation $s^{\prime}_{k}$ .

Step 4

Obtain samples: For each $k=1,\ldots,N$ , compute the mean action $\overline{a}_{k}$ , reward $\hat{r}_{k}$ after reallocation and next local state $\hat{s}^{\prime}_{k}$ after sharing.

Step 5

Store samples in buffer: For each $k=1,\ldots,N$ , store the results of step 4 as a tuple sample $\left\langle\hat{\boldsymbol{s}},\boldsymbol{a},\boldsymbol{\hat{r}},\hat{\boldsymbol{s}}^{\prime},\overline{\boldsymbol{a}}\right\rangle$ in replay buffer $\mathcal{D}_{k}$ ; If the number of samples stored in the $\mathcal{D}_{k}$ is less than the minimum number of samples required for training, goto Step 1, otherwise the next step is executed sequentially.

Step 6

Compute sample target values: For each $k=1,\ldots,N$ , $M$ samples are randomly extracted from $\mathcal{D}_{k}$ and the target value $Y_{k}^{\mathrm{Co-DQL}}$ is calculated according to the sample data.

Step 7

Update Neural Network Parameters: For each $k=1,\ldots,N$ , the gradient of the parameter $\phi_{k}$ is obtained from the loss function, and $\phi_{k}$ is updated according to the learning rate, then $\phi_{-,k}$ is softly updated with update rate $\tau$ . Goto Step 1.

For most RL algorithms, the termination condition is generally set to be that the number of episodes experienced by agents reachs the preset number. The preset number of episodes is usually selected according to the training situation of the algorithm in the given problem.

The action-value function $Q^{\mathfrak{a}}_{k}(\cdot|\phi)$ (parameterized by $\phi$ ) is trained by minimizing the loss:

[TABLE]

where $Y_{k}^{\mathrm{Co-DQL}}$ is the target value of agent $k$ and is calculated by the following formulation:

[TABLE]

In Co-DQL, the mean field approximation makes every independent agent learn the awareness of collaboration with the others. Moreover, the reward allocation mechanism and the local state sharing method of agents improve the stability and robustness of the training process compared with the independent agent learning method.

In order to theoretically support the effectiveness of our proposed Co-DQL algorithm, we provide the convergence proof under some assumptions in the next subsection.

III-C Convergence Analysis

In previous literature, the convergence of mean field Q-learning under the set of tabular Q-functions and the convergence of when Q-function is represented by other function approximators have been proved [35] [27]. Under similar constraints, we develop the convergence proof of Co-DQL, which is the mean field RL with double estimators.

Assuming that there are only a limited number of state-action pairs, for each agent $k$ , we can write updating rules of two functions $Q^{\mathfrak{a}}_{k}$ and $Q^{\mathfrak{b}}_{k}$ of agent $k$ according to Section III-A and Section III-B:

[TABLE]

where $\mathfrak{a}^{*}_{k}=\operatorname{argmax}_{a_{k}}Q^{\mathfrak{a}}_{k}(s^{\prime},a_{k},\overline{a}_{k})$ , and $\mathfrak{b}^{*}_{k}=\operatorname{argmax}_{a_{k}}Q^{\mathfrak{b}}_{k}(s^{\prime},a_{k},\overline{a}_{k})$ . At any update time step, either of the two of Eq. 17 is updated. Our goal is to prove that both $\boldsymbol{Q}^{\mathfrak{a}}=(Q^{\mathfrak{a}}_{1},\ldots,Q^{\mathfrak{a}}_{N})$ and $\boldsymbol{Q}^{\mathfrak{b}}=(Q^{\mathfrak{b}}_{1},\ldots,Q^{\mathfrak{b}}_{N})$ converge to Nash Q-values. Our proof follows the convergence proof framework of single agent Double Q-learning [24], and we use the following assumptions and lemma.

Assumption 1.

Each action-value pair is visited infinitely often, and the reward is bounded by some constant $K$ .

Assumption 2.

Agent’s policy is Greedy in the Limit with Infinite Exploration (GLIE). In the case with the Boltzmann policy, the policy becomes greedy w.r.t. the Q-function in the limit as the temperature decays asymptotically to zero.

Assumption 3.

For each stage game $[Q_{t}^{1}(s),\ldots,Q_{t}^{N}(s)]$ at time $t$ and in state $s$ in training, for all $t,s,j\in\{1,\ldots,N\}$ , the Nash equilibrium $\boldsymbol{\pi}_{*}=[\pi_{*}^{1},\ldots,\pi_{*}^{N}]$ is recognized either as 1) the global optimum or 2) a saddle point expressed as:

$\mathbb{E}_{\pi_{*}}[Q_{t}^{j}(s)]\geq\mathbb{E}_{\pi}[Q_{t}^{j}(s)],\forall\pi\in\Omega(\prod_{k}\mathcal{A}^{k})$ ; 2. 2.

$\mathbb{E}_{\pi_{*}}[Q_{t}^{j}(s)]\geq\mathbb{E}_{\pi^{j}}\mathbb{E}_{\pi_{*}^{-j}}[Q_{t}^{j}(s)],\forall\pi^{j}\in\Omega(\mathcal{A}^{j})$ * and *

$\mathbb{E}_{\pi_{*}}[Q_{t}^{j}(s)]\leq\mathbb{E}_{\pi_{*}^{j}}\mathbb{E}_{\pi^{-j}}[Q_{t}^{j}(s)],\forall\pi^{-j}\in\Omega(\prod_{k\neq j}\mathcal{A}^{k}).$ **

Lemma 1.

The random process $\left\{\Delta_{t}\right\}$ defined in $\mathbb{R}$ as

[TABLE]

converges to zero with probability 1 (w.p.1) when

$0\leq\alpha_{t}(x)\leq 1,\sum_{t}\alpha_{t}(x)=\infty,\sum_{t}\alpha_{t}^{2}(x)<\infty$ ; 2. 2.

$x\in\mathscr{X},$ * the set of possible states, and $|\mathscr{X}|<\infty$ ;* 3. 3.

$\left\|\mathbb{E}\left[F_{t}(x)|\Im_{t}\right]\right\|_{W}\leq\gamma\left\|\Delta_{t}\right\|_{W}+c_{t},$ * where $\gamma\in[0,1)$ and $c_{t}$ converges to zero w.p.1;* 4. 4.

$\operatorname{var}\left[F_{t}(x)|\Im_{t}\right]\leq K(1+\left\|\Delta_{t}\right\|_{W}^{2})$ * with constant $K>0$ .*

Here $\mathscr{F}_{t}$ denotes the filtration of an increasing sequence of $\sigma$ -fields including the history of processes; $\alpha_{t},\Delta_{t},F_{t}\in\mathscr{F}_{t}$ and $\|\cdot\|_{W}$ is a weighted maximum norm [30].

Proof.

Similar to the proof of Theorem 1 in [36] and Corollary 5 in [37]. ∎

Our theorem and proof sketches are as follows:

Theorem 1.

In a finite-state stochastic game, if Assumption 1,2 & 3, and Lemma 1’s first and second conditions are met, then both $\boldsymbol{Q}^{\mathfrak{a}}$ and $\boldsymbol{Q}^{\mathfrak{b}}$ as updated by the rule of Algorithm 2 in Eq. 17 will converge to the Nash Q-value $\boldsymbol{Q}^{*}=(Q_{1}^{*},\ldots,Q_{N}^{*})$ with probability one.

Proof.

We need to show that the third and fourth conditions of Lemma 1 hold so that we can apply it to prove Theorem 1. Obviously, the updates of functions $\boldsymbol{Q}^{\mathfrak{a}}$ and $\boldsymbol{Q}^{\mathfrak{b}}$ are symmetrical, so as long as one of them is proved to converge, the other must converge. By subtracting two sides of Eq. 17 by $\boldsymbol{Q}^{*}$ , and then the following formula can be obtained by comparing with the equation in Lemma 1:

[TABLE]

where $\mathfrak{a}^{*}=\operatorname{argmax}_{\boldsymbol{a}}\boldsymbol{Q}^{\mathfrak{a}}(s_{t+1},\boldsymbol{a}_{t},\overline{\boldsymbol{a}}_{t})$ . Let $\Im_{t}=\{\boldsymbol{Q}^{\mathfrak{a}}_{0},\boldsymbol{Q}^{\mathfrak{b}}_{0},s_{0},\boldsymbol{a}_{0},\alpha_{0},\boldsymbol{r}_{1},s_{1},\dots,s_{t},\boldsymbol{a}_{t}\}$ denote the $\sigma$ -fields generated by all random variables in the history of the stochastic game up to time $t$ . Note that $\boldsymbol{Q}^{\mathfrak{a}}_{t}$ and $\boldsymbol{Q}^{\mathfrak{b}}_{t}$ are two random variables derived from the historical trajectory up to time $t$ . Given the fact that all $\boldsymbol{Q}^{\mathfrak{a}}_{\tau}$ and $\boldsymbol{Q}^{\mathfrak{b}}_{\tau}$ with $\tau<t$ are $\mathscr{F}_{t}$ -measurable, both $\boldsymbol{\Delta}_{t}$ and $\boldsymbol{F}_{t}$ are therefore also $\mathscr{F}_{t}$ -measurable.

Since the reward is bounded by some constant $K$ in Assumption 1, then $\operatorname{Var}[\boldsymbol{r}_{t}]<\varpropto$ , the fourth condition in the lemma holds.

Next, we show that the third condition of the lemma holds. We can rewrite Eq. 18 as follows:

[TABLE]

where $\boldsymbol{F}^{Q}_{t}=\boldsymbol{r}_{t}+\gamma\boldsymbol{Q}^{\mathfrak{a}}_{t}\left(s_{t+1},\mathfrak{a}^{*}\right)-\boldsymbol{Q}^{*}\left(s_{t},\boldsymbol{a}_{t}\right)$ is the value of $\boldsymbol{F}_{t}$ if normal MF-Q would be under consideration. In [17], $\|\mathbb{E}[\boldsymbol{F}^{Q}_{t}|\Im_{t}]\|_{W}\leq\gamma\left\|\Delta_{t}\right\|_{W}$ has been proved, so in order to meet the third condition, we identify $c_{t}=\gamma(\boldsymbol{Q}^{\mathfrak{b}}_{t}\left(s_{t+1},\mathfrak{a}^{*}\right)-\boldsymbol{Q}^{\mathfrak{a}}_{t}\left(s_{t+1},\mathfrak{a}^{*}\right))$ and it is sufficient to show that $\boldsymbol{\Delta}_{t}^{\mathfrak{b}\mathfrak{a}}=\boldsymbol{Q}_{t}^{\mathfrak{b}}-\boldsymbol{Q}_{t}^{\mathfrak{a}}$ converges to zero. The update of $\boldsymbol{\Delta}_{t}^{\mathfrak{b}\mathfrak{a}}$ depends on whether $\boldsymbol{Q}^{\mathfrak{b}}$ or $\boldsymbol{Q}^{\mathfrak{a}}$ is updated, so

[TABLE]

where $\boldsymbol{F}^{\mathfrak{a}}_{t}(s_{t},\boldsymbol{a}_{t})=\boldsymbol{r}_{t}+\gamma\boldsymbol{Q}^{\mathfrak{b}}_{t}\left(s_{t+1},\mathfrak{a}^{*}\right)-\boldsymbol{Q}^{\mathfrak{a}}_{t}\left(s_{t},\boldsymbol{a}_{t}\right)$ and $\boldsymbol{F}^{\mathfrak{b}}_{t}(s_{t},\boldsymbol{a}_{t})=\boldsymbol{r}_{t}+\gamma\boldsymbol{Q}^{\mathfrak{a}}_{t}\left(s_{t+1},\mathfrak{b}^{*}\right)-\boldsymbol{Q}^{\mathfrak{b}}_{t}\left(s_{t},\boldsymbol{a}_{t}\right)$ . We define $\xi^{\mathfrak{b}\mathfrak{a}}_{t}=\frac{1}{2}\alpha_{t}$ , then

[TABLE]

where $\mathbb{E}[\boldsymbol{F}^{\mathfrak{b}\mathfrak{a}}_{t}(s_{t},\boldsymbol{a}_{t})|\Im_{t}]\!=\!\gamma\mathbb{E}[\boldsymbol{Q}^{\mathfrak{a}}_{t}\left(\!s_{t+1},\!\mathfrak{a}^{*}\!\right)\!-\!\boldsymbol{Q}^{\mathfrak{b}}_{t}\left(\!s_{t+1},\!\mathfrak{a}^{*}\!\right)\!|\Im_{t}]$ . At each time step, one of the following two cases must hold.

Case 1: $\mathbb{E}[\boldsymbol{Q}^{\mathfrak{a}}_{t}\left(s_{t+1},\!\mathfrak{b}^{*}\right)\!|\Im_{t}]\!\geq\!\mathbb{E}[\boldsymbol{Q}^{\mathfrak{b}}_{t}\left(s_{t+1},\!\mathfrak{a}^{*}\right)\!|\Im_{t}]$ . We have $\boldsymbol{Q}^{\mathfrak{a}}_{t}\left(\!s_{t+1},\!\mathfrak{a}^{*}\!\right)\!=\!\max\boldsymbol{Q}^{\mathfrak{a}}_{t}\left(\!s_{t+1},\!\boldsymbol{a}\!\right)\!\geq\!\boldsymbol{Q}^{\mathfrak{a}}_{t}\left(\!s_{t+1},\!\mathfrak{b}^{*}\!\right),$ therefore

[TABLE]

Case 2: $\mathbb{E}[\boldsymbol{Q}^{\mathfrak{a}}_{t}\left(s_{t+1},\mathfrak{b}^{*}\right)|\Im_{t}]<\mathbb{E}[\boldsymbol{Q}^{\mathfrak{b}}_{t}\left(s_{t+1},\mathfrak{a}^{*}\right)|\Im_{t}]$ . We have $\mathbb{E}[\boldsymbol{Q}^{\mathfrak{b}}_{t}\left(s_{t+1},\mathfrak{b}^{*}\right)|\Im_{t}]\geq\mathbb{E}[\boldsymbol{Q}^{\mathfrak{b}}_{t}\left(s_{t+1},\mathfrak{a}^{*}\right)|\Im_{t}]$ . Then

[TABLE]

Hence, no matter which of the above cases is hold, we can obtain the satisfactory result, that is, $|\mathbb{E}[\boldsymbol{F}^{\mathfrak{b}\mathfrak{a}}_{t}(s_{t},\boldsymbol{a}_{t})|\Im_{t}]|\leq\|\boldsymbol{\Delta}^{\mathfrak{b}\mathfrak{a}}_{t}\|$ . Then, we can apply Lemma 1 and get the convergence of $\boldsymbol{\Delta}_{t}^{\mathfrak{b}\mathfrak{a}}$ to 0, the third condition is thus hold. Finally, with all conditions are satisfied, Theorem 1 is proved. ∎

IV Application of Co-DQL to TSC

This section first uses MDP notations to represent the key elements of TSC problem, so that MARL can be used in TSC. To facilitate the training and evaluation of the MARL model applied to TSC problem, we also introduce the TSC simulators.

IV-A Description of TSC Based on MDP Notations

Although we model the entire traffic network in a decentralized way as a multi-agent structure, the global state of the whole traffic system is still Markov, namely, the next state only depends on the current state:

[TABLE]

where $s_{t}$ and $s_{t+1}$ denotes the state of traffic system at time step $t$ and $t+1$ , $\boldsymbol{a}_{t}$ denotes the joint action of the traffic system at time step $t$ . Therefore, it can be modeled using the framework of MARL described in Section II-B.

When to cope with TSC problem, there are many different MDP settings. Their differences lie in the definition of action space, state space or reward function, etc. [11][23][38] [39] [40] [41]. Here, we focus on the following two kinds of MDP settings. Note that it may be potential to extend our method to other kinds of settings. Since the source code of our method is open 333https://github.com/Brucewangxq/larger_real_net, an interested reader can try to test or extend it to deal with other kinds of MDP settings.

IV-A1 A simplified MDP setting for TSC problem

Suppose a road network has $N$ signalized intersections, i.e., $N$ signal agents. The action of signal agent $k$ at time step $t$ can be written as $a_{k,t}$ , and its local observation or state is $s_{k,t}$ . We set the signal agent’s actions at each intersection has only two possible cases $\{0,1\}$ : Green traffic lights for incoming traffic in the north and south directions and red traffic lights in the east and west directions at the same time, or contrary to that, so the action space is $\{0,1\}^{N}$ . The local state, which is the observation vector $s_{k,t}$ , is the waiting queue density (or queue length) on all the one-way lanes (or edges) connected to the intersection $k$ : $s_{k,t}=[q[kn],q[ks],q[kw],q[ke]]$ , where $q[kn],q[ks],q[kw]$ and $q[ke]$ represent the waiting queue density in four directions related to intersection $k$ respectively, and they are the lanes of vehicles driving in the direction of intersection $k$ . The value space of each of them can be expressed as $\{0,1,2,\ldots,\max_{q}\}$ , where $\max_{q}$ is the maximum capacity of vehicles on an lane between every two intersections. For the peripheral signal agent of the system, if there is no road connected to it in a certain direction, the number of vehicles in that direction is always zero. For simplicity, it is assumed that a normally traveling vehicle has the same speed and can start or stop immediately.

For any signal agent $k$ , the reward at time step $t$ can be calculated by the number of vehicles waiting on all lanes towards the intersection, that is,

[TABLE]

where $q_{t}[kj]$ is the number of vehicles that have zero speed on lane $kj$ leading to intersection $k$ . To avoid changing traffic signal too frequently, the action can be taken every $\Delta t$ time steps, that is, a Markov state transition occurs only once every $\Delta t$ time steps. Then from the $T$ -th to $T+1$ -th state transition, the signal agent obtains the sum of the rewards in $\Delta t$ time steps, that is,

[TABLE]

Our goal is to minimize the total waiting time of vehicles in the traffic network:

[TABLE]

where $T_{max}$ denotes the total number of state transitions, and the joint action $\boldsymbol{a}$ changes every $\Delta t$ time steps.

In this simplified situation, all other agents are treated as the neighboring agents of each agent. Fig. 1 shows how to apply Co-DQL to TSC. The input information of each agent includes the shared local state information and the mean action information calculated from actions of the neighboring agents in the previous time step. Each agent receives a reallocated reward after performing an action.

IV-A2 A more realistic MDP setting for TSC problem

In the literature of RL for TSC, there are several standard action definitions, such as phase duration[38], phase switch[39] [40] and phase itself[41][23][11]. Here, we follow the last definition and pre-define a set of feasible phases for each signal agent. Specifically, we adopt the definition of feasible phases in [23], which defines five feasible phases for each signal agent, including east-west straight, east-west left-turn, and three straight and left-turn for east, west and north-south. These five feasible phases constitute the action space, each phase corresponds to an action. Each signal agent selects one of them to implement for a duration of $\Delta t$ at each Markov time step. In addition, a yellow time $t_{y}<\Delta t$ is enforced after each phase switch to ensure safety.

After comprehensively understanding a variety of commonly used state definitions[41] [38] [23], we tend to follow the one in [23] and define local state as

[TABLE]

where $lane$ is each incoming lane of intersection $k$ . $wait$ measures the cumulative delay [s] of the first vehicle and $wave$ measures the total number [veh] of approaching vehicles along each incoming lane. In our experiment, we use laneAreaDetector in Simulation of Urban Mobility (SUMO)[42] [43] to obtain the state information, and in practice, the state information can be obtained by near-intersection induction-loop detectors as described in [23].

Similar to the definition of reward in the simplified TSC problem mentioned earlier, we also further consider the cumulative delay of the first car as a regularizer:

[TABLE]

where $\beta$ is the regularization rate and typically chosen to approximately scale different reward terms into the same range. Note that the rewards are only measured at time $t+\Delta t$ . Compared to other reward definitions such as wave[38] and appropriateness of green time[44], the reward we defined emphasizes traffic congestion and travel delay, and it is directly correlated to state and action[23].

IV-B Description of the Simulation Platform

IV-B1 A simplified TSC simulator

The simulation platform used in Section V-B is a grid TSC system based on OpenAI-gym [45]. There are three different scenarios in the experiment: global random traffic flow, double-ring traffic flow and four-ring traffic flow which correspond to the three subfigures of Fig. 2 respectively.

Each rectangle denotes a signalized intersection and the number in the rectangle represents the immediate reward for the intersection. Every two adjacent intersections are connected by two one-way lanes. The color of each lane in the picture ranges from green to red, which vaguely means the number of vehicles waiting (at zero speed) on the lane, i.e. the level of congestion. Green means unimpeded and red indicates serious congestion. During the operation of the simulator, a certain number of vehicles will be generated at each time step and scattered randomly in the road network. And every newly generated vehicle will have a randomly generated route according to a certain rule, and the vehicle will follow the route and finally the vehicle will be removed from the road network when it reaches the destination.

Among these three scenarios, the one difference is that the rules of generating a driving route of a vehicle, which results in different level of congestion at different intersections. This can simulate the real information of the traffic flow between the main and secondary roads in the city. In the actual traffic network, serious congestion does often occur only in certain specific sections. The other difference is that the number of new vehicles added per time step is various, which can be used to simulate different levels of traffic congestion.

The primary parameters of the simulator are listed in Table I. The normal driving time between two intersections, that is, the distance between two intersections, indicates that normal driving vehicles need 5 time steps from one intersection to an adjacent intersection. The initial number (note that it is not the number after resetting the simulator when training model) of vehicles in simulator is used to obtain random seeds. The shortest route length is 2, which means that the shortest distance that a vehicle generated in the simulator can travel is two intersections. The longest route length is 20, which means that the longest distance that a vehicle generated in the simulator can travel is twenty intersections. The action time interval of signal agent is 4, which means that a signal agent must keep at least 4 time steps before it can change one action.

IV-B2 A more realistic TSC simulator

We take the road network in some areas of Xi’an as the prototype of the real road network to design a TSC simulator based on SUMO, which has 49 signalized intersections on the real road network. Fig. 3 and Fig. 4 show the overall road network view and a local view of two adjacent intersections, respectively. The cars driving on the road network have the following properties: the length is $5m$ , the acceleration is $5m/s$ , and the deceleration is $10m/s$ . As for the setting of signal agents’ action time interval $\Delta t$ , as discussed in [23], if $\Delta t$ is too long, signal agent will not be adaptive enough, if $\Delta t$ is too short, the agent decision will not be delivered on time due to computational cost and communication latency, and it may be unsafe since the action is switched too frequently. Some recent works suggested $\Delta t=10s,t_{y}=5s$ [38], $\Delta t=5s,t_{y}=2s$ [23]. We adopt the latter setting in the simulator to ensure that each signal agent is more adaptive.

In order to evaluate the robustness and optimality of algorithms in a challenging TSC scenario, we design intensive, stochastic, time-variant traffic flows to simulate the peak-hour traffic, instead of fixed congestion levels in the simplified TSC simulator. The simulation time of each episode is $60min$ and we set up four traffic flow groups. Specifically, four traffic flow groups are generated as multiples of “unit” flows $1100veh/hr$ , $660veh/hr$ , $920veh/hr$ , and $552veh/hr$ . The first two traffic flows are simulated during the first $40min$ , as $[0.4,0.7,0.9,1.0,0.75,0.5,0.25]$ unit flows with $5min$ intervals, while the last two traffic flows are generated during a shifted time window from $15min$ to $55min$ , as $[0.3,0.8,0.9,1.0,0.8,0.6,0.2]$ unit flows with $5min$ intervals.

V Numerical Experiments and Discussions

V-A Implementation Details of Algorithms

In order to analyze the performance of the proposed algorithm, we compared it with several popular RL methods in the same traffic scenarios. Details of the implementation of Co-DQL and the other methods are described as follows:

Co-DQL: The procedure described in Section III-B is implemented. Multilayer fully connected neural network is used to approximate the Q-function of each agent. We use the ReLU-activation between hidden layers, and transform the final output of Q-network with it. All agents share the same Q-network, the shared Q-network takes an agent embedding as input and computes Q-value for each candidate action. We also feed in the action approximation $\overline{a}_{k}$ and sharing joint state $\hat{s}_{k}$ . We use the Adam optimizer with a learning rate of 0.0001. The discounted factor $\gamma$ is set to 0.95, the mini-batch size is 1024, and the reward allocation factor $\alpha$ is set to $1/n$ , where $n$ represents the number of neighbor agents. The size of replay buffer is $5\times 10^{5}$ and $\tau=0.01$ for updating the target networks. The network parameters will be updated once an episode samples are added to the replay buffer.

Multi-Agent A2C (MA2C): The start-of-the-art MARL (decentralized) algorithms for large-scale TSC. The hyper-parameters of the algorithm in the experiment are basically consistent with the original one [23].

Independent Q-learning (IQL): It has almost the same hyper-parameters settings as Co-DQL. And the network architecture is identical to Co-DQL, except a mean action and sharing joint state are not fed as an addition input to the Q-network.

Independent double Q-learning (IDQL): The parameter setting of this method is almost the same as that of independent Q-learning. The main difference is that the double estimators are used when calculating the target value.

Deep deterministic policy gradient (DDPG): This is an off-policy algorithm too. It consists of two parts: actor and critic. Each agent is trained with DDPG algorithm and we share the critic among all agents in each experiment and all of the actors are kept separate. It uses the Adam optimizer with a learning rate of 0.001 and 0.0001 for critics and actors respectively. The settings of other parameters are the same as those of Co-DQL.

It is noteworthy that all the hyper-parameter settings of all algorithms may affect the performance of the algorithm to a certain extent.

V-B Experiments in The Simplified TSC Simulator

By training and evaluating the proposed method in different traffic scenarios, we can demonstrate that the proposed method is promising. Next, we will analyze the performance of the algorithms in three scenarios.

V-B1 global random traffic flow

As shown in the Fig. 5 (a), under the condition that signal agents adopt a random strategy, the mean reward reaches stable after about 2000 time steps, which means that the traffic flow of the simulator reaches a stable state too. In order to ensure the diversity of training samples and avoid over-fitting some traffic flow states as far as possible, we record 10 discrete simulator states (i.e. vehicle position, driving status, signal status) after 2000 time steps as random seeds and it will be used to train and evaluate these methods. In the global random traffic flow, we set the number of new vehicles added at each time step to 5, which corresponds to a high level of traffic congestion.

Result Analysis. We run 2500 episodes for training all five models, and regularly save the trained models. The mean reward curve of signal agents is shown in Fig. 6. It can be seen from the figure that IQL suffers the lowest training performance. Although IDQL is just slightly better than IQL, the results tend to indicate that over-estimation of action-value function will damage the performance of signal control and that using double estimators can improve the performance to a certain extent. Interestingly, the performance of DDPG is better than that of IDQL, it may be due to the advantages of actor-critic structure. Although MA2C and Co-DQL both have more robust learning ability, Co-DQL greatly outperforms all the other methods. Co-DQL uses mean field approximation to directly model the strategies of other agents, thus it can learn a good cooperative strategies and maximize the total reward of the road network.

For each algorithm, the best model obtained in the training process is used to test in this scenario. We evaluate all of them over 100 episodes. Table II shows the results of evaluation. Average delay time is calculated from the total delay time of vehicles in the road network during an episode. The standard deviation is given in parentheses after the mean value. Co-DQL greatly reduces the average delay time compared with the other methods. The test results are basically consistent with the trained model performance, which shows the validity of our trained model.

V-B2 double-ring traffic flow

Fig. 5 (b) shows the mean reward curve of agents using random strategies in double-ring traffic flow scene. Similarly, 10 simulator states are selected as seeds. In this scenario, we set the number of new vehicles added to the network at each time step to 4, which corresponds to a medium level of traffic congestion. The other parameters of the simulator are the same as those in Section V-B1.

Result Analysis. Similarly, we train all the models in this scenario and save the model with the best training performance. The mean reward curve is shown in Fig. 7. As expected, the training performance of Co-DQL method still outperforms all the other methods. In addition, mainly due to the information transfer among agents, MA2C can obtain better training results in contrast to the independent agent methods, that is, IQL and IDQL. However, although the convergence rates of DDPG, IQL and IDQL are different, the final training results are basically similar. This may be because the problem of double-ring traffic flow is relatively simple, so these three methods can achieve relatively consistent results. In this scenario, the evaluation results are shown in Table III. Co-DQL can obtain shorter average delay time and smaller standard deviations than other methods.

V-B3 four-ring traffic flow

Select seeds for the four-ring traffic flow according to the curve of Fig. 5 (c). In order to simulate traffic conditions with low level of traffic congestion, we set the number of new vehicles added to the road network at each time step to 3. The other parameters of the simulator are set in the same way as other scenarios.

Result Analysis. The training curve in this scenario is shown in Fig. 8, and the test results are shown in Table IV. In this scenario, the training performance of IDQL is significantly better than that of IQL without double estimators. The learning process of Co-DQL and MA2C is relatively stable and the standard deviation in the evaluation process is smaller than that of IQL, IDQL and DDPG, this may be due to that they share information among agents. But ultimately, Co-DQL achieves the shortest average delay time by means of mean field approximation for opponent modeling and local information sharing.

V-C Experiment in The More Realistic TSC Simulator

Experiment Settings. Experiment with the simulator setup described in Section IV-B2. Regarding MDP setting, the regularization rate $\beta$ in reward is set to $0.2veh/s$ , and the regularization factors of $wave$ , $wait$ , and reward are $5veh$ , $100s$ , and $2000veh$ . Here, we train all MARL models around 1400 episodes given episode horizon $T=720$ steps, then evaluate the trained models over 10 episodes.

Result Analysis. The mean episode reward curve during the training in this scenario is shown in Fig. 9. In this challenging scenario, DDPG suffers from the worst training performance, which may be due to the time-varying traffic flow leading to a large variance of critics, so it can not effectively guide the learning of actors. Surprisingly, although the training performance of MA2C is much better than that of DDPG, it has no obvious advantage over IQL and IDQL. This may be due to MA2C is more sensitive to the number of agents, and the setting of many hyper-parameters involved is also a big challenge. As expected, Co-DQL achieves the best training performance.

In this more realistic simulator, we have the opportunity to consider more traffic metrics than in the simplified one. Table V shows the evaluation results using ten different random seeds, in which Avg. Vehicle Speed is calculated by dividing the total distance traveled by the driving time, Avg. Intersection Delay is calculated by dividing the total delay time of each intersection by the total number of vehicles at the intersection, and Avg. Queue Length is calculated by the queue length of each time period, and Trip Delay refers to the total delay time of vehicles in the driving process, and Trip Arrived Rate is calculated by dividing the number of vehicles that have arrived at the destination before the end of the simulation by the total number of vehicles. The comparison results in terms of all measures are relatively consistent.

According to the results, over-estimation makes a difference in the performance between IQL and IDQL, and the use of double estimators in IDQL always has a slight advantage over IQL according to most of the measurements. Compared with IQL, IDQL and DDPG, Co-DQL and MA2C show more robust test performance (less standard deviation), which shows that information sharing among agents brings benefits to cooperation among agents, and Co-DQL achieves the best average performance with respect to multiple measures, which shows the advantage of mean field approximation in agent behavior modeling.

V-D Discussions

Firstly, we discuss the performance of different algorithms in three traffic flow scenarios with the simplified MDP setting. As seen from Fig. 10 (blue bar), all methods have a smaller mean episode reward in the global random traffic flow scenario than in the other scenarios, which is due to the highest level of traffic congestion and the largest traffic volume in this scenario. According to Fig. 11 (green bar), although the mean episode reward level of each evaluation model in the four-ring traffic flow scenario is moderate, the number of vehicles in this scenario is small, which may lead to greater average vehicle delay. Although the traffic volume of double-ring traffic flow scenario is larger than that of four-ring traffic flow scenario, the evaluation results in the former scenario (orange bar) are even slightly better than the latter (green bar), regardless of the mean episode reward of agent or the average waiting time of vehicle. The analysis shows that the double-ring traffic flow scenario just needs the cooperation between two groups of agents, namely, the cooperation of signal agents in the inner and outer loop, while the four-ring traffic flow scenario needs the collaboration among four groups, so the cooperation task of signal agents in the latter may be more complex.

Experimental results on multiple scenarios show that the performance of the algorithm with double estimator is always better than that without double estimator. Compared with the simplified situation, in the more realistic case, MA2C does not achieve the desired performance. Co-DQL can still get more training reward and better evaluation performance than the state-of-the-art decentralized MARL algorithms. In addition, we also conducted an experiment on a $7\times 7$ grid road network simulator, the setting and results about the experiment are shown in the supplementary materials. One can notice that Co-DQL can achieve the best results.

In the society of RL, a hot topic is how to use it in reality. Because the uncertainty brought by the exploration behavior of RL model in the training process is a potential safety hazard for the application of TSC in practice, the training stage of our model is completed in a TSC simulator in a similar way as most RL models [15][16][23], and the model deployed in reality is generally the model trained in the simulator. Although there is a gap between the simulator and the real environment, simulation to reality (sim2real)[46], as a branch of RL, has been widely studied in order to bridge the gap.

VI Conclusion

When to design a MARL algorithm, a critical challenge is how to make the agents efficiently cooperate, and one of the breach of realize is properly estimating the Q values and sharing local information among agents. Along this line of thought, this paper developed Co-DQL, which takes advantage of some important ideas studied in the literature. In more detail, Co-DQL employs an independent double Q-learning method based on double estimators and the UCB exploration, which can eliminate the over-estimation of traditional independent Q-learning while ensuring exploration. It adopts mean field approximation to model the interaction among agents so that agents can learn a better cooperative strategy. In addition, we presented a reward allocation mechanism and a local state sharing method. Based on the characteristics of TSC, we gave the details of the algorithmic elements. To validate the performance of the proposed algorithm, we tested Co-DQL on various traffic flow scenarios of TSC simulators. Compared with several state-of-the-art MARL algorithms (i.e., IQL, IDQL, DDPG and MA2C), Co-DQL can achieve promising results.

In the future, we hope to further test Co-DQL on the real city road network, and we will consider other approaches on large-scale MARL such as hierarchical architecture [41] [47]. In addition, note that the local optimization of an agent’s reward (throughput) may reduce the neighboring agents’ rewards in a nonlinear way. Such a nonlinearity is typical in traffic flow. Using the linear weighted function with a constant $\alpha$ may not fully capture the nonlinear throughput relationship between neighboring intersections. Also, each agent’s reward will appear multiple times, depending on the number of connected neighboring intersections. For instance, an intersection with five legs will receive more weights than a three-leg intersection that may cause a biased optimal solution. Hence, it may be interesting to further study on the reward allocation mechanism.

So far, a great number of methods have been proposed for TSC, such as max pressure[48], cell transmission model[49]. It may be interesting to comprehensively compare these methods. Furthermore, parameters heavily affect the performance of an algorithm, it is interesting to study how to automatically adjust them so as to achieve the promising quality. Finally, it may be interesting to study our method on the other MDP settings for TSC problem.

Acknowledgment

This work was supported by the National Natural Science Foundation of China (No. 61973244, 61573277).

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] K.-L. A. Yau, J. Qadir, H. L. Khoo, M. H. Ling, and P. Komisarczuk, “A survey on reinforcement learning models and algorithms for traffic signal control,” ACM Computing Surveys (CSUR) , vol. 50, no. 3, p. 34, 2017.
2[2] Q. Wu and J. Guo, “Optimal bidding strategies in electricity markets using reinforcement learning,” Electric Power Components and Systems , vol. 32, no. 2, pp. 175–192, 2004.
3[3] B. Yin, M. Dridi, and A. El Moudni, “Traffic network micro-simulation model and control algorithm based on approximate dynamic programming,” IET Intelligent Transport Systems , vol. 10, no. 3, pp. 186–196, 2016.
4[4] P. Koonce and L. Rodegerdts, “Traffic signal timing manual.” United States. Federal Highway Administration, Tech. Rep., 2008.
5[5] H. Ceylan and M. G. Bell, “Traffic signal timing optimisation based on genetic algorithm approach, including drivers routing,” Transportation Research Part B: Methodological , vol. 38, no. 4, pp. 329–342, 2004.
6[6] J. García-Nieto, E. Alba, and A. C. Olivera, “Swarm intelligence for traffic light scheduling: Application to real urban areas,” Engineering Applications of Artificial Intelligence , vol. 25, no. 2, pp. 274–283, 2012.
7[7] J. Qiao, N. Yang, and J. Gao, “Two-stage fuzzy logic controller for signalized intersection,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans , vol. 41, no. 1, pp. 178–184, 2010.
8[8] D. Srinivasan, M. C. Choy, and R. L. Cheu, “Neural networks for real-time traffic signal control,” IEEE Transactions on intelligent transportation systems , vol. 7, no. 3, pp. 261–272, 2006.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Large-scale Traffic Signal Control Using a Novel Multi-Agent Reinforcement Learning

Abstract

Index Terms:

I Introduction

II Background on Reinforcement Learning

II-A Single-Agent RL

II-B Multi-Agent RL

III Description of The Proposed Method

III-A Independent Double Q-learning Method

III-B *Cooperative Double Q-learning Method *

III-C Convergence Analysis

Assumption 1**.**

Assumption 2**.**

Assumption 3**.**

Lemma 1**.**

Proof.

Theorem 1**.**

Proof.

IV Application of Co-DQL to TSC

IV-A Description of TSC Based on MDP Notations

IV-A1 A simplified MDP setting for TSC problem

IV-A2 A more realistic MDP setting for TSC problem

IV-B Description of the Simulation Platform

IV-B1 A simplified TSC simulator

IV-B2 A more realistic TSC simulator

V Numerical Experiments and Discussions

V-A Implementation Details of Algorithms

V-B Experiments in The Simplified TSC Simulator

V-B1 global random traffic flow

V-B2 double-ring traffic flow

V-B3 four-ring traffic flow

V-C Experiment in The More Realistic TSC Simulator

V-D Discussions

VI Conclusion

Acknowledgment

III-B Cooperative Double Q-learning Method

Assumption 1.

Assumption 2.

Assumption 3.

Lemma 1.

Theorem 1.