Inequity aversion reduces travel time in the traffic light control   problem

Mersad Hassanjani; Farinaz Alamiyan-Harandi; Pouria Ramazi

arXiv:2302.12053·cs.MA·February 24, 2023

Inequity aversion reduces travel time in the traffic light control problem

Mersad Hassanjani, Farinaz Alamiyan-Harandi, Pouria Ramazi

PDF

Open Access 1 Repo

TL;DR

This paper introduces IACoLight, an improved traffic light control model that incorporates inequity aversion to enhance traffic flow, achieving up to 11.4% better performance than previous models by reshaping agent rewards.

Contribution

It proposes a novel integration of inequity aversion into deep reinforcement learning for traffic control, exploring positive and negative reward adjustments for the first time.

Findings

01

IACoLight outperforms CoLight by up to 11.4% in traffic flow efficiency.

02

Rewarding advantageous inequities further improves traffic management.

03

Incorporating inequity aversion reduces average vehicle travel time.

Abstract

The traffic light control problem is to improve the traffic flow by coordinating between the traffic lights. Recently, a successful deep reinforcement learning model, CoLight, was developed to capture the influences of neighboring intersections by a graph attention network. We propose IACoLight that boosts up to 11.4% the performance of CoLight by incorporating the Inequity Aversion (IA) model that reshapes each agent's reward by adding or subtracting advantageous or disadvantageous reward inequities compared to other agents. Unlike in the other applications of IA, where both advantageous and disadvantageous inequities are punished by considering negative coefficients, we allowed them to be also rewarded and explored a range of both positive and negative coefficients. Our experiments demonstrated that making CoLight agents averse to inequities improved the vehicles' average travel time…

Tables1

Table 1. Table 1: Statistical results for the Hangzhou dataset.

Methods	Episode	Performance	Convergence
	(From $0$ to $100$ )	(Second)	(Episode numbers)
CoLight	$466.7$	$349.1$	$84, 86$
Disadvantageous IACoLight ( $β = 0$ )
$α = 0.4$	$474.5$	$344.6$	$82, 95$
$α = - 0.2$	$461.2$	$335.4$	$87, 95$
Advantageous IACoLight ( $α = 0$ )
$β = 0.4$	458.4	$320.7$	$97, 97$
$β = - 1$	$481.3$	$371.9$	$73, 96$
IACoLight ( $α, β > 0$ )
$α = 0.2, β = 0.2$	$460.1$	$325.1$	$74, 99$
IACoLight (Exhaustive search)
$α = 0.6, β = - 0.2$	$473$	309.1	$78, 88$
\botrule

Equations6

L (θ_{n}) = E (r_{t}^{k} + γ a_{t + 1}^{k} max Q^{π^{k}} (s_{t + 1}^{k}, a_{t + 1}^{k} \mathchar 24635 θ_{n - 1}) - Q^{π^{k}} (s_{t}^{k}, a_{t}^{k} \mathchar 24635 θ_{n}))^{2},

L (θ_{n}) = E (r_{t}^{k} + γ a_{t + 1}^{k} max Q^{π^{k}} (s_{t + 1}^{k}, a_{t + 1}^{k} \mathchar 24635 θ_{n - 1}) - Q^{π^{k}} (s_{t}^{k}, a_{t}^{k} \mathchar 24635 θ_{n}))^{2},

i_{t}^{k} = - \frac{α _{k}}{N - 1} j \neq = k \sum max (w_{t}^{j} - w_{t}^{k}, 0) - \frac{β _{k}}{N - 1} j \neq = k \sum max (w_{t}^{k} - w_{t}^{j}, 0),

i_{t}^{k} = - \frac{α _{k}}{N - 1} j \neq = k \sum max (w_{t}^{j} - w_{t}^{k}, 0) - \frac{β _{k}}{N - 1} j \neq = k \sum max (w_{t}^{k} - w_{t}^{j}, 0),

w_{t}^{j} = γ λ w_{t - 1}^{j} + e_{t}^{j} \forall t \geq 1, w_{0}^{j} = 0,

w_{t}^{j} = γ λ w_{t - 1}^{j} + e_{t}^{j} \forall t \geq 1, w_{0}^{j} = 0,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mersadhj/iacolight
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTraffic control and management · Traffic Prediction and Management Techniques · Traffic and Road Safety

Full text

\jyear

2022

\equalcont

These authors contributed equally to this work.

[1]\fnmFarinaz \surAlamiyan-Harandi \equalcontThese authors contributed equally to this work.

[1]\orgdivDepartment of Electrical & Computer Engineering, \orgnameIsfahan University of Technology, \orgaddress\cityIsfahan, \postcode84156-83111, \countryIran

2]\orgdivDepartment of Mathematics & Statistics, \orgnameBrock University, \orgaddress\streetSt. Catharines, ON L2S 3A1, \countryCanada

Inequity aversion reduces travel time in the traffic light control problem

\fnmMersad \surHassanjani

[email protected]

\fnmPouria \surRamazi

[email protected]

[

Abstract

The traffic light control problem is to improve the traffic flow by coordinating between the traffic lights. Recently, a successful deep reinforcement learning model, CoLight, was developed to capture the influences of neighboring intersections by a graph attention network. We propose IACoLight that boosts up to 11.4% the performance of CoLight by incorporating the Inequity Aversion (IA) model that reshapes each agent’s reward by adding or subtracting advantageous or disadvantageous reward inequities compared to other agents. Unlike in the other applications of IA, where both advantageous and disadvantageous inequities are punished by considering negative coefficients, we allowed them to be also rewarded and explored a range of both positive and negative coefficients. Our experiments demonstrated that making CoLight agents averse to inequities improved the vehicles’ average travel time and rewarding rather than punishing advantageous inequities enhanced the results.

keywords:

Traffic Light Control, Deep Reinforcement Learning, Multi-Agent Systems, Inequity Aversion Model

1 Introduction

The problem of traffic light control is to coordinate between intersections by controlling their traffic lights to improve traffic flow. This problem remains as one of the greatest challenges in the $21$ st century (Qadri \BOthers., \APACyear2020). To tackle this challenge, researchers have taken various approaches such as the coordinated method modifying the start time of the green lights between the consecutive intersections (Koonce \BBA Rodegerdts, \APACyear2008), the optimization technique minimizing the vehicles’ travel time under certain traffic flow assumptions (Diakaki \BOthers., \APACyear2002), and the models applying perimeter control to handle transferring flows between regions of a city (Kouvelas \BOthers., \APACyear2017, \APACyear2015). In addition to conventional approaches, the problem was recently tackled with Reinforcement Learning (RL) methods (Qadri \BOthers., \APACyear2020). RL is a promising machine-learning framework where an agent interacts within a given environment by applying actions and receiving signals, which are interpreted as rewards and punishments. Via the interactions, the agents learn an optimal policy, a probability distribution over the available actions that maximizes the total obtained rewards for each visited environment state (Sutton \BOthers., \APACyear1998; Alamiyan-Harandi \BOthers., \APACyear2018; Rasheed \BOthers., \APACyear2020).

Encompassing several intersections, the traffic light control problem requires several actions to be executed at the same time. Hence, often the Multi-Agent (MA) extension of RL, i.e., MARL, is used for this problem. In the MARL setting, several agents coexist in an environment with multiple intersections. Each agent is responsible to control the traffic flow of one intersection by scheduling the traffic lights and learns a policy from its own observations to minimize the average queue length on all lanes over all intersections. This multi-agent setting resulted in a conflict of interest in using intersections as the common resources.

The CoLight (Wei, Xu\BCBL \BOthers., \APACyear2019) is a state-of-the-art model to enable cooperation of traffic signals in the traffic light control problem. The authors utilized a graph attention network (Veličković \BOthers., \APACyear2018) that represents the observations of the neighboring intersections as an overall summary and enables the agents to learn the neighboring influence on their under-control intersection. Each agent managed the traffic flow of an intersection by using a Deep Q Network (DQN) structure in the CityFlow simulator, that is an open-source traffic simulator designed for large-scale traffic scenarios (Zhang \BOthers., \APACyear2019).

Recently, the Inequity Aversion (IA) model (Hughes \BOthers., \APACyear2018) was introduced to improve the performance of MARL algorithms by manipulating the agents’ rewards based on the envy and guilt notions. Each agent compares its reward with that of each of its fellows. An advantageous inequity happens when the agent earns more in a pairwise comparison, and a disadvantageous inequity happens otherwise. Both cases are punished by subtracting the difference from the agent’s own reward, capturing the guilt and envy notions, respectively.

Nevertheless, advantageous and disadvantageous inequities may be considered as a reward, rather than a punishment, by adding a positive scale of differences to the agent’s own reward. Yet this has not been thoroughly investigated in the literature. To the best of our knowledge, all previous research on applying the IA model to MARL has used inequities as punishments. It hence remains an open question whether and how much rewarding the inequities can improve the performance of the IA model. This is partly due to the great computational costs associated with obtaining the results for a single pair of coefficients for the inequities. In the original work, a limited search over the parameter space is performed to find the best range.

Our goal is (i) to investigate the effectiveness of reshaping rewards in MARL in the traffic light control problem by using the IA model and (ii) to investigate the effect of different coefficients of the advantageous and disadvantageous inequities. To this end, we incorporate the IA model into the training process of the CoLight model (Wei, Xu\BCBL \BOthers., \APACyear2019) and introduce the IACoLight model. The obtained results are compared with the CoLight’s result under the same simulated environments. We also investigate the positive and negative ranges of the IA model’s hyperparameters to analyze the potential benefits achieved from their different combinations in the traffic light control problem.

The remainder of this paper is organized as follows: Section 2 reviews some RL methods that have been recently introduced to control traffic lights. Section 3 defines the MARL problem and the proposed IACoLight model. The experimental results are presented in Section 4 and are discussed in Section 5.

2 RL methods to control traffic lights

To apply RL methods in traffic signal control, researchers considered various traffic scenarios and scales from controlling the traffic flow of a single intersection to several ones in crowdy cities.

To control the traffic lights of a single intersection in the Simulation of Urban MObility (SUMO) simulator (Lopez \BOthers., \APACyear2018), the DQN structure of the IntelliLight method (Wei \BOthers., \APACyear2018) was trained using real data collected from surveillance cameras. The learned policy was evaluated not only based on the quantitative evaluation metrics including the weighted sum of the average of rewards, queue lengths, durations, and delays but also by a qualitative assessment of the pattern learned for light switching, because some policies might have the same reward but one may be more suitable for real-world practice.

According to an analogy that “when humans attempt to master a skill, they often refer to expert knowledge”, the authors of the DemoLight (Xiong \BOthers., \APACyear2019) utilized the Self-Organizing Traffic Light Control (Cools \BOthers., \APACyear2013) to make the agents learn from an expert. They applied the Advantage Actor-Critic (A2C) algorithm (Mnih \BOthers., \APACyear2016) to speed up the learning process in the CityFlow simulator. The results of the DemoLight were evaluated by the travel time metric and demonstrated a more efficient exploration of this approach in comparison with other previous methods.

The PressLight algorithm was introduced in (Wei, Chen\BCBL \BOthers., \APACyear2019) where the reward function was defined according to the Max Pressure method (Lioris \BOthers., \APACyear2016) that minimizes the overall travel time for the whole network by minimizing the pressure of each intersection in the network. A DQN was trained to control the traffic lights of several intersections in the CityFlow simulator using both synthetic and real-world traffic data. The performance excelled under heavy traffic.

MetaLight (Zang \BOthers., \APACyear2020) was another approach where the gradient-based meta-learning algorithm (Finn \BOthers., \APACyear2017) was applied in the CityFlow simulator to speed up the learning process and extend generalization so that the knowledge gained in previous traffic scenarios could be used in new ones. This method also improved the FRAP111A model that is invariant to symmetric operations like Flipping and Rotation and considers All Phase configurations. model (Zheng \BOthers., \APACyear2019), that was based on the DQN structure to train the agents. Individual and global level adaptions were used to apply meta-learning on the off-policy RL method of the MetaLight. MetaLight was tested on four real-world datasets.

MPLight method (Chen \BOthers., \APACyear2020) was another DQN-based method that employed a decentralized DQN structure which used the concept of pressure in traffic and utilized parameter sharing to control all intersections. Although the reward function and state definition were the same as the ones introduced in the PressLight method, FRAP was used in MPLight instead of a simple DQN as the base model. Scalability, coordination, and data feasibility were the three main problems tested in the experiments with more than $1,000$ traffic lights using the CityFlow simulator.

3 The IACoLight setup

We define the traffic light control problem in the following MARL setting. Consider an environment in the form of an urban region consisting of several roads with $N$ intersections, each having a number of traffic lights controlled by a single agent, resulting in a total of $N$ agents. At each time step $t$ , each agent $k$ makes the partial observation $o^{k}_{t}$ that consists of (i) the number of vehicles waiting on each of the lanes of the intersection and (ii) the light of which lanes of the intersection are green. The observation $o^{k}_{t}$ is represented by some of its features extracted by a layer of Multi-Layer Perceptron (MLP). The representation is referred to as the environment’s local state $s^{k}_{t}$ and is communicated to the agent’s neighboring intersections via graph attentional networks (Veličković \BOthers., \APACyear2018). These networks prepare a comprehensive summary of the intersection’s neighborhood indicating the importance of the received information from each neighbor.

Each agent executes an action that determines the intersection’s phase specifying the light of which two traffic movements of the intersection will be green during the next time interval $\Delta t$ . For example, consider an agent that controls a typical intersection with four entering approaches marked with the main directions east, north, west, and south (Figure 1). The vehicles in each entering approach should choose from one of the three lanes: the right lane to turn right, the left lane to turn left and the middle lane to go straight ahead. For this intersection, a standard phase turns the light green for vehicles in the middle lanes of two opposite entrances.

The available actions for agent $k$ form the action set $\mathcal{A}^{k}=\{a_{1},\ldots,a_{m}\}$ . Agent $k$ selects an action $a^{k}_{t}$ by using a policy $\pi^{k}$ , that is a probability distribution over the agent’s action set. By executing the joint action $\bm{a}_{t}=[a^{1}_{t},...,a^{k}_{t},...,a^{N}_{t}]$ at global state $s_{t}\in\mathcal{S}$ , the features extracted from all agents’ observations, the environment transfers to the global state $s_{t+1}$ according to a transition distribution $\mathcal{T}(s_{t+1}\lvert s_{t},\bm{a}_{t})$ , resulting in the reward $r^{k}_{t+1}$ to each agent $k$ . Aiming to minimize the vehicles’ travel time, we define the reward of each agent $k$ based on the average length of the queue formed in the incoming lanes of every direction of the agent’s intersection as $r^{k}_{t}=-\frac{1}{d}\sum_{l}u_{t}^{k,l}$ where $u_{t}^{k,l}$ is the queue length of lane $l$ of intersection $k$ at time $t$ and $d$ is the number of the intersection’s entrances.

Each agent $k$ uses its rewards to compute a state-action value function $Q^{\pi^{k}}(s^{k}_{t},a^{k}_{t})$ for all local states $s^{k}_{t}$ and actions $a^{k}_{t}\in\mathcal{A}^{k}$ , where $Q^{\pi^{k}}(s^{k}_{t},a^{k}_{t})$ approximates the return $R^{k}_{t}=\sum_{i=0}^{\infty}\gamma^{i}r^{k}_{t+i+1}$ , $0\leq\gamma\leq 1$ , that is an estimation of the cumulative $\gamma$ - discounted rewards over all local states visited in the future by applying action $a^{k}_{t}$ on local state $s^{k}_{t}$ and following policy $\pi^{k}$ . Agent $k$ selects the action with maximum $Q^{\pi^{k}}(s^{k}_{t},a^{k}_{t})$ at each time step $t$ .

Here similar to the CoLight model (Wei, Xu\BCBL \BOthers., \APACyear2019), a DQN (Figure 2) (Mnih \BOthers., \APACyear2015) is used as $Q^{\pi^{k}}(s^{k}_{t},a^{k}_{t})$ ; it takes the visited local state $s^{k}_{t}$ as the input and estimates the state-action value of applying each available action of agent $k$ in the local state $s^{k}_{t}$ as the output. The parameters of the DQN, denoted by $\bm{\theta}$ , are learned iteratively by minimizing the following loss function:

[TABLE]

where $\bm{\theta}_{n}$ is the parameter vector of the DQN’s neural networks in the $n$ th iteration of the learning process. The loss is measured as the differences between the predicted and actual (target) $Q$ values. Term $Q^{\pi^{k}}(s^{k}_{t},a^{k}_{t}\mathchar 24635\relax\;\bm{\theta}_{n})$ is the predicted $Q$ value and term $r^{k}_{t}+\gamma{\max_{a^{k}_{t+1}}}Q^{\pi^{k}}({s^{k}_{t+1}},{{a}^{k}_{t+1}}\mathchar 24635\relax\;\bm{\theta}_{n-1})$ is the target $Q$ value that computed by using the result of DQN in presence of $\bm{\theta}_{n-1}$ , the DQN’s parameters in the previous iteration of the learning process. To implement the IACoLight model, we used the linear reward function $r^{k}_{t}=\alpha e^{k}_{t}+\beta i^{k}_{t}$ where $\alpha$ and $\beta$ are constant scalars, $e^{k}_{t}$ is the extrinsic reward that agent $k$ receives from the environment, and $i^{k}_{t}$ is the intrinsic reward that agent $k$ computes according to the IA model (Hughes \BOthers., \APACyear2018):

[TABLE]

where $\alpha_{k}$ , $\beta_{k}\in\mathbb{R}$ are adjustable parameters and $w^{j}_{t}$ is a temporary memory of the extrinsic reward occurrence (Hughes \BOthers., \APACyear2018):

[TABLE]

and where $\lambda\in[0,1]$ is a trace-decay hyper-parameter. In equation (2), the terms $\max(w^{k}_{t}-w^{j}_{t},0)$ and $\max(w^{j}_{t}-w^{k}_{t},0)$ are the advantagous and disdvantagous inequities of agent $k$ against one of its fellows, agent $j$ , respectively.

4 Experiments

4.1 Experiment setup

We used the same experiment setup as that of CoLight (Wei, Xu\BCBL \BOthers., \APACyear2019) where several environments were generated based on the CityFlow simulator. The authors collected data by analyzing the trajectories of vehicles in captured images of roadside cameras in the Chinese cities, Hangzhou and Jinan, as well as New York city in the United States. In these environments, several vehicles move from various origins to their destinations while encountering traffic lights. The green signal of each traffic light is accompanied by a yellow three-second signal and a minimum of two seconds red signal. The authors extracted traffic information such as the number of vehicles passing through the intersections during a single day. They also used some artificially generated data.

In the Hangzhou dataset, the data is collected from roadside surveillance cameras existing in 16 intersections in Gudang Sub-district Hangzhou, China. The mean and standard deviation of the arrival rate (vehicles/ $300s$ ) in this dataset are $526.63$ and $86.70$ , respectively. In the Jinan dataset, the extracted data is associated with roadside cameras located on $12$ intersections in Dongfeng Sub-district, Jinan, China. The mean and standard deviation of the arrival rate (vehicles/ $300s$ ) in this dataset are $250.70$ and $38.21$ , respectively.

We conducted our experiments on the environment that has been created by using the Hangzhou and Jinan datasets222https://traffic-signal-control.github.io, which consists of one $4\times 4$ and another ( $3\times 4$ ) grid with $16$ (resp. $12$ ) intersections. Each agent controls an intersection and sends its information to its four neighbors located at its four main sides. To learn appropriate policies in these environments, each experiment included $100$ episodes, each containing $1440$ samples, which is the maximum amount of data that can be created in an intersection during a day.

We compared IACoLight with CoLight (Wei, Xu\BCBL \BOthers., \APACyear2019) as a baseline. For the CoLight method, we set the number of heads in the attention mechanism to $5$ . For IACoLight, we used the advantageous type of IA model where $\alpha$ is zero and $\beta$ is set to $0.05$ . To evaluate the performance of each method, the time distance between entering and leaving the grid for each vehicle was measured and their average was used as the vehicle travel time metric. We executed each method $5$ times, with different traffic flow sampled from the main datasets, and obtained the average of the evaluation metric.

Since the IA model is sensitive to hyperparameters, we ran a sweep over $\alpha$ and $\beta$ parameters in order to calibrate them correctly in the traffic light control problem and also analyze their synergic effect. Here, to investigate the effect of agents’ aversion to each of the advantageous and disadvantageous inequities, the range of $-1$ to $1$ with increments of $0.2$ for both $\alpha$ and $\beta$ was tested, resulting in a total of $11\times 11=121$ cases, each repeated $3$ times and for a length of $100$ episodes. This resulted in a total of $363$ experiments. Each experiment took around $5$ hours and was performed on a Linux server with $16$ CPUs and $120$ G RAM.

4.2 Experimental results

According to the result of tuning the $\alpha$ and $\beta$ hyperparameters of IACoLight in the Hangzhou dataset, using both inequities of the IA model with tuned coefficients in the IACoLight model outperforms all other methods (Table 1). The best value of the average travel time computed for the last $20$ episodes as a measure of the final learned performance was $309.1$ that belonged to $\alpha=0.6$ and $\beta=-0.2$ . It was $11.4\%$ better than the state-of-the-art CoLight model. This performance is achieved when disadvantageous inequities were punished ( $\alpha>0$ ) but advantageous inequities were rewarded ( $\beta<0$ ), which is more than the common setup in the literature where only disadvantageous inequities are used and punished ( $\alpha>0,\beta=0,1.3\%$ ), or only advantageous inequities are used and punished ( $\alpha=0,\beta>0,8.1\%$ ). Note that these differences are significant in the context of traffic flow control. The next best $3$ performance values were $311.3$ , $315.9$ , and $316.7$ . All of these values are lower than $325.1$ , the performance of the common setup in the literature where both advantageous and disadvantageous inequities were punished ( $\alpha,\beta>0$ ). These results demonstrated that the lower mean travel time for vehicles was obtained when $\beta$ was negative (Figure 3).

In the Hangzhou dataset, the convergence speed of IACoLight with best $\alpha$ and $\beta$ combination was slightly higher, but CoLight and the advantageous IACoLight, with the commonly used $\alpha$ and $\beta$ values in the literature, had a lower distance between the two convergence indexes (the vertical lines in Figure 4). The final learned performance demonstrated that IACoLight with best $\alpha$ and $\beta$ combination outperformed the CoLight and the advantageous IACoLight methods (the horizontal lines in Figure 4). It had a $11.4\%$ and $3.6\%$ lower travel time compared to CoLight and the advantageous IACoLight over the last of $20$ episodes.

In the Jinan dataset, IACoLight with best $\alpha$ and $\beta$ combination reached both convergence threshold sooner. In addition, its final learned performance had $7\%$ and $9.2\%$ lower travel time compared to CoLight and the advantageous IACoLight (Figure 5).

5 Discussion

We boosted the performance of agents trained based on the state-of-the-art CoLight method in the MARL setting of the traffic light control problem by utilizing the IA method inspired by the envy and guilt feelings. We conducted the experiments on real data from Hangzhou and Jinan cities and compared the result of IACoLight with CoLight. The comparison results showed that IACoLight reduces the average travel time of the vehicles.

The recommended intervals for the $\alpha$ and $\beta$ parameters of the IA model in the RL literature were as follows: $\alpha>\beta$ and $\beta\in[0,1]$ (Hughes \BOthers., \APACyear2018; Yang \BOthers., \APACyear2020; Jiang \BBA Lu, \APACyear2019). These settings were adopted on the basis that people are averse to inequities against themselves more than inequities against others. So they punish themselves more when they face disadvantage inequities. For the first time, we performed an exhaustive search over these parameters of the IA model. Unlike the common practice of using positive values for both $\alpha$ and $\beta$ in the literature, we found that negative values of $\beta$ can yield a higher performance. This corresponds to the case where advantageous inequities are rewarded.

The superior performance of IACoLight may be explained by the positive and negative coefficients of the reward inequities. A positive value of the $\alpha$ parameter implies punishing the agent when “feeling” envy, causing the agent to act in order to improve its performance. However, negative values of $\alpha$ encourage the agent to continue its current behavior. On the other hand, assigning positive values to the $\beta$ parameter induces the feeling of guil whereas negative values encourage the agent by increasing its received reward, which may be interpreted as the feeling of pride.

Now, the common practice is to take both $\alpha$ and $\beta$ positive, implying that both the agent who earned less and the one who earned more punish themselves and try to change their behaviour. This may lead to a population where all indivdiuals earn the same reward, but with the possible cost of loosing the valuable actions taken by the highest-earners. However, when $\beta$ is negative, only the less-earning agent punishes itself and tries to change its behaviour. The higher-earning continues its rewarding actions. Which of these two or other combinations of $\alpha$ and $\beta$ perform best? It seems to depend on the environment, and remains as a future work to be further investigated.

The best advantageous type of IACoLight , i.e., $\alpha=0$ , had a positive $\beta$ value ( $0.4$ ) and the best disadvantageous type of IACoLight, i.e., $\beta=0$ , had a negative $\alpha$ value ( $-0.2$ ). Namely, if only one type of inequity is considered, it is better to feel guilty against the advantageous inequities and make the agent punish itself using larger positive values compared the case when both inequities are considered. On the other hand, it is better to reward the agent against the disadvantageous inequities using lower negative values compared the case when both inequities are considered.

It is noticeable that unlike the problems discussed in the IA model in the RL literature (Hughes \BOthers., \APACyear2018; Yang \BOthers., \APACyear2020; Jiang \BBA Lu, \APACyear2019), agents are not able to punish each other in the traffic light control problem and self-punishment of an agent does not encourage the agent to punish others. This may partly explain the difference between the claimed near-optimal hyperparameters.

The results highlight the potential of reshaped rewards in improving the performance of deep reinforcement learning methods. Moreover, the higher performance of the IA model for the newly tested range of parameters suggests performing a search over the coefficients $\alpha$ and $\beta$ of the inequities in future applications. Automatic search on these parameters is subject to future studies.

\bmhead

Acknowledgments We would like to thank Digital Research Alliance of Canada for providing computational resources that facilitated our experiments.

Declarations

•

Funding

No funds, grants, or other support was received.

•

Competing interests

The authors declare no competing interest.

•

Ethics approval and Consent to participate/publication

“ Not applicable”

•

Code/data availability

The codes are available online at https://github.com/MersadHJ/IACoLight.

•

Authors’ contributions

MH contributed to the algorithm and experiments. FA contributed to the idea and analysis, and took the lead on the writing. PR contributed to the idea and analysis. FA and PR supervised the study. All authors contributed to the writing.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1\bibcommenthead
2Alamiyan-Harandi \B Others . ( \APA Cyear 2018) \APA Cinsertmetastar alamiyan 2018 new {APA Crefauthors} Alamiyan-Harandi, F., Derhami, V. \BCBL Jamshidi, F. \APA Cref Year Month Day 2018. \BBOQ \APA Crefatitle A new framework for mobile robot trajectory tracking using depth data and learning algorithms A new framework for mobile robot trajectory tracking using depth data and learning algorithms. \BBCQ \APA Cjournal Vol Num Pages Journal of Intelligent & Fuzzy Systems 3463969–3982. \P
3Chen \B Others . ( \APA Cyear 2020) \APA Cinsertmetastar chen 2020 toward {APA Crefauthors} Chen, C., Wei, H., Xu, N., Zheng, G., Yang, M., Xiong, Y. \BDBL Li, Z. \APA Cref Year Month Day 2020. \BBOQ \APA Crefatitle Toward a thousand lights: Decentralized deep reinforcement learning for large-scale traffic signal control Toward a thousand lights: Decentralized deep reinforcement learning for large-scale traffic signal control. \BBCQ \APA Crefbtitle Proceedings of the AAAI Conference on Art
4Cools \B Others . ( \APA Cyear 2013) \APA Cinsertmetastar cools 2013 self {APA Crefauthors} Cools, S \BHBI B., Gershenson, C. \BCBL D’Hooghe, B. \APA Cref Year Month Day 2013. \BBOQ \APA Crefatitle Self-organizing traffic lights: A realistic simulation Self-organizing traffic lights: A realistic simulation. \BBCQ \APA Crefbtitle Advances in applied self-organizing systems Advances in applied self-organizing systems ( \BPGS 45–55). \APA Caddress Publisher Springer. \Print Back Refs \Curr
5Diakaki \B Others . ( \APA Cyear 2002) \APA Cinsertmetastar diakaki 2002 multivariable {APA Crefauthors} Diakaki, C., Papageorgiou, M. \BCBL Aboudolas, K. \APA Cref Year Month Day 2002. \BBOQ \APA Crefatitle A multivariable regulator approach to traffic-responsive network-wide signal control A multivariable regulator approach to traffic-responsive network-wide signal control. \BBCQ \APA Cjournal Vol Num Pages Control Engineering Practice 102183–195. \Print Back Refs \Current Bib
6Finn \B Others . ( \APA Cyear 2017) \APA Cinsertmetastar finn 2017 model {APA Crefauthors} Finn, C., Abbeel, P. \BCBL Levine, S. \APA Cref Year Month Day 2017. \BBOQ \APA Crefatitle Model-agnostic meta-learning for fast adaptation of deep networks Model-agnostic meta-learning for fast adaptation of deep networks. \BBCQ \APA Crefbtitle International conference on machine learning International conference on machine learning ( \BPGS 1126–1135). \Print Back Refs \Current Bib
7Hughes \B Others . ( \APA Cyear 2018) \APA Cinsertmetastar hughes 2018 inequity {APA Crefauthors} Hughes, E., Leibo, J.Z., Phillips, M., Tuyls, K., Dueñez-Guzman, E., Castañeda, A.G. \BDBL others \APA Cref Year Month Day 2018. \BBOQ \APA Crefatitle Inequity aversion improves cooperation in intertemporal social dilemmas Inequity aversion improves cooperation in intertemporal social dilemmas. \BBCQ \APA Crefbtitle Proceedings of the 32nd International Conference on Neural Information Process
8Jiang \BBA Lu ( \APA Cyear 2019) \APA Cinsertmetastar jiang 2019 learning {APA Crefauthors} Jiang, J. \BCBT \BBA Lu, Z. \APA Cref Year Month Day 2019. \BBOQ \APA Crefatitle Learning fairness in multi-agent systems Learning fairness in multi-agent systems. \BBCQ \APA Cjournal Vol Num Pages Advances in Neural Information Processing Systems 32. \Print Back Refs \Current Bib