Diverse Policy Optimization for Structured Action Space

Wenhao Li; Baoxiang Wang; Shanchao Yang; Hongyuan Zha

arXiv:2302.11917·cs.LG·February 24, 2023

Diverse Policy Optimization for Structured Action Space

Wenhao Li, Baoxiang Wang, Shanchao Yang, Hongyuan Zha

PDF

Open Access 1 Repo

TL;DR

This paper introduces Diverse Policy Optimization (DPO), a novel reinforcement learning method that models policies as energy-based models and uses GFlowNet for efficient, diverse policy sampling in structured action spaces, improving robustness and exploration.

Contribution

The paper proposes DPO, combining energy-based models and GFlowNet to effectively discover diverse policies in structured action spaces, addressing scalability issues of existing methods.

Findings

01

DPO efficiently discovers diverse policies in challenging benchmarks.

02

DPO substantially outperforms existing state-of-the-art methods.

03

DPO demonstrates robustness and improved exploration in structured action spaces.

Abstract

Enhancing the diversity of policies is beneficial for robustness, exploration, and transfer in reinforcement learning (RL). In this paper, we aim to seek diverse policies in an under-explored setting, namely RL tasks with structured action spaces with the two properties of composability and local dependencies. The complex action structure, non-uniform reward landscape, and subtle hyperparameter tuning due to the properties of structured actions prevent existing approaches from scaling well. We propose a simple and effective RL method, Diverse Policy Optimization (DPO), to model the policies in structured action space as the energy-based models (EBM) by following the probabilistic RL framework. A recently proposed novel and powerful generative model, GFlowNet, is introduced as the efficient, diverse EBM-based policy sampler. DPO follows a joint optimization framework: the outer layer…

Tables12

Table 1. Table 1 . Performance ( ↓ ↓ \downarrow ) on the ATSC benchmark.

MP	Ing. Reg.	Col. Reg.	DvD	Ing. Reg.	Col. Reg.
Avg. Delay	59.64	22.06	Avg. Delay	73.22	55.91
Avg. Trip Time	197.23	86.02	Avg. Trip Time	212.81	115.54
Avg. Wait	20.19	5.46	Avg. Wait	31.36	28.35
Avg. Queue	0.8	0.38	Avg. Queue	1.42	2.28
SQL	Ing. Reg.	Col. Reg.	RSPO	Ing. Reg.	Col. Reg.
Avg. Delay	67.65	58.32	Avg. Delay	90.42	57.28
Avg. Trip Time	205.44	116.29	Avg. Trip Time	226.5	120.53
Avg. Wait	26.45	30.01	Avg. Wait	44.16	28.19
Avg. Queue	1.15	2.06	Avg. Queue	1.74	2.59
MPLight	Ing. Reg.	Col. Reg.	DPO	Ing. Reg.	Col. Reg.
Avg. Delay	78.16	60.42	Avg. Delay	57.2	20.28
Avg. Trip Time	215.72	123.93	Avg. Trip Time	192.75	81.42
Avg. Wait	34.57	30.34	Avg. Wait	18.26	4.77
Avg. Queue	1.48	2.33	Avg. Queue	0.65	0.32

Table 2. Table 2 . Hyperparameters of all methods used in experiments.

Name	Tuning Range
number of GPN layers	3
hidden units of GPN	{64, 128, 256}
dropout of GPN layers	0.6
dropout of GPN attention layer	0.5
alpha of GPN	(0, 1)
number of heads of GPN attention layer	4
use residual in GPN	True
norm layer in GPN	{Layernorm, Batchnorm}
number of GAT layers (soft-Q)	3
hidden units of GAT (soft-Q)	{64, 128, 256}
dropout of GAT layers (soft-Q)	0.6
dropout of GAT attention layer (soft-Q)	0.5
alpha of GAT (soft-Q)	(0, 1)
number of heads of GAT attention layer (soft-Q)	4
use residual in GAT (soft-Q)	True
norm layer in GAT (soft-Q)	{Layernorm, Batchnorm}
learning rate of GPN	(1e-5, 1e-3)
learning rate of $Z$	(1e-3, 1e-1)
learning rate of GAT	(1e-5, 1e-3)
Optimizers	AdamW
Replay Buffer Size	1e6
$γ$	(0.9, 0.99)
replay start size	32
minibatch size	32
max gradient norm	20
initial temperature (soft-Q)	1.0
temperature learning rate	(1e-5, 3e-4)
soft update coefficient	(2e-3, 5e-1)
GPN update ratio	(2, 6)
number of GPN updates	(1, 10)

Table 3. (a) DPO.

Name	Tuning Range
number of GPN layers	3
hidden units of GPN	{64, 128, 256}
dropout of GPN layers	0.6
dropout of GPN attention layer	0.5
alpha of GPN	(0, 1)
number of heads of GPN attention layer	4
use residual in GPN	True
norm layer in GPN	{Layernorm, Batchnorm}
number of GAT layers (soft-Q)	3
hidden units of GAT (soft-Q)	{64, 128, 256}
dropout of GAT layers (soft-Q)	0.6
dropout of GAT attention layer (soft-Q)	0.5
alpha of GAT (soft-Q)	(0, 1)
number of heads of GAT attention layer (soft-Q)	4
use residual in GAT (soft-Q)	True
norm layer in GAT (soft-Q)	{Layernorm, Batchnorm}
learning rate of GPN	(1e-5, 1e-3)
learning rate of $Z$	(1e-3, 1e-1)
learning rate of GAT	(1e-5, 1e-3)
Optimizers	AdamW
Replay Buffer Size	1e6
$γ$	(0.9, 0.99)
replay start size	32
minibatch size	32
max gradient norm	20
initial temperature (soft-Q)	1.0
temperature learning rate	(1e-5, 3e-4)
soft update coefficient	(2e-3, 5e-1)
GPN update ratio	(2, 6)
number of GPN updates	(1, 10)

Table 4. (b) Baselines.

Name	Tuning Range
learning rate (IDQN)	1e-3
training frequency (IDQN)	5
batch size (IDQN)	256
target update (IDQN)	1200
memory size (IDQN)	2e20
learning rate (IDQN)	1e-4
$γ$ (IDQN)	0.99
learning rate (MFQ)	1e-4
exploration decay (MFQ)	$1.0 \to 0.05, 2000$
$γ$ (MFQ)	0.95
batch size (MFQ)	128
memory size (MFQ)	5e5
batch size (MPLight)	32
$γ$ (MPLight)	0.99
exploration decay (MPLight)	$1.0 \to 0.0, 220$
target update (MPLight)	500
demand shape (MPLight)	1
$σ$ (DvD)	(1e-4, 1e-2)
$η$ (DvD)	(1e-4, 1e-2)
hidden units (DvD)	{32, 64, 128}
ES-sensings (DvD)	{200, 300, 400}
$K$ (SQL)	(32, 100)
$M$ (SQL)	(32, 100)
$K_{V}$ (SQL)	50
$a l p h a$ (RSPO)	(0.1, 1.5)
$λ_{B}^{i n t}$ (RSPO)	(0, 10)
$λ_{R}^{i n t}$ (RSPO)	(0, 1)
Initial learning rate (RSPO)	(1e-4, 1e-3)
Batch size (RSPO)	{512, 1600, 6400}
PPO epochs (RSPO)	(1, 10)

Table 5. Table 3 . Performance ( ↓ ↓ \downarrow ) of independent learning variants on two scenarios of the ATSC benchmark.

I-DvD.	Col. Reg.	DvD	Col. Reg.
Avg. Delay	50.16	Avg. Delay	55.91
Avg. Trip Time	108.43	Avg. Trip Time	115.54
Avg. Wait	22.50	Avg. Wait	28.35
Avg. Queue	2.12	Avg. Queue	2.28
I-SQL	Col. Reg.	SQL	Col. Reg.
Avg. Delay	28.39	Avg. Delay	58.32
Avg. Trip Time	93.48	Avg. Trip Time	116.29
Avg. Wait	6.74	Avg. Wait	30.01
Avg. Queue	0.55	Avg. Queue	2.06
I-RSPO	Col. Reg.	RSPO	Col. Reg.
Avg. Delay	23.46	Avg. Delay	57.28
Avg. Trip Time	88.81	Avg. Trip Time	120.53
Avg. Wait	5.95	Avg. Wait	28.19
Avg. Queue	0.49	Avg. Queue	2.59

Table 6. (a) Performance of independent learning variants (TAPAS Cologne).

I-DvD.	Col. Reg.	DvD	Col. Reg.
Avg. Delay	50.16	Avg. Delay	55.91
Avg. Trip Time	108.43	Avg. Trip Time	115.54
Avg. Wait	22.50	Avg. Wait	28.35
Avg. Queue	2.12	Avg. Queue	2.28
I-SQL	Col. Reg.	SQL	Col. Reg.
Avg. Delay	28.39	Avg. Delay	58.32
Avg. Trip Time	93.48	Avg. Trip Time	116.29
Avg. Wait	6.74	Avg. Wait	30.01
Avg. Queue	0.55	Avg. Queue	2.06
I-RSPO	Col. Reg.	RSPO	Col. Reg.
Avg. Delay	23.46	Avg. Delay	57.28
Avg. Trip Time	88.81	Avg. Trip Time	120.53
Avg. Wait	5.95	Avg. Wait	28.19
Avg. Queue	0.49	Avg. Queue	2.59

Table 7. (b) Performance of independent learning variants (InTAS).

I-DvD.	Ing. Reg.	DvD	Ing. Reg.
Avg. Delay	74.58	Avg. Delay	73.22
Avg. Trip Time	215.07	Avg. Trip Time	212.81
Avg. Wait	32.48	Avg. Wait	31.36
Avg. Queue	1.45	Avg. Queue	1.42
I-SQL	Ing. Reg.	SQL	Ing. Reg.
Avg. Delay	65.29	Avg. Delay	67.65
Avg. Trip Time	201.26	Avg. Trip Time	205.44
Avg. Wait	22.41	Avg. Wait	26.45
Avg. Queue	1.01	Avg. Queue	1.15
I-RSPO	Ing. Reg.	RSPO	Ing. Reg.
Avg. Delay	86.59	Avg. Delay	90.42
Avg. Trip Time	215.25	Avg. Trip Time	226.5
Avg. Wait	39.76	Avg. Wait	44.16
Avg. Queue	1.58	Avg. Queue	1.74

Table 8. Table 4 . Ablation studies of the proposed DPO under two scenarios of the ATSC benchmark.

Soft value regression	Termination action	Physical dependencies	Ing.Reg				Epochs	Col.Reg				Epochs
Soft value regression	Termination action	Physical dependencies	Delay	Trip	Wait	Queue	Epochs	Delay	Trip	Wait	Queue	Epochs
			78.92	214.63	32.75	1.51	/	57.6	120.85	31.66	2.43	/
$✓$			59.79	198.85	18.67	0.71	$\sim$ 2.6 $\times$	23.23	85.2	4.88	0.33	$\sim$ 3.5 $\times$
	$✓$		77.06	210.49	29.68	1.48	/	61.53	125.12	33.75	2.64	/
		$✓$	78.61	218.42	32.89	1.52	/	60.99	126.4	33.71	2.61	/
$✓$	$✓$		72.4	200.72	23.45	1.39	$\sim$ 2 $\times$	30.26	91.59	8.81	0.62	$\sim$ 1.7 $\times$
	$✓$	$✓$	78.5	211.46	32.57	1.51	/	58.76	123.58	31.8	2.54	/
$✓$		$✓$	59.35	194.16	18.23	0.65	$\sim$ 2.6 $\times$	20.22	85.49	4.86	0.33	$\sim$ 3.2 $\times$
$✓$	$✓$	$✓$	57.2	192.75	18.26	0.65	1 $\times$	20.28	81.42	4.77	0.32	1 $\times$

Table 9. Table 5 . Ablation studies of the proposed DPO under the Battle benchmark.

Soft value regression	Termination action	Avg. # Kills	Avg. # Reward	Avg. Epochs
		20.1( $\pm 3.9$ )	0.013( $\pm 0.02$ )	/
$✓$		61.3( $\pm 0.4$ )	0.135( $\pm 0.08$ )	$\sim$ 3.7 $\times$
	$✓$	20.6( $\pm 4.2$ )	0.015( $\pm 0.02$ )	/
$✓$	$✓$	62.1( $\pm$ 0.1)	0.142( $\pm$ 0.19)	1 $\times$

Soft value regression	Termination action	Avg. # Kills	Avg. # Reward	Avg. Epochs
		20.1( $\pm 3.9$ )	0.013( $\pm 0.02$ )	/
$✓$		61.3( $\pm 0.4$ )	0.135( $\pm 0.08$ )	$\sim$ 3.7 $\times$
	$✓$	20.6( $\pm 4.2$ )	0.015( $\pm 0.02$ )	/
$✓$	$✓$	62.1( $\pm$ 0.1)	0.142( $\pm$ 0.19)	1 $\times$

# Nearest agents	Avg. # Kills	Avg. # Reward	Avg. Epochs
3	61.3( $\pm 0.4$ )	0.135( $\pm 0.08$ )	$\sim$ 1 $\times$
4	62.1( $\pm$ 0.1)	0.142( $\pm$ 0.19)	1 $\times$
5	62.3( $\pm$ 0.1)	0.146( $\pm$ 0.23)	$\sim$ 1.7 $\times$
6	58.6( $\pm$ 0.1)	0.101( $\pm$ 0.25)	$\sim$ 3.1 $\times$

Table 12. Table 6 . The robustness of different algorithms after perturbing the traffic distribution in the ATSC benchmark. we randomly select 1 1 1 or 2 2 2 of the respective main roads, increase the traffic flow by 10 % percent 10 10\% , and traine all the algorithms for 50 50 50 episodes (about 3 % percent 3 3\% of the standard training sample size).

MP	Ing. Reg.		Col. Reg.		I-SQL	Ing. Reg.		Col. Reg.
MP	1 road	2 roads	1 road	2 roads	I-SQL	1 road	2 roads	1 road	2 roads
Avg. Delay	70.24	85.9	26.83	31.4	Avg. Delay	91.36	101.04	38.31	44.59
Avg. Trip Time	235.77	272.82	92.02	113.25	Avg. Trip Time	258.09	286.68	134.58	143.06
Avg. Wait	23.46	26.18	5.59	6.31	Avg. Wait	32.31	35.73	7.36	7.87
Avg. Queue	0.83	0.88	0.39	0.43	Avg. Queue	1.26	1.39	0.52	0.59
I-RSPO	Ing. Reg.		Col. Reg.		DPO	Ing. Reg.		Col. Reg.
I-RSPO	1 road	2 roads	1 road	2 roads	DPO	1 road	2 roads	1 road	2 roads
Avg. Delay	88.54	97.46	35.51	39.62	Avg. Delay	60.27	72.77	21.45	25.37
Avg. Trip Time	246.73	261.9	126.99	132.55	Avg. Trip Time	211.82	245.41	86.6	102.87
Avg. Wait	30.63	32.23	7.08	7.48	Avg. Wait	19.17	22.91	5.04	5.87
Avg. Queue	1.15	1.22	0.47	0.51	Avg. Queue	0.69	0.824	0.34	0.42

Equations22

π_{ent}^{*} = ar g π max t \sum E_{(s_{e}^{t}, a_{e}^{t}) \sim ρ_{π}} [r (s_{e}^{t}, a_{e}^{t}) + α H (π (\cdot ∣ s_{e}^{t}))],

π_{ent}^{*} = ar g π max t \sum E_{(s_{e}^{t}, a_{e}^{t}) \sim ρ_{π}} [r (s_{e}^{t}, a_{e}^{t}) + α H (π (\cdot ∣ s_{e}^{t}))],

Q_{soft}^{*} (s_{e}^{t}, a_{e}^{t}) := r_{e}^{t} + E_{s_{e}^{t + ℓ} \sim ρ_{π}} [ℓ = 1 \sum \infty γ^{ℓ} (r_{e}^{t + ℓ} + α H (π_{ent}^{*} (\cdot ∣ s_{e}^{t + ℓ})))] .

Q_{soft}^{*} (s_{e}^{t}, a_{e}^{t}) := r_{e}^{t} + E_{s_{e}^{t + ℓ} \sim ρ_{π}} [ℓ = 1 \sum \infty γ^{ℓ} (r_{e}^{t + ℓ} + α H (π_{ent}^{*} (\cdot ∣ s_{e}^{t + ℓ})))] .

π_{ent}^{*} = exp (\frac{1}{α} (Q_{soft}^{*} (s_{e}^{t}, a_{e}^{t}) - V_{soft}^{*} (s_{e}^{t}))),

π_{ent}^{*} = exp (\frac{1}{α} (Q_{soft}^{*} (s_{e}^{t}, a_{e}^{t}) - V_{soft}^{*} (s_{e}^{t}))),

V_{soft}^{*} (s_{e}^{t}) = α lo g \int_{A_{e}} exp (\frac{1}{α} Q_{soft}^{*} (s_{e}^{t}, a_{e}^{'})) d a_{e}^{'} .

V_{soft}^{*} (s_{e}^{t}) = α lo g \int_{A_{e}} exp (\frac{1}{α} Q_{soft}^{*} (s_{e}^{t}, a_{e}^{'})) d a_{e}^{'} .

Q_{soft}^{*} (s_{e}^{t}, a_{e}^{t}) = r_{e}^{t} + γ E_{s_{e}^{t + 1} \sim p_{s_{e}}} [V_{soft}^{*} (s_{e}^{t + 1})] .

Q_{soft}^{*} (s_{e}^{t}, a_{e}^{t}) = r_{e}^{t} + γ E_{s_{e}^{t + 1} \sim p_{s_{e}}} [V_{soft}^{*} (s_{e}^{t + 1})] .

\left\{\begin{aligned} &\leavevmode\resizebox{375.80542pt}{}{$\min_{\theta}J_{Q}(\theta):=\mathbb{E}_{s^{t}_{e},a^{t}_{e},r^{t}_{e},s^{t+1}_{e}\sim D}\left[\frac{1}{2}\left(r^{t}_{e}+V^{\bar{\theta}}\left(s^{t+1}_{e}\right)-Q^{\theta}\left(s^{t}_{e},{a}^{t}_{e}\right)\right)^{2}\right],$}\\ &\leavevmode\resizebox{375.80542pt}{}{$\min_{\phi}J_{\pi}\left(\phi;s^{t}_{e}\right):={\mathrm{KL}}\left(\pi^{\phi}\left(\cdot|s^{t}_{e}\right)\|\exp\left(\frac{1}{\alpha}\left(Q^{\theta}\left(s^{t}_{e},\cdot\right)-V^{{\theta}}(s^{t}_{e})\right)\right)\right),$}\end{aligned}\right.

\left\{\begin{aligned} &\leavevmode\resizebox{375.80542pt}{}{$\min_{\theta}J_{Q}(\theta):=\mathbb{E}_{s^{t}_{e},a^{t}_{e},r^{t}_{e},s^{t+1}_{e}\sim D}\left[\frac{1}{2}\left(r^{t}_{e}+V^{\bar{\theta}}\left(s^{t+1}_{e}\right)-Q^{\theta}\left(s^{t}_{e},{a}^{t}_{e}\right)\right)^{2}\right],$}\\ &\leavevmode\resizebox{375.80542pt}{}{$\min_{\phi}J_{\pi}\left(\phi;s^{t}_{e}\right):={\mathrm{KL}}\left(\pi^{\phi}\left(\cdot|s^{t}_{e}\right)\|\exp\left(\frac{1}{\alpha}\left(Q^{\theta}\left(s^{t}_{e},\cdot\right)-V^{{\theta}}(s^{t}_{e})\right)\right)\right),$}\end{aligned}\right.

V^{θ} (s_{e}^{t}) := α lo g E_{a_{e}^{'} \sim q_{a_{e}^{'}}} [exp (\frac{1}{α} Q^{θ} (s_{e}^{t}, a_{e}^{'})) / q_{a_{e}^{'}} (a_{e}^{'})],

V^{θ} (s_{e}^{t}) := α lo g E_{a_{e}^{'} \sim q_{a_{e}^{'}}} [exp (\frac{1}{α} Q^{θ} (s_{e}^{t}, a_{e}^{'})) / q_{a_{e}^{'}} (a_{e}^{'})],

s_{i ∣ e}^{ℓ} = γ s_{i ∣ e}^{ℓ - 1} Θ + (1 - γ) ϕ_{θ} (\frac{1}{∣ N ( i ) ∣} {s_{j ∣ e}^{ℓ - 1}}_{j \in N (i) \cup {i}}),

s_{i ∣ e}^{ℓ} = γ s_{i ∣ e}^{ℓ - 1} Θ + (1 - γ) ϕ_{θ} (\frac{1}{∣ N ( i ) ∣} {s_{j ∣ e}^{ℓ - 1}}_{j \in N (i) \cup {i}}),

Q = s_{g}^{L} W_{Q} K = s_{g}^{L} W_{K} V = s_{g}^{L} W_{V}, LinAttn_{k} (s_{g}^{L}) = \frac{\sum _{j = 1}^{N} ( ψ ( Q _{k} ) ^{⊤} ψ ( K _{j} ) ) V _{j}}{\sum _{j = 1}^{N} ψ ( Q _{k} ) ^{⊤} ψ ( K _{j} )},

Q = s_{g}^{L} W_{Q} K = s_{g}^{L} W_{K} V = s_{g}^{L} W_{V}, LinAttn_{k} (s_{g}^{L}) = \frac{\sum _{j = 1}^{N} ( ψ ( Q _{k} ) ^{⊤} ψ ( K _{j} ) ) V _{j}}{\sum _{j = 1}^{N} ψ ( Q _{k} ) ^{⊤} ψ ( K _{j} )},

u_{i}^{(j)} = {u_{i}^{(j)} - \infty if j \neq = σ (k), \forall k < j, otherwise,

u_{i}^{(j)} = {u_{i}^{(j)} - \infty if j \neq = σ (k), \forall k < j, otherwise,

L_{Θ} (τ ∣ s_{e}) = [lo g \frac{Z ( s _{e} ; θ _{Z} ) \prod _{t = 0}^{n - 1} P _{F} ( s _{g}^{t + 1} ∣ s _{g}^{t} , s _{e} ; θ _{F} )}{R ( s _{g}^{n} ∣ s _{e} ) \prod _{t = 0}^{n - 1} P _{B} ( s _{g}^{t} ∣ s _{g}^{t + 1} , s _{e} ; θ _{B} )}]^{2},

L_{Θ} (τ ∣ s_{e}) = [lo g \frac{Z ( s _{e} ; θ _{Z} ) \prod _{t = 0}^{n - 1} P _{F} ( s _{g}^{t + 1} ∣ s _{g}^{t} , s _{e} ; θ _{F} )}{R ( s _{g}^{n} ∣ s _{e} ) \prod _{t = 0}^{n - 1} P _{B} ( s _{g}^{t} ∣ s _{g}^{t + 1} , s _{e} ; θ _{B} )}]^{2},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qiang-ma/graph-pointer-network
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Reinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis

Full text

\setcopyright

ifaamas \acmConference[AAMAS ’23]Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023)May 29 – June 2, 2023 London, United KingdomA. Ricci, W. Yeoh, N. Agmon, B. An (eds.) \copyrightyear2023 \acmYear2023 \acmDOI \acmPrice \acmISBN \acmSubmissionID378

\affiliation \institutionThe Chinese University of Hong Kong, Shenzhen \cityShenzhen \countryChina

\authornoteCorresponding author \affiliation \institutionThe Chinese University of Hong Kong, Shenzhen, Shenzhen Institute of AI and Robotics for Society \cityShenzhen \countryChina

Diverse Policy Optimization for Structured Action Space

Wenhao Li

[email protected]

,

Baoxiang Wang

[email protected]

,

Shanchao Yang

[email protected]

and

Hongyuan Zha

[email protected]

Abstract.

Enhancing the diversity of policies is beneficial for robustness, exploration, and transfer in reinforcement learning (RL). In this paper, we aim to seek diverse policies in an under-explored setting, namely RL tasks with structured action spaces with the two properties of composability and local dependencies. The complex action structure, non-uniform reward landscape, and subtle hyperparameter tuning due to the properties of structured actions prevent existing approaches from scaling well. We propose a simple and effective RL method, Diverse Policy Optimization (DPO), to model the policies in structured action space as the energy-based models (EBM) by following the probabilistic RL framework. A recently proposed novel and powerful generative model, GFlowNet, is introduced as the efficient, diverse EBM-based policy sampler. DPO follows a joint optimization framework: the outer layer uses the diverse policies sampled by the GFlowNet to update the EBM-based policies, which supports the GFlowNet training in the inner layer. Experiments on ATSC and Battle benchmarks demonstrate that DPO can efficiently discover surprisingly diverse policies in challenging scenarios and substantially outperform existing state-of-the-art methods.

Key words and phrases:

Reinforcement Learning; Generative Model; Diversity; Robustness

1. Introduction

The history of human civilization can be seen as a chronicle of creative capacity, i.e., the diversity of solutions to the same puzzle (Osborn, 1953). Counter-intuitively, a popular consensus in deep learning with theoretical justifications (Ma, 2021) that most local optimas to a non-convex optimization problem are very close to the global optimum has led mainstream AI research to focus on finding a single local solution to a given optimization problem, rather than on which local optimum is dicovered (Zhou et al., 2022). It is no coincidence that most methods in reinforcement learning (RL) are also designed to seek a single reward-maximizing policy (Sutton and Barto, 2018; Mnih et al., 2015; Schulman et al., 2017).

However, different local optima in the policy space can correspond to strategies that differ in nature, which makes the above consensus problematic in RL tasks where the environment is unstable. For example, in adaptive traffic signal control (ATSC) (Van der Pol and Oliehoek, 2016; Wei et al., 2018b, 2019) (conceptual diagram and more examples are included in Figure 1), if two traffic flows are desired to reach the target points from the departure points quickly, multiple control strategies with similar average commuting times may exist due to the combinatorial nature of traffic lights. The performance of a single policy obtained by reward maximization is bound to be affected if the subsequent traffic volumes on other sections of the road network associated with the traveled section of that traffic change. Moreover, if our goal is to discover a diverse set of policies, some of these may prove more valuable than others in different situations.

Therefore, celebrating the diversity of policies is beneficial for many RL applications. In addition to ATSC and the simple game in Figure 1, these RL application areas include but are not limited to conversation generation in intelligent customer service (Li et al., 2016), drug discovery in smart healthcare (Pereira et al., 2021), and simulator design in automated machine learning (AutoML) (Wang et al., 2019). Furthermore, in addition to robustness, a set of diverse policies can also be useful for exploration (Peng et al., 2020), transfer (Kumar et al., 2020), and hierarchy (Alver and Precup, 2022) in RL.

There is no doubt that RL researchers have demonstrated their creative ability in discovering diverse policies. The majority of the literature has been done in the field of neuroevolution methods inspired by Quality-Diversity (QD), which typically maintains a collection of policies and adapts it using evolutionary algorithms to balance the QD trade-off (Pugh et al., 2016; Duarte et al., 2017; Parker-Holder et al., 2020; Nilsson and Cully, 2021; Gangwani et al., 2021; Lim et al., 2022). In another part of the work, intrinsic rewards have been used for learning diversity in terms of the discriminability of different trajectory-specific quantities (Gregor et al., 2016; Eysenbach et al., 2019; Hartikainen et al., 2020; Goyal et al., 2020; Sharma et al., 2020a; Zahavy et al., 2020; Alver and Precup, 2022), or have been used as a regularizer when maximizing the extrinsic reward (Levine, 2018; Gangwani et al., 2019; Masood and Doshi-Velez, 2019; Sharma et al., 2020b; Zhang et al., 2019). There is also a small body of work that transforms the problem into a Constrained Markov Decision Process (CMDP) (Sun et al., 2020; Zhou et al., 2022; Derek and Isola, 2021; Zahavy et al., 2022), or implicitly induce diversity to learn policies that maximize the set robustness to the worst-possible reward (Kumar et al., 2020; Zahavy et al., 2021).

This paper considers a more complex, realistic, less focused, and under-explored setting, namely RL tasks with structured action spaces. We define structured actions as actions with the following two properties: composability, i.e., environmental actions consist of a large number of atomic actions with complete functionality and local dependencies, i.e., there are local physical or logical correlations between atomic actions111In this paper, only pairwise relationships between atomic actions are considered.. For example, in ATSC, the phases of all traffic signals on all intersections in the entire road network must be redetermined at certain intervals, and atomic actions are phases of each signal and interact with each other through the physical road network. In addition, for the predator-prey task in Figure 1, the atomic actions are the decisions of each predator, and there is a local spatial, logical association.

The high dimensionality of the RL agent’s policy due to the composability of structured actions prevents existing methods from scaling well. Specifically, the combinability will make the underlying reward landscape of the RL problem particularly non-uniform, which may make QD-like methods require substantially large population sizes to fully explore the policy space and prevent the algorithm from collapsing to visually identical policies (Tang et al., 2021; Zhou et al., 2022). Also, due to composability, the complex soft objective introduced by intrinsic reward or CMDP-driven methods will result in non-trivial and subtle hyperparameter tuning (Masood and Doshi-Velez, 2019; Parker-Holder et al., 2020). In addition, the existing agents’ policies are mainly parameterized categorical distributions or Gaussian distributions. Their extension to structured actions with independent assumptions on atomic actions will prevent the agent from effectively using the structural information of environmental actions to achieve an efficient search for the policy space.

We propose a simple and effective RL method, Diverse Policy Optimization (DPO), to discover a diverse set of policies in tasks with structured action spaces. We follow the probabilistic reinforcement learning (PRL) framework (Levine, 2018) to transform reinforcement learning problems under stochastic dynamics into variational inference problems on probabilistic graphical models and model the policies of RL agents as the energy-based models (EBM). The action distribution induced by this EBM in a structured action space is highly multimodal, and sampling from such a high-dimensional distribution is intractable. To this end, we introduce a recently proposed novel and powerful generative model, Generative Flow Networks (GFlowNet) (Bengio et al., 2021a, b; Jain et al., 2022; Zhang et al., 2022), as the efficient diverse policy sampler. GFlowNet can be regarded as amortized Monte-Carlo Markov chains (MCMC), which gradually builds composable environmental actions through the single but trained generative pass of ”building blocks (i.e., atomic actions)”, so that the final sampled environmental actions obey a given energy-based policy distribution.

Notably, our method does not simply introduce the GFlowNet to RL with structured action spaces. Since in the PRL framework, with the update of the soft Q function, the energy-based policy distribution is also constantly changing. This violates the assumption of the fixed energy model in GFlowNet and makes DPO face a more complex optimization problem. Therefore, we model DPO as a joint optimization problem: the outer layer uses the diverse policies sampled by the GFlowNet to update the soft Q function, and the inner layer trains the GFlowNet through an EBM based on the soft Q function (see Figure 3). Furthermore, a two-timescale alternating optimization method is proposed to solve it efficiently.

We empirically validate DPO on ATSC tasks (Ault and Sharon, 2021) where atomic actions have local physical dependencies, and more generally, Battle scenarios (Zheng et al., 2018) where atomic actions have logical local dependencies. Experiments demonstrate that DPO can reliably and efficiently discover surprisingly diverse strategies in all these challenging scenarios and substantially outperform existing baselines. The contributions can be summarized as follows:

(1)

We propose a novel algorithm, Diverse Policy Optimization, for discovering diverse policies for structured action spaces. The GFlowNet-based sampler can efficiently sample diverse policies from the high-dimensional multimodal distribution induced by structured action spaces. 2. (2)

We propose an efficient joint training framework to interleaved optimize the soft-Q-function-based EBM and the reward-conditional GFlowNet-based sampler. 3. (3)

Our algorithm is general and effective across structured action spaces with physical and logical local dependencies.

2. Preliminaries and Notations

The proposed DPO follows the PRL to model policies as a high-dimensional multimodal energy-based probability distribution and introduces GFlowNet to efficiently sample policies with diversity from this distribution. Below, we briefly review the PRL and GFlowNet.

2.1. Probabilistic Reinforcement Learning

PRL aims to learn the maximum entropy optimal policy:

[TABLE]

where $s^{t}_{e}\in\mathcal{S}_{e}$ and $a^{t}_{e}\in\mathcal{A}_{e}$ denotes the state and action respectively. The subscript $e$ represents the “environment”, which is used to distinguish related concepts in RL from GFlowNets, and the $\alpha$ is the coefficient to trade off between entropy and reward. Function $\mathcal{H}$ denotes the entropy term. By defining the soft $Q$ function as:

[TABLE]

The optimal maximum entropy policy can be proved as in Levine (2018)

[TABLE]

where the soft value function $V_{\mathrm{soft}}^{*}$ is defined by

[TABLE]

Thus the policy learning can be treated as the approximation to the Boltzmann-like distribution of optimal $Q$ function. Taking the soft $Q$ -Learning (SQL) Haarnoja et al. (2017) method as an example, it provides the optimal $Q$ is the fixed point of soft Bellman backup, which satisfies the soft Bellman equation

[TABLE]

Due to the infinite set of states and actions, it takes parameterized $Q$ and uses a function $\pi$ as an approximate sampler of Boltzmann-like distribution of $Q$ . Specifically, it updates $Q$ and $\pi$ as:

[TABLE]

where function $V^{{\theta}}$ is denoted as

[TABLE]

and $\theta,\bar{\theta},\phi$ denote the parameters of critic, target critic and policy respectively; $q_{a^{\prime}}$ is an arbitrary policy distribution. The policy distribution induced by the EBM (i.e., the Boltzmann-like distribution of $Q$ ) under structured action spaces is highly multimodal, and sampling from such a high-dimensional distribution is intractable. In this paper, DPO introduces a powerful generative model, Generative Flow Networks (GFlowNet), as the efficient diverse policies sampler.

2.2. Generative Flow Networks

Generative flow networks, which are trainable generative policies, model the generation or sampling process of composite objects $x\in\mathcal{X}$ by a sequence of discrete actions that incrementally modify a partially constructed object (state). Note that action and state here do not refer to the concepts in RL (Zhang et al., 2022). In this paper, we model actions in RL problems with structured action spaces as states in GFlowNet, and actions in GFlowNet correspond to the atomic actions that compose structured actions. In other words, the composite object $x$ generated by the GFlowNet is $a_{e}$ , and $\mathcal{X}$ is equivalent to $\mathcal{A}_{e}$ . The partially constructed object and corresponding action sequence space can be represented by a directed acyclic graph (DAG, See the DAG consisting of traffic lights and roads in Figure 2) $G=(\mathcal{S}_{g},\mathcal{A}_{g})$ , where the subscript $g$ denotes the “GFlowNet”. The vertices in $\mathcal{S}_{g}$ are states and the edges in $\mathcal{A}_{g}$ are actions that modify one state to another. The tails of incoming edges and the heads of outgoing edges of a state are denoted as the parents and childrens, respectively. The sampling process of the composite object $a_{e}$ starts from the initial state $s^{0}_{g}$ and transits to the terminal state $s^{n}_{g}\in\mathcal{A}_{e}$ , which is a state without outgoing edges, after $n\in(0,T]$ steps and $T$ is the maximum length. Note that the same terminal state may correspond to multiple action sequences.

A complete trajectory is a state sequence from a initial state to a terimal state $s^{0}_{g}\rightarrow s^{1}_{g}\rightarrow\ldots\rightarrow s^{n}_{g}$ , where each transition $s^{t}_{g}\rightarrow s^{t+1}_{g}$ is an action in $\mathcal{A}_{g}$ . A trajectory flow is a unnormalized density or a non-negative function, $F:\mathcal{T}\rightarrow\mathbb{R}_{\geq 0}$ , on the set of all complete trajectories $\mathcal{T}$ . The flow is called Markovian if there exist distributions $P_{F}(\cdot\mid s_{g})$ over the children of every non-terminal state $s_{g}$ and a constant $Z$ , such that for any complete trajectory $\tau$ we have $P_{F}(\tau)=F(\tau)/Z$ with $P_{F}(\tau)=P_{F}\left(s^{1}_{g}\mid s^{0}_{g}\right)P_{F}\left(s^{2}_{g}\mid s^{1}_{g}\right)\ldots P_{F}\left(s^{n}_{g}\mid s^{n-1}_{g}\right)$ . $P_{F}\left(s^{t+1}_{g}\mid s^{t}_{g}\right)$ is called a forward policy, which is used to sample the composite object $a_{e}$ from the density $F$ . $P_{T}(a_{e})$ then denotes the probability that a complete trajectory sampled from $P_{F}$ terminates in $a_{e}$ .

The problem we are interested in is fitting a Markovian flow to a fixed energy function on $\mathcal{A}_{e}$ . Given an energy function $\mathcal{E}(a_{e}):=-\log R(a_{e})$ and the associated non-negative reward function (again, not a reward in RL) $R_{g}:\mathcal{A}_{e}\rightarrow\mathbb{R}_{\geq 0}$ , one seeks a Markovian flow $F$ such that the likelihood of a complete trajectory sampled from $F$ terminating in a given $a_{e}$ is proportional to $R_{g}(a_{e})$ , i.e., $P_{T}(a_{e})\propto R_{g}(a_{e})$ . This $F$ can be obtained by imposing the reward-matching constraint: $R_{g}(a_{e})=\sum_{\tau=\left(s^{0}_{g}\rightarrow\ldots\rightarrow s^{n}_{g}\right),s^{n}_{g}=a_{e}}F(\tau)$ . The details of how to parameterize a GFlowNet and train a Markovian flow $F$ that satisfies the reward matching constraint will be explained soon.

3. Diverse Policy Optimization

This section proposes a simple and effective RL method, Diverse Policy Optimization (DPO), to discover diverse policies in structured action spaces. We follow the probabilistic reinforcement learning (PRL) framework (Levine, 2018) to transform RL problems under stochastic dynamics into variational inference problems on probabilistic graphical models and model the policies of RL agents as EBMs. PRL framework corresponds to a maximum entropy variant of reinforcement learning or optimal control, where the optimal policy aims to maximize the expected reward and maintain high entropy. Due to the maximum entropy objective, some existing works (Haarnoja et al., 2017, 2018) have proposed algorithms for low-dimensional continuous action spaces to discover diverse policies based on this framework.

Our method is an instance of the maximum entropy actor-critic algorithm in the PRL framework, which adopts a message-passing approach and can produce lower-variance estimates. In addition, to make the policy still scalable in the structured action space, we do not use an explicit policy parameterization but fit only the message, i.e., the $Q$ -value function, similar to soft $Q$ -learning (Haarnoja et al., 2017). Specifically, we opt for using general energy-based policies $\pi\left(a_{e}\mid s_{e}\right)\propto\exp\left(-\mathcal{E}\left(s_{e},a_{e}\right)\right),$ where $\mathcal{E}$ is an energy function. Furthermore, we set $\mathcal{E}\left(s_{e},a_{e}\right)=-\frac{1}{\alpha}Q_{\text{soft }}\left(s_{e},a_{e}\right)$ , then the optimal maximum entropy policy is an EBM that satisfies Equation (2).

However, The action distribution induced by this EBM in a structured action space is highly multimodal, and sampling from such a high-dimensional distribution is intractable. Fortunately, the composability and local dependencies of the structured action space make generative flow networks naturally suitable for efficiently sampling diverse and high-quality policies from it. And we only need to set the energy function that needs to be fitted by the Markovian flow $F(a_{e})$ (where the action $a_{e}$ corresponding to the composite object $x$ ) to be $(-{1}/{\alpha})\cdot Q_{\text{soft }}\left(s_{e},a_{e}\right)$ , and its associated reward function $R_{g}(a_{e})$ to be set to $\exp\left(({1}/{\alpha})\cdot Q_{\text{soft }}\left(s_{e},a_{e}\right)\right)$ , we can elegantly introduce GFlowNet as an efficient and diverse sampler.

Nevertheless, the unreasonable part of the above modeling is that there is no place left for the environment state $s_{e}$ in the input of the Markovian flow and the reward function. The reason is that $\pi$ in the PRL framework is a conditional distribution, but GFlowNet is an unconditional sampler. To this end, we will introduce a variant of GFlowNet, namely reward-conditional GFlowNet, to model the policy of RL agents, and details will be explained shortly.

Since in the PRL framework, with the update of the $Q_{\mathrm{soft}}$ , the energy-based policy distribution is also constantly changing. DPO adopts a joint training framework where the EBM and the GFlowNet are optimized alternately, similar with (Zhang et al., 2022): The energy function serves as the negative log-reward function for the GFlowNet, which is trained with the trajectory balance (Malkin et al., 2022) objective to sample from the evolving energy-based policies. In contrast, the energy function is trained with soft Bellman backup, where the GFlowNet provides diverse samples. The schematic diagram of RL based on reward-conditional GFlowNet as the agent’s policy and the joint training framework are shown in Figure 3 and Algorithm 1.

In the following, we will explain the generation process of structured action, the parameterization and training of reward-conditional GFlowNet, and its interleaved update with EBM, respectively.

3.1. Structured Action Generative Process

The framework of diverse policy optimization is introduced in the previous section, and this section will describe the process of generating structured actions based on the reward-conditional GFlowNet. The local dependencies of structured actions indicate that there may be two correlations between atomic actions: locally physical and locally logical correlations. The former is a typical graph, while the latter belongs to a typical set. For the unity of the framework, this paper only considers the physical correlation between atomic actions. It transforms the logical correlation into the physical correlation without loss of generality.

Expressly, we assume that atomic actions with local logical correlations have a fixed influence range with a radius $d$ in Euclidean space. An atomic action can then establish a physical correlation with others within its influence range. Of course, other types of topologies, such as fully connected, star, hierarchical, etc., can also be used in addition to adjacency topologies. This paper adopts the adjacency topology to make a trade-off between efficiency and performance. The experimental results also show that the algorithm performance is not sensitive to the influence radius $d$ .

In the structured action space, the action consists of $N$ atomic actions in $K$ -dimensional discrete space, i.e., $a_{e}\in\mathcal{A}_{e}\triangleq[K]^{N}$ , where $[K]\triangleq\{0,\ldots,K-1\}$ . $a_{e}$ could be a phase configuration of $N$ traffic lights, and each traffic light contains $K$ phases or the joint action of $N$ predators, and each predator can go in $K$ directions. We model the generation or sampling of vectors in $\mathcal{A}_{e}$ by a reward-conditional GFlowNet. The state space of GFlowNet is denoted as $\mathcal{S}_{g}$ , and we have $\mathcal{S}_{g}\triangleq\left\{\left(s_{g}^{1},\ldots,s_{g}^{N}\right)\mid s_{g}^{n}\in[K]\cup\oslash,n=1,\ldots,N\right\},$ where the void symbol $\oslash$ represents a yet unspecified atomic action. The DAG structure on $\mathcal{S}_{g}$ is the $N$ -th Cartesian power of the DAG with states $[K]\cup\oslash$ , where $[K]$ are children of $\oslash$ . Concretely, the children of a state $s_{g}=\left(s_{g}^{1},\ldots,s_{g}^{N}\right)$ are vectors that can be obtained from $s_{g}$ by changing any one atomic action $\mathbf{s}_{\mathrm{g}}^{n}$ from $\oslash$ to $[K]$ , and its parents are states that can be obtained by changing a single atomic action $s_{g}^{n}\in[K]$ to $\oslash$ .

Moreover, $\mathcal{A}_{e}$ is naturally identified with $\{s_{g}\in\mathcal{S}_{g}:|s_{g}|=D\}$ where $|s_{g}|\triangleq\#\left\{s_{g}^{n}\mid s_{g}^{n}\in[K],n=1,\ldots,N\right\}$ . Similarly, the initial state is denoted as $s^{0}_{g}\triangleq$ $(\oslash,\oslash,\ldots,\oslash)$ , which means that the reward-conditional GFlowNet-based RL policy needs to take $N$ steps to sample a structured action, i.e., constructing a trajectory from $s^{0}_{g}$ to $a_{e}\in\mathcal{A}_{e}$ . The forward policy $P_{F|e}(\cdot|s_{g},s_{e})$ of a reward-conditional GFlowNet (will explained soon), extends from $\S$ 2.2, is a distribution over all paths to select a position with a void atomic action in $s_{g}$ and a value $k\in[K]$ to assign to this atomic action based on the environmental state $s_{e}\in\mathcal{S}_{e}$ . Thus the action space for a state $s_{g}$ has size $K(N-|s_{g}|)$ . Since $k\ll N$ , the action space of the forward policy (same as the backward policy below) grows linearly with the atomic actions increase, so DPO has a good scalability. Correspondingly, the backward policy $P_{B|e}(\cdot|s_{g},s_{e})$ is a distribution over the $|s_{g}|$ paths to select a position with a nonvoid atomic action in $s_{g}$ .

More efficient generation. As we mentioned earlier, as an amortized version of MCMC, GFlowNets can alleviate the mix-moding problem (Jasra et al., 2005; Pompe et al., 2020) of the MCMC method, thereby improving the sampling efficiency of diverse samples. However, if the two modes are close enough, the MCMC method will have higher sampling efficiency because it only perturbs the previous sample slightly. However, GFlowNets, for this case, need to rebuild the entire structured action sequentially, although only a minimal number of atomic actions have changed. To this end, we introduce a small trick: adding a termination action in the action space. GFlowNets are trained to successfully sample from two close modes by deciding to terminate at different modes at different runs. Since the physical meaning of the termination action is quite different from other actions, we use a different output head to predict it separately, as shown in Figure 4. Once the forward policy $P_{F|e}(\cdot|s_{g},s_{e})$ decides to take the termination action, the output of the other head will be ignored. Experiments show that this small trick can significantly improve the learning efficiency of the algorithm in some tasks.

3.2. GFlowNet Parameterization

After showing how to sample structured actions using the GFlowNet, this section elaborates on how to parameterize it and train a Markovian flow $F$ that satisfies the reward matching constraint. As stated earlier, if we take the form of the GFlowNet in $\S$ 2.2, there will be no place for the environment state $s_{e}$ in the forward policy $P_{F}$ as well as in the backward policy $P_{B}$ . Thus, we use an extended version of flow networks by conditioning each component on some information, which is external to the flow network but influences the terminating flows. In our setting, the external information is RL’s environmental state $s_{e}$ . Since the external information $s_{e}$ affects the reward function $R_{g}$ in $\S$ 2.2, this conditional GFlowNet is also called reward-conditional GFlowNet (Bengio et al., 2021b, Definition 29).

Since reward-conditional GFlowNets are defined using the same components as the unconditional one, they inherit from all the properties of the GFlowNet for all DAGs $G_{e}=(\mathcal{S}_{g},\mathcal{A}_{g},\mathcal{S}_{e})$ and flow functions $F_{e}:\mathcal{T}\times\mathcal{S}_{e}\rightarrow\mathbb{R}_{\geq 0}$ , where $e$ represents the “environment” in RL again. In particular, we can directly extend notions of $\S$ 2.2 to reward-conditional GFlowNets with forward policy $P_{F|e}(\cdot|s_{g},s_{e})$ , backward policy $P_{B|e}(\cdot|s_{g},s_{e})$ , energy function $\mathcal{E}(a_{e}|e):=-\log R_{(}g|e)(a_{e}|s_{e})$ and the associated non-negative reward function $R_{g|e}:\mathcal{A}_{e}\times\mathcal{S}_{e}\rightarrow\mathbb{R}_{\geq 0}$ ; The only difference is that now every term explicitly depends of the conditioning variable, environmental state $s_{e}\in\mathcal{S}_{e}$ under the RL context.

In our experiments, we parameterize the forward and backward policy with deep neural networks $P_{F|e}(\theta_{F})$ and $P_{B|e}(\theta_{B})$ respectively, and for convenience, we omit the input without introducing ambiguity. As $P_{F}$ incrementally builds structured actions, its action space gradually decreases, similar to the traveling salesman problem (TSP) (Papadimitriou, 1977). Considering the effectiveness of the pointer network (Vinyals et al., 2015) in dealing with such problems, we introduce the modified graph pointer network (GPN, (Ma et al., 2019)) as the forward and backward policy (see Figure 2) to further model the structured information of the action space. The forward process of the modified GPN can be divided into the following three stages:

**Environmental state encoding: ** In this stage, the $i$ -th row of the adjency matrix $\ell_{i}$ and local observed information $o_{i}$ of each atomic action are concatenated as $s_{i|e}=\left[\ell_{i}\|o_{i}\right]$ , and then $s_{i|e}$ is embedded into a higher dimensional vector $\tilde{s}_{i|e}\in\mathbb{R}^{d}$ by a shared feed-forward network, where $d$ is the hidden dimension. The context information is then obtained by encoding all atomic actions’ embeddings $s_{e}$ via a graph neural network (GNN, (Kipf and Welling, 2016; Xu et al., 2019)), where $s_{e}=[\tilde{s}_{1|e}^{\top},\ldots,\tilde{s}_{N|e}^{\top}]^{\top}$ . Each layer of the GNN is expressed as:

[TABLE]

where $s_{i|e}^{\ell}\in\mathbb{R}^{d_{\ell}}$ is the $\ell$ -th layer variable with $\ell\in\{1,\ldots,L\}$ , $s_{i|e}^{0}=s_{i|e}$ , $\gamma$ is a trainable parameter, $\Theta\in\mathbb{R}^{d_{\ell-1}\times d_{\ell}}$ is a trainable weight matrix, $\mathcal{N}(i)$ is the adjacency set of atomic action $i$ , and $\phi_{\theta}:\mathbb{R}^{d_{\ell-1}}\rightarrow\mathbb{R}^{d_{\ell}}$ is the aggregation function (Kipf and Welling, 2016), which is represented by a neural network in our experiments.

**GFlowNet state encoding: ** In this stage, we use the vectors pointing from the newly added atomic action to all others as the embedding of $s_{g}$ , which is similar with Ma et al. (2019). Specifically, for the newly added atomic action $\tilde{s}_{i|e}$ , suppose $s_{\mathrm{E|i}}=\left[\tilde{s}_{i|e}^{\top},\ldots,\tilde{s}_{i|e}^{\top}\right]^{\top}\in\mathbb{R}^{N\times d}$ is a matrix with identical rows $\tilde{s}_{1|e}$ . We define $s_{g}=s_{i|e}^{L}-s_{\mathrm{E|i}}=\left[s_{i|g}^{\top},\ldots,s_{N|g}^{\top}\right]^{\top}\in\mathbb{R}^{N\times d}$ . Then $s_{g}$ is passed into the GNN again and the embedding of each atomic action after GFlowNet state encoding is denoted as $s_{i|g}^{L}$ .

**Atomic action selection: ** The atomic action selector is based on the Linear Transformer (Katharopoulos et al., 2020), which has the advantage of not suffering from the quadratic scaling in the input size. This architecture relies on a linearized attention mechanism, defined as

[TABLE]

where $\psi(\cdot)$ is a non-linear feature map, and $Q,K$ , and $V$ are linear transformations of $s_{g}^{L}$ corresponding to the queries, keys, and values respectively, as is standard with Transformers. The pointer vector outputted by the Linear Transformer is first masked by the mask $\mathbf{m}$ associated with the physical dependencies in structured action space and is then passed to a softmax layer to generate a distribution over the next candidate intersections. Similar to pointer networks (Vinyals et al., 2015), the masked pointer vector $\mathbf{u}_{i}$ is defined as:

[TABLE]

where $\sigma(k)$ denotes $k$ -th processed atomic action and $\mathbf{u}_{i}^{(j)}$ is the $j$ -th entry of the vector $\mathbf{u}_{i}$ .

3.3. Reward-Conditional GFlowNet Training

After parameterizing the GFlowNet, we now describe how reward-conditional GFlowNets could be trained toward matching a given conditional reward. Recall from $\S$ 2.2, $\S$ 3 and $\S$ 3.1, given a non-negtive conditional reward function $R_{g|e}:\mathcal{A}_{e}\times\mathcal{S}_{e}\rightarrow\mathbb{R}_{\geq 0}$ , a reward-conditional GFlowNet can be trained so that its terminating probability distribution matches the associated energy-based model. To be precise, the marginal likelihood that a trajectory sampled from the forward policy $P_{F|e}(\cdot|s_{g},s_{e})$ terminates at a given structured action is propotional to the action’s soft $Q$ value $P_{T}(a_{e}|s_{e})\propto\exp\left(({1}/{\alpha})\cdot Q_{\text{soft }}\left(s_{e},a_{e}\right)\right)$ , where $a_{e}\in\mathcal{A}_{e}$ and $s_{e}\in\mathcal{S}_{e}$ .

To train the parameters $\theta_{F}$ and $\theta_{B}$ of the reward-conditional GFlowNet, we use the trajectory balance objective (Malkin et al., 2022) that optimizes the following objective along complete trajectories $\tau=(s_{g}^{0}\rightarrow s_{g}^{1}\rightarrow\ldots\rightarrow\ldots\rightarrow s_{g}^{n})$ :

[TABLE]

where $\Theta\triangleq\{\theta_{F},\theta_{B},\theta_{Z}\}$ . The scalar function $Z(\cdot)$ is parametrized in the log domain, as suggested by Malkin et al. (2022). With the trajectory balance objective, we train the reward-conditional GFlowNet with stochastic gradient $\mathbb{E}_{\tau\sim\pi_{\Theta}(\tau|s_{e})}\left[\nabla_{\Theta}\mathcal{L}_{\Theta}(\tau|s_{e})\right]$ with some training trajectory distribution $\pi_{\Theta}(\tau)$ . Akin to RL settings, we take $\pi_{\Theta}$ to be the distribution over trajectories sampled from a tempered version of current forward policy $P_{F|e}(\cdot|s_{g},s_{e})$ . That is, $\tau$ is sampled with $\mathbf{s}_{g}^{t+1}\sim P_{F|e}(\cdot|s_{g}^{t},s_{e})$ starting from $s_{e}^{0}$ , mixed with a uniform action policy to ensure $\pi_{\Theta}$ has full support.

Learning about total flow $Z$ . Experiments show that learning the scalar function $Z(\cdot)$ end-to-end is very difficult. Since $Z$ represents the total flow in the entire flow network, many samples are required for an accurate estimation. Unlike the original work of trajectory balance (Malkin et al., 2022), in our setting, the scalar function $Z$ needs to condition on the external environmental state $s_{e}$ thus has higher sample complexity. Interestingly, since the target EBM of GFlowNets is derived from the PRL framework in our method, $Z$ has an additional physical meaning, i.e., the soft value function $V_{\mathrm{soft}}^{*}(\cdot)$ in $\S$ 2.1. Since the soft value function is dependent on the soft $Q$ value, $Z$ can be updated by a mechanism similar to the bootstrap learning adopted by RL, thereby improving the sample efficiency. To this end, in addition to end-to-end training of $Z$ using Equation (10), we estimate $V_{\mathrm{soft}}^{*}(\cdot)$ in the same way as in Haarnoja et al. (2017) and fit $Z$ to it. The experimental results show that this form of mixed gradient update can improve the learning efficiency of Z.

3.4. Joint Training with EBM

Reward-conditional GFlowNets’ training relies on a given function $R_{g|e}(a_{e}|s_{g},s_{e})$ to provide reward signals. However, in the PRL framework, the energy-based policy distribution is also constantly changing with the update of the soft Q function. Therefore, we propose a joint training framework (Algorithm 1), where the EBM and the reward-conditional GFlowNet are optimized alternately:

(1)

**GFlowNet updating step: ** the soft $Q$ function serves as the reward function for the GFlowNet, which is trained with the trajectory balance objective to sample from the evolving EBM; 2. (2)

**EBM updating step: ** the EBM is trained with soft $Q$ iteration (Haarnoja et al., 2017, $\S$ 3.1) where the GFlowNet provides diverse samples.

Moreover, again inspired by soft $Q$ -learning (Haarnoja et al., 2017), we find it advantageous to evaluate the forward policy, backward policy and total flow function in (10) with a separate target network, where the parameters $\bar{\theta}_{F}$ , $\bar{\theta}_{B}$ and $\bar{\theta}_{Z}$ are updated softly (Lillicrap et al., 2015).

4. Experiments

In this section, we will empirically validate DPO on two RL problems with structured action space, which include ATSC tasks (Ault and Sharon, 2021) where atomic actions have physical local dependencies; and more generally, Battle scenarios (Zheng et al., 2018) where atomic actions have logical local dependencies (see Appendix for more environment details). It is worth noting that we did not use the population diversity (PD) proposed by Parker-Holder et al. (2020) or the modified PD proposed by Zhou et al. (2022) as one of the evaluation metrics. In our experiments, we find that due to the high dimensionality and local dependencies of structured actions, PD, a locality indicator, cannot well reflect the diversity of policies. Therefore, we evaluate different global metrics for different tasks to verify the diversity.

4.1. Adaptive Signal Traffic Control

We choose the following algorithms as baselines, mainly including the state-of-the-art methods for the ASTC task and for encouraging policy diversity: Max-Pressure control (MP) where the phase combination with the maximal joint pressure is enabled as described in (Chen et al., 2020); MPLight-implementation is based on the FRAP open source implementation (Zheng et al., 2019) along with the ChainerRL (Fujita et al., 2021) DQN implementation and pressure sensing; DvD (Parker-Holder et al., 2020) is a population-based RL method for effective diversity; SQL (Haarnoja et al., 2017) method is the skeleton of the proposed DPO, which can obtain diverse policies in the low-dimensional continuous action space; Recent proposed RSPO (Zhou et al., 2022) transforms the problem of seeking diversity policies into a constrained Markov decision process.

From the experimental results in Table 1 and Figure 5, it can be seen that DPO achieves state-of-the-art (SOTA) performance and convergence speed on two coordinated control tasks in TAPAS Cologne and InTAS scenarios. It is worth noting that classical MP methods based on heuristic rules and expert knowledge also show good results. DPO can outperform the MP method through a reinforcement learning mechanism, showing its superiority in solving the ATSC problem. While among the three algorithms that encourage policy diversity, the DvD performs the worst, which we believe is due to the limitations of how it computes the distance between two policies on complex problems. The other two algorithms, SQL and RSPO, can show near-SOTA performance on small-scale problems, i.e., the TAPAS Cologne scenario where a structured action consists of $8$ atomic actions. However, in the larger-scale InTAS scenario, its performance drops sharply, which shows that existing algorithms that encourage policy diversity have certain limitations when dealing with structured action spaces.

Figure 6 shows the comparison of the policy diversity between RSPO and DPO (see the appendix for more results). We ignore the atomic action level, that is, the diversity of each traffic light’s phase selection strategy, but the diversity of the entire road network’s traffic control strategy. To this end, we calculate the average commute time of the main road under multiple random seeds for different algorithms in different scenarios. Furthermore, for visualization convenience, we normalized each algorithm separately. Red indicates longer commute time; otherwise, it is shown in blue. As seen from the figure, DPO learns policies with sufficient diversity in structured action spaces of different scales, but RSPO only shows some effect in small-scale tasks.

4.2. Battle Scenario

In the Battle scenario, the atomic action is each agent’s action, and we transform the logical correlation between each agent into the physical correlation without loss of generality. Expressly, we assume that atomic actions with local logical correlations have a fixed influence range with a radius $d=4$ in Euclidean space. In this benchmark, we additionally select IDQN, the built-in algorithm in the MAgent, and MFQ (Yang et al., 2018), the state-of-the-art algorithm on the Battle as baselines.

We first train the IDQN in a self-play way and the blue agent loads the checkpoint and fixes the model parameters. The red agent is then trained with different algorithms, and the final result is shown in Figure 7. It is worth noting that DvD, SQL, and RSPO are less scalable. So in the Battle scenario, we combine independent learning to obtain I-DvD, I-SQL, and I-RSPO variants. Independent learning does not constrain the algorithm’s performance, while the IDQN algorithm also shows promising results. As seen from the figure, the three algorithms that encourage policy diversity do not show good results in large-scale structured action spaces, while DPO can still stably approach the performance of SOTA.

Figure 8 shows the diversity of policies between I-RSPO and DPO in the early and middle stages of the game (see appendix for more results). As seen from the figure, the policies learned by DPO show a variety of deployment strategies in the early stage; in the middle stage, the enemy can be surrounded by different formations to maximize the attack power. Although I-RSPO based on independent learning shows a specific diversity at the individual level, it is not easy to generate different policies as a whole.

Diverse policies are more difficult to be exploited by opponents in competitive scenarios and can better adapt to changes in opponents’ policies. In order to verify the above point, we let the red agents trained based on different algorithms compete against each other and count the average winning rate. The results are shown in Figure 9. As seen from the figure, DPO shows good robustness against different opponents.

5. Closing Remarks

In this paper, we aim to seek diverse policies in an under-explored setting, namely RL tasks with structured action spaces with the composability and local dependencies. The complex action structure, non-uniform reward landscape, and subtle hyperparameter tuning due to the structured actions prevent existing methods from scaling well. We propose a simple and effective method, Diverse Policy Optimization (DPO), to model the policies in structured action space as the energy-based models by following the probabilistic RL framework. DPO adopts a joint training framework, where the energy-based model, and the generative flow network, which is introduced as the efficient, diverse EBM-based policy sampler, are optimized alternately: The energy function serves as the negative log-reward function for the GFlowNet, which is trained with the trajectory balance objective to sample from the evolving energy-based policies. In contrast, the energy function is trained with soft Bellman backup, where the GFlowNet provides diverse samples. Experiments demonstrate that the proposed DPO is both general and practical across structured action spaces with physical and, more generally, logical local dependencies.

{acks}

This work was supported in part by Postdoctoral Science Foundation of China (2022M723039), NSFC (62106213, 72150002), SSTP (RCBS20210609104356063, JCYJ20210324120011032), and a grant from Shenzhen Institute of Artificial Intelligence and Robotics for Society.

Appendix A Related Works

To the best of our knowledge, existing work on reinforcement learning rarely pursues both the quality as well as the diversity of optimal policies in sequential decision problems with large-scale, structured action spaces. Therefore, this section will briefly review the work in reinforcement learning focusing on the diversity of solutions and dealing with sequential decision problems with large-scale or structured action spaces, respectively.

A.1. Diverse Solutions in RL

Most of the literature on this problem has been done in the field of neuroevolution methods inspired by Quality-Diversity (QD), seeking to maximize the reward of a policy through approaches strongly motivated by natural biological processes. They typically work by perturbing a policy and either computing a gradient (as in Evolution Strategies) or selecting the top-performing perturbations (as in Genetic Algorithms). Neuroevolution methods comprise two leading families of algorithms: MAP-Elites (Cully et al., 2015; Mouret and Clune, 2015) and novelty search with local competition (Lehman and Stanley, 2011). These methods typically maintain a collection of policies and adapt it using evolutionary algorithms to balance the QD trade-off (Pugh et al., 2016; Duarte et al., 2017; Parker-Holder et al., 2020; Nilsson and Cully, 2021; Gangwani et al., 2021; Lim et al., 2022).

In another part of the work, intrinsic rewards have been used for learning diversity in terms of the discriminability of different trajectory-specific quantities (Gregor et al., 2016; Eysenbach et al., 2019; Hartikainen et al., 2020; Goyal et al., 2020; Sharma et al., 2020a; Zahavy et al., 2020; Alver and Precup, 2022). These methods are similar in principle to novelty search without a reward signal but instead focus on diversity in behaviors defined by the states they visit. Other work implicitly induces diversity to learn policies that maximize the set robustness to the worst-possible reward (Kumar et al., 2020; Zahavy et al., 2021), or uses diversity as a regularizer when maximizing the extrinsic reward (Levine, 2018; Gangwani et al., 2019; Masood and Doshi-Velez, 2019; Sharma et al., 2020b; Zhang et al., 2019). There is also a small body of work that transforms the problem of seeking diversity policies into a Constrained Markov Decision Process (Sun et al., 2020; Zhou et al., 2022; Derek and Isola, 2021; Zahavy et al., 2022).

In addition to getting policies with diversity in RL, some related work is encouraging policy diversity. In imitation learning, the problem of imitating diverse behaviors from expert demonstrations has been addressed in previous studies (Wang et al., 2017; Li et al., 2017; Sharma et al., 2019; Merel et al., 2019). In these methods, diverse behaviors are encoded in latent variables. However, these imitation learning methods assume the availability of observations of diverse behaviors performed by experts. Encouraging agents to diversify their exploration in the early stages of RL has also received significant attention in recent years (Hong et al., 2018; Conti et al., 2018; Khadka et al., 2019; Liu et al., 2019; Majumdar et al., 2020; Peng et al., 2020). The diversity of policies in multi-agent reinforcement learning (MARL) is also crucial to improve the agent’s robustness and their ability to zero-shot cooperate (Tang et al., 2021; Yang et al., 2020; Lupu et al., 2021; Nieves et al., 2021).

A.2. Structured or Large-Scale Actions

A large part of the current work on policy optimization for structured action spaces addresses one particular class of problems, namely, parametric action space problems, in which the action space has a particular master-slave structure. The difficulty in solving the parameterized action space lies in the heterogeneity of discrete master actions and continuous slave actions. Current methods either learn a continuous parameter policy for each discrete action (Masson et al., 2016; Xiong et al., 2018; Bester et al., 2019); or discrete actions are output in parallel with continuous actions and employ gradient post-processing techniques or improved value function networks to solve the master-slave action correspondence problem (Hausknecht and Stone, 2015; Fan et al., 2019); or first, generate discrete actions, then generate continuous parameters based on that action and design sophisticated gradient update schemes for end-to-end training (Delalleau et al., 2019; Berner et al., 2019; Wei et al., 2018a).

In contrast, there are fewer algorithms oriented towards structured action spaces in general, and in the tasks solved by these algorithms, there are no explicit dependencies between atomic actions. Thus, existing approaches are either based on the assumption of independence of the decomposed sub-actions (Tavakoli et al., 2018; Mahajan et al., 2021a); or they are based on the inductive bias to assign a conditional dependency structure to the decomposed sub-actions and pick up the actions one by one through an autoregressive form based on recurrent neural networks, which are finally spliced into the original actions (Metz et al., 2017; PIERROT et al., 2021). There are also a series of approaches that assume a game relationship between the decomposed sub-actions, model each sub-action as an agent, and use MARL methods to solve them (Yang et al., 2018; Fu et al., 2019; Vinyals et al., 2019; Mahajan et al., 2021b; Li et al., 2021). However, the field of MARL is still in the preliminary exploration stage, and numerous theoretical problems remain unsolved. Thus modeling as a multi-agent problem will introduce more new challenges.

To address the curse of dimensionality caused by (non-structured) large-scale action spaces, existing methods are based on the idea of reshaping the action space and thus reducing the dimensionality, e.g., some works perform dimensionality reduction by clustering the actions (Dulac-Arnold et al., 2015; Chandak et al., 2019; He et al., 2015; Wang and Yu, 2016; Tennenholtz and Mannor, 2019). However, these approaches require the assumption that actions have dense semantic information, consist of natural language, and cannot be applied to general high-dimensional tasks. Some works propose solutions for generic large-scale action spaces, such as dividing the action space by using multiple hierarchical policies similar to a tree structure to reduce the action dimension of each layer of the policy (Zahavy et al., 2018; Chen et al., 2019; Delarue et al., 2020); or gradually increasing the action space employing curriculum learning so that the policy only needs to be optimized in a smaller action space in the early stage (Farquhar et al., 2020).

Appendix B Training Details

B.1. Environments

Adaptive Signal Traffic Control. This benchmark based on $2$ well-established Simulation of Urban Mobility traffic simulator (SUMO) (Behrisch et al., 2011) scenarios, namely, “TAPAS Cologne” ( $8$ lights) (Varschen and Wagner, 2006) and “InTAS” ( $21$ lights) (Lobo et al., 2020), which describe traffic within a real-world city, Cologne and Ingolstadt (Germany) respectively. There are $3$ kinds of tasks in the original work (Ault and Sharon, 2021), namely (a) controlling a single intersection, (b) controlling multiple intersections along an arterial corridor, and (c) coordinated control of multiple intersections within a congested area. We select the most complex coordinated control task (c) to demonstrate the advantage of DPO in finding diverse policies. In the coordinated control task, the atomic action is the selection of the signal light’s phase at each intersection, and the physical dependencies are the roads between the intersections. The road network is shown in Figure 10.

Battle Scenario. This benchmark is based on the MAgent (Zheng et al., 2018), a research platform for many-agent reinforcement learning. We selected the competitive task, Battle, as the simulation environment to highlight the advantages of the diverse policies. In Battle, $n$ agents learn to fight against $n$ enemies who have superior abilities than agents. (Figure 11). As the enemy’s hit point is more than a single agent’s damage, agents must continuously cooperate to kill the enemy.

In our experiments on the ATSC and Battle benchmarks, all the environment settings, such as the definition of state, the definition of reward, etc., as well as the evaluation metrics, are kept the same as in Ault and Sharon (2021)222https://github.com/Pi-Star-Lab/RESCO . and Terry et al. (2021)333https://github.com/Farama-Foundation/PettingZoo. respectively.

B.2. Methods

Random seeds. Except as mentioned in the text, all experiments were run for $5$ random seeds each. Graphs show the average (solid line) and std dev (shaded) performance over random seed throughout training. In the ATSC benchmark, the tables show the empirical mean of the relevant evaluation metrics.

Hyperparameters. Table 2 shows the tuning range of hyperparameters used for all the experiments of our method and baselines. For all hyperparameters that need to be tuned, we use the Bayesian hyperparameter search method in the wandb platform444https://wandb.ai/ for parallel tuning. During the parallel tuning, the platform will create a probabilistic model of a metric score as a function of the hyperparameters, and choose parameters with high probability of improving the metric. Bayesian hyperparameter search method uses a Gaussian Process to model the relationship between the parameters and the model metric and chooses parameters to optimize the probability of improvement.

Hardware. The hardwares used in the experiment are a server with $128$ cores, $128$ G memory and $4$ NVIDIA GeForce RTX 1080Ti graphics cards with $11$ G video memory, and a server with $128$ cores, $256$ G memory and $2$ NVIDIA GeForce RTX 3090 graphics cards with $24$ G video memory.

The Code of Baselines. The code and license of baselines are shown in following list:

•

IDQN (Zheng et al., 2018): https://github.com/geek-ai/MAgent, MIT License;

•

MFQ (Yang et al., 2018): https://github.com/mlii/mfrl, MIT License;

•

Max-Pressure (Chen et al., 2020): https://github.com/Pi-Star-Lab/RESCO, No License;

•

MPLight (Zheng et al., 2019): https://github.com/Pi-Star-Lab/RESCO, No License;

•

DvD (Parker-Holder et al., 2020): https://github.com/jparkerholder/DvD_ES, Apache-2.0 license;

•

SQL (Haarnoja et al., 2017): https://github.com/haarnoja/softqlearning, No License;

•

RSPO (Zhou et al., 2022): https://github.com/footoredo/rspo-iclr-2022, No License.

Learning curves are smoothed by the exponential moving average technique with coefficient $0.6$ . Source code is available at this anonymous code repository555https://anonymous.4open.science/r/DPO., which is based on (Ma et al., 2019)666https://github.com/qiang-ma/graph-pointer-network. and Zhang et al. (2022)777https://github.com/GFNOrg/EB_GFN..

Appendix C More Results

Due to space constraints, we place some experimental results of the additional validation in the appendix section. These results consist of three main sections: one is a comparison of three algorithms that encourage policy diversity, DvD, SQL, and RSPO, with their respective independent learning variants; the second is ablation studies of the proposed DPO algorithm; Moreover, the third verify the robustness of different algorithms in the ATSC benchmark task under out-of-distribution traffic flow. Before giving these additional experimental results, we post the complete diversity visualization results here, as shown in Figure 12(a), 12(b), 13(a) and 13(b).

In addition to visualizing the global diversity of strategies obtained by different algorithms, we also show the proposed DPO’s policy diversity at local intersections. In order to improve the interpretability of the visualization results, we selected the MP method based on heuristic rules and expert knowledge as a comparison, and the results are shown in Figure 14. It can be seen from the figure that the strategy output by the MP can better match the traffic flow, thereby reducing the average delay and other indicators. However, the DPO method does not simply perform local optimization but considers global information. This makes the diversity policies obtained from the DPO achieve a trade-off for allocating green light time at different times of a single intersection.

C.1. Independent Learning Variants

In the ATSC benchmark task, we find that the performance of the three algorithms DvD, SQL, and RSPO, which encourage policy diversity, showed a significant degradation in large-scale structured action space. This is why in larger scale Battle scenarios, we directly use these algorithms’ corresponding independent learning variants. In this section, we further compare the DvD, SQL, and RSPO algorithms and their independent learning variants I-DvD, I-SQL, and I-RSPO in the TAPAS Cologne and InTAS scenarios of the ATSC benchmark, and the results are shown in Table 3(a) and 3(b).

The table shows that using the independent learning variant in a larger structured action space can lead to a more considerable performance improvement. However, the DvD algorithm still does not perform as well as expected. Independent learning encourages diversity of atomic actions, which will also prevent I-SQL and I-RSPO from getting a better diversity of policies in the structured action space. To verify this, we used the same visualization method as in the experimental part of the main text, and the final results are shown in Figure 15.

As can be seen from the figure, in the small-scale structured action space, the independent learning variant does not bring significant performance improvement in terms of diversity; However, in the large-scale structured action space, the independent learning variant learns policies with more significant diversity.

C.2. Ablation Study

In this section, we perform some ablation studies on the three critical implementations of the DPO algorithm, including the additional soft value regression (denoted as $\mathrm{S}$ ) task introduced to accelerate the training of total flow $Z$ , the additionally expanded termination action (denoted as $\mathrm{T}$ ) to accelerate the training, and the action space design (denoted as $\mathrm{P}$ ) of GFlowNet. For the last point, in the ATSC benchmark, we analyze the impact of the road network-based GFlowNet’s action space design on performance; In the Battle benchmark, we analyze the impact of different physical topologies resulting from different influence ranges.

We first analyze the performance of the DPO algorithm on the ATSC benchmark, and the results are shown in Table 4, Figure 16(a) and 16(b). As seen from the table, the soft value regression task plays a crucial role in the performance of the DPO. This is due to its operational guidance for training total flow $Z$ , and the accuracy of $Z$ estimation directly determines the diversity of the sampled structured actions. While the termination action and the road network-based GFlowNet’s action space design have little impact, they can significantly improve the convergence speed of the algorithm. Overall, the results of the ablation study are consistent with our previous conjecture.

The ablation studies on the DPO algorithm in the Battle benchmark task exhibited similar results, as shown in Table 5, Figure 16(c) and 16(d). In our experiments, instead of picking a different range of influence, an alternative approach is used, i.e., the nearest $k$ agents are chosen for implementation. As seen in Table 5, while choosing a more significant number of agents to form the physical dependencies provides a slight performance improvement, it also slows down the convergence of the algorithm because of the resulting larger GFlowNet action space.

C.3. Robustness in ATSC benchmarks

As explained in the $\S$ 1, diversity of policies can improve the robustness of algorithms in non-stationary environments. Therefore, this section tests the robustness of different algorithms by perturbing the traffic distribution in the ATSC benchmark and verifies whether the diversity policies are effective against the non-stationary factors in the environment. Specifically, for the TAPAS Cologne ( $8$ lights, $5$ main roads) and InTAS ( $21$ lights, $8$ main roads) scenarios in the ATSC benchmark, we first randomly select $1$ or $2$ of the respective main roads, increase the traffic flow by $10\%$ , and train all the algorithms for $50$ episodes (about $3\%$ of the standard training sample size).

Since the DvD and MPLight algorithms have poor performance under ATSC and Battle benchmarks, we do not consider these two methods here. Also, considering the poor scalability of SQL and RSPO under large-scale structured action spaces, we only verify the robustness of the independent learning variants, i.e., I-SQL and I-RSPO. The average performance is shown in Table 6.

As seen from the table, DPO can quickly achieve good performance using only a small number of samples for fine-tuning. The lack of policy diversity in the other algorithms makes them have a significant performance gap with DPO.

Bibliography103

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Alver and Precup (2022) Alver, S. and Precup, D. (2022). Constructing a good behavior basis for transfer using generalized policy updates. In ICLR .
2Ault and Sharon (2021) Ault, J. and Sharon, G. (2021). Reinforcement learning benchmarks for traffic signal control. In Neur IPS .
3Behrisch et al. (2011) Behrisch, M. , Bieker, L. , Erdmann, J. and Krajzewicz, D. (2011). Sumo–simulation of urban mobility: an overview. In SIMUL 2011 .
4Bengio et al. (2021 a) Bengio, E. , Jain, M. , Korablyov, M. , Precup, D. and Bengio, Y. (2021 a). Flow network based generative models for non-iterative diverse candidate generation. In Neur IPS .
5Bengio et al. (2021 b) Bengio, Y. , Deleu, T. , Hu, E. J. , Lahlou, S. , Tiwari, M. and Bengio, E. (2021 b). Gflownet foundations. ar Xiv preprint ar Xiv:2111.09266 .
6Berner et al. (2019) Berner, C. , Brockman, G. , Chan, B. , Cheung, V. , D \k ebiak, P. , Dennison, C. , Farhi, D. , Fischer, Q. , Hashme, S. , Hesse, C. et al. (2019). Dota 2 with large scale deep reinforcement learning. ar Xiv preprint ar Xiv:1912.06680 .
7Bester et al. (2019) Bester, C. J. , James, S. D. and Konidaris, G. D. (2019). Multi-pass q-networks for deep reinforcement learning with parameterised action spaces. ar Xiv preprint ar Xiv:1905.04388 .
8Chandak et al. (2019) Chandak, Y. , Theocharous, G. , Kostas, J. , Jordan, S. and Thomas, P. (2019). Learning action representations for reinforcement learning. In ICML .