Adaptive Scaling of Policy Constraints for Offline Reinforcement Learning

Tan Jing; Xiaorui Li; Chao Yao; Xiaojuan Ban; Yuetong Fang; Renjing Xu; Zhaolin Yuan

arXiv:2508.19900·cs.LG·April 30, 2026

Adaptive Scaling of Policy Constraints for Offline Reinforcement Learning

Tan Jing, Xiaorui Li, Chao Yao, Xiaojuan Ban, Yuetong Fang, Renjing Xu, Zhaolin Yuan

PDF

1 Repo 1 Video

TL;DR

The paper introduces ASPC, a dynamic framework for balancing policy constraints in offline RL, reducing the need for dataset-specific hyperparameter tuning and improving performance across diverse datasets.

Contribution

ASPC is a second-order differentiable method that adaptively scales policy constraints, outperforming existing methods with minimal tuning and computational overhead.

Findings

01

ASPC outperforms other adaptive constraint methods on 39 datasets.

02

ASPC requires only a single hyperparameter setting across datasets.

03

The code for ASPC will be publicly available at the specified GitHub link.

Abstract

Offline reinforcement learning (RL) enables learning effective policies from fixed datasets without any environment interaction. Existing methods typically employ policy constraints to mitigate the distribution shift encountered during offline RL training. However, because the scale of the constraints varies across tasks and datasets of differing quality, existing methods must meticulously tune hyperparameters to match each dataset, which is time-consuming and often impractical. We propose Adaptive Scaling of Policy Constraints (ASPC), a second-order differentiable framework that dynamically balances RL and behavior cloning (BC) during training. We theoretically analyze its performance improvement guarantee. In experiments on 39 datasets across four D4RL domains, ASPC using a single hyperparameter configuration outperforms other adaptive constraint methods and state-of-the-art offline…

Figures9

Click any figure to enlarge with its caption.

Tables13

Table 1. Table 1: Average normalized score over the final evaluation across four random seeds. The best performance in each dataset is highlighted in bold , while the second-best performance is indicated with an underline . Blue shading indicates methods with top domain average performance. The symbol ± \pm denotes the standard deviation. ✓denotes fixed hyperparameters, whereas ✗denotes dataset-specific ones. * To ensure fairness, TD3+BC and wPC employ the robust critic described in Section 5.5 .

	Task Name	TD3+BC^*(✓)	A2PR(✓)	IQL(✗)	wPC^*(✓)	ReBRAC(✗)	ASPC (Ours)(✓)
HalfCheetah	Random	10.6 $\pm$ 0.7	21.1 $\pm$ 0.8	19.5 $\pm$ 0.8	18.8 $\pm$ 0.7	29.5 $\pm$ 1.5	20.8 $\pm$ 0.9
	Medium	49.6 $\pm$ 0.2	56.1 $\pm$ 0.3	50.0 $\pm$ 0.2	54.8 $\pm$ 0.2	65.6 $\pm$ 1.0	58.7 $\pm$ 0.4
	Expert	100.4 $\pm$ 0.4	99.9 $\pm$ 3.2	95.5 $\pm$ 2.1	103.8 $\pm$ 2.4	105.9 $\pm$ 1.7	105.1 $\pm$ 1.2
	Medium-Expert	97.9 $\pm$ 1.6	95.9 $\pm$ 6.0	92.7 $\pm$ 2.8	98.9 $\pm$ 8.5	101.1 $\pm$ 5.2	99.9 $\pm$ 1.2
	Medium-Replay	45.8 $\pm$ 0.2	49.0 $\pm$ 0.4	42.1 $\pm$ 3.6	48.1 $\pm$ 0.2	51.0 $\pm$ 0.8	50.6 $\pm$ 0.5
	Full-Replay	74.5 $\pm$ 1.6	79.5 $\pm$ 1.5	75.0 $\pm$ 0.7	76.7 $\pm$ 2.3	82.1 $\pm$ 1.1	79.3 $\pm$ 0.9
Hopper	Random	8.6 $\pm$ 0.2	20.1 $\pm$ 11.6	10.1 $\pm$ 5.9	8.5 $\pm$ 1.4	8.1 $\pm$ 2.4	9.4 $\pm$ 1.5
	Medium	62.0 $\pm$ 3.0	78.3 $\pm$ 4.4	65.2 $\pm$ 4.2	81.8 $\pm$ 9.8	102.0 $\pm$ 1.0	92.7 $\pm$ 7.2
	Expert	108.2 $\pm$ 4.2	83.9 $\pm$ 6.0	108.8 $\pm$ 3.1	79.1 $\pm$ 26.6	100.1 $\pm$ 8.3	112.3 $\pm$ 0.4
	Medium-Expert	103.3 $\pm$ 9.2	110.8 $\pm$ 2.6	85.5 $\pm$ 29.7	109.1 $\pm$ 4.5	107.0 $\pm$ 6.4	111.0 $\pm$ 2.1
	Medium-Replay	47.4 $\pm$ 35.4	98.9 $\pm$ 2.0	89.6 $\pm$ 13.2	100.8 $\pm$ 0.7	98.1 $\pm$ 5.3	101.3 $\pm$ 0.6
	Full-Replay	90.3 $\pm$ 22.9	97.1 $\pm$ 17.8	104.4 $\pm$ 10.8	105.6 $\pm$ 0.6	107.1 $\pm$ 0.4	107.2 $\pm$ 0.5
Walker2d	Random	5.9 $\pm$ 3.5	1.2 $\pm$ 1.5	11.3 $\pm$ 7.0	12.5 $\pm$ 10.6	18.4 $\pm$ 4.5	15.6 $\pm$ 6.4
	Medium	62.0 $\pm$ 3.0	84.2 $\pm$ 4.7	80.7 $\pm$ 3.4	89.6 $\pm$ 0.3	82.5 $\pm$ 3.6	92.4 $\pm$ 5.4
	Expert	108.2 $\pm$ 4.2	84.8 $\pm$ 49.0	96.9 $\pm$ 32.3	111.5 $\pm$ 0.1	112.3 $\pm$ 0.2	110.8 $\pm$ 0.1
	Medium-Expert	103.3 $\pm$ 9.2	88.2 $\pm$ 40.7	112.1 $\pm$ 0.5	110.1 $\pm$ 0.5	111.6 $\pm$ 0.3	111.1 $\pm$ 0.3
	Medium-Replay	76.6 $\pm$ 12.7	84.5 $\pm$ 12.3	75.4 $\pm$ 9.3	93.4 $\pm$ 3.0	77.3 $\pm$ 7.9	97.6 $\pm$ 0.5
	Full-Replay	88.3 $\pm$ 11.7	102.5 $\pm$ 0.0	97.5 $\pm$ 1.4	99.5 $\pm$ 0.5	102.2 $\pm$ 1.7	102.1 $\pm$ 0.2
	MuJoCo Avg	70.7	74.2	72.9	77.8	81.2	82.1
Maze2d	Umaze	34.5 $\pm$ 13.9	102.5 $\pm$ 6.3	-8.9 $\pm$ 6.1	73.1 $\pm$ 13.8	106.8 $\pm$ 22.1	128.1 $\pm$ 31.8
	Medium	63.3 $\pm$ 63.3	90.4 $\pm$ 29.6	34.8 $\pm$ 2.7	87.4 $\pm$ 48.7	105.1 $\pm$ 31.6	117.8 $\pm$ 17.3
	Large	108.9 $\pm$ 43.6	177.7 $\pm$ 34.2	61.7 $\pm$ 3.5	123.3 $\pm$ 70.5	78.3 $\pm$ 61.7	195.8 $\pm$ 31.3
	Maze2d Avg	68.9	123.53	46.2	94.6	96.7	147.2
AntMaze	Umaze	100.0 $\pm$ 0.0	92.5 $\pm$ 8.3	83.3 $\pm$ 4.5	97.5 $\pm$ 5.0	97.8 $\pm$ 1.0	92.5 $\pm$ 5.0
	Umaze-Diverse	87.5 $\pm$ 12.5	32.5 $\pm$ 34.9	70.6 $\pm$ 3.7	75.0 $\pm$ 20.8	88.3 $\pm$ 13.0	92.5 $\pm$ 9.5
	Medium-Play	7.5 $\pm$ 9.5	40.0 $\pm$ 7.1	64.6 $\pm$ 4.9	85.0 $\pm$ 5.7	84.0 $\pm$ 4.2	85.0 $\pm$ 12.9
	Medium-Diverse	12.5 $\pm$ 12.5	40.0 $\pm$ 25.5	61.7 $\pm$ 6.1	85.0 $\pm$ 12.9	76.3 $\pm$ 13.5	70.0 $\pm$ 11.5
	Large-Play	2.5 $\pm$ 5.0	5.0 $\pm$ 8.7	42.5 $\pm$ 6.5	65.0 $\pm$ 19.1	60.4 $\pm$ 26.1	55.0 $\pm$ 5.7
	Large-Diverse	2.5 $\pm$ 5.0	22.5 $\pm$ 14.8	27.6 $\pm$ 7.8	65.0 $\pm$ 10.0	54.4 $\pm$ 25.1	52.5 $\pm$ 18.9
	AntMaze Avg	35.4	38.75	58.3	78.7	76.8	74.5
Pen	Human	53.8 $\pm$ 15.7	-2.1 $\pm$ 0.0	81.5 $\pm$ 17.5	39.9 $\pm$ 12.8	103.5 $\pm$ 14.1	81.1 $\pm$ 8.1
	Cloned	71.7 $\pm$ 21.5	6.5 $\pm$ 6.0	77.2 $\pm$ 17.7	34.6 $\pm$ 11.3	91.8 $\pm$ 21.7	87.2 $\pm$ 4.2
	Expert	126.6 $\pm$ 24.8	51.5 $\pm$ 38.4	133.6 $\pm$ 16.0	141.8 $\pm$ 11.8	154.1 $\pm$ 5.4	141.2 $\pm$ 9.4
Door	Human	0.0 $\pm$ 0.0	-0.2 $\pm$ 0.0	3.1 $\pm$ 2.0	-0.2 $\pm$ 0.0	0.0 $\pm$ 0.0	0.0 $\pm$ 0.0
	Cloned	0.0 $\pm$ 0.0	-0.3 $\pm$ 0.0	0.8 $\pm$ 1.0	0.0 $\pm$ 0.0	1.1 $\pm$ 2.6	0.0 $\pm$ 0.0
	Expert	81.6 $\pm$ 16.3	-0.3 $\pm$ 0.0	105.3 $\pm$ 2.8	51.4 $\pm$ 55.3	104.6 $\pm$ 2.4	105.6 $\pm$ 0.4
Hammer	Human	0.0 $\pm$ 0.0	1.1 $\pm$ 0.4	2.5 $\pm$ 1.9	0.0 $\pm$ 0.1	0.2 $\pm$ 0.2	2.2 $\pm$ 3.2
	Cloned	0.1 $\pm$ 0.0	0.3 $\pm$ 0.0	1.1 $\pm$ 0.5	0.1 $\pm$ 0.1	6.7 $\pm$ 3.7	12.0 $\pm$ 9.1
	Expert	132.8 $\pm$ 0.4	0.3 $\pm$ 0.1	129.6 $\pm$ 71.5	57.6 $\pm$ 0.1	133.8 $\pm$ 0.7	128.6 $\pm$ 0.4
Relocate	Human	0.0 $\pm$ 0.0	-0.3 $\pm$ 0.0	0.1 $\pm$ 0.1	0.1 $\pm$ 0.0	0.0 $\pm$ 0.0	0.1 $\pm$ 0.2
	Cloned	0.0 $\pm$ 0.0	-0.3 $\pm$ 0.0	0.2 $\pm$ 0.4	0.1 $\pm$ 0.0	0.9 $\pm$ 1.6	0.0 $\pm$ 0.0
	Expert	90.6 $\pm$ 18.2	-0.3 $\pm$ 0.0	106.5 $\pm$ 2.5	6.7 $\pm$ 4.6	106.6 $\pm$ 3.2	111.2 $\pm$ 2.4
	Adroit Avg	46.4	4.65	53.4	28.8	58.6	55.7
	Total Avg	57.7	51.2	62.6	64.2	74.8	77.9

Table 2. Table 2: Results under different α \alpha settings. Values in parentheses indicate the percent difference from Naive. Blue denotes improvement, and red denotes degradation.

Domain	Naive $α$	Converged $α$	Linear $α$	Dynamic $α$ (ASPC)
Mujoco	70.3	79.3 ( $↑$ 12.8%)	77.0 ( $↑$ 9.5%)	82.1 ( $↑$ 16.8%)
Maze2d	61.9	133.2 ( $↑$ 115.2%)	103.3 ( $↑$ 66.9%)	147.2 ( $↑$ 137.8%)
AntMaze	28.7	64.1 ( $↑$ 123.3%)	56.3 ( $↑$ 96.2%)	74.5 ( $↑$ 159.2%)
Adroit	49.9	49.1 ( $↓$ 1.6%)	47.6 ( $↓$ 4.6%)	55.7 ( $↑$ 11.6%)
Total Avg	57.0	71.8 ( $↑$ 25.9%)	66.7 ( $↑$ 17.0%)	77.9 ( $↑$ 36.6%)

Table 3. Table 3: Performance on Gym-MuJoCo datasets. +ASPC denotes the baseline combined with ASPC, and the percent change indicates its relative improvement over the baseline.

Gym-MuJoCo	IQL	+ASPC	CQL	+ASPC	Diffusion-QL	+ASPC
halfcheetah-medium	50.0	48.4 ( $↓$ 3.2%)	46.8	56.3 ( $↑$ 20.3%)	51.5	59.2 ( $↑$ 15.0%)
halfcheetah-medium-expert	92.7	94.4 ( $↑$ 1.8%)	94.2	93.6 ( $↓$ 0.6%)	96.8	96.7 ( $↓$ 0.1%)
halfcheetah-medium-replay	42.1	44.4 ( $↑$ 5.5%)	45.3	51.0 ( $↑$ 12.6%)	47.8	58.2 ( $↑$ 21.8%)
hopper-medium	65.2	61.4 ( $↓$ 5.8%)	61.3	71.6 ( $↑$ 16.8%)	90.5	101.0 ( $↑$ 11.6%)
hopper-medium-expert	85.5	100.2 ( $↑$ 17.2%)	90.1	106.9 ( $↑$ 18.6%)	111.1	111.1 ( $↑$ 0.0%)
hopper-medium-replay	89.6	88.3 ( $↓$ 1.4%)	77.5	79.9 ( $↑$ 3.1%)	101.3	100.4 ( $↓$ 0.9%)
walker2d-medium	80.7	83.9 ( $↑$ 4.0%)	82.6	83.8 ( $↑$ 1.5%)	87.0	80.3 ( $↓$ 7.7%)
walker2d-medium-expert	112.1	112.1 ( $↑$ 0.0%)	109.1	109.7 ( $↑$ 0.6%)	110.1	110.5 ( $↑$ 0.4%)
walker2d-medium-replay	75.4	77.5 ( $↑$ 2.8%)	74.5	81.7 ( $↑$ 9.7%)	95.5	95.2 ( $↓$ 0.3%)
Average	77.0	79.0 ( $↑$ 2.5%)	75.7	81.6 ( $↑$ 7.8%)	88.0	90.3 ( $↑$ 2.6%)

Table 4. Table 4: Performance on OGBench. Each entry shows mean ± \pm std. FQL+ASPC includes the relative performance change over FQL. Bold numbers indicate the best performance for each task.

OGBench	TD3+BC	IQL	ReBRAC	ASPC	FQL	FQL+ASPC
antmaze-large-navigate-singletask-task1-v0	$20 \pm 44$	$48 \pm 9$	$91 \pm 10$	$𝟗𝟑 \pm 𝟒$	$80 \pm 8$	$84$ ( $↑$ 5.0%)
antmaze-large-navigate-singletask-task2-v0	$20 \pm 31$	$42 \pm 6$	$𝟖𝟖 \pm 𝟒$	$87 \pm 7$	$57 \pm 10$	$63$ ( $↑$ 10.5%)
antmaze-large-navigate-singletask-task3-v0	$58 \pm 31$	$72 \pm 7$	$51 \pm 18$	$𝟗𝟔 \pm 𝟒$	$93 \pm 3$	$88$ ( $↓$ 5.4%)
antmaze-large-navigate-singletask-task4-v0	$31 \pm 37$	$51 \pm 9$	$84 \pm 7$	$𝟖𝟔 \pm 𝟓$	$80 \pm 4$	$70$ ( $↓$ 12.5%)
antmaze-large-navigate-singletask-task5-v0	$35 \pm 38$	$54 \pm 2$	$𝟗𝟎 \pm 𝟐$	$88 \pm 4$	$83 \pm 4$	$80$ ( $↓$ 3.6%)
antmaze-giant-navigate-singletask-task1-v0	$0 \pm 1$	$0 \pm 0$	$𝟐𝟕 \pm 𝟐𝟐$	$22 \pm 20$	$4 \pm 5$	$2$ ( $↓$ 50.00%)
antmaze-giant-navigate-singletask-task2-v0	$15 \pm 24$	$1 \pm 1$	$16 \pm 17$	$𝟕𝟒 \pm 𝟏𝟗$	$9 \pm 7$	$26$ ( $↑$ 188.9%)
antmaze-giant-navigate-singletask-task3-v0	$0 \pm 1$	$0 \pm 0$	$𝟑𝟒 \pm 𝟐𝟐$	$18 \pm 13$	$0 \pm 1$	$0$ ( $↑$ 0.0%)
antmaze-giant-navigate-singletask-task4-v0	$11 \pm 18$	$0 \pm 0$	$5 \pm 12$	$𝟔𝟓 \pm 𝟏𝟖$	$14 \pm 23$	$33$ ( $↑$ 135.7%)
antmaze-giant-navigate-singletask-task5-v0	$16 \pm 25$	$19 \pm 7$	$49 \pm 22$	$𝟓𝟓 \pm 𝟏𝟒$	$16 \pm 28$	$49$ ( $↑$ 206.3%)
Average	$20.6$	$28.7$	$53.5$	$68.4$	$43.6$	$49.5$ ( $↑$ 13.5%)

Table 5. Table 5: ASPC hyperparameters.

	Hyperparameter	Value
TD3+BC hyperparameters	Optimizer	Adam Kingma (2014)
	Critic learning rate	3e-4
	Actor learning rate	3e-4
	Mini-batch size	256
	Discount factor	0.99
	Target update rate	5e-3
	Policy noise	0.2
	Policy noise clipping	(-0.5, 0.5)
	Policy update frequency	2
Architecture	Critic hidden dim	256
	Critic hidden layers	3
	Critic activation function	ReLU
	Critic LayerNorm	True
	Actor hidden dim	256
	Actor hidden layers	2
	Actor activation function	ReLU
ASPC hyperparameters	Initial $α$	2.5
	$α$ learning rate	2e-3
	$α$ learning rate decay	Exponential
	$α$ update interval	10
	EMA smoothing factor	0.995

Table 6. Table 6: TD3+BC hyperparameters.

	Hyperparameter	Value
TD3+BC hyperparameters	Optimizer	Adam Kingma (2014)
	Critic learning rate	3e-4
	Actor learning rate	3e-4
	Mini-batch size	256
	Discount factor	0.99
	Target update rate	5e-3
	Policy noise	0.2
	Policy noise clipping	(-0.5, 0.5)
	Policy update frequency	2
	$α$	2.5
Architecture	Critic hidden dim	256
	Critic hidden layers	3
	Critic activation function	ReLU
	Critic LayerNorm	True
	Actor hidden dim	256
	Actor hidden layers	2
	Actor activation function	ReLU

Table 7. Table 7: wPC hyperparameters.

	Hyperparameter	Value
wPC hyperparameters	Optimizer	Adam Kingma (2014)
	Critic learning rate	3e-4
	Actor learning rate	3e-4
	Value learning rate	3e-4
	Mini-batch size	256
	Discount factor	0.99
	Target update rate	5e-3
	Policy noise	0.1
	Policy noise clipping	(-0.5, 0.5)
	Policy update frequency	2
	$α$	2.5
Architecture	Critic hidden dim	256
	Critic hidden layers	3
	Critic activation function	ReLU
	Critic LayerNorm	True
	Actor hidden dim	256
	Actor hidden layers	2
	Actor activation function	ReLU
	Value hidden dim	256
	Value hidden layers	2
	Value activation function	ReLU

Table 8. Table 8: A2PR hyperparameters.

	Hyper-parameters	Value
TD3+BC hyperparameters	Optimizer	Adam Kingma (2014)
	Critic learning rate	3e-4
	Actor learning rate	3e-4
	Mini-batch size	256
	Discount factor	0.99
	Target update rate $τ$	5e-3
	Policy noise	0.2
	Policy noise clipping	(-0.5, 0.5)
	Policy update frequency	2
	$α$	2.5
Architecture	Q-Critic hidden dim	256
	Q-Critic hidden layers	3
	Q-Critic Activation function	ReLU
	V-Critic hidden dim	256
	V-Critic hidden layers	3
	V-Critic Activation function	ReLU
	Actor hidden dim	256
	Actor hidden layers	2
	Actor Activation function	ReLU
A2PR hyperparameters	Normalized state	True
	$ϵ_{A}$	0
	$w_{1}$ , $w_{2}$	1.0

Table 9. Table 9: IQL hyperparameters.

	Hyperparameter	Value
IQL hyperparameters	Optimizer	Adam Kingma (2014)
	Critic learning rate	3e-4
	Actor learning rate	3e-4
	Value learning rate	3e-4
	Mini-batch size	256
	Discount factor	0.99
	Target update rate	5e-3
	Learning rate decay	Cosine
Architecture	Critic hidden dim	256
	Critic hidden layers	2
	Critic activation function	ReLU
	Actor hidden dim	256
	Actor hidden layers	2
	Actor activation function	ReLU
	Value hidden dim	256
	Value hidden layers	2
	Value activation function	ReLU

Table 10. Table 10: IQL’s best hyperparameters used in D4RL benchmark.

Task Name	$𝜷$	$IQL 𝝉$	Deterministic policy
halfcheetah-random	3.0	0.95	False
halfcheetah-medium	3.0	0.95	False
halfcheetah-expert	6.0	0.9	False
halfcheetah-medium-expert	3.0	0.7	False
halfcheetah-medium-replay	3.0	0.95	False
halfcheetah-full-replay	1.0	0.7	False
hopper-random	1.0	0.95	False
hopper-medium	3.0	0.7	True
hopper-expert	3.0	0.5	False
hopper-medium-expert	6.0	0.7	False
hopper-medium-replay	6.0	0.7	True
hopper-full-replay	10.0	0.9	False
walker2d-random	0.5	0.9	False
walker2d-medium	6.0	0.5	False
walker2d-expert	6.0	0.9	False
walker2d-medium-expert	1.0	0.5	False
walker2d-medium-replay	0.5	0.7	False
walker2d-full-replay	1.0	0.7	False
maze2d-umaze	3.0	0.7	False
maze2d-medium	3.0	0.7	False
maze2d-large	3.0	0.7	False
antmaze-umaze	10.0	0.7	False
antmaze-umaze-diverse	10.0	0.95	False
antmaze-medium-play	6.0	0.9	False
antmaze-medium-diverse	6.0	0.9	False
antmaze-large-play	10.0	0.9	False
antmaze-large-diverse	6.0	0.9	False
pen-human	1.0	0.95	False
pen-cloned	10.0	0.9	False
pen-expert	10.0	0.8	False
door-human	0.5	0.9	False
door-cloned	6.0	0.7	False
door-expert	0.5	0.7	False
hammer-human	3.0	0.9	False
hammer-cloned	6.0	0.7	False
hammer-expert	0.5	0.95	False
relocate-human	1.0	0.95	False
relocate-cloned	6.0	0.9	False
relocate-expert	10.0	0.9	False

Table 11. Table 11: ReBRAC hyperparameters.

	Hyperparameter	Value
ReBRAC hyperparameters	Optimizer	Adam Kingma (2014)
	Mini-batch size	1024 on Gym-MuJoCo, 256 on others
	Learning rate	1e-3 on Gym-MuJoCo, 1e-4 on AntMaze
	Discount factor $γ$	0.999 on AntMaze, 0.99 on others
	Target update rate $τ$	5e-3
Architecture	Hidden dim (all networks)	256
	Hidden layers (all networks)	3
	Activation function	ReLU
	Critic LayerNorm	True

Table 12. Table 12: ReBRAC’s best hyperparameters used in D4RL benchmark.

Task Name	$𝜷_{𝟏}$ (actor)	$𝜷_{𝟐}$ (critic)
halfcheetah-random	0.001	0.1
halfcheetah-medium	0.001	0.01
halfcheetah-expert	0.01	0.01
halfcheetah-medium-expert	0.01	0.1
halfcheetah-medium-replay	0.01	0.001
halfcheetah-full-replay	0.001	0.1
hopper-random	0.001	0.01
hopper-medium	0.01	0.001
hopper-expert	0.1	0.001
hopper-medium-expert	0.1	0.01
hopper-medium-replay	0.05	0.5
hopper-full-replay	0.01	0.01
walker2d-random	0.01	0.0
walker2d-medium	0.05	0.1
walker2d-expert	0.01	0.5
walker2d-medium-expert	0.01	0.01
walker2d-medium-replay	0.05	0.01
walker2d-full-replay	0.01	0.01
maze2d-umaze	0.003	0.001
maze2d-medium	0.003	0.001
maze2d-large	0.003	0.001
antmaze-umaze	0.003	0.002
antmaze-umaze-diverse	0.003	0.001
antmaze-medium-play	0.001	0.0005
antmaze-medium-diverse	0.001	0.0
antmaze-large-play	0.002	0.001
antmaze-large-diverse	0.002	0.002
pen-human	0.1	0.5
pen-cloned	0.05	0.5
pen-expert	0.01	0.01
door-human	0.1	0.1
door-cloned	0.01	0.1
door-expert	0.05	0.01
hammer-human	0.01	0.5
hammer-cloned	0.1	0.5
hammer-expert	0.01	0.01
relocate-human	0.1	0.01
relocate-cloned	0.1	0.01
relocate-expert	0.05	0.01

Table 13. Table 13: Ablation on different formulations of the L 3 L_{3} term. Values in parentheses denote relative change (%) w.r.t. the full formulation (variant 4). Positive changes are shown in blue , negative in red .

Formulation	Gym-MuJoCo	Maze2d	AntMaze	Adroit	Total Avg
(1)	76.7 (-6.6%)	97.2 (-34.0%)	31.7 (-57.4%)	55.2 (-0.9%)	64.7 (-16.9%)
(2)	76.8 (-6.5%)	107.2 (-27.2%)	31.9 (-57.2%)	54.9 (-1.4%)	65.5 (-15.9%)
(3)	81.1 (-1.2%)	151.8 (+3.1%)	73.3 (-1.6%)	56.1 (+0.7%)	77.7 (-0.2%)
(4)	82.1	147.2	74.5	55.7	77.9
(5)	82.0 (-0.1%)	149.2 (+1.4%)	74.6 (+0.1%)	55.5 (-0.4%)	77.9 (+0.0%)

Equations175

L = α L_{RL} + L_{BC},

L = α L_{RL} + L_{BC},

Q^{π} (s, a) = E_{π} [t = 0 \sum \infty γ^{t} R (s_{t}, a_{t}) s_{0} = s, a_{0} = a],

Q^{π} (s, a) = E_{π} [t = 0 \sum \infty γ^{t} R (s_{t}, a_{t}) s_{0} = s, a_{0} = a],

π = ar g π max E_{(s, a) \sim D} [λ RL Q (s, π (s)) - BC (π (s) - a)^{2}], λ = \frac{α}{\frac{1}{N} \sum _{i} ∣ Q ( s _{i} , a _{i} ) ∣} .

π = ar g π max E_{(s, a) \sim D} [λ RL Q (s, π (s)) - BC (π (s) - a)^{2}], λ = \frac{α}{\frac{1}{N} \sum _{i} ∣ Q ( s _{i} , a _{i} ) ∣} .

L_{inner} (θ; α) = E_{(s, a) \sim D} [- λ (α) Q (s, π_{θ} (s)) + ∥ π_{θ} (s) - a ∥^{2}],

L_{inner} (θ; α) = E_{(s, a) \sim D} [- λ (α) Q (s, π_{θ} (s)) + ∥ π_{θ} (s) - a ∥^{2}],

\tilde{θ} (α) = θ - η_{θ} \nabla_{θ} L_{inner} (θ; α),

\tilde{θ} (α) = θ - η_{θ} \nabla_{θ} L_{inner} (θ; α),

L_{1}

L_{1}

L_{2}

L_{3}

L_{outer} (\tilde{θ} (α)) = L_{1} + L_{2} + L_{3} .

L_{outer} (\tilde{θ} (α)) = L_{1} + L_{2} + L_{3} .

α \leftarrow α - η_{α} (\frac{\partial L _{outer} ( θ ~ ( α ))}{\partial θ ~} \frac{\partial θ ~ ( α )}{\partial α}),

α \leftarrow α - η_{α} (\frac{\partial L _{outer} ( θ ~ ( α ))}{\partial θ ~} \frac{\partial θ ~ ( α )}{\partial α}),

∥ Q (s, a_{1}) - Q (s, a_{2}) ∥

∥ Q (s, a_{1}) - Q (s, a_{2}) ∥

J (π_{t + 1}) - J (π_{t}) \geq \frac{1}{1 - γ} (Δ Q - Φ (Δ L_{\infty}^{B C}, c_{\infty}^{2})),

J (π_{t + 1}) - J (π_{t}) \geq \frac{1}{1 - γ} (Δ Q - Φ (Δ L_{\infty}^{B C}, c_{\infty}^{2})),

J (π^{*}) - J (π_{T}) \leq Ψ (ε_{β}) - T δ_{m i n},

J (π^{*}) - J (π_{T}) \leq Ψ (ε_{β}) - T δ_{m i n},

π_{t} \equiv π_{θ}, π_{t + 1} \equiv π_{\tilde{θ}} .

π_{t} \equiv π_{θ}, π_{t + 1} \equiv π_{\tilde{θ}} .

L_{t}^{BC} : = E_{(s, a) \sim D} [∥ π_{t} (s) - a ∥^{2}], L_{t + 1}^{BC} : = E_{(s, a) \sim D} [∥ π_{t + 1} (s) - a ∥^{2}],

L_{t}^{BC} : = E_{(s, a) \sim D} [∥ π_{t} (s) - a ∥^{2}], L_{t + 1}^{BC} : = E_{(s, a) \sim D} [∥ π_{t + 1} (s) - a ∥^{2}],

Δ L_{BC} : = ∣ L_{t + 1}^{BC} - L_{t}^{BC} ∣, c : = L_{t}^{BC}, x : = E_{s \sim D} [∥ π_{t + 1} (s) - π_{t} (s) ∥^{2}] .

Δ L_{BC} : = ∣ L_{t + 1}^{BC} - L_{t}^{BC} ∣, c : = L_{t}^{BC}, x : = E_{s \sim D} [∥ π_{t + 1} (s) - π_{t} (s) ∥^{2}] .

Δ L_{B C}

Δ L_{B C}

= E_{s} [(π_{t + 1} (s) - π_{β} (s))^{⊤} (π_{t + 1} (s) - π_{β} (s)) - (π_{t} (s) - π_{β} (s))^{⊤} (π_{t} (s) - π_{β} (s))]

= E_{s} [(π_{t + 1} (s) - π_{β} (s) + π_{t} (s) - π_{β} (s))^{⊤} \cdot (π_{t + 1} (s) - π_{t} (s))]

= E_{s} [π_{t + 1} (s) - π_{t} (s)^{2} + 2 (π_{t} (s) - π_{β} (s))^{⊤} (π_{t + 1} (s) - π_{t} (s))]

= x + 2 E_{s} [(π_{t} (s) - π_{β} (s))^{⊤} (π_{t + 1} (s) - π_{t} (s))]

\geq \ref l e m : A 1 - r t i ∣ x ∣ - 2 E_{s} [(π_{t} (s) - π_{β} (s))^{⊤} (π_{t + 1} (s) - π_{t} (s))]

= x - 2 E_{s} [(π_{t} (s) - π_{β} (s))^{⊤} (π_{t + 1} (s) - π_{t} (s))]

\geq \ref l e m : A 1 - cs x - 2 E_{s} π_{t} (s) - π_{β} (s)^{2} \cdot E_{s} π_{t + 1} (s) - π_{t} (s)^{2}

= x - 2 c x .

Δ L_{B C} \geq max {x - 2 c x, 0} .

Δ L_{B C} \geq max {x - 2 c x, 0} .

(Δ Q)^{2}

(Δ Q)^{2}

\leq E_{s} [(Q (s, π_{t + 1} (s)) - Q (s, π_{t} (s)))^{2}]

\leq L_{Q}^{2} E_{s} ∥ π_{t + 1} (s) - π_{t} (s) ∥^{2}

= L_{Q}^{2} x

x \geq x_{m i n} := (Δ Q)^{2} / L_{Q}^{2} .

x \geq x_{m i n} := (Δ Q)^{2} / L_{Q}^{2} .

Δ L_{B C} \geq max {0, \frac{( Δ Q ) ^{2}}{L _{Q}^{2}} - 2 c \frac{∣Δ Q ∣}{L _{Q}}} .

Δ L_{B C} \geq max {0, \frac{( Δ Q ) ^{2}}{L _{Q}^{2}} - 2 c \frac{∣Δ Q ∣}{L _{Q}}} .

x \leq (c + c^{2} + Δ L_{B C})^{2} .

x \leq (c + c^{2} + Δ L_{B C})^{2} .

(Δ Q)^{2} \leq L_{Q}^{2} (c + c^{2} + Δ L_{B C})^{2} .

(Δ Q)^{2} \leq L_{Q}^{2} (c + c^{2} + Δ L_{B C})^{2} .

J (π_{1}) - J (π_{2}) = \frac{1}{1 - γ} E_{s \sim d_{π_{1}}} [E_{a \sim π_{1}} Q^{π_{2}} (s, a) - V^{π_{2}} (s)] .

J (π_{1}) - J (π_{2}) = \frac{1}{1 - γ} E_{s \sim d_{π_{1}}} [E_{a \sim π_{1}} Q^{π_{2}} (s, a) - V^{π_{2}} (s)] .

∥ d_{π} - d_{π_{β}} ∥_{1}

∥ d_{π} - d_{π_{β}} ∥_{1}

x_{\infty}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Colin-Jing/ASPC
github

Videos

Adaptive Scaling of Policy Constraints for Offline Reinforcement Learning· slideslive

Full text

Adaptive Scaling of Policy Constraints for Offline Reinforcement Learning

Jing Tan

University of Science and Technology Beijing

Beijing, China

Hong Kong University of Science and Technology (Guangzhou)

Guangzhou, China

&Xiaorui Li, Chao Yao, Xiaojuan Ban & Zhaolin Yuan

University of Science and Technology Beijing

Beijing, China

[email protected]

&Yuetong Fang & Renjing Xu11footnotemark: 1

Hong Kong University of Science and Technology (Guangzhou)

Guangzhou, China

[email protected] Corresponding authors

Abstract

Offline reinforcement learning (RL) enables learning effective policies from fixed datasets without any environment interaction. Existing methods typically employ policy constraints to mitigate the distribution shift encountered during offline RL training. However, because the scale of the constraints varies across tasks and datasets of differing quality, existing methods must meticulously tune hyperparameters to match each dataset, which is time-consuming and often impractical. To bridge this gap, we propose Adaptive Scaling of Policy Constraints (ASPC), a second-order differentiable framework that automatically adjusts the scale of policy constraints during training. We theoretically analyze its performance improvement guarantee. In experiments on 39 datasets across four D4RL domains, ASPC using a single hyperparameter configuration outperforms other adaptive constraint methods and state-of-the-art offline RL algorithms that require per-dataset tuning, achieving an average 35% improvement in normalized performance over the baseline. Moreover, ASPC consistently yields additional gains when integrated with a variety of existing offline RL algorithms, demonstrating its broad generality.

1 Introduction

Offline reinforcement learning (RL) learns a policy exclusively from a fixed, pre-collected dataset without further interactions with the environment Levine et al. (2020). This characteristic is particularly crucial in real-world applications such as autonomous driving El Sallab et al. (2017); Kendall et al. (2019), healthcare Prasad et al. (2017); Wang et al. (2018), industry Zhan et al. (2022); Yuan et al. (2024), and other tasks, where interacting with the environment can be expensive and risky.

Despite the potential advantages, a critical challenge in offline RL is the distribution shift Levine et al. (2020) between the offline data and the training policies, often leading to suboptimal or even invalid policy updates. Many methods have been proposed to mitigate the adverse effects of the distribution shift. A common strategy is to impose explicit or implicit policy constraints Fujimoto et al. (2019); Kumar et al. (2020); Fujimoto and Gu (2021); Kostrikov et al. (2022), ensuring that the learned policy remains close to the behavior policy used to collect the dataset. By imposing constraints on policy updates, these methods can effectively mitigate the extrapolation error of the Q value Fujimoto et al. (2019) induced by the distribution shift while offering certain performance guarantees.

A central but often overlooked issue in policy constraint methods is the choice of the constraint scale, which crucially governs the balance between the RL objective and the behavior cloning (BC) term. Existing approaches fall into two categories. First, methods that rely on dataset-specific hyperparameter tuning can achieve strong results, but their performance collapses once a single configuration is applied across tasks or datasets of varying quality, as shown in Figure 1(b). Second, adaptive variants with fixed hyperparameters Peng et al. (2023); Yang et al. (2024) alleviate tuning costs, yet they only reweight actions locally and neglect the global trade-off scale, leaving a significant gap to carefully tuned baselines. In practical offline RL, where extensive tuning is prohibitively expensive or even infeasible, the pressing challenge is how to achieve robust performance with a single hyperparameter configuration across diverse datasets.

To enable a single hyperparameter configuration to match or exceed the performance of finely tuned methods across datasets of varying quality and tasks, we propose an adaptive scaling of policy constraints (ASPC) approach that dynamically adjusts the constraint scale during training. The intuition of this method is shown in Figure 1(a). Our approach leverages a second-order differentiable optimization framework Finn et al. (2017) to balance the goals of RL and BC. Specifically, we parameterize the scale factor $\alpha$ as a learnable parameter that balances the RL objective $\mathcal{L}_{\mathrm{RL}}$ and the BC objective $\mathcal{L}_{\mathrm{BC}}$ in TD3+BC Fujimoto and Gu (2021). The combined objective $\mathcal{L}$ is given by

[TABLE]

For the full definitions of $\alpha$ , refer to equation 3. During training, $\alpha$ is dynamically adjusted by constraining the rate of change of the Q-value and the BC loss, enabling the algorithm to discover a more stable learning path and exhibit remarkable adaptability across tasks and datasets.

We theoretically analyze the performance improvement guarantee of ASPC and extensively evaluate it on the D4RL benchmark Levine et al. (2020). Our empirical results demonstrate that ASPC outperforms other state-of-the-art offline RL algorithms that depend on meticulously tuned hyperparameters for each dataset, while adding only minimal computational overhead to the original TD3+BC backbone. In addition, ASPC improves a variety of offline RL algorithms beyond TD3+BC, further indicating its generality and broad applicability.

2 Related Works

2.1 Offline RL

Offline RL aims to learn policies purely from static datasets and suffers from distribution shift between the behavior policy and the learned policy, leading to value overestimation and policy collapse. Existing approaches address this challenge from several perspectives. Policy constraint methods explicitly Fujimoto et al. (2019); Fujimoto and Gu (2021) or implicitly Kumar et al. (2020); Kostrikov et al. (2022) regularize the learned policy toward the behavior distribution. Uncertainty-aware approaches penalize actions with high epistemic or aleatoric uncertainty An et al. (2021); Bai et al. (2022); Zhang et al. (2023). Sequence modeling methods reformulate RL as conditional trajectory modeling using transformers Chen et al. (2021); Janner et al. (2021). Among these, policy constraint methods have emerged as the most direct and widely adopted solution, but their effectiveness crucially depends on properly scaling the constraint. This motivates our focus on developing an adaptive scaling mechanism that eliminates the need for per-dataset tuning while retaining robustness across diverse offline RL benchmarks.

2.2 Adaptive Policy Constraints

Balancing the RL objective against BC is central to offline RL, and the strength of this constraint critically affects both stability and performance. Recent work has explored adaptive ways to tune this balance. Trajectory- or sample-weighting methods such as AW Hong et al. (2023), wPC Peng et al. (2023), and OAP Yang et al. (2023) reweight transitions or actions based on estimated value or expert preference, thereby adjusting constraint strength locally. Other approaches introduce auxiliary models to refine constraint scaling, for example PRDC Ran et al. (2023), GORL Yang et al. (2024), A2PR Liu et al. (2024), and IEPC Liu and Hofert (2024). Despite these advances, current approaches either rely on per-dataset hyperparameter tuning for optimal performance, or apply a fixed configuration that yields only limited gains across domains. Our ASPC method addresses this gap by dynamically adjusting the constraint scale during training, enabling robust performance across diverse datasets with a single hyperparameter configuration.

3 Preliminaries

RL problems are formulated as a Markov decision process (MDP), described by the tuple $(S,A$ , $P,R,\gamma)$ . The set of states is $S$ , the set of actions is $A$ , the transition probability function is $P(s^{\prime}|s,a)$ , the reward function is $R(s,a)$ , and $\gamma\in[0,1)$ is the discount factor. The objective is to find a policy $\pi:S\rightarrow A$ that maximizes the expected discounted return. This objective is equivalently expressed as maximizing the Q-value $Q^{\pi}(s,a)$ under $\pi$ , given by:

[TABLE]

where $s_{t}$ and $a_{t}$ represent the state and action at time $t$ . In practice, RL algorithms update Q-values using the Bellman equation as an iterative rule, seeking to converge to the optimal policy $\pi^{*}$ .

A central challenge for offline RL is the distribution shift. When a state–action pair $(s,a)$ lies outside the dataset $\mathcal{D}$ , directly optimizing the Q–function may cause severe over-estimation. One remedy is to constrain the target policy $\pi$ to stay close to the behaviour policy $\pi_{\beta}$ . TD3+BC Fujimoto and Gu (2021) does so by solving:

[TABLE]

normalizes the RL term to the scale of the BC loss. In vanilla TD3+BC, $\alpha$ is a fixed constant. Instead of keeping the scale factor $\alpha$ static, we update it throughout training.

4 Method

We now present the ASPC algorithm in detail. We begin by introducing its core framework, a second-order differentiable optimization that adaptively balances the RL and BC objectives (Section 4.1). We then provide a theoretical analysis (Section 4.2), which explains the role of the mutual constraint term and establishes single-step and long-term performance guarantees. Finally, we describe a practical instantiation of ASPC built on TD3+BC (Section 4.3), which enables its application to standard offline RL benchmarks.

4.1 Adaptive Scaling of Policy Constraints

To adaptively adjust the relative scaling between the RL and BC objectives, ASPC adopts a meta-learning approach Finn et al. (2017); Franceschi et al. (2018). It converts the scale factor $\alpha$ in equation 3 into a learnable parameter and optimizes it dynamically via bilevel training, utilizing inner updates and outer updates to maximize RL exploration near the behavior policy.

Inner Update To optimize the policy under offline data, we define the inner objective as

[TABLE]

where $\lambda(\alpha)=\alpha\big/\mathbb{E}_{s\sim\mathcal{D}}\!\left[|Q(s,\pi_{\theta}(s))|\right]$ . The inner update is then obtained via a gradient descent step with learning rate $\eta_{\theta}$ :

[TABLE]

and $\tilde{\theta}(\alpha)$ denotes the updated policy parameters after one inner step.

Outer Update While the inner update optimizes the policy parameters $\theta$ for a given scale $\alpha$ , the outer update is responsible for adjusting $\alpha$ itself so as to dynamically balance the RL and BC objectives. The outer loss is composed of three coordinated components. $\mathcal{L}_{1}$ mirrors TD3 + BC and steers $\alpha$ toward a better balance between RL and BC. $\mathcal{L}_{2}$ penalizes abrupt increases in the expected Q-value, while $\mathcal{L}_{3}$ constrains large shifts in the BC loss. Together, $\mathcal{L}_{2}$ and $\mathcal{L}_{3}$ adaptively regulate the step prescribed by $\mathcal{L}_{1}$ , preventing either RL or BC from dominating and thereby stabilizing training. Formally, we write:

[TABLE]

The outer objective is

[TABLE]

Here, $\pi_{\theta}$ and $\pi_{\tilde{\theta}}$ denote the policies before and after the inner update, respectively. $\mathcal{.}detach$ indicates stopping gradients. While $\mathcal{L}_{1}$ and $\mathcal{L}_{2}$ are relatively standard, the design of $\mathcal{L}_{3}$ requires clarification. Theoretically, its form follows directly from Theorem 4.4, with details in Appendix A.3. Intuitively, $\mathcal{L}_{3}$ combines three factors: the rate of change in Q-values, the upper bound of the BC loss, and the variation in BC loss across iterations. Large Q-value fluctuations or a high BC-loss bound signal rapid policy change or significant deviation from the behavior policy. In such cases, strengthening the penalty on BC variation helps suppress distributional shift and stabilize training, consistent with our intuition. To update $\alpha$ , we treat the inner update parameters $\tilde{\theta}(\alpha)$ as an implicit function of $\alpha$ and use second-order derivatives. Lets $\eta_{\alpha}$ be the learning rate of $\alpha$ . The gradient-descent step is

[TABLE]

4.2 Theoretical Analysis

We now analyze the theoretical properties of ASPC. We show that the outer objective ensures stable updates and reduces the gap to the optimal policy.

Assumption 4.1.

The critic $Q(s,a)$ and the transition kernel $P(\cdot\mid s,a)$ are Lipschitz continuous with respect to the action variable. That is, there exist constants $L_{Q},L_{P}>0$ , independent of $s$ , such that for all $s\in\mathcal{S}$ and all $a_{1},a_{2}\in\mathcal{A}$ ,

[TABLE]

Proposition 4.2 (Mutual constraints between $\Delta L_{BC}$ and $(\Delta Q)^{2}$ ).

Under Assumption 4.1, the change in BC loss ( $\Delta L_{BC}$ ) and the squared change in Q-values ( $(\Delta Q)^{2}$ ) mutually constrain each other: $(\Delta Q)^{2}$ provides a lower bound on $\Delta L_{BC}$ , while $\Delta L_{BC}$ provides an upper bound on $(\Delta Q)^{2}$ .

This result shows that the two penalties in equation 7 and equation 8 are inherently coupled rather than independent. It explains why in practice some tasks succeed with only one of them, while others require both for stable training (see Section 5.5). The detailed proof is provided in Appendix A.1.

Proposition 4.3 (Single-step performance lower bound).

For the update step from $\pi_{t}$ to $\pi_{t+1}$ , the performance improvement admits the following lower bound:

[TABLE]

where $\Phi(\Delta L_{\infty}^{BC},c_{\infty}^{2})$ is a nonnegative function depending on the BC-loss variation upper bound $\Delta L_{\infty}^{BC}$ and the BC-loss upper bound $c_{\infty}^{2}$ .

This proposition serves as the theoretical basis for Theorem 4.4. It also directly motivates the design of the penalty term $\mathcal{L}_{3}$ (equation 8), whose form is derived from bounding $\Phi(\Delta L_{\infty}^{BC},c_{\infty}^{2})$ . The detailed derivation of $\Phi$ is deferred to Appendix A.2.

Theorem 4.4 (Single-step performance condition for ASPC).

An idealized ASPC update that satisfies the condition $\Delta Q\geq\Phi$ leads to a non-decreasing policy performance: $J(\pi_{t+1})-J(\pi_{t})\geq 0.$

ASPC employs a smooth relaxation of this condition via the outer objective, which is designed to guide updates toward this provably stable regime. The detailed proof is given in Appendix A.3.

Theorem 4.5 (Performance gap to optimal).

With Theorem 4.4, after $T$ iterations when the single-step gain vanishes ( $\delta_{T}=0$ ), the gap to the optimal policy satisfies:

[TABLE]

where $\Psi(\varepsilon_{\beta})$ is a function of the mismatch $\varepsilon_{\beta}$ between the behavior policy and the optimal policy, and $\delta_{\min}$ denotes the minimal single-step improvement before convergence.

This theorem shows that ASPC progressively reduces the suboptimality gap until convergence, where the remaining gap is controlled by $\Psi(\varepsilon_{\beta})$ . The full derivation of $\Psi(\varepsilon_{\beta})$ is given in Appendix A.4.

4.3 Implementation on TD3+BC

To make ASPC practical, we instantiate it on top of the TD3+BC backbone with only two modifications: (i) a redesigned critic network, and (ii) a learnable scale factor $\alpha$ . All other network components and hyperparameters remain unchanged. See Appendix B.2 for a full specification.

Recent studies show that deeper critics Kumar et al. (2022); Lee et al. (2022) and the insertion of LayerNorm between layers Nikulin et al. (2023); Ball et al. (2023); Tarasov et al. (2024a) can mitigate Q-value over-estimation and improve stability. Following this evidence, we extend the TD3+BC critic from two to three hidden layers and insert a LayerNorm after each layer. An ablation of this choice is provided in Section 5.5.

Algorithm 1 lists the ASPC procedure. Blue highlights indicate lines that differ from the TD3+BC backbone. Although second-order gradients increase cost, we set the $\alpha$ -update interval $k_{\alpha}$ far longer than the actor-update interval $k_{\pi}$ , which maintains performance while sharply reducing runtime. Section 5.4 analyses this trade-off in detail.

5 Experiments

In this section we evaluate ASPC on the D4RL benchmark. Section 5.1 compares ASPC with strong baselines to demonstrate its adaptability and overall effectiveness. Section 5.2 analyzes the learning curves of $\alpha$ during training, further illustrating ASPC’s adaptive behaviour. Section 5.3 investigates the necessity of dynamically adjusting $\alpha$ . Section 5.4 reports runtime results to highlight the efficiency of ASPC. Section 5.5 presents ablation studies on the key components of ASPC. Section 5.6 provides results on integrating the ideas of ASPC with other methods, and Section 5.7 presents the performance of ASPC on the OGBench benchmark.

5.1 Comparative Performance on Benchmark

We evaluate ASPC on 39 datasets spanning four D4RL domains Levine et al. (2020): MuJoCo (v2), AntMaze (v2), Maze2d (v1), and Adroit (v1). Our baselines include TD3+BC Fujimoto and Gu (2021) and IQL Kostrikov et al. (2022) as standard policy-constraint methods. wPC Peng et al. (2023) and A2PR Liu et al. (2024) are state-of-the-art (SOTA) adaptive policy constraint methods built on TD3+BC. ReBRAC Tarasov et al. (2024a) integrates multiple performance-boosting components into TD3+BC and has achieved SOTA results across a wide range of datasets. TD3+BC, wPC, A2PR, and ASPC are all set as the single hyperparameter set, whereas IQL and ReBRAC rely on dataset-specific hyperparameters found via grid search. We reproduce results for TD3+BC, wPC and A2PR. IQL and ReBRAC results are taken from Tarasov et al. (2024a; b). Complete experimental details for each algorithm are provided in the appendix B.2.

The performance comparison is summarized in Table 1. ASPC achieves the best performance on MuJoCo and Maze2d, and exhibits competitive results on Adroit and AntMaze. Most notably, ASPC attains SOTA performance on average across all four domains, which not only outperforms other adaptive policy constraint methods but also surpasses approaches that rely on meticulous per-dataset hyperparameter tuning, highlighting its remarkable adaptability. Figure 2 shows that the performance profile curves (left) place ASPC above all baselines for almost every threshold, and the min-max-scaled radar chart (right) gives ASPC the largest, most balanced polygon, visually confirming its strong and stable performance across tasks without per-dataset tuning.

5.2 Adaptability of the Sacle Factor

Dataset Adaptability Figure 3 shows the evolution of $\alpha$ on HalfCheetah, Hopper, and Walker2d for four dataset quality levels, listed from highest to lowest as expert, medium-expert, medium, and medium-replay. Across all three tasks, higher-quality datasets lead to smaller $\alpha$ , which places more weight on BC, whereas lower-quality datasets lead to larger $\alpha$ , shifting the emphasis toward RL. The consistent ordering confirms that ASPC automatically adjusts the policy-constraint scale to dataset quality without any per-dataset hyperparameter tuning.

Task Adaptability Figure 4 plots the $\alpha$ trajectories on six heterogeneous tasks. Tasks such as door, pen, hammer, and relocate possess narrow expert data distributions; here $\alpha$ settles near $10^{-1}$ , giving greater weight to BC. Conversely, antmaze and maze2d, whose datasets contain highly sub-optimal trajectories, drive $\alpha$ above 10, shifting emphasis to RL. This task-aware scaling requires no manual tuning and highlights ASPC’s cross-task adaptability.

Training Adaptability Combining the curves from Figures 4 and 3, we observe a common learning dynamic: $\alpha$ first drops (or rises only slightly) during the early training phase, indicating greater reliance on BC when the policy is still immature. As learning progresses and the critic stabilises, $\alpha$ gradually increases, handing more control to RL. This smooth, stage-wise adjustment underpins ASPC’s stable convergence across tasks and datasets.

5.3 Necessity of Dynamic Scale Factor Adjustment

As shown in Table 1, the hyperparameters meticulously selected via grid search ultimately underperform compared to the ASPC algorithm, which dynamically adjusts hyperparameters during training. This observation raises the question: is grid search simply failing to find the best setting, or is the dynamic adjustment in ASPC the true source of its advantage? To answer this, we conduct three controlled tests. Naive $\alpha$ . TD3+BC is run with a fixed scale factor $\alpha=2.5$ . Converged $\alpha$ . TD3+BC is run with $\alpha$ fixed to the final value reached by ASPC on the same dataset. Linear $\alpha$ . TD3+BC starts from $\alpha=2.5$ and linearly interpolates to the above converged value over the training horizon. To ensure fairness, all TD3+BC variants utilize the same robust critic architecture as ASPC, comprising three hidden layers, each followed by a LayerNorm.

Table 2 summarises the mean normalised scores in the four D4RL domains. Percentages in blue report the relative gain over the naive baseline that fixes $\alpha=2.5$ . Converged $\alpha$ and Linear $\alpha$ both outperform the naive setting, which confirms that the value to which ASPC eventually converges is a much more appropriate scale for the policy constraint. ASPC (Dynamic $\alpha$ ) still exceeds the Converged variant by a wide margin, and the Linear schedule closes only part of the gap. These results show that simply finding a good fixed $\alpha$ is not enough. Adapting the scale throughout training is essential for the best performance. ASPC provides this dynamic adjustment automatically and therefore achieves the highest overall score.

5.4 Runtime Analysis

ASPC employs second-order gradient computations for updating $\alpha$ , which increases cost. However, its update interval ( $k_{\alpha}$ ) can be set substantially longer than that of the actor ( $k_{\pi}$ ), thereby minimizing the additional computational overhead. To evaluate runtime efficiency, we compare the execution time of one million iterations of ASPC against that of other baseline algorithms. Figure 5 presents a bar chart comparing the runtime of ASPC against TD3+BC, CQL, IQL, wPC and A2PR on the halfcheetah-medium-v2 dataset. The results indicate that ASPC introduces only a minimal additional computational overhead beyond that of TD3+BC.

We further analyze the relationship between $k_{\alpha}$ , runtime, and performance, as illustrated in Figure 5. The baseline setting for $k_{\alpha}$ is 10, we observe that reducing $k_{\alpha}$ does not lead to significant performance degradation. This suggests that ASPC effectively captures the correct gradient optimization direction, maintaining robustness even when the gradient step size is large. When $k_{\alpha}$ is set to 30, the runtime is nearly identical to that of TD3+BC while maintaining strong performance. This highlights the efficiency of the ASPC algorithm.

5.5 Ablation Studies

Robust Critic(RC) When using the original TD3+BC critic network (with two hidden layers and no LayerNorm), during the process of adjusting $\alpha$ , Q-values exhibit significant instability, frequently leading to overestimation, causing catastrophic failure of the algorithm. Since wPC is also designed based on the original TD3+BC framework, we include it in our experiments related to RC (with three hidden layers and LayerNorm). Figure 6(a) presents the experimental results. The results indicate that when RC is not utilized, both wPC and ASPC achieve limited performance improvement and even exhibit performance degradation on certain tasks.

Loss Function Figure 6(b) reveals clear, domain-dependent effects when the regularization terms are added to the base loss $\mathcal{L}_{1}$ . Adding neither term ( $\mathcal{L}_{1}$ only) gives the poorest performance. Introducing only $\mathcal{L}_{2}$ lifts performance in MuJoCo and Adroit to the level of full ASPC, while leaving AntMaze almost unchanged. Conversely, adding only $\mathcal{L}_{3}$ significantly boosts AntMaze but has little effect on MuJoCo or Adroit. For Maze2D, neither single term suffices. Only the full loss $\mathcal{L}_{1}+\mathcal{L}_{2}+\mathcal{L}_{3}$ attains the best result. These results can be explained by Proposition 4.2, which shows that $\mathcal{L}_{2}$ and $\mathcal{L}_{3}$ implicitly constrain one another. Consequently, adding $\mathcal{L}_{2}$ in MuJoCo and Adroit implicitly bounds $\Delta L_{BC}$ as well, so the single‐step performance guarantee of Theorem 4.4 is already satisfied. Conversely, in AntMaze a direct $\mathcal{L}_{3}$ penalty implicitly limits $(\Delta Q)^{2}$ , again meeting the theorem’s lower bound. For Maze2D, however, neither implicit relation is strong enough; both $\mathcal{L}_{2}$ and $\mathcal{L}_{3}$ must be enforced explicitly for the condition in Theorem 4.4 to hold.

5.6 Extending ASPC to Other Offline RL Methods

Many offline RL algorithms follow the form of equation 1. To evaluate the generality of ASPC, we integrate its adaptive policy constraint into three representative baselines, including IQL, CQL, and Diffusion-QL Wang et al. (2023). Each method contains a hyperparameter analogous to $\alpha$ that controls the balance between value learning and conservatism. We replace this manually tuned coefficient with a learnable parameter and update it using the same bi-level second-order procedure as ASPC. The detailed objectives for each algorithm are provided in Appendix D.

As shown in Table 3, incorporating ASPC consistently improves the performance of all three baselines, which demonstrates the broad applicability of our approach. IQL yields the smallest improvement, and a possible reason is that it performs implicit Q learning, so increasing $\alpha$ does not effectively shift the policy toward the RL objective. This implicit structure offers stability but limits the best achievable performance. CQL benefits more from ASPC because updating $\alpha$ directly adjusts the level of conservatism. Diffusion-QL already achieves very strong results, and ASPC further improves its performance, which highlights the robustness of ASPC even when applied to a strong baseline.

5.7 Additional Experiments on OGBench

We further evaluate the generality and robustness of ASPC on OGBench Park et al. (2025a), a new benchmark for offline goal-conditioned RL. Results across ten datasets in Table 4 show that ASPC clearly surpasses all existing baselines, indicating strong applicability beyond D4RL. Since FQL Park et al. (2025b) also follows equation 1, we integrate ASPC by making its scale factor learnable and applying the same bi level optimization procedure, with details in Appendix D. This modification consistently improves FQL, further supporting the broad generality of ASPC across standard and goal-conditioned offline RL.

6 Conclusion

We presented ASPC, a bi-level framework that adapts the RL–BC trade off by optimizing the scaling factor $\alpha$ through second-order updates. ASPC yields consistent improvements not only on TD3+BC but also when combined with other offline RL baselines, demonstrating strong generality. However, these simple integrations yield smaller gains than those seen with TD3+BC, indicating that different algorithms may require ASPC-style components tailored to their training dynamics. Future work includes developing such method-specific adaptive mechanisms under a unified principle and evaluating them on larger benchmarks and real-world datasets.

Acknowledgments

This work was supported by the National Science and Technology Major Project of the Ministry of Science and Technology of China (2024ZD0608100), the National Natural Science Foundation of China (62506031, 62332017, U22A2022), the Postdoctoral Fellowship Program of CPSF under Grant Number GZC20251093.

Ethics Statement

This work focuses on methodological advances in offline RL. All experiments are conducted on standard simulated benchmarks, which do not involve human subjects, personally identifiable information, or sensitive data. We strictly follow the licensing terms of all datasets and simulation platforms used in this study. Our method, Adaptive Scaling of Policy Constraints (ASPC), is designed to improve the stability and reliability of offline RL algorithms. While RL has the potential for deployment in safety-critical domains, such as robotics and autonomous systems, the experiments in this paper remain purely in simulation. Any real-world use of these methods should be preceded by domain-specific safety checks and human oversight to avoid unintended harm.

Reproducibility Statement

We have taken several measures to ensure the reproducibility of our work. The proposed method is described in detail in Section 4, and the complete theoretical derivations are provided in Appendix A. Experimental settings and hyperparameters are reported in Appendix B. Moreover, we include the full implementation code in the Supplementary Material to facilitate replication of all results.

Appendix A Theoretical Proofs

A.1 Proof of Proposition 4.2

Throughout the argument, we adopt the following shorthand. We index the policies as

[TABLE]

We write

[TABLE]

Lemma A.1 (Reverse triangle inequality).

For all $A,B\in\mathbb{R}$ one has $|A+B|\geq\bigl||A|-|B|\bigr|$ .

Lemma A.2 (Cauchy–Schwarz).

For square–integrable real random variables $X,Y$ , $\bigl|\mathbb{E}[XY]\bigr|\leq\bigl(\mathbb{E}[X^{2}]\bigr)^{1/2}\bigl(\mathbb{E}[Y^{2}]\bigr)^{1/2}$ .

Proof.

The proof proceeds in three steps.

Step 1: A lower bound on $\Delta L_{BC}$ . Expand the definition of $\Delta L_{BC}$ and simplify:

[TABLE]

Since $\Delta L_{BC}\geq 0$ by definition, combining with equation 14 yields

[TABLE]

Step 2: An upper bound on $(\Delta Q)^{2}$ . Jensen’s inequality and the assumption 4.1 yield

[TABLE]

Step 3: Mutual bound. From equation 16 we have

[TABLE]

Since $\Delta L_{BC}\geq\max\{x-2c\sqrt{x},0\}$ from equation 15, we relate this expression to $\Delta Q$ as follows. The function $h(x)=x-2c\sqrt{x}$ is non-positive on $[0,4c^{2}]$ and strictly increasing on $[4c^{2},\infty)$ . When $|\Delta Q|\leq 2cL_{Q}$ , we have $x_{\min}\leq 4c^{2}$ , and $h(x_{\min})$ is non-positive; thus $\Delta L_{BC}\geq 0\geq h(x_{\min})$ . When $|\Delta Q|>2cL_{Q}$ , we have $x_{\min}>4c^{2}$ and $h(x)$ is increasing for all $x\geq x_{\min}$ , which gives $\Delta L_{BC}\geq h(x_{\min})$ . Combining the two regimes yields the bound

[TABLE]

Similarly, using equation 15 we obtain the following upper bounds for $x$ :

[TABLE]

Combining equation 16 with equation 19 gives an upper bound on $(\Delta Q)^{2}$ :

[TABLE]

equation 18 and equation 20 together yield the desired mutual bounds. ∎

A.2 Proof of Proposition 4.3

This section analyses conditions under which the one-step performance difference $J(\pi_{t+1})-J(\pi_{t})$ admits a tractable lower bound when training on a fixed offline dataset $D$ collected under behavior policy $\pi_{\beta}$ (so $D\approx d_{\pi_{\beta}}$ ).

Lemma A.3 (Performance‑difference lemma).

For any policies $\pi_{1}$ and $\pi_{2}$ ,

[TABLE]

The proof of Lemma A.3 can be found in Kakade and Langford (2002).

Lemma A.4.

Under Assumption 4.1, the total variation distance between the visitation distributions of any policy $\pi$ and the behavior policy $\pi_{\beta}$ satisfies

[TABLE]

where $C>0$ is a constant.

The proof of Lemma A.4 can be found in the appendix of Xiong et al. (2022).

Lemma A.5 (Sup‐norm version of equation 19).

Define

[TABLE]

Then

[TABLE]

Proof.

For each $s$ , let $\Delta L_{BC}(s)=\|\pi_{t+1}(s)-\pi_{\beta}(s)\|^{2}-\|\pi_{t}(s)-\pi_{\beta}(s)\|^{2}$ . Then

[TABLE]

Taking the supremum over $s$ gives the stated result:

[TABLE]

The proof of Lemma A.5 is finished. ∎

Proof.

In our deterministic setting, the conditional action distribution $\pi(\cdot|s)$ for any state $s$ is a Dirac measure concentrated at a single action. Specifically, for $\pi_{2}$ in Lemma A.3 we have:

[TABLE]

Applying Lemma A.3 with $\pi_{1}=\pi_{t+1}$ and $\pi_{2}=\pi_{t}$ gives:

[TABLE]

Write the performance–difference identity equation 27 as

[TABLE]

Thus, the one–step performance satisfies the lower bound

[TABLE]

The proof of Proposition 4.3 is finished. ∎

A.3 Proof of Theorem 4.4

We now show how our outer‐loss components ensure the performance lower bound equation LABEL:eq:per-lower-bound is maintained.

$\mathcal{L}_{1}$ equation 6 updates $\alpha$ based on the relative gradients of Q-value and the BC loss. Under the initialization assumption $\nabla_{\theta}\mathbb{E}[Q]>\nabla_{\theta}L_{BC}$ , so $\mathcal{L}_{1}$ updates $\alpha$ to favor Q‐improvement.

In our algorithm, the two regularizers $\mathcal{L}_{2}$ and $\mathcal{L}_{3}$ play complementary roles in guaranteeing safe single‐step improvements. Specifically, $\mathcal{L}_{2}$ in equation 7 penalizes the squared change in the Q‐function, $\Delta Q^{2}$ , to prevent overly large and unreliable Q‐updates. Due to the bootstrapping error inherent in RL, the single-step Q-value changes can be noisy, and therefore we apply an exponential moving average (EMA) for stabilization. In order to preserve the one‐step performance lower bound equation LABEL:eq:per-lower-bound, $\mathcal{L}_{3}$ in equation 8 must impose a matching penalty on the bias term identified in that bound. By choosing $\mathcal{L}_{3}$ so that its curvature mirrors that of $\mathcal{L}_{2}$ , we ensure the single‐step performance guarantee remains non‐negative.

Proof.

We perform a second‐order Taylor expansion of $\sqrt{c_{\infty}^{2}+\Delta L_{\infty}^{BC}}$ around $\Delta L_{\infty}^{BC}=0$ , assuming $\Delta L_{\infty}^{BC}/c_{\infty}^{2}\ll 1$ , discarding higher-order and constant terms. Substituting into the square and retaining only terms up to $O(\Delta L_{\infty}^{BC})$ yields:

[TABLE]

In practice, we scale $\mathcal{L}_{3}$ by the value of $\mathcal{L}_{2}$ to match its regularization strength and simply set $w$ to 1:

[TABLE]

By setting an appropriate $w$ , the algorithm can guarantee that:

[TABLE]

The proof of Theorem 4.4 is finished. ∎

A.4 Proof of Theorem 4.5

Proof.

We split the total performance gap into two components:

[TABLE]

We first observe that the behavior‐cloning loss

[TABLE]

decreases rapidly during early training. Hence there exists a warm‐up time $t_{0}$ such that

[TABLE]

and we set

[TABLE]

then

[TABLE]

We define

[TABLE]

Next, each one‐step update $i$ produces the gain equation 32

[TABLE]

Summing these gains yields the unified bound

[TABLE]

With a fixed regularization weight $\alpha$ , the sequence $\{\delta_{i}\}$ tends to decay rapidly toward zero or even become negative. Therefore, static $\alpha$ leaves a large residual gap in equation 39. Our meta-update dynamically adjusts $\alpha$ so that each $\delta_{i}$ stays bounded below by a positive constant $\delta_{\min}>0$ over a long horizon. Thus

[TABLE]

The proof of Theorem 4.5 is finished. ∎

Appendix B Experimental Details

B.1 Hardware and Software

We use the following hadrward:

Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10 GHz 2. 2)

NVIDIA GeForce RTX 4090 GPU

We use the following software versions:

Python 3.8.10 2. 2)

D4RL 1.1 3. 3)

MuJoCo 3.2.3 4. 4)

Gym 0.23.1 5. 5)

mujoco-py 2.1.2.14 6. 6)

PyTorch 2.2.2 + CUDA 12.1 7. 7)

TorchOpt 0.7.3 Ren* et al. (2023)

B.2 Hyperparameters

The network structures and hyperparameter configurations of each algorithm corresponding to Table 1 are as follows.

Appendix C Learning Curves

C.1 Scale Factor Curves

Figure 7 plots the $\alpha$ learning curves for all 39 datasets. The curves show that our algorithm (i) drives $\alpha$ toward distinct optima across tasks and (ii) merely modulates its step size and pace when the dataset quality changes within the same task. This dual behaviour highlights the method’s adaptability to both task differences and data-quality variations.

C.2 Performance Curves

Figure 8 shows the learning curves of all four algorithms on the 39 D4RL datasets. ASPC rises much more rapidly than the baselines, typically within the first $0.2\text{--}0.3$ M environment steps, and surpasses them long before the others stabilize. Its final normalized scores are almost always the highest (or very close to the highest) across all task families, maintaining a clear margin where the competing methods usually plateau. Moreover, the shaded regions (mean $\pm$ 1 s.d. over four seeds) remain consistently narrow for ASPC, and its curves show no late-stage collapses, pointing to lower variance and steadier adaptation across widely varying task dynamics and data quality. Overall, the figure suggests that ASPC combines greater sample efficiency, stronger ultimate performance, and more reliable behavior than the other approaches.

Appendix D Integrating ASPC with Other Offline RL Algorithms

D.1 Integration with IQL

In IQL, the policy is trained by advantage-weighted behavior cloning. Let $\mathrm{adv}(s,a)$ denote the IQL advantage estimate and $\beta>0$ the temperature parameter. We integrate ASPC by treating $\beta$ as the adaptive policy-constraint coefficient.

Inner objective.

The IQL actor minimizes

[TABLE]

where $\ell_{\mathrm{BC}}(\pi_{\theta}(a\mid s))=-\log\pi_{\theta}(a\mid s)$ . A single gradient step yields the updated policy $\pi_{\tilde{\theta}(\beta)}$ .

Outer objective.

Following ASPC, we construct an outer loss on the updated policy using a normalized Q-improvement term and the corresponding BC loss:

[TABLE]

The second term measures the change in mean Q-value induced by the inner update:

[TABLE]

The outer objective for adapting $\beta$ is

[TABLE]

D.2 Integration with CQL

CQL constrains Q-values by penalizing larger Q-values on out-of-distribution (OOD) actions. Let $\alpha>0$ denote the conservatism coefficient. Following ASPC, we treat $\alpha$ as the adaptive policy-constraint parameter.

Inner objective.

Given a batch $(s,a,r,s^{\prime})$ , the CQL critic update solves

[TABLE]

where

[TABLE]

A single gradient step produces the updated critic $Q_{\tilde{\psi}(\alpha)}$ .

Outer objective.

ASPC evaluates the updated critic with a normalized Q-improvement term and the corresponding CQL penalty, forming

[TABLE]

The Q-value change induced by the inner update is

[TABLE]

The outer objective for adapting $\alpha$ becomes

[TABLE]

D.3 Integration with Diffusion-QL

Diffusion-QL trains a diffusion policy by combining a behavior-cloning loss with a normalized Q-term. Let $\eta>0$ be the coefficient controlling the trade-off between policy improvement and imitation. Following ASPC, we treat $\eta$ as the adaptive constraint parameter.

Inner objective.

Given state–action pairs $(s,a)$ , the diffusion policy $\pi_{\theta}$ is trained under the objective

[TABLE]

A single gradient step produces the updated diffusion policy $\pi_{\tilde{\theta}(\eta)}$ .

Outer objective.

ASPC evaluates the updated diffusion policy through a normalized Q-value term and the corresponding BC term:

[TABLE]

The Q-improvement induced by the inner update is captured by

[TABLE]

The outer objective becomes

[TABLE]

D.4 Integration with FQL

FQL employs two policies: (i) a teacher flow policy trained purely by flow-matching, and (ii) a student one-step flow policy trained via distillation and Q-improvement. Only the student policy interacts with the Q-function, making it the component that requires adaptive scaling. We integrate ASPC by treating the student’s trade-off coefficient $\alpha$ as the adaptive constraint parameter.

Teacher objective (BC Flow).

The teacher flow policy is trained via standard flow-matching:

[TABLE]

where $x_{t}=(1-t)x_{0}+ta$ and $f_{\theta}$ denotes the flow velocity network. This loss is independent of $\alpha$ .

Inner objective (Student Flow).

The student one-step policy $\pi_{\theta}$ predicts an action in a single step and matches the teacher via a distillation loss, while also incorporating a normalized Q-improvement term. The inner objective is

[TABLE]

A single gradient update produces the updated student policy $\pi_{\tilde{\theta}(\alpha)}$ .

Outer objective.

ASPC evaluates the updated student policy by combining its normalized Q-value and distillation loss, and the Q-improvement incurred by the inner update:

[TABLE]

The outer objective becomes

[TABLE]

Appendix E Additional Empirical Analyses

E.1 Performance on AntMaze and Adroit

As shown in Table 1, ASPC does not achieve state-of-the-art performance on AntMaze and Adroit. Both benchmarks are characterized by extremely sparse rewards, with AntMaze in particular using a binary [math]– $1$ success signal Fu et al. (2020). In such settings, even online RL methods struggle to learn effectively, and behavior cloning plays a dominant role in determining policy quality.

Although ASPC can adapt $\alpha$ toward a more BC-dominated regime, the current BC term imitates all actions in the dataset, including suboptimal or unsuccessful trajectories. This limits the attainable performance on sparse-reward tasks. To address this issue, we experimented with augmenting the BC term using advantage-weighted behavior cloning, where high-advantage samples receive larger weights. The modified loss improves the selectivity of imitation by emphasizing demonstrably good behaviors. Experimental results, shown in Figure 9, indicate consistent performance gains on both AntMaze and Adroit when advantage-weighting is applied. This suggests that selectively imitating high-quality behaviors is crucial for sparse-reward offline RL tasks.

E.2 Ablation on the Formulation of the $L_{3}$ Term

To better understand the role of each component in the $L_{3}$ term, we consider five variants. The first variant keeps only the third component:

[TABLE]

The second variant multiplies the third component by the squared BC deviation:

[TABLE]

The third variant replaces the BC-deviation factor with the detached $L_{2}$ term:

[TABLE]

The fourth variant is the complete formulation used in our method:

[TABLE]

The fifth variant replaces both supremum operators in $L_{3}^{(4)}$ by dataset expectations:

[TABLE]

Table 13 summarizes the results. Variants (3)–(5), which include the detached $L_{2}$ term, provide clear gains on Maze2d and AntMaze, showing that this component is essential for these domains. By contrast, Adroit displays only minor differences across all variants, suggesting that Q-value gradients dominate BC-related gradients there, making the precise form of $L_{3}$ less influential. Finally, variant (5) achieves a performance nearly identical to the full formulation, implying that strict worst-case bounds using the $\sup$ operator are not essential in practice.

E.3 Case Study of ASPC Dynamics

Figure 10 shows the training dynamics on halfcheetah-medium-v2. ASPC consistently increases both the estimated Q-value and the BC loss, while simultaneously improving the normalized score. It is essential to note that the increase in BC loss under ASPC does not indicate instability or degradation. Since ASPC deliberately allows the policy to deviate from the behavior policy when such deviations yield sufficient Q-value improvement, the BC loss can increase while performance improves. This matches our theory: whenever the Q-value gain compensates for the increased deviation, the update remains beneficial. Thus, an increasing BC loss indicates that ASPC is escaping the behavior cloning regime and moving toward higher-value actions. In contrast, TD3+BC rapidly plateaus in all three curves, indicating that its fixed trade-off between RL and BC limits its ability to continue improving.

Appendix F Hyperparameter Sensitivity Analyses

F.1 Sensitivity to the Initial Value of $\alpha$

Figure 11 illustrates the influence of the initial value of $\alpha$ on ASPC. Across Gym-MuJoCo, AntMaze, and Adroit, the intermediate setting $\alpha_{0}=2.5$ provides the strongest overall performance, while a very small initialization ( $\alpha_{0}=0.1$ ) tends to bias the early update dynamics too strongly toward BC, limiting the contribution of the RL term. Conversely, an excessively large initialization (e.g., $\alpha_{0}=10$ ) can overemphasize the RL component at the beginning, which weakens the intended stabilizing effect of the BC objective and leads to performance drops, particularly on Adroit. These observations indicate that a balanced initialization is important for achieving stable optimization.

F.2 Sensitivity to the Learning Rate of $\alpha$

We study the effect of the learning rate used for updating $\alpha$ . The results show that different domains prefer different learning rate magnitudes. Too small values slow down the adjustment of the RL–BC trade-off, while too large values make the meta-update unstable and degrade performance.

Appendix G The use of LLM

Large Language Models (LLMs) were used to aid and polish the writing of this paper. In particular, they were applied to rephrase sentences for improved readability and refine grammar and wording to meet academic style requirements.

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1G. An, S. Moon, J. Kim, and H. O. Song (2021) Uncertainty-based offline reinforcement learning with diversified q-ensemble . Advances in neural information processing systems 34 , pp. 7436–7447 . Cited by: §2.1 .
2C. Bai, L. Wang, Z. Yang, Z. Deng, A. Garg, P. Liu, and Z. Wang (2022) Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning . ar Xiv preprint ar Xiv:2202.11566 . Cited by: §2.1 .
3P. J. Ball, L. Smith, I. Kostrikov, and S. Levine (2023) Efficient online reinforcement learning with offline data . In International Conference on Machine Learning , pp. 1577–1594 . Cited by: §4.3 .
4L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch (2021) Decision transformer: reinforcement learning via sequence modeling . Advances in neural information processing systems 34 , pp. 15084–15097 . Cited by: §2.1 .
5A. El Sallab, M. Abdou, E. Perot, and S. Yogamani (2017) Deep reinforcement learning framework for autonomous driv-ing . stat 1050 , pp. 8 . Cited by: §1 .
6C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks . In International conference on machine learning , pp. 1126–1135 . Cited by: §1 , §4.1 .
7L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil (2018) Bilevel programming for hyperparameter optimization and meta-learning . In International conference on machine learning , pp. 1568–1577 . Cited by: §4.1 .
8J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine (2020) D 4rl: datasets for deep data-driven reinforcement learning . ar Xiv preprint ar Xiv:2004.07219 . Cited by: §E.1 .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Adaptive Scaling of Policy Constraints for Offline Reinforcement Learning

Abstract

1 Introduction

2 Related Works

2.1 Offline RL

2.2 Adaptive Policy Constraints

3 Preliminaries

4 Method

4.1 Adaptive Scaling of Policy Constraints

4.2 Theoretical Analysis

Assumption 4.1**.**

Proposition 4.2** (Mutual constraints between ΔLBC\Delta L_{BC}ΔLBC​ and (ΔQ)2(\Delta Q)^{2}(ΔQ)2).**

Proposition 4.3** (Single-step performance lower bound).**

Theorem 4.4** (Single-step performance condition for ASPC).**

Theorem 4.5** (Performance gap to optimal).**

4.3 Implementation on TD3+BC

5 Experiments

5.1 Comparative Performance on Benchmark

5.2 Adaptability of the Sacle Factor

5.3 Necessity of Dynamic Scale Factor Adjustment

5.4 Runtime Analysis

5.5 Ablation Studies

5.6 Extending ASPC to Other Offline RL Methods

5.7 Additional Experiments on OGBench

6 Conclusion

Acknowledgments

Ethics Statement

Reproducibility Statement

Appendix A Theoretical Proofs

A.1 Proof of Proposition 4.2

Lemma A.1** (Reverse triangle inequality).**

Lemma A.2** (Cauchy–Schwarz).**

Proof.

A.2 Proof of Proposition 4.3

Lemma A.3** (Performance‑difference lemma).**

Lemma A.4**.**

Lemma A.5** (Sup‐norm version of equation 19).**

Proof.

Proof.

A.3 Proof of Theorem 4.4

Proof.

A.4 Proof of Theorem 4.5

Proof.

Appendix B Experimental Details

B.1 Hardware and Software

B.2 Hyperparameters

Appendix C Learning Curves

C.1 Scale Factor Curves

C.2 Performance Curves

Appendix D Integrating ASPC with Other Offline RL Algorithms

D.1 Integration with IQL

Inner objective.

Outer objective.

D.2 Integration with CQL

Inner objective.

Outer objective.

D.3 Integration with Diffusion-QL

Inner objective.

Outer objective.

D.4 Integration with FQL

Teacher objective (BC Flow).

Inner objective (Student Flow).

Outer objective.

Appendix E Additional Empirical Analyses

E.1 Performance on AntMaze and Adroit

E.2 Ablation on the Formulation of the L3L_{3}L3​ Term

E.3 Case Study of ASPC Dynamics

Appendix F Hyperparameter Sensitivity Analyses

F.1 Sensitivity to the Initial Value of α\alphaα

F.2 Sensitivity to the Learning Rate of α\alphaα

Appendix G The use of LLM

Assumption 4.1.

Proposition 4.2 (Mutual constraints between $\Delta L_{BC}$ and $(\Delta Q)^{2}$ ).

Proposition 4.3 (Single-step performance lower bound).

Theorem 4.4 (Single-step performance condition for ASPC).

Theorem 4.5 (Performance gap to optimal).

Lemma A.1 (Reverse triangle inequality).

Lemma A.2 (Cauchy–Schwarz).

Lemma A.3 (Performance‑difference lemma).

Lemma A.4.

Lemma A.5 (Sup‐norm version of equation 19).

E.2 Ablation on the Formulation of the $L_{3}$ Term

F.1 Sensitivity to the Initial Value of $\alpha$

F.2 Sensitivity to the Learning Rate of $\alpha$