TL;DR
This paper introduces RADT, a decision transformer model that enables zero-shot reach-avoid policies in offline reinforcement learning by encoding goals and avoid regions as prompt tokens, allowing flexible, dynamic specification at evaluation time.
Contribution
RADT is the first model to encode avoid regions directly as prompt tokens, enabling zero-shot generalization to novel avoid region configurations in offline goal-conditioned RL.
Findings
RADT outperforms existing models across 11 tasks and environments.
RADT achieves 35.7% improvement in normalized cost in zero-shot settings.
RADT successfully reduces visits to undesirable states in biological cell reprogramming.
Abstract
Offline goal-conditioned reinforcement learning methods have shown promise for reach-avoid tasks, where an agent must reach a target state while avoiding undesirable regions of the state space. Existing approaches typically encode avoid-region information into an augmented state space and cost function, which prevents flexible, dynamic specification of novel avoid-region information at evaluation time. They also rely heavily on well-designed reward and cost functions, limiting scalability to complex or poorly structured environments. We introduce RADT, a decision transformer model for offline, reward-free, goal-conditioned, avoid region-conditioned RL. RADT encodes goals and avoid regions directly as prompt tokens, allowing any number of avoid regions of arbitrary size to be specified at evaluation time. Using only suboptimal offline trajectories from a random policy, RADT learns…
Peer Reviews
Decision·Submitted to ICLR 2026
- Encodes both goals and avoid regions as prompt tokens, decoupling reach-avoid specifications from state representation and enabling zero-shot generalization to arbitrary avoid region counts, locations, and sizes - Creative strategy that generates trajectory pairs with opposite avoid success labels, allowing the model to learn from both successful and unsuccessful demonstrations without requiring expert data - Eliminates brittle reward/cost function design by learning directly from hindsight-re
- Only 2 robotics environments with relatively simple geometric constraints; no high-dimensional state spaces (e.g., pixel observations) or complex avoid region shapes beyond boxes/spheres - Training time (72 GPU hours mentioned in appendix), memory overhead, and model size (GPT-2 architecture) not compared against baselines; acknowledged in limitations but critical for practical deployment - RbSL/AM-Lag baselines use impassable obstacles (Figure 3b) rather than passable avoid regions, making di
**Writing** - The paper is well-structured. - Fig. 2 well illustrates the prompt pipeline. - The authors clearly develop the motivation behind the necessity of each technology and the problem setting. - Prompting as decoupling mechanism. prompt tokens decouple task spec from the state and enable test-time conditioning. - Zero-shot matters. The narrative ties zero-shot generalization to realistic deployments where avoidance constraints vary. **Method** - The authors explains embeddings
**LLM** - I found three times **\*something\*** letters. Although using LLM to correct the grammar and rephrase some sentences, regarding the appendix, the reviewer is concerned that some of the entire paragraphs may have been written by LLM. - Line 422: We then run a second set of 200 episodes, this time providing the most visited intermediate state as an **\*avoid token\*** in the prompt. - Line 1229 and 1236: **\*larger, OOD\*** and **\*adversely\*** --- **Writing** - "Avoid" is a v
Strengths - The paper is intuitive and well-motivated. Incorporating avoid-region information into prompts to adapt to different numbers and sizes of constraints is reasonable and addresses some limitations of prior work. - The experimental results are convincing, demonstrating the effectiveness of the proposed approach. - The paper is clearly written, and the implementation details are easy to understand. - The experiments on extreme OOD generalization are impressive, showing strong general
Weaknesses - The maze tasks used in experiments are relatively simple. It would be better to include more complex control tasks besides fetchreach, such as those from safe RL benchmarks, to further validate the method. - I did not find ablation studies on the relabeling component, which is an important part of the contribution and should be discussed in the main text. - Some prior works on offline safe RL have explored dynamic constraints, such as [1,2], which adjust the constraint threshold
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
