ResWorld: Temporal Residual World Model for End-to-End Autonomous Driving
Jinqing Zhang, Zehua Fu, Zelin Xu, Wenying Dai, Qingjie Liu, Yunhong Wang

TL;DR
ResWorld introduces a novel temporal residual world model that enhances dynamic object prediction and trajectory refinement for autonomous driving, achieving state-of-the-art results without relying on detection and tracking.
Contribution
The paper proposes TR-World, a dynamic object modeling approach using temporal residuals, and FGTR, a trajectory refinement module leveraging future scene information.
Findings
Achieves state-of-the-art planning performance on nuScenes and NAVSIM datasets.
Effectively models dynamic objects without detection and tracking.
Improves trajectory accuracy through future-guided refinement.
Abstract
The comprehensive understanding capabilities of world models for driving scenarios have significantly improved the planning accuracy of end-to-end autonomous driving frameworks. However, the redundant modeling of static regions and the lack of deep interaction with trajectories hinder world models from exerting their full effectiveness. In this paper, we propose Temporal Residual World Model (TR-World), which focuses on dynamic object modeling. By calculating the temporal residuals of scene representations, the information of dynamic objects can be extracted without relying on detection and tracking. TR-World takes only temporal residuals as input, thus predicting the future spatial distribution of dynamic objects more precisely. By combining the prediction with the static object information contained in the current BEV features, accurate future BEV features can be obtained.…
Peer Reviews
Decision·ICLR 2026 Poster
1. The FGTR module establishes an explicit feedback loop between the planner and the world model, using predicted future information to correct the current plan, which is logically clear. 2. The method achieves promising results on both the nuScenes (open-loop) and NAVSIM (closed-loop) benchmarks.
1. A major weakness lies in the experimental evaluation. In the NAVSIM closed-loop tests, the authors admit (Sec 4.2) to not using the core TR-World module, leaving its closed-loop effectiveness unverified. 2. TR-World's residual calculation is highly sensitive to the stability of the BEV features themselves. If the underlying BEV encoder is unstable between frames, its "feature noise" will be conflated with "true motion," leading to an unreliable residual signal. 3. The core assumption of TR-Wo
- Temporal residuals in BEV separate dynamic from static content, alleviating redundant static region modeling and weak interaction between the world model and trajectories. - FGTR explicitly couples predicted trajectories with future BEV features under supervision, improving robustness. - Experiments on Nuscenes and NavSim demonstrate the superiority of the proposed methods.
- ResWorld utilizes GeoBEV to generate high-quality BEV features. Compared to other methods, especially the baseline SSR, this approach may be considered unfair. Moreover, the resolution of the BEV features and other hyperparameters, such as the number and dimensions of scene tokens, are not clearly defined, making it challenging to practically verify the effectiveness of the proposed method. - The analysis and conclusions presented in Table 4 are both confusing and overstated, lacking sufficien
1. The idea of utilizing the static information from the current frame to avoid redundant predictions of static objects in future frames, thereby focusing more on dynamic targets, is very insightful. 2. On the recent popular NAVSIM benchmark, the proposed method achieves a noticeable improvement in planning performance compared to previous approaches.
1. The statements in the abstract, line 101, and the experimental tables suggesting that the proposed method does not rely on auxiliary tasks are somewhat misleading, since the BEV encoder still requires supervision from detection and mapping tasks during training. 2. A major weakness of the paper is the lack of evaluation on inference speed. Given that the proposed method introduces multiple designs for feature interaction and adopts a two-stage trajectory generation process, it is unclear whet
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Robotic Path Planning Algorithms · Adversarial Robustness in Machine Learning
