SeqAffordSplat: Scene-level Sequential Affordance Reasoning on 3D Gaussian Splatting
Di Li, Jie Feng, Jiahao Chen, Weisheng Dong, Guanbin Li, Yuhui Zheng, Mingtao Feng, Guangming Shi

TL;DR
This paper introduces SeqAffordSplat, a new benchmark and method for long-horizon, scene-level 3D affordance reasoning, enabling embodied agents to understand complex multi-object interactions from instructions.
Contribution
It proposes SeqSplatNet, an end-to-end framework combining language models and 3D segmentation, along with a new benchmark and pre-training strategy for scene-level affordance reasoning.
Findings
Sets a new state-of-the-art on the SeqAffordSplat benchmark.
Effectively handles complex, multi-object, sequential affordance tasks.
Demonstrates improved scene understanding for embodied agents.
Abstract
3D affordance reasoning, the task of associating human instructions with the functional regions of 3D objects, is a critical capability for embodied agents. Current methods based on 3D Gaussian Splatting (3DGS) are fundamentally limited to single-object, single-step interactions, a paradigm that falls short of addressing the long-horizon, multi-object tasks required for complex real-world applications. To bridge this gap, we introduce the novel task of Sequential 3D Gaussian Affordance Reasoning and establish SeqAffordSplat, a large-scale benchmark featuring 1800+ scenes to support research on long-horizon affordance understanding in complex 3DGS environments. We then propose SeqSplatNet, an end-to-end framework that directly maps an instruction to a sequence of 3D affordance masks. SeqSplatNet employs a large language model that autoregressively generates text interleaved with special…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- This is the first work that combines seal task reasoning and affordance prediction. The proposed dataset can be useful for future research. - The experiments carefully ablate the different components and backbone to demonstrate the contribution of each component choice.
- It is hard to understand how the inference of the subtask sequence can influence the affordance prediction. In the current presentation, it can be solved by using LLMs to decompose the task into subtasks and then perform single-subtask affordance prediction with Gaussian splatting. Why is a unified LLM needed? - The proposed method has limited novelty. The 3DGS encoder training and the lifting of semantic features are methods used in prior approaches. The use of LLMs is similar to SeqAfford.
1/ The model’s design combining a compact LLM (Qwen3-0.6B) with a 3D Gaussian feature encoder is technically sound and aligns with recent multimodal trends. The sequential affordance decoding formulation is clear and well motivated. 2/ The paper is easy to follow, with clear figures and an intuitive overall structure that guides the reader through dataset, model, and results.
1/ The core framework mostly combines existing components — a pre-trained 3D Gaussian encoder, a small-scale LLM for instruction parsing, and a mask decoder — without introducing fundamentally new algorithms. The idea of sequential reasoning itself is an incremental extension of prior affordance prediction works. 2/ While the dataset is larger and includes sequential labels, it is mainly a compositional expansion of existing resources rather than a new type of data or annotation paradigm. The us
1. This paper proposes a large-scale 3D affordance dataset and benchmark, which for the first time supports serialized, scene-level 3D affordance prediction, greatly expanding existing evaluation metrics and providing strong support for related work. 2. At the model level, this paper proposes a solution to the difficulty of convergence in training 3DGS representation encoders from scratch: a pre-training strategy based on conditional geometry reconstruction. By reconstructing affordance masks us
1. The authors point out that part of the scene construction in the dataset needs to be done manually. This process should be automated to increase the dataset's scalability. Also, the paper suggests that the automatic generation of instructions using engineering + GPT-4o followed by manual verification may introduce human bias, leading to data distribution shifts. 2. The paper provides successful visualization examples. Please supplement with typical failure examples and analyze the root causes
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Anomaly Detection Techniques and Applications · Robotics and Sensor-Based Localization
