TL;DR
This paper introduces Moving Out, a benchmark for physically-grounded human-AI collaboration, and proposes BASS, a method to improve agent diversity and understanding, demonstrating superior performance in physical collaboration tasks.
Contribution
The paper presents Moving Out, a new benchmark for physical human-AI collaboration, and introduces BASS, a novel method to enhance agent diversity and physical understanding.
Findings
BASS outperforms state-of-the-art models in collaboration tasks.
Moving Out effectively evaluates adaptation to diverse human behaviors.
BASS improves agent diversity and understanding of physical outcomes.
Abstract
The ability to adapt to physical actions and constraints in an environment is crucial for embodied agents (e.g., robots) to effectively collaborate with humans. Such physically grounded human-AI collaboration must account for the increased complexity of the continuous state-action space and constrained dynamics caused by physical constraints. In this paper, we introduce Moving Out, a new human-AI collaboration benchmark that resembles a wide range of collaboration modes affected by physical attributes and constraints, such as moving heavy items together and maintaining consistent actions to move a big item around a corner. Using Moving Out, we designed two tasks and collected human-human interaction data to evaluate models' abilities to adapt to diverse human behaviors and unseen physical attributes. To address the challenges in physical environments, we propose a novel method, BASS…
Peer Reviews
Decision·Submitted to ICLR 2026
The Moving Out environment bridges the gap between symbolic multi-agent environments (e.g., Overcooked-AI) and physically grounded continuous control, with explicit physics and multi-modal collaboration types (coordination, awareness, action consistency). The BASS framework introduces behavior recombination for partner diversity and a next-state simulation module for physical reasoning, which is well-motivated and conceptually solid. Quantitative and qualitative analyses (Task Completion Rate,
The augmentation and simulation modules mainly combine known ideas (trajectory recombination, learned dynamics, and model-based scoring). The paper could clarify theoretical contributions beyond engineering integration — e.g., formal guarantees, diversity metrics, or physical constraint satisfaction proofs. Both Task 1 (adapting to diverse human behaviors) and Task 2 (generalizing to unseen physical attributes) require human demonstrations or real-time human interaction. Although the authors in
### Quality - Paper is very well-motivated - RQ structure is thoughtful. I appreciate not just that there are RQs, but that they target so many aspects of one concept, including failure modes. ### Clarity - Well written - Well-designed figures - Research question structure is very useful - helps understand what the paper is arguing. This is even more important given that the tasks, envs, data, etc. are expansive and somewhat arbitrary. Without the RQs and with just numbers, it would be hard
### Quality - Design is somewhat arbitrary. The tasks are very general, but their instantiations - the objects, environment, domain expansion techniques, etc. - are specific and feel somewhat arbitrary. Ultimately that might be fine, as all benchmarks have to be feasible to build and use, but it would help to have some justification for the particular instantiations. - Evaluation lacks control. The numbers are convincing - BASS shows improvement on the tasks, and the tasks are large and expansi
1. This paper contributes a well-designed benchmark for symbolic human-AI collaboration and physically grounded interaction. 2. The BASS method integrates data augmentation and next-state prediction to enhance robustness to diverse human behavior and physical variations. 3. Extensive experiments, such as ablation studies, human user studies, and failure case analyses validate performance improvements.
1. Some key terms are not clearly defined, which reduces the overall clarity of the paper. - L11: The phrase “adapt to physical actions and constraints” lacks precise definition. The authors should clarify what constitutes “physical actions” and “constraints.” - L42: The scope of “diverse physical constraints, variations, and behavior” should be specified more precisely. - L339, L341: The meanings of “aligned the boundaries” and “match the boundaries” are unclear. - Title, Fig. 1, etc.: The
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
