TL;DR
H2OFlow is a framework that learns comprehensive 3D human-object affordances, including contact, orientation, and spatial occupancy, using synthetic data and dense diffusion on point clouds, reducing reliance on manual annotations.
Contribution
It introduces a novel dense flow-based approach with synthetic data to understand 3D affordances beyond contact analysis, covering orientation and spatial occupancy.
Findings
Outperforms prior annotation-dependent methods in 3D affordance modeling
Generalizes effectively to real-world objects
Uses synthetic data with dense diffusion for comprehensive affordance learning
Abstract
Understanding how humans interact with the surrounding environment, and specifically reasoning about object interactions and affordances, is a critical challenge in computer vision, robotics, and AI. Current approaches often depend on labor-intensive, hand-labeled datasets capturing real-world or simulated human-object interaction (HOI) tasks, which are costly and time-consuming to produce. Furthermore, most existing methods for 3D affordance understanding are limited to contact-based analysis, neglecting other essential aspects of human-object interactions, such as orientation (\eg, humans might have a preferential orientation with respect certain objects, such as a TV) and spatial occupancy (\eg, humans are more likely to occupy certain regions around an object, like the front of a microwave rather than its back). To address these limitations, we introduce \emph{H2OFlow}, a novel…
Peer Reviews
Decision·ICLR 2026 Poster
1. The introduction of the affordance representation that captures both explicit contact and implicit non-contact interaction patterns is novel to me. 2. The problem the paper studies is important. If we want to move to spatial and physical AI in the future, it is important to understand human-object affordance.
1. The organisation of the paper needs to be improved. The overview figure as referred in Section 3 is very important for the understanding of the problem formulation, but the authors put it in the appendix. In Section 4, the authors describe a lot about their method, and it would be a lot better if there is a figure to demonstrate the whole method. 2. Insufficient real-world results. The authors claim that their method can be well generalised to unseen real-world objects. It would be more con
Originality: 1. This paper introduces a novel paradigm shift from manual contact annotation to direct learning from point cloud datasets. 2. The integration of dense diffusion flows with 3D generative models represents an innovative approach to synthetic data generation that eliminates dependency on high-quality mesh inputs. 3. It extends beyond the traditional binary contact-based affordance definition by incorporating spatial orientation and occupancy patterns, providing a more comprehensive
1. Although it uses the comprehensive affordance representations for better affordance definition, all three representations are proposed in previous methods, lacking original definitions and contributions. Additionally, there is no ablation study on whether the two additional representations actually yield better results for affordance learning. 2. Compared with only one previous baseline COMA, lacking experiments. 3. Most of the main figures are in the supplementary, indicating that the paper
* The paper introduces an innovative framework that learns comprehensive affordances from synthetic HOI samples generated by 3D generative models. This approach cleverly eliminates the need for manual annotation and avoids the error-prone 2D-to-3D uplifting process used in prior work. * A key contribution is the use of "dense diffused flows" as a probabilistic, point-based representation for human interaction. This design, learned via a diffusion model, elegantly circumvents the dependency on ma
* The framework's performance is fundamentally tethered to the quality and diversity of a single upstream generative model (CHOIS). The paper appears to neglect any strategy for analyzing, filtering, or augmenting synthetic data. This raises critical questions: How does the model mitigate the risk of inheriting and amplifying potential biases from its sole data source—such as limitations in interaction patterns, insufficient object diversity, and unnatural poses? What are the upper limits of the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
