Interaction-aware Representation Modeling with Co-occurrence Consistency for Egocentric Hand-Object Parsing
Yuejiao Su, Yi Wang, Lei Yao, Yawen Cui, Lap-Pui Chau

TL;DR
This paper introduces InterFormer, an interaction-aware transformer model that improves egocentric hand-object parsing by grounding queries in spatial dynamics, fusing interactive cues, and enforcing physical consistency, achieving state-of-the-art results.
Contribution
The paper proposes a novel end-to-end transformer architecture with dynamic query generation, dual-context feature selection, and a co-occurrence loss for better hand-object interaction understanding.
Findings
Achieves state-of-the-art performance on EgoHOS and mini-HOI4D datasets.
Effectively suppresses interaction-irrelevant noise in predictions.
Demonstrates strong generalization to out-of-distribution data.
Abstract
A fine-grained understanding of egocentric human-environment interactions is crucial for developing next-generation embodied agents. One fundamental challenge in this area involves accurately parsing hands and active objects. While transformer-based architectures have demonstrated considerable potential for such tasks, several key limitations remain unaddressed: 1) existing query initialization mechanisms rely primarily on semantic cues or learnable parameters, demonstrating limited adaptability to changing active objects across varying input scenes; 2) previous transformer-based methods utilize pixel-level semantic features to iteratively refine queries during mask generation, which may introduce interaction-irrelevant content into the final embeddings; and 3) prevailing models are susceptible to "interaction illusion", producing physically inconsistent predictions. To address these…
Peer Reviews
Decision·ICLR 2026 Poster
- Targeted Problem-Solving: It accurately identifies three critical EgoHOS pain points (e.g., interaction hallucinations harming agent safety) and designs solutions accordingly, ensuring tight problem-solution alignment. - Logical Component Design: PQG, DFS, and CoCo Loss form a cohesive pipeline (query generation -> feature processing -> result optimization), with progressive and rigorous logic. - Comprehensive Experiments: It outperforms baselines on out-of-domain/OOD datasets (+5.09%/+11.4%)
Section 3.4 mentions that the presence of a hand is a fundamental prerequisite for any hand-object interaction; when the right hand is not detected in the prediction results, current models may incorrectly classify the interacting object as being 'operated by both hands' despite the absence of one hand (the right hand). In terms of results, the CoCo Loss designed based on this premise has indeed achieved good performance in tests. However, this assumption has a critical flaw: the fact that a han
1. The idea to explicitly model interaction for extract hand-object segmentation masks under interaction is highly intuitive. 2. The experimental results support the claim of improving performance over SOTA on various datasets; qualitative evidence is also shown to strengthen that claim. 3. Paper tries to address the problem of interaction illusion which forces the model to learn causality over correlation (weakly speaking).
1. The paper uses a lot of ambiguous and vague terms - "enrichment", "structural priors", "task-relevant", "preliminary coarse interactive representations". I urge the authors to not use them since it causes confusion and distracts from understanding the actual core contributions. I would recommend a proper rewriting of the paper. 2. The idea to use interaction cues (from Zhang et al., 2022, EGOHOS) is useful, but it is important to highlight if other methods lack this supervision. If they do, t
1. The paper points to a key area, i.e. handling contact regions with a prior predictor for interactions. 2. Instead of a one size fits all - the approach of handling specific SOTA technique drawback by the most optimal way is a fresh look -- PQG, DFS, and CoCo loss. 3. CoCo loss is a simple solution to handle the interaction illusion problem. 4. Codebase will be released if accepted and appendix section helps some clarifications. 5. Comparison with extensive SOTA establishes good empirical resu
1. Instead of an end-end pipeline, the proposed modular approach may lead to complexity in time, space and a cut in error backprop. 2. The dependency of the later modules on IPP, makes the contextual training necessary for IPP for generalization. 3. Anonymous codebase was expected for validity along with more supplementray to show qualitative results and failure cases. 4. Certain variables like CocoLoss params were found by ablation on specific datasets, thereby reducing generalization in lack o
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Neural Network Applications
