Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation
Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, Umapada Pal, Michael Blumenstein

TL;DR
This paper introduces a novel cross-attention mechanism for context-aware human affordance prediction in 2D scenes, improving pose estimation by encoding scene context from multiple modalities.
Contribution
It proposes a disentangled, multi-step approach combining cross-attention and VAEs for more accurate human pose and location prediction in complex scenes.
Findings
Significant improvement over previous affordance prediction baselines.
Effective encoding of scene context from multiple modalities.
Enhanced pose and location prediction accuracy.
Abstract
Human affordance learning investigates contextually relevant novel pose prediction such that the estimated pose represents a valid human action within the scene. While the task is fundamental to machine perception and automated interactive navigation agents, the exponentially large number of probable pose and action variations make the problem challenging and non-trivial. However, the existing datasets and methods for human affordance prediction in 2D scenes are significantly limited in the literature. In this paper, we propose a novel cross-attention mechanism to encode the scene context for affordance prediction by mutually attending spatial feature maps from two different modalities. The proposed method is disentangled among individual subtasks to efficiently reduce the problem complexity. First, we sample a probable location for a person within the scene using a variational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
