TL;DR
This paper introduces a novel three-decoder architecture with focused attention and a generalized intersection box prediction task to improve the detection of relative occlusion and distance relationships in images, achieving state-of-the-art results.
Contribution
The work presents a new architecture and training strategy specifically designed for geometric relationship detection, advancing beyond semantic relationship detection methods.
Findings
Achieved a distance F1-score increase from 33.8% to 38.6%.
Boosted occlusion F1-score from 34.4% to 41.2%.
Demonstrated the effectiveness of focused attention in relationship detection.
Abstract
For humans, understanding the relationships between objects using visual signals is intuitive. For artificial intelligence, however, this task remains challenging. Researchers have made significant progress studying semantic relationship detection, such as human-object interaction detection and visual relationship detection. We take the study of visual relationships a step further from semantic to geometric. In specific, we predict relative occlusion and relative distance relationships. However, detecting these relationships from a single image is challenging. Enforcing focused attention to task-specific regions plays a critical role in successfully detecting these relationships. In this work, (1) we propose a novel three-decoder architecture as the infrastructure for focused attention; 2) we use the generalized intersection box prediction task to effectively guide our model to focus on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
