REACT: Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation
Ma\"elic Neau, Paulo E. Santos, Anne-Gwenn Bosser, C\'edric Buche, Akihiro Sugimoto

TL;DR
REACT is a novel scene graph generation architecture that balances real-time inference speed, object detection accuracy, and relation prediction, achieving the fastest speeds and significant improvements over existing methods.
Contribution
REACT introduces a new architecture that significantly improves inference speed and object detection accuracy while reducing model size for scene graph generation.
Findings
REACT is 2.7 times faster than existing models.
REACT improves object detection accuracy by 58%.
REACT reduces model size by an average of 5.5x.
Abstract
Scene Graph Generation (SGG) is a task that encodes visual relationships between objects in images as graph structures. SGG shows significant promise as a foundational component for downstream tasks, such as reasoning for embodied agents. To enable real-time applications, SGG must address the trade-off between performance and inference speed. However, current methods tend to focus on one of the following: (1) improving relation prediction accuracy, (2) enhancing object detection accuracy, or (3) reducing latency, without aiming to balance all three objectives simultaneously. To address this limitation, we propose the Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation (REACT) architecture, which achieves the highest inference speed among existing SGG models, improving object detection accuracy without sacrificing relation prediction performance. Compared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Artificial Intelligence in Games · Human Motion and Animation
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Focus · You Only Look Once
