YOLOA: Real-Time Affordance Detection via LLM Adapter
Yuqi Ji, Junjie Ke, Lihuo He, Jun Liu, Kaifan Zhang, Yu-Kun Lai, Guiguang Ding, Xinbo Gao

TL;DR
YOLOA is a real-time affordance detection model that jointly predicts object classes, locations, and affordances using a lightweight detector enhanced by an LLM adapter, achieving state-of-the-art accuracy and efficiency.
Contribution
It introduces YOLOA, a novel real-time affordance detection framework that integrates LLM adapters to improve joint understanding of 'what', 'where', and 'how' in embodied AI.
Findings
Achieves 52.8 / 73.1 mAP on ADG-Det / IIT-Heat benchmarks.
Runs at up to 89.77 FPS, with a lightweight variant reaching 846.24 FPS.
Outperforms previous methods in accuracy and real-time performance.
Abstract
Affordance detection aims to jointly address the fundamental "what-where-how" challenge in embodied AI by understanding "what" an object is, "where" the object is located, and "how" it can be used. However, most affordance learning methods focus solely on "how" objects can be used while neglecting the "what" and "where" aspects. Other affordance detection methods treat object detection and affordance learning as two independent tasks, lacking effective interaction and real-time capability. To overcome these limitations, we introduce YOLO Affordance (YOLOA), a real-time affordance detection model that jointly handles these two tasks via a large language model (LLM) adapter. Specifically, YOLOA employs a lightweight detector consisting of object detection and affordance learning branches refined through the LLM Adapter. During training, the LLM Adapter interacts with object and affordance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications
