TL;DR
This paper presents a multi-modal approach for estimating the filling mass of containers in human-robot handovers, combining visual, audio, and depth data to improve perception accuracy in collaborative robotics.
Contribution
It introduces a novel multi-modal method that jointly predicts filling type, level, and container capacity to estimate filling mass, achieving top performance in the CORSMAL 2020 Challenge.
Findings
Achieved Top-1 overall performance in the challenge
Demonstrated effective multi-modal data fusion for perception tasks
No evidence of overfitting in the proposed method
Abstract
Human-robot object handover is a key skill for the future of human-robot collaboration. CORSMAL 2020 Challenge focuses on the perception part of this problem: the robot needs to estimate the filling mass of a container held by a human. Although there are powerful methods in image processing and audio processing individually, answering such a problem requires processing data from multiple sensors together. The appearance of the container, the sound of the filling, and the depth data provide essential information. We propose a multi-modal method to predict three key indicators of the filling mass: filling type, filling level, and container capacity. These indicators are then combined to estimate the filling mass of a container. Our method obtained Top-1 overall performance among all submissions to CORSMAL 2020 Challenge on both public and private subsets while showing no evidence of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsGated Recurrent Unit · Adam
