Audio-Visual Contact Classification for Tree Structures in Agriculture
Ryan Spears, Moonyoung Lee, George Kantor, Oliver Kroemer

TL;DR
This paper introduces a multi-modal audio-visual classification system for identifying contact types in agricultural tree manipulation, improving robot safety and effectiveness in cluttered, unstructured environments.
Contribution
It presents a novel fusion of vibrotactile and visual data for contact classification, with zero-shot transfer from hand-held to robot-mounted sensors.
Findings
Achieved an F1 score of 0.82 in contact classification.
Demonstrated zero-shot generalization to robot-mounted sensors.
Showed that audio signals effectively distinguish material types.
Abstract
Contact-rich manipulation tasks in agriculture, such as pruning and harvesting, require robots to physically interact with tree structures to maneuver through cluttered foliage. Identifying whether the robot is contacting rigid or soft materials is critical for the downstream manipulation policy to be safe, yet vision alone is often insufficient due to occlusion and limited viewpoints in this unstructured environment. To address this, we propose a multi-modal classification framework that fuses vibrotactile (audio) and visual inputs to identify the contact class: leaf, twig, trunk, or ambient. Our key insight is that contact-induced vibrations carry material-specific signals, making audio effective for detecting contact events and distinguishing material types, while visual features add complementary semantic cues that support more fine-grained classification. We collect training data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTree Root and Stability Studies · Smart Agriculture and AI · Tactile and Sensory Interactions
