Audio-Visual Traffic Light State Detection for Urban Robots
Sagar Gupta, Akansel Cosgun

TL;DR
This paper introduces a multimodal approach combining vision and sound for traffic light detection in urban robots, improving robustness against occlusion and noise compared to single-modality methods.
Contribution
It presents a novel fusion method that combines audio features with visual pixel ratios, enhancing traffic light detection in challenging urban robot navigation scenarios.
Findings
Outperforms single-modality solutions during robot motion
Effectively handles visual occlusions and noise
Demonstrates the potential of multimodal perception in robotics
Abstract
We present a multimodal traffic light state detection using vision and sound, from the viewpoint of a quadruped robot navigating in urban settings. This is a challenging problem because of the visual occlusions and noise from robot locomotion. Our method combines features from raw audio with the ratios of red and green pixels within bounding boxes, identified by established vision-based detectors. The fusion method aggregates features across multiple frames in a given timeframe, increasing robustness and adaptability. Results show that our approach effectively addresses the challenge of visual occlusion and surpasses the performance of single-modality solutions when the robot is in motion. This study serves as a proof of concept, highlighting the significant, yet often overlooked, potential of multi-modal perception in robotics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Measurement and Detection Methods
