AMAVA: Adaptive Motion-Aware Video-to-Audio Framework for Visually-Impaired Assistance
Benjamin Klein, Kazi Ruslan Rahman, and Sanchita Ghose

TL;DR
AMAVA is a real-time video-to-audio system designed to assist visually impaired users by providing contextually relevant sound cues and descriptions, improving environmental awareness and safety.
Contribution
It introduces a motion-aware, AI-driven framework that dynamically switches between scene descriptions and hazard alerts to enhance navigation for the visually impaired.
Findings
User confidence increased significantly with AMAVA.
AMAVA effectively distinguishes between static and dynamic scenes.
The system reduces auditory clutter and latency in real-time navigation.
Abstract
Navigational aids for blind and low vision individuals struggle conveying dynamic real-world environments, leading to cognitive overload from continuous, undifferentiated feedback. We present AMAVA, a novel real-time video-to-audio framework that converts mobile device video into contextually relevant sound effects or text-to-speech descriptions. We propose a motion-aware pipeline using a lightweight AI classification model to distinguish between low and high-movement scenes followed by a real-time text-to-audio synthesis pipeline to enhance environmental perception more efficiently. In static environments, AMAVA generates spoken audio scene descriptions for situational awareness. In high-movement situations, it prioritizes safety by delivering sound cues, such as spoken hazard alerts and environmental sound effects. These audio outputs are produced by a decoder-only transformer-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
