TL;DR
This paper introduces a bimodal neural network that uses audio and video signals to improve the synchronization of sensory effects in mulsemedia applications, enhancing timing accuracy and reducing manual effort.
Contribution
The paper presents a novel bimodal neural network architecture that leverages both audio and video data to assist in synchronizing sensory effects in mulsemedia applications, outperforming unimodal methods.
Findings
Bimodal approach yields better synchronization accuracy than unimodal methods.
The model trained on Google's AudioSet demonstrates effective scene component prediction.
Experimental results confirm the superiority of combined audio-video signals for timing synchronization.
Abstract
In mulsemedia applications, traditional media content (text, image, audio, video, etc.) can be related to media objects that target other human senses (e.g., smell, haptics, taste). Such applications aim at bridging the virtual and real worlds through sensors and actuators. Actuators are responsible for the execution of sensory effects (e.g., wind, heat, light), which produce sensory stimulations on the users. In these applications sensory stimulation must happen in a timely manner regarding the other traditional media content being presented. For example, at the moment in which an explosion is presented in the audiovisual content, it may be adequate to activate actuators that produce heat and light. It is common to use some declarative multimedia authoring language to relate the timestamp in which each media object is to be presented to the execution of some sensory effect. One problem…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
