TL;DR
TMac introduces a graph-based approach to model temporal relationships in multi-modal audiovisual data, significantly improving acoustic event classification by capturing dynamic intra- and inter-modal information.
Contribution
The paper presents a novel graph learning method that explicitly models temporal relations in multi-modal data for acoustic event classification, outperforming existing methods.
Findings
TMac achieves superior performance over state-of-the-art models.
Explicit temporal modeling enhances multi-modal acoustic event classification.
The approach effectively captures intra- and inter-modal temporal dynamics.
Abstract
Audiovisual data is everywhere in this digital age, which raises higher requirements for the deep learning models developed on them. To well handle the information of the multi-modal data is the key to a better audiovisual modal. We observe that these audiovisual data naturally have temporal attributes, such as the time information for each frame in the video. More concretely, such data is inherently multi-modal according to both audio and visual cues, which proceed in a strict chronological order. It indicates that temporal information is important in multi-modal acoustic event modeling for both intra- and inter-modal. However, existing methods deal with each modal feature independently and simply fuse them together, which neglects the mining of temporal relation and thus leads to sub-optimal performance. With this motivation, we propose a Temporal Multi-modal graph learning method for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
