TL;DR
XFlow introduces cross-modal deep neural networks that enable data exchange between audio and visual streams, improving audiovisual classification by exploiting correlations and achieving state-of-the-art results on multiple datasets.
Contribution
The paper presents novel cross-modality dataflow architectures and extends cross-connections to non-compatible data, enhancing multimodal learning capabilities.
Findings
Models outperform baselines by up to 11.5%
Achieve state-of-the-art results on AVletters, CUAVE, and Digits datasets
Learn interpretable features that improve discrimination ability
Abstract
In recent years, there have been numerous developments towards solving multimodal tasks, aiming to learn a stronger representation than through a single modality. Certain aspects of the data can be particularly useful in this case - for example, correlations in the space or time domain across modalities - but should be wisely exploited in order to benefit from their full predictive potential. We propose two deep learning architectures with multimodal cross-connections that allow for dataflow between several feature extractors (XFlow). Our models derive more interpretable features and achieve better performances than models which do not exchange representations, usefully exploiting correlations between audio and visual data, which have a different dimensionality and are nontrivially exchangeable. Our work improves on existing multimodal deep learning algorithms in two essential ways: (1)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
