AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition
Rameswar Panda, Chun-Fu Chen, Quanfu Fan, Ximeng Sun, Kate Saenko,, Aude Oliva, Rogerio Feris

TL;DR
AdaMML introduces an adaptive framework for multi-modal video recognition that dynamically selects the most relevant modalities per segment, significantly reducing computation while improving accuracy.
Contribution
It proposes a novel adaptive multi-modal learning framework with a policy network that dynamically chooses modalities, enhancing efficiency and performance in video recognition.
Findings
Achieves 35%-55% reduction in computation compared to baseline.
Consistently outperforms state-of-the-art methods in accuracy.
Demonstrates effectiveness across four diverse datasets.
Abstract
Multi-modal learning, which focuses on utilizing various modalities to improve the performance of a model, is widely used in video recognition. While traditional multi-modal learning offers excellent recognition results, its computational expense limits its impact for many real-world applications. In this paper, we propose an adaptive multi-modal learning framework, called AdaMML, that selects on-the-fly the optimal modalities for each segment conditioned on the input for efficient video recognition. Specifically, given a video segment, a multi-modal policy network is used to decide what modalities should be used for processing by the recognition model, with the goal of improving both accuracy and efficiency. We efficiently train the policy network jointly with the recognition model using standard back-propagation. Extensive experiments on four challenging diverse datasets demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Cancer-related molecular mechanisms research
