TL;DR
This paper introduces MAC-X, a multimodal extension of Compositional Attention Networks, designed for social reasoning in videos, demonstrating improved accuracy in social video question answering tasks.
Contribution
The paper presents MAC-X, a novel multimodal deep architecture that performs iterative mid-level fusion of visual, auditory, and text inputs for social reasoning in videos.
Findings
MAC-X effectively leverages multimodal cues through mid-level fusion.
Achieves 2.5% accuracy improvement on Social IQ dataset.
Outperforms current state-of-the-art in social video reasoning.
Abstract
We propose a novel deep architecture for the task of reasoning about social interactions in videos. We leverage the multi-step reasoning capabilities of Compositional Attention Networks (MAC), and propose a multimodal extension (MAC-X). MAC-X is based on a recurrent cell that performs iterative mid-level fusion of input modalities (visual, auditory, text) over multiple reasoning steps, by use of a temporal attention mechanism. We then combine MAC-X with LSTMs for temporal input processing in an end-to-end architecture. Our ablation studies show that the proposed MAC-X architecture can effectively leverage multimodal input cues using mid-level fusion mechanisms. We apply MAC-X to the task of Social Video Question Answering in the Social IQ dataset and obtain a 2.5% absolute improvement in terms of binary accuracy over the current state-of-the-art.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
