Multi-Modal Video Dialog State Tracking in the Wild
Adnen Abdessaied, Lei Shi, Andreas Bulling

TL;DR
MST-MIXER is a multi-modal video dialog model that effectively tracks and integrates visual and other modality information in real-world scenarios, improving state tracking accuracy.
Contribution
It introduces a novel multi-modal graph structure learning method and a comprehensive approach to real-world multi-modal state tracking in video dialogs.
Findings
Achieves state-of-the-art results on five benchmarks.
Effectively models complex real-world multi-modal interactions.
Learns local and global graph structures for better state tracking.
Abstract
We present MST-MIXER - a novel video dialog model operating over a generic multi-modal state tracking scheme. Current models that claim to perform multi-modal state tracking fall short of two major aspects: (1) They either track only one modality (mostly the visual input) or (2) they target synthetic datasets that do not reflect the complexity of real-world in the wild scenarios. Our model addresses these two limitations in an attempt to close this crucial research gap. Specifically, MST-MIXER first tracks the most important constituents of each input modality. Then, it predicts the missing underlying structure of the selected constituents of each modality by learning local latent graphs using a novel multi-modal graph structure learning method. Subsequently, the learned local graphs and features are parsed together to form a global graph operating on the mix of all modalities which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Video Surveillance and Tracking Methods · Advanced Vision and Imaging
