Joint Video and Text Parsing for Understanding Events and Answering Queries
Kewei Tu, Meng Meng, Mun Wai Lee, Tae Eun Choe, Song-Chun Zhu

TL;DR
This paper introduces a joint parsing framework for videos and texts that models spatial, temporal, and causal structures to improve event understanding and query answering.
Contribution
It presents a novel probabilistic model and a hierarchical graph representation for deep semantic joint parsing of video and text data.
Findings
Achieved accurate event and causal structure parsing from video and text.
Enhanced query answering accuracy using the joint parse graph.
Demonstrated applications in narrative generation and question answering.
Abstract
We propose a framework for parsing video and text jointly for understanding events and answering user queries. Our framework produces a parse graph that represents the compositional structures of spatial information (objects and scenes), temporal information (actions and events) and causal information (causalities between events and fluents) in the video and text. The knowledge representation of our framework is based on a spatial-temporal-causal And-Or graph (S/T/C-AOG), which jointly models possible hierarchical compositions of objects, scenes and events as well as their interactions and mutual contexts, and specifies the prior probabilistic distribution of the parse graphs. We present a probabilistic generative model for joint parsing that captures the relations between the input video/text, their corresponding parse graphs and the joint parse graph. Based on the probabilistic model,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Video Analysis and Summarization
