Discourse Parsing in Videos: A Multi-modal Appraoch
Arjun R. Akula, Song-Chun Zhu

TL;DR
This paper introduces the task of visual discourse parsing in videos, proposing a method to identify discourse cues without explicit scene annotation, supported by a new dataset of 310 videos.
Contribution
It presents a novel approach to extract discourse cues from videos without scene annotation and provides a new dataset for this task.
Findings
Successfully identified discourse cues without scene annotation
Created a dataset with 310 videos for visual discourse parsing
Potential applications in Visual Dialog and Visual Storytelling
Abstract
Text-level discourse parsing aims to unmask how two sentences in the text are related to each other. We propose the task of Visual Discourse Parsing, which requires understanding discourse relations among scenes in a video. Here we use the term scene to refer to a subset of video frames that can better summarize the video. In order to collect a dataset for learning discourse cues from videos, one needs to manually identify the scenes from a large pool of video frames and then annotate the discourse relations between them. This is clearly a time consuming, expensive and tedious task. In this work, we propose an approach to identify discourse cues from the videos without the need to explicitly identify and annotate the scenes. We also present a novel dataset containing 310 videos and the corresponding discourse cues to evaluate our approach. We believe that many of the multi-discipline AI…
| RNN Type | #Hidden Units | Bidirectional | #Layers | Relations | Edges | Relations+Edges | Bleu4 |
|---|---|---|---|---|---|---|---|
| LSTM | 256 | NO | 1 | 0.3 | 0.51 | 0.21 | 0.22 |
| LSTM | 512 | NO | 1 | 0.52 | 0.62 | 0.42 | 0.41 |
| LSTM | 1024 | YES | 1 | 0.49 | 0.51 | 0.42 | 0.33 |
| LSTM | 1024 | NO | 1 | 0.35 | 0.51 | 0.21 | 0.34 |
| LSTM | 512 | NO | 2 | 0.35 | 0.51 | 0.21 | 0.38 |
| LSTM | 512 | NO | 3 | 0.56 | 0.62 | 0.42 | 0.39 |
| LSTM | 512 | NO | 4 | 0.56 | 0.62 | 0.42 | 0.39 |
| GRU | 512 | NO | 1 | 0.3 | 0.51 | 0.21 | 0.33 |
| RNN Type | #Hidden Units | Bidirectional | #Layers | #Attention Type | Relations | Edges | Relations+Edges | Bleu4 |
|---|---|---|---|---|---|---|---|---|
| LSTM | 512 | NO | 1 | general | 0.63 | 0.69 | 0.53 | 0.59 |
| LSTM | 512 | NO | 1 | dot | 0.52 | 0.65 | 0.45 | 0.52 |
| LSTM | 512 | NO | 1 | concat | 0.52 | 0.65 | 0.45 | 0.51 |
| LSTM | 512 | NO | 2 | general | 0.52 | 0.65 | 0.45 | 0.47 |
| LSTM | 512 | NO | 3 | general | 0.5 | 0.65 | 0.39 | 0.41 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Subtitles and Audiovisual Media
Discourse Parsing in Videos: A Multi-modal Appraoch
Arjun R. Akula
University of California, Los Angeles
&Song-Chun Zhu
University of California, Los Angeles
Abstract
Text-level discourse parsing aims to unmask how two segments (or sentences) in the text are related to each other. We propose the task of Visual Discourse Parsing, which requires understanding discourse relations among scenes in a video. Here we use the term scene to refer to a subset of video frames that can better summarize the video. In order to collect a dataset for learning discourse cues from videos, one needs to manually identify the scenes from a large pool of video frames and then annotate the discourse relations between them. This is clearly a time consuming, expensive and tedious task. In this work, we propose an approach to identify discourse cues from the videos without the need to explicitly identify and annotate the scenes. We also present a novel dataset containing 310 videos and the corresponding discourse cues to evaluate our approach. We believe that many of the multi-discipline Artificial Intelligence problems such as Visual Dialog and Visual Storytelling would greatly benefit from the use of visual discourse cues. Our code is publicly available at this github link: https://github.com/arjunakula/Visual-Discourse-Parsing
1 Introduction
Discourse structure aids in understanding a piece of text by linking it with other text units (such as surrounding clauses, sentences, etc.) from its context Carlson et al. (2003); Soricut and Marcu (2003); LeThanh et al. (2004). A text span may be linked to another span through semantic relationships such as contrast relation, causal relation, etc. Marcu and Echihabi (2002); Duverle and Prendinger (2009). Text-level discourse parsing algorithms aim to unmask such relationships in text, which is central to many downstream natural language processing (NLP) applications such as information retrieval, text summarization, sentiment analysis Wang and Lan (2015) and question answering Chai and Jin (2004); Akula et al. (2013); Akula (2015).
Recently, there has been a lot of focus on multi-discipline Artificial Intelligence (AI) research problems such as visual storytelling Huang et al. (2016) and visual dialog Das et al. (2016). Solving these problems requires multi-modal knowledge that combines computer vision (CV), NLP, and knowledge representation & reasoning (KR), making the need for commonsense knowledge and complex reasoning more essential. In an effort to fill this need, we introduce the task of Visual Discourse Parsing.
Task Definition. The concrete task in Visual Discourse Parsing is the following - given a video, understand discourse relationships among its scenes. Specifically, given a video, the task is to identify a scene’s relation with the context. Here we use the term scene to refer to a subset of video frames that can better summarize the video. We use Rhetorical Structure Theory (RST) Mann and Thompson (1988) to capture discourse relations among the scenes Akula and Zhu (2019a); Carlson et al. (2003); Soricut and Marcu (2003); LeThanh et al. (2004).
Consider for example, nine frames of a video shown in Figure 1. We can represent the discourse structure of this video using only 3 out of 9 frames, i.e. there are only 3 scenes for this video. The discourse structure in the Figure 1 interprets the video as follows: the event “person going to the bathroom and cleaning his stains” is caused by the event “the person spilling coffee over his shirt”; the event “the person used his handkerchief to dry the water on his shirt” is simply an elaboration of the event “person going to the bathroom and cleaning his stains” Akula and Zhu (2019a); Akula et al. (2020a); Akula and Zhu (2019b); Akula et al. (2021c, d, b, 2020c); R Akula et al. (2019); Pulijala et al. (2013); Gupta et al. (2012).
It is a time consuming, expensive and tedious task to collect a dataset for learning discourse cues from videos. This is because one needs to manually identify the scenes from a large pool of video frames and annotate the discourse relations between them. To this end, we propose an approach to identify discourse cues from the videos without the need to explicitly identify and annotate the scenes. We also present a novel dataset containing 310 videos and the corresponding discourse cues to evaluate our approach Akula et al. (2013, 2018, 2021a); Gupta et al. (2016); Akula et al. (2019b); Akula (2021); Akula et al. (2019a, 2020b).
2 Approach
Algorithm 1 presents our approach for learning a model to identify discourse structure from the videos. Firstly, we generate natural language text descriptions from the videos automatically using video captioning methods such as Yu et al. (2016), Venugopalan et al. (2016) and Pasunuru and Bansal (2017). Secondly, we obtain discourse structures of the above text descriptions using text-level discourse parsers such as Duverle and Prendinger (2009) and Ji and Eisenstein (2014). We represent the discourse structure as a sequence of words (see Figure 1).
Next we use an end-to-end trainable architecture for learning to predict the text-level discourse structures (i.e. sequence of words) from the videos (i.e. sequence of video frames). This gives us a model to map videos to their corresponding text-level discourse structures. Finally we use saliency methods Ramanishka et al. (2017) to replace the textual descriptions in the discourse structure, i.e. elementary discourse units (EDUs), with the scenes. For example, in Figure 1, (b) is the text description of video shown in (a). Text-level discourse structure of (b) is shown in (c) as a sequence of words. We then map (c) to (d) using saliency methods.
The quality of natural language descriptions and the their text-level discourse structures is crucial for learning a robust model. The state-of-the-art video captioning and text-level discourse parsing approaches, as we found in our experiments, may generate a lot of noise in their outputs. While developing our corpus, we manually performed these two steps. These manual annotations are still much easier to perform compared to the tedious task of directly annotating discourse relations between video frames.
In the step 3 of our algorithm, we use the standard machine translation encoder-decoder RNN model Sutskever et al. (2014). As RNN suffers from decaying of gradient and blowing-up of gradient problem, we use LSTM units, which are good at memorizing long-range dependencies due to forget-style gates Hochreiter and Schmidhuber (1997). The sequence of video frames are passed to the encoder. The last hidden state of the encoder is then passed to the decoder. The decoder generates the discourse structure as a sequence of words. Let the input sequence of video frames be and the output sequence of words as . The distribution of the output sequence w.r.t. the input sequence is:
[TABLE]
where is the hidden state at the time step of the decoding LSTM.
Soft Attention: We further improve our encoder-decoder model using an attention based sequence-to-sequence model Bahdanau et al. (2014). The attention weights act as an alignment mechanism by re-weighting the encoder hidden states that are more relevant for decoder time step.
3 Experiments
We developed a new dataset containing 310 videos. These videos are shot at various settings such as playing sports (Table Tennis, Frisbee, Tennis, Rugby), bus stop, dining hall, elevator, classroom, library, garden, study room, etc. On average, the length of each video is about 19 seconds. We first manually generated descriptions of each video and then annotated the discourse structure of these descriptions - with the help of 5 graduate students. Each video is annotated by at least 2 students. We solved the disagreements found in the annotations together. As the training data is not large enough, we chose short videos and described each video using only three sentences, i.e. discourse structure (RST tree) of each video contains only two relations and two edges. This reduces the total number of parameters that need to be learned from the end-to-end training (in step 3 of Algorithm 1).
We evaluate our approach by using the following three metrics:
- (a)
BLEU score We used the BLEU score Papineni et al. (2002) to evaluate the translation quality of the discourse structure generated from the videos. We computed our BLEU score on the tokenized predictions and ground truth. 2. (b)
Relations Accuracy Each video, in our dataset, contains two discourse relations. The Relations Accuracy metric is defined as the total number of relations correctly predicted by the model. 3. (c)
Edges Accuracy Each video, in our dataset, contains two edges. The Edges Accuracy metric is defined as the total number of edges (i.e. RST node nuclearity directions) correctly predicted by the model. 4. (d)
Relations+Edges Accuracy Here, we compute the correctness of the complete discourse structure, i.e. the predicted discourse structure will be considered correct only if all the relations and the edges are correctly predicted by the model.
We use 210 videos for training, 30 videos for validation and the rest 70 videos for testing. We fix our sampling rate to 5fps to bring uniformity in the temporal representation of actions across all videos. These sampled frames are then converted into features using VGGNet Simonyan and Zisserman (2014). We initialize the decoder embedding with Google pre-trained word2vec word embeddings Mikolov et al. (2013). We tune all hyperparameters using our validation data: learning rate, weight initializations, hidden states. We use a 1024-dimension RNN hidden state size. We use Adam optimizer Kingma and Ba (2015) and apply a dropout of 0.5.
Table 1 presents our results, reporting several configurations of the encoder-decoder RNN model, using the four evaluation metrics. The metrics Relations, Edges and Relations+Edges accuracies are scaled between 0 and 1. It may be noted that RNN configuration using LSTM unit with 512 hidden units and 3,4 encoder layers outperformed other RNN configurations. Table 2 reports the performance of our attention based sequence to sequence model. Attention based models gave better accuracies than the simple sequence to sequence models. In particular, the LSTM unit with 512 hidden units and single encoding layer outperformed other configurations. These results are encouraging considering the small size of our training dataset.
4 Related Work
**Text-level Discourse Parsing
**Several works Carlson et al. (2003); LeThanh et al. (2004); Marcu and Echihabi (2002); Duverle and Prendinger (2009) have been proposed in the past to parse text documents into RST style discourse representation. Carlson et al. (2003) proposed a data driven approach using RST Discourse Treebank. Reitter (2003) proposed a chart-parsing-style techniques to derive discourse trees from documents Akula and Zhu (2022); Agarwal et al. (2018); Akula et al. (2019c); Akula (2015); Palakurthi et al. (2015); Agarwal et al. (2017); Dasgupta et al. (2014). LeThanh et al. (2004) proposed a three-stage approach to find discourse parse. It segments the elementary discourse units (EDUs) into trees for each successive level of document: first at the sentence level, then paragraph and finally at the document level. However, using a three-stage approach exponentially increased their search space, making it computationally intractable to find the optimal discourse tree. Duverle and Prendinger (2009) proposed end-to-end parsing algorithms. However, in none of these works, the problem of extracting discourse information from videos has been addressed.
**Video Captioning
**There are few works on video captioning which come closer to our line of research. Video captioning is the task of describing the content of a video. Guadarrama et al. (2013); Thomason et al. (2014); Yu et al. (2016) proposed a multi-stage approach to identify dependency roles such as subject and object in order to generate a description of the video. More recently, Venugopalan et al. (2016) proposed a sequence-to-sequence model with encoder and decoder RNNs. Pasunuru and Bansal (2017) proposed a multi-task learning approach to generate captions from the video. However, all of these works mainly focus on generating text from the video but not on understanding relationships between the scenes (or frames).
5 Conclusions
We introduced a new AI task - Visual Discourse Parsing, where the AI agent needs to understand discourse relations among scenes in a video. We presented an end-to-end learning approach to identify the discourse structure of videos. Central to our approach is the use of text descriptions of videos to identify discourse relations. In the future, we plan to extend this dataset to include longer videos that need more than three sentences to describe. We also intend to experiment with multi-task learning approaches. Our results indicate that there is significant scope for improvement.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Agarwal et al. (2017) Shivali Agarwal, Vishalaksh Aggarwal, Arjun R Akula, Gargi Banerjee Dasgupta, and Giriprasad Sridhara. 2017. Automatic problem extraction and analysis from unstructured text in it tickets. IBM Journal of Research and Development 61(1):4–41.
- 2Agarwal et al. (2018) Shivali Agarwal, Arjun R Akula, Gaargi B Dasgupta, Shripad J Nadgowda, and Tapan K Nayak. 2018. Structured representation and classification of noisy and unstructured tickets in service delivery. US Patent 10,095,779.
- 3Akula et al. (2021 a) Arjun Akula, Spandana Gella, Keze Wang, Song-chun Zhu, and Siva Reddy. 2021 a. Mind the context: The impact of contextualization in neural module networks for grounding visual referring expressions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing . pages 6398–6416.
- 4Akula et al. (2021 b) Arjun Akula, Varun Jampani, Soravit Changpinyo, and Song-Chun Zhu. 2021 b. Robust visual reasoning via language guided neural module networks. Advances in Neural Information Processing Systems 34.
- 5Akula et al. (2013) Arjun Akula, Rajeev Sangal, and Radhika Mamidi. 2013. A novel approach towards incorporating context processing capabilities in nlidb system. In Proceedings of the sixth international joint conference on natural language processing . pages 1216–1222.
- 6Akula and Zhu (2022) Arjun Akula and Song-Chun Zhu. 2022. Effective representation to capture collaboration behaviors between explainer and user. ar Xiv preprint ar Xiv:2201.03147 .
- 7Akula (2015) Arjun R Akula. 2015. A novel approach towards building a generic, portable and contextual nlidb system. International Institute of Information Technology Hyderabad .
- 8Akula et al. (2021 c) Arjun R Akula, Beer Changpinyo, Boqing Gong, Piyush Sharma, Song-Chun Zhu, and Radu Soricut. 2021 c. Crossvqa: Scalably generating benchmarks for systematically testing vqa generalization .
