Visual Understanding and Narration: A Deeper Understanding and Explanation of Visual Scenes
Stephanie M. Lukin, Claire Bonial, and Clare R. Voss

TL;DR
This paper introduces the task of Visual Understanding and Narration, where an agent generates descriptive text for images captured during navigation, aiming to enhance interpretability of visual scenes.
Contribution
It formalizes the task of visual narration for robots, proposing methods for generating open-ended descriptive text based on visual data.
Findings
Proposed a framework for visual narration in robotic navigation
Demonstrated the system's ability to answer open-ended questions about scenes
Improved understanding of scene context through narration
Abstract
We describe the task of Visual Understanding and Narration, in which a robot (or agent) generates text for the images that it collects when navigating its environment, by answering open-ended questions, such as 'what happens, or might have happened, here?'
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
Visual Understanding and Narration:
A Deeper Understanding and Explanation of Visual Scenes
Stephanie M. Lukin , Claire Bonial11footnotemark: 1 , and Clare R. Voss
U.S. Army Research Laboratory
Adelphi, MD 20783
- Indicates equal contribution
1 Introduction
We describe the task of Visual Understanding and Narration, in which a robot (or agent) generates text for the images that it collects when navigating its environment, by answering open-ended questions, such as what happens, or might have happened, here? This task was first explored in Lukin et al. (2018a) where humans wrote narratives answering such questions about images taken by a robot (e.g., Fig. 1) during our human-robot interaction research (Lukin et al., 2018b). The intersection of object identification and text generation has been explored by Das et al. (2017) and Antol et al. (2015), however these works stop short of inferencing and narration requirements. Zellers et al. (2018) and Goyal et al. (2017) exploit common sense knowledge of stereotypical, human-centric scenarios in individual images and video clips respectively, yet convey their deductions through multiple choice or slot-filling, rather than generating language or narratives. We briefly survey related current technology and resources, and then sketch our two-pronged approach to bridging the gaps between these fundamental tasks and requirements of the new task.
2 Visual Understanding
Addressing the gap between recognizing particular objects in images and reasoning about why they may be present in a given physical environment requires commonsense knowledge.
2.1 Commonsense Gaps
Commonsense knowledge about objects for computer vision has shown both to improve object and activity recognition and to provide additional information necessary for deeper reasoning Gupta and Malik (2015); Yatskar et al. (2016b); Ronchi and Perona (2015). This type of object knowledge is primarily visual, supporting tasks such as object and activity recognition, as well as transfer learning to visually similar objects and scenes. Such knowledge has included spatial relations Yatskar et al. (2016a), shape similarity to other objects, and visual attributes such as color Singh et al. (2018).
However, knowledge humans exploit when analyzing an environment goes beyond visual clues. To interpret Fig. 1 possibly as a kitchen, a system needs not only to recognize the objects, but to know which actions are commonly performed with these objects, and then to infer where such actions may occur. The actions, termed ‘object affordances’ Gibson (1979), have been defined in computer vision studies as the combination of: an affordance label, a human pose representation of the action, and a relative position of the object with respect to the human Grabner et al. (2011); Kjellström et al. (2011); Yao and Huang (2018); Zhu et al. (2014). Though the latter two can be extracted from visual data, a challenge is how to systematically collect appropriate affordance labels for shared re-use by vision and language researchers, reducing the redundant labor of independent, manual assignments of verbs as labels for small, fixed sets of objects (e.g., sit-on – chair).
2.2 Approach to Bridging Gaps
‘Qualia’ are relations associated with a particular object Pustejovsky (1991), including Agentive (created_by), Telic (functions_as, used_for), Constitutive (part_of, made_of), and Formal (is_a), providing a rich source of commonsense and affordance information, and a framework for disambiguating senses of a word (e.g., book: physical item vs. content). They have been demonstrated as useful knowledge representations for intelligent agents McDonald et al. (2013); Pustejovsky et al. (2017); Narayana et al. (2018).
A comprehensive set of qualia relations have yet to be defined and organized. We are tackling this challenge and aim to make the qualia usable for visual understanding tasks: qualia have been automatically extracted and evaluated for quality via crowdsourcing Kazeminejad et al. (2018), then encoded as relations between entities and events in the Rich Event Ontology (REO) Bonial et al. (2016). Assuming, for example, that the objects in Fig. 1 can be recognized accurately, the resulting list of objects (e.g., pot, cereal) can first be queried for their qualia in REO, to discover: pot is used_for cooking, and cereal functions_as nourishing and is_a Prepared_Food. These activities with object classes can next be queried for their common locations in REO via their semantic roles, to discover: cooking Prepared_Food returns kitchen. In this approach, the objects, their affordances, and REO roles, would support the inference that this space functions_as as a kitchen.
3 Narrative Building
Once the visual scene is interpreted, we determine what is needed to answer the task question via content selection and narrative generation.
3.1 Generation Gaps
Content selection, or framing, is the relationship between the narrating agent and what they know and choose to talk about (Lönneker, 2005). The choice of appropriate framing device depends on the intended audience of the final narrative. Many recent works in vision treat framing as an observational task, describing the image in a single sentence (Rashtchian et al., 2010; Hodosh et al., 2013; Lin et al., 2014; Chen et al., 2015; Krishna et al., 2016).111Ferraro et al. (2015) survey of vision and language resources; framing prompts are similar to those listed here. This limits the scope to the visually observable and restricts what can be learned by extrapolation from the past or to future. With just a handful of open-ended prompts, e.g., what happened, creative scene interpretations can be elicited that go beyond single sentences (Gordon and Roemmele, 2014; Huang et al., 2016; Vaidyanathan et al., 2018).
After assessing what to talk about, the narrating agent must establish how to talk about it. Recent neural vision and text models rely solely on crowd-sourced data for guidance in this phase of narrative crafting (Park and Kim, 2015; Yu et al., 2017; Huang et al., 2016; Fan et al., 2018; Wang et al., 2018). Much can be learned from narratological studies, such as the categorization, combination, and presentation of narrative elements (Labov and Waletzky, 1997; Rahimtoroghi et al., 2013; Niehaus and Young, 2009; Lehnert, 1981; Elson, 2012). However, the template-based approaches (Montfort, 2007; Callaway and Lester, 2002) and statistical models (Li, 2015) that have successfully leveraged these elements for content selection and narrative shaping in text-based story generation, have not yet been applied to visual narration.
3.2 Approach to Bridging Gaps
Lukin et al. (2018a) performed a pilot data collection with framing to elicit a narrative connecting a sequence of images. In our ongoing work, preliminary analysis of human authored narratives about Fig. 1 have found both extrapolation beyond the observable in the image (“[someone intends] to live here at least until they finish the project that they are working on”) and creative causal reasoning for what is not visually depicted in the image (“[someone] is pulling an all-nighter and brought breakfast for the next morning”).
4 Next Steps
The two prongs of our approach provide complementary information for identifying and reasoning about a visual scene, from which succinct and targeted text can be generated in support of human-robot interactions to talk about what happens in the robot’s environment. Qualia encoded in REO provide bottom-up, commonsense knowledge for reasoning, and existing narrative schema can be applied in a top-down manner to formulate narratives, leveraging content from crowd-sourced narrative elements. Our ontology and crowdsourced annotations will be made available to the community, supplementing existing resources.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In Proceedings of the IEEE international conference on computer vision , pages 2425–2433.
- 2Bonial et al. (2016) Claire Bonial, David Tahmoush, Susan Windisch Brown, and Martha Palmer. 2016. Multimodal Use of an Upper-Level Event Ontology. In Proceedings of the Fourth Workshop on Events , pages 18–26.
- 3Callaway and Lester (2002) Charles B Callaway and James C Lester. 2002. Narrative Prose Generation. Artificial Intelligence , 139(2):213–252.
- 4Chen et al. (2015) Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft COCO Captions: Data Collection and Evaluation Server. ar Xiv preprint ar Xiv:1504.00325 .
- 5Das et al. (2017) Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 326–335.
- 6Elson (2012) David Elson. 2012. Drama Bank: Annotating Agency in Narrative Discourse. In LREC , pages 2813–2819.
- 7Fan et al. (2018) Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical Neural Story Generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 889–898. Association for Computational Linguistics.
- 8Ferraro et al. (2015) Francis Ferraro, Nasrin Mostafazadeh, Lucy Vanderwende, Jacob Devlin, Michel Galley, Margaret Mitchell, et al. 2015. A Survey of Current Datasets for Vision and Language Research. ar Xiv preprint ar Xiv:1506.06833 .
