Co-attentional Transformers for Story-Based Video Understanding

Bj\"orn Bebensee; Byoung-Tak Zhang

arXiv:2010.14104·cs.CV·October 28, 2020

Co-attentional Transformers for Story-Based Video Understanding

Bj\"orn Bebensee, Byoung-Tak Zhang

PDF

Open Access

TL;DR

This paper introduces a co-attentional transformer model for story-based video understanding, effectively capturing long-term dependencies and interactions in visual narratives, and demonstrates superior performance on the DramaQA dataset.

Contribution

The paper presents a novel co-attentional transformer architecture tailored for visual story understanding, improving long-term dependency modeling in video question answering tasks.

Findings

01

Outperforms baseline by 8 percentage points overall

02

Achieves at least 4.95 and up to 12.8 percentage points improvement across difficulty levels

03

Beats the winner of the DramaQA challenge

Abstract

Inspired by recent trends in vision and language learning, we explore applications of attention mechanisms for visio-lingual fusion within an application to story-based video understanding. Like other video-based QA tasks, video story understanding requires agents to grasp complex temporal dependencies. However, as it focuses on the narrative aspect of video it also requires understanding of the interactions between different characters, as well as their actions and their motivations. We propose a novel co-attentional transformer model to better capture long-term dependencies seen in visual stories such as dramas and measure its performance on the video question answering task. We evaluate our approach on the recently introduced DramaQA dataset which features character-centered video story understanding questions. Our model outperforms the baseline model by 8 percentage points overall,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning