# Two-stream Spatiotemporal Feature for Video QA Task

**Authors:** Chiwan Song, Woobin Im, and Sung-eui Yoon

arXiv: 1907.05006 · 2019-07-12

## TL;DR

This paper introduces a two-stream neural network with attention mechanisms for video question answering, leveraging spatiotemporal features to improve understanding of video content and answer accuracy.

## Contribution

It proposes a novel multi-channel two-stream network with channel-wise attention and a context matching module for enhanced video question answering performance.

## Key findings

- Improved results in textual-only setting on TVQA dataset.
- Identified limitations and potential of visual features in the proposed model.
- Demonstrated the effectiveness of attention mechanisms in joint spatiotemporal and textual modeling.

## Abstract

Understanding the content of videos is one of the core techniques for developing various helpful applications in the real world, such as recognizing various human actions for surveillance systems or customer behavior analysis in an autonomous shop. However, understanding the content or story of the video still remains a challenging problem due to its sheer amount of data and temporal structure. In this paper, we propose a multi-channel neural network structure that adopts a two-stream network structure, which has been shown high performance in human action recognition field, and use it as a spatiotemporal video feature extractor for solving video question and answering task. We also adopt a squeeze-and-excitation structure to two-stream network structure for achieving a channel-wise attended spatiotemporal feature. For jointly modeling the spatiotemporal features from video and the textual features from the question, we design a context matching module with a level adjusting layer to remove the gap of information between visual and textual features by applying attention mechanism on joint modeling. Finally, we adopt a scoring mechanism and smoothed ranking loss objective function for selecting the correct answer from answer candidates. We evaluate our model with TVQA dataset, and our approach shows the improved result in textual only setting, but the result with visual feature shows the limitation and possibility of our approach.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.05006/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/1907.05006/full.md

## References

38 references — full list in the complete paper: https://tomesphere.com/paper/1907.05006/full.md

---
Source: https://tomesphere.com/paper/1907.05006