# Dual-modality seq2seq network for audio-visual event localization

**Authors:** Yan-Bo Lin, Yu-Jhe Li, Yu-Chiang Frank Wang

arXiv: 1902.07473 · 2020-08-07

## TL;DR

This paper introduces AVSDN, a deep neural network that jointly processes audio and visual data for precise localization of events in videos, outperforming recent methods in both supervised and weakly supervised scenarios.

## Contribution

The paper presents a novel dual-modality seq2seq network that effectively integrates audio and visual features for event localization in videos, applicable in various supervision settings.

## Key findings

- Outperforms recent deep learning approaches in experiments
- Effective in both fully supervised and weakly supervised settings
- Learns global and local event information from combined audio-visual data

## Abstract

Audio-visual event localization requires one to identify theevent which is both visible and audible in a video (eitherat a frame or video level). To address this task, we pro-pose a deep neural network named Audio-Visual sequence-to-sequence dual network (AVSDN). By jointly taking bothaudio and visual features at each time segment as inputs, ourproposed model learns global and local event information ina sequence to sequence manner, which can be realized in ei-ther fully supervised or weakly supervised settings. Empiricalresults confirm that our proposed method performs favorablyagainst recent deep learning approaches in both settings.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1902.07473/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/1902.07473/full.md

## References

22 references — full list in the complete paper: https://tomesphere.com/paper/1902.07473/full.md

---
Source: https://tomesphere.com/paper/1902.07473