AVATAR submission to the Ego4D AV Transcription Challenge

Paul Hongsuck Seo; Arsha Nagrani; Cordelia Schmid

arXiv:2211.09966·cs.CV·November 21, 2022

AVATAR submission to the Ego4D AV Transcription Challenge

Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

PDF

Open Access

TL;DR

This paper presents AVATAR, a state-of-the-art AV-ASR model that achieved top performance in the Ego4D AV Speech Transcription Challenge 2022, significantly outperforming the baseline.

Contribution

The paper introduces AVATAR, a novel encoder-decoder AV-ASR model with early fusion of spectrograms and RGB images, winning the challenge.

Findings

01

Achieved a WER of 68.40 on the challenge test set

02

Outperformed the baseline by 43.7%

03

Won the Ego4D AV Speech Transcription Challenge 2022

Abstract

In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech Transcription Challenge 2022. Our pipeline is based on AVATAR, a state of the art encoder-decoder model for AV-ASR that performs early fusion of spectrograms and RGB images. We describe the datasets, experimental settings and ablations. Our final method achieves a WER of 68.40 on the challenge test set, outperforming the baseline by 43.7%, and winning the challenge.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing

MethodsTest