AVATAR submission to the Ego4D AV Transcription Challenge
Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

TL;DR
This paper presents AVATAR, a state-of-the-art AV-ASR model that achieved top performance in the Ego4D AV Speech Transcription Challenge 2022, significantly outperforming the baseline.
Contribution
The paper introduces AVATAR, a novel encoder-decoder AV-ASR model with early fusion of spectrograms and RGB images, winning the challenge.
Findings
Achieved a WER of 68.40 on the challenge test set
Outperformed the baseline by 43.7%
Won the Ego4D AV Speech Transcription Challenge 2022
Abstract
In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech Transcription Challenge 2022. Our pipeline is based on AVATAR, a state of the art encoder-decoder model for AV-ASR that performs early fusion of spectrograms and RGB images. We describe the datasets, experimental settings and ablations. Our final method achieves a WER of 68.40 on the challenge test set, outperforming the baseline by 43.7%, and winning the challenge.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsTest
