QuAVF: Quality-aware Audio-Visual Fusion for Ego4D Talking to Me   Challenge

Hsi-Che Lin; Chien-Yi Wang; Min-Hung Chen; Szu-Wei Fu; Yu-Chiang Frank; Wang

arXiv:2306.17404·cs.CV·July 3, 2023·1 cites

QuAVF: Quality-aware Audio-Visual Fusion for Ego4D Talking to Me Challenge

Hsi-Che Lin, Chien-Yi Wang, Min-Hung Chen, Szu-Wei Fu, Yu-Chiang Frank, Wang

PDF

Open Access 1 Repo

TL;DR

This paper presents QuAVF, a dual-model approach for audio-visual fusion in ego-centric video analysis, utilizing face quality scores for filtering and fusion, achieving top performance in the Ego4D TTM challenge.

Contribution

The novel use of separate models for audio and video processing combined with face quality-aware fusion improves performance in ego-centric talking-to-me tasks.

Findings

01

Achieved 67.4% mAP on test set, ranking first.

02

Utilized face quality scores for data filtering and fusion.

03

Outperformed baseline methods significantly.

Abstract

This technical report describes our QuAVF@NTU-NVIDIA submission to the Ego4D Talking to Me (TTM) Challenge 2023. Based on the observation from the TTM task and the provided dataset, we propose to use two separate models to process the input videos and audio. By doing so, we can utilize all the labeled training data, including those without bounding box labels. Furthermore, we leverage the face quality score from a facial landmark prediction model for filtering noisy face input data. The face quality score is also employed in our proposed quality-aware fusion for integrating the results from two branches. With the simple architecture design, our model achieves 67.4% mean average precision (mAP) on the test set, which ranks first on the leaderboard and outperforms the baseline method by a large margin. Code is available at: https://github.com/hsi-che-lin/Ego4D-QuAVF-TTM-CVPR23

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hsi-che-lin/ego4d-quavf-ttm-cvpr23
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Advanced Data Compression Techniques