Missingness-resilient Video-enhanced Multimodal Disfluency Detection

Payal Mohapatra; Shamika Likhite; Subrata Biswas; Bashima Islam; Qi; Zhu

arXiv:2406.06964·cs.CL·June 12, 2024

Missingness-resilient Video-enhanced Multimodal Disfluency Detection

Payal Mohapatra, Shamika Likhite, Subrata Biswas, Bashima Islam, Qi, Zhu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a resilient multimodal disfluency detection method that combines audio and video data, improving accuracy even when video data is partially missing, by using a novel fusion technique and dataset.

Contribution

It presents a new audiovisual dataset and a fusion approach with shared encoders that effectively handle missing video data during inference, advancing multimodal disfluency detection.

Findings

01

Outperforms audio-only methods by 10% when both modalities are available.

02

Achieves 7% improvement even with missing video data in half of the samples.

03

Demonstrates effectiveness across five disfluency detection tasks.

Abstract

Most existing speech disfluency detection techniques only rely upon acoustic data. In this work, we present a practical multimodal disfluency detection approach that leverages available video data together with audio. We curate an audiovisual dataset and propose a novel fusion technique with unified weight-sharing modality-agnostic encoders to learn the temporal and semantic context. Our resilient design accommodates real-world scenarios where the video modality may sometimes be missing during inference. We also present alternative fusion strategies when both modalities are assured to be complete. In experiments across five disfluency-detection tasks, our unified multimodal approach significantly outperforms Audio-only unimodal methods, yielding an average absolute improvement of 10% (i.e., 10 percentage point increase) when both video and audio modalities are always available, and 7%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

payalmohapatra/Multimodal-Speech-Disfluency
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Speech and Audio Processing