Temporally Aligning Long Audio Interviews with Questions: A Case Study   in Multimodal Data Integration

Piyush Singh Pasi; Karthikeya Battepati; Preethi Jyothi; Ganesh; Ramakrishnan; Tanmay Mahapatra; Manoj Singh

arXiv:2310.06702·cs.CL·October 11, 2023

Temporally Aligning Long Audio Interviews with Questions: A Case Study in Multimodal Data Integration

Piyush Singh Pasi, Karthikeya Battepati, Preethi Jyothi, Ganesh, Ramakrishnan, Tanmay Mahapatra, Manoj Singh

PDF

Open Access 1 Repo

TL;DR

This paper introduces INDENT, a cross-attention-based framework for aligning long audio interviews with questions, improving retrieval accuracy in multilingual, noisy, real-world survey recordings without requiring verbatim text matches.

Contribution

The work presents a novel cross-attention model that leverages temporal sentence order and semantic embeddings to align questions with long audio recordings in multiple languages.

Findings

01

Significant improvement in retrieval accuracy (about 3% R-avg) over text heuristics.

02

Effective use of noisy ASR outputs for better alignment.

03

Model trained on Hindi generalizes to 11 Indic languages.

Abstract

The problem of audio-to-text alignment has seen significant amount of research using complete supervision during training. However, this is typically not in the context of long audio recordings wherein the text being queried does not appear verbatim within the audio file. This work is a collaboration with a non-governmental organization called CARE India that collects long audio health surveys from young mothers residing in rural parts of Bihar, India. Given a question drawn from a questionnaire that is used to guide these surveys, we aim to locate where the question is asked within a long audio recording. This is of great value to African and Asian organizations that would otherwise have to painstakingly go through long and noisy audio recordings to locate questions (and answers) of interest. Our proposed framework, INDENT, uses a cross-attention-based model and prior information on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

piyushsinghpasi/INDENT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis