Temporally Aligning Long Audio Interviews with Questions: A Case Study in Multimodal Data Integration
Piyush Singh Pasi, Karthikeya Battepati, Preethi Jyothi, Ganesh, Ramakrishnan, Tanmay Mahapatra, Manoj Singh

TL;DR
This paper introduces INDENT, a cross-attention-based framework for aligning long audio interviews with questions, improving retrieval accuracy in multilingual, noisy, real-world survey recordings without requiring verbatim text matches.
Contribution
The work presents a novel cross-attention model that leverages temporal sentence order and semantic embeddings to align questions with long audio recordings in multiple languages.
Findings
Significant improvement in retrieval accuracy (about 3% R-avg) over text heuristics.
Effective use of noisy ASR outputs for better alignment.
Model trained on Hindi generalizes to 11 Indic languages.
Abstract
The problem of audio-to-text alignment has seen significant amount of research using complete supervision during training. However, this is typically not in the context of long audio recordings wherein the text being queried does not appear verbatim within the audio file. This work is a collaboration with a non-governmental organization called CARE India that collects long audio health surveys from young mothers residing in rural parts of Bihar, India. Given a question drawn from a questionnaire that is used to guide these surveys, we aim to locate where the question is asked within a long audio recording. This is of great value to African and Asian organizations that would otherwise have to painstakingly go through long and noisy audio recordings to locate questions (and answers) of interest. Our proposed framework, INDENT, uses a cross-attention-based model and prior information on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
