A Synchronized Audio-Visual Multi-View Capture System
Xiangwei Shi, Gara Dorta, Ruud de Jong, Ojas Shirekar, Chirag Raman

TL;DR
This paper introduces a multi-view capture system that synchronizes audio and video signals for studying conversational interactions with high temporal precision.
Contribution
It presents a unified system combining multi-camera and multi-microphone setups with a calibration workflow for scalable, synchronized audio-visual data collection.
Findings
Synchronization performance is quantitatively validated.
Recordings are temporally consistent for detailed conversation analysis.
The system supports repeatable, large-scale data acquisition.
Abstract
Multi-view capture systems have been an important tool in research for recording human motion under controlling conditions. Most existing systems are specified around video streams and provide little or no support for audio acquisition and rigorous audio-video alignment, despite both being essential for studying conversational interaction where timing at the level of turn-taking, overlap, and prosody matters. In this technical report, we describe an audio-visual multi-view capture system that addresses this gap by treating synchronized audio and synchronized video as first-class signals. The system combines a multi-camera pipeline with multi-channel microphone recording under a unified timing architecture and provides a practical workflow for calibration, acquisition, and quality control that supports repeatable recordings at scale. We quantify synchronization performance in deployment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
