Detecting In-Person Conversations in Noisy Real-World Environments with Smartwatch Audio and Motion Sensing
Alice Zhang, Callihan Bertley, Dawei Liang, Edison Thomaz

TL;DR
This paper presents a multimodal smartwatch-based system that detects face-to-face conversations using synchronized audio and motion data, achieving high accuracy in lab and real-world settings.
Contribution
It introduces a novel neural network framework that fuses audio and inertial data for real-time conversation detection on commercial smartwatches.
Findings
Achieved 82% macro F1-score in lab settings.
Achieved 77.2% macro F1-score in semi-naturalistic environments.
Demonstrated real-time detection on a commercial smartwatch.
Abstract
Social interactions play a crucial role in shaping human behavior, relationships, and societies. It encompasses various forms of communication, such as verbal conversation, non-verbal gestures, facial expressions, and body language. In this work, we develop a novel computational approach to detect face-to-face verbal conversations, a foundational aspect of human social interactions. We leverage multimodal data captured by a commodity smartwatch, specifically synchronizing microphone audio with 6-axis inertial signals (accelerometer and gyroscope). We design, train, and evaluate convolutional and attention-based neural networks using three different fusion methods to integrate the audio and motion modalities. To validate this framework, we conduct a lab study with 11 participants and a semi-naturalistic study with 24 participants. Our comprehensive evaluation demonstrates that fusing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
