Spherical World-Locking for Audio-Visual Localization in Egocentric   Videos

Heeseung Yun; Ruohan Gao; Ishwarya Ananthabhotla; Anurag Kumar; Jacob; Donley; Chao Li; Gunhee Kim; Vamsi Krishna Ithapu; Calvin Murdock

arXiv:2408.05364·cs.CV·August 13, 2024

Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

Heeseung Yun, Ruohan Gao, Ishwarya Ananthabhotla, Anurag Kumar, Jacob, Donley, Chao Li, Gunhee Kim, Vamsi Krishna Ithapu, Calvin Murdock

PDF

Open Access

TL;DR

This paper introduces Spherical World-Locking (SWL), a novel egocentric scene representation framework that improves multisensory spatial synchronization by transforming data with respect to head orientation on a sphere, enhancing understanding of egocentric videos.

Contribution

The paper presents SWL, a new spherical world-locked representation that better handles self-motion and multisensory data alignment in egocentric videos, along with a transformer-based architecture for scene understanding.

Findings

01

SWL improves spatial synchronization across modalities.

02

The framework enhances performance on egocentric video tasks.

03

It effectively handles self-motion challenges in multisensory data.

Abstract

Egocentric videos provide comprehensive contexts for user and scene understanding, spanning multisensory perception to behavioral interaction. We propose Spherical World-Locking (SWL) as a general framework for egocentric scene representation, which implicitly transforms multisensory streams with respect to measurements of head orientation. Compared to conventional head-locked egocentric representations with a 2D planar field-of-view, SWL effectively offsets challenges posed by self-motion, allowing for improved spatial synchronization between input modalities. Using a set of multisensory embeddings on a worldlocked sphere, we design a unified encoder-decoder transformer architecture that preserves the spherical structure of the scene representation, without requiring expensive projections between image and world coordinate systems. We evaluate the effectiveness of the proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Music Technology and Sound Studies

MethodsSparse Evolutionary Training