Babel: A Scalable Pre-trained Model for Multi-Modal Sensing via   Expandable Modality Alignment

Shenghong Dai; Shiqi Jiang; Yifan Yang; Ting Cao; Mo Li; Suman; Banerjee; Lili Qiu

arXiv:2407.17777·cs.AI·March 24, 2025

Babel: A Scalable Pre-trained Model for Multi-Modal Sensing via Expandable Modality Alignment

Shenghong Dai, Shiqi Jiang, Yifan Yang, Ting Cao, Mo Li, Suman, Banerjee, Lili Qiu

PDF

Open Access

TL;DR

Babel is a scalable, expandable multi-modal sensing model that effectively aligns multiple sensing modalities, overcoming data scarcity and partial pairing challenges, to enhance human activity recognition and enable new sensing applications.

Contribution

The paper introduces the concept of expandable modality alignment, transforming multi-modality alignment into binary alignments, with novel techniques to handle data scarcity and modality integration.

Findings

01

Achieves up to 22% accuracy improvement in multi-modal sensing tasks.

02

Effectively aligns six sensing modalities including Wi-Fi, mmWave, IMU, LiDAR, video, and depth.

03

Enables cross-modality retrieval and sensing comprehension through case studies.

Abstract

This paper presents Babel, the expandable modality alignment model, specially designed for multi-modal sensing. While there has been considerable work on multi-modality alignment, they all struggle to effectively incorporate multiple sensing modalities due to the data scarcity constraints. How to utilize multi-modal data with partial pairings in sensing remains an unresolved challenge. Babel tackles this challenge by introducing the concept of expandable modality alignment. The key idea involves transforming the N-modality alignment into a series of binary-modality alignments. Novel techniques are also proposed to further mitigate data scarcity issue and balance the contribution of the newly incorporated modality with the previously established modality alignment during the expandable alignment process. We provide the comprehensive implementation. In the pre-training phase, Babel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Hand Gesture Recognition Systems