The USTC-NERCSLIP Systems for the CHiME-8 MMCSG Challenge
Ya Jiang, Hongbo Lan, Jun Du, Qing Wang, Shutong Niu

TL;DR
This paper presents the USTC-NERCSLIP systems for the CHiME-8 MMCSG challenge, focusing on real-time speech recognition in two-person conversations using simulated multi-modal data and IMU sensor integration.
Contribution
It introduces a novel training strategy with simulated data and multi-modal fusion, improving real-time speech recognition in smart glasses scenarios.
Findings
Effective use of simulation data reduces domain gap.
Multi-modal data improves recognition accuracy.
Real-time performance is enhanced with IMU data integration.
Abstract
In the two-person conversation scenario with one wearing smart glasses, transcribing and displaying the speaker's content in real-time is an intriguing application, providing a priori information for subsequent tasks such as translation and comprehension. Meanwhile, multi-modal data captured from the smart glasses is scarce. Therefore, we propose utilizing simulation data with multiple overlap rates and a one-to-one matching training strategy to narrow down the deviation for the model training between real and simulated data. In addition, combining IMU unit data in the model can assist the audio to achieve better real-time speech recognition performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Imaging Techniques and Applications
