Snap, Segment, Deploy: A Visual Data and Detection Pipeline for Wearable Industrial Assistants
Di Wen, Junwei Zheng, Ruiping Liu, Yi Xu, Kunyu Peng, Rainer Stiefelhagen

TL;DR
This paper presents a fully on-device, modular system for industrial assistance that combines lightweight perception, speech recognition, and retrieval techniques to support real-time, privacy-preserving training and operations in constrained environments.
Contribution
It introduces a novel on-device pipeline integrating detection, speech, and RAG for industrial support, with an automated data pipeline and a two-stage refinement strategy for robustness.
Findings
Improved robustness to domain shifts and visual corruptions.
Positive user feedback on guidance clarity and interaction quality.
Effective real-time support without cloud reliance.
Abstract
Industrial assembly tasks increasingly demand rapid adaptation to complex procedures and varied components, yet are often conducted in environments with limited computing, connectivity, and strict privacy requirements. These constraints make conventional cloud-based or fully autonomous solutions impractical for factory deployment. This paper introduces a mobile-device-based assistant system for industrial training and operational support, enabling real-time, semi-hands-free interaction through on-device perception and voice interfaces. The system integrates lightweight object detection, speech recognition, and Retrieval-Augmented Generation (RAG) into a modular on-device pipeline that operates entirely on-device, enabling intuitive support for part handling and procedure understanding without relying on manual supervision or cloud services. To enable scalable training, we adopt an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
