OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments
Hymalai Bello, Lala Ray, Joanna Sorysz, Sungho Suh, Paul Lukowicz

TL;DR
OpenMarcie is a comprehensive multimodal dataset for human action recognition in manufacturing environments, capturing diverse activities with wearable sensors and cameras to enhance industrial worker safety and productivity.
Contribution
It introduces the largest multimodal dataset for industrial human activity monitoring, covering diverse tasks and experimental settings with extensive sensor data.
Findings
Dataset includes over 37 hours of multimodal data from 36 participants.
Benchmark results provided for activity classification, captioning, and cross-modal alignment.
Demonstrates potential for improving activity recognition in industrial settings.
Abstract
Smart factories use advanced technologies to optimize production and increase efficiency. To this end, the recognition of worker activity allows for accurate quantification of performance metrics, improving efficiency holistically while contributing to worker safety. OpenMarcie is, to the best of our knowledge, the biggest multimodal dataset designed for human action monitoring in manufacturing environments. It includes data from wearables sensing modalities and cameras distributed in the surroundings. The dataset is structured around two experimental settings, involving a total of 36 participants. In the first setting, twelve participants perform a bicycle assembly and disassembly task under semi-realistic conditions without a fixed protocol, promoting divergent and goal-oriented problem-solving. The second experiment involves twenty-five volunteers (24 valid data) engaged in a 3D…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
This is a good topic. Well‑scoped industrial focus with realistic goal‑driven workflows (incl. sequential collaborative assembly), bridging the gap between short scripted actions and long‑horizon procedures. The dataset covers many modalities and placements (IMU, RGB‑LiDAR, thermal, spectrometer, stereo audio, ego/exo RGB‑D), enabling research on fusion, substitution, and privacy‑preserving sensing. The benchmarks, models, and metrics are standard and appropriate; the late‑fusion transformer a
External validity? Limited generalization capability?: data are collected in a test bench, not a running factory; acoustic environment also lacks authentic machinery noise—this likely underestimates the value of audio and may limit deployment realism. LLM‑assisted labeling is scalable but introduces distributional priors; although consistency checks are reported, a more thorough human validation on a held‑out subset (e.g., inter‑annotator agreement vs. LLM outputs) would strengthen claims.
The overall idea is quite interesting: to collect data relevant for manufacturing.
One weakness is the lack of diversity on the tasks (only two scenarios are included). It is not clear if the data collection approach is scalable. Given the size of the dataset and the limited set of tasks, it is unclear if this dataset offers something new that cannot be done with other existing datasets such as Ego4D. The main advantage of this dataset is the inclusion of more sensing modalities, but it is not clear this compensates the lack of diversity. Despite that the goal is to collect d
1. Exceptional multimodal richness and scale: combine a wide range of synchronized data streams: egocentric/exocentric RGBD video, LiDAR, stereo sound, IMUs, thermal cameras, and spectrometers. 2. Realistic experimental scenarios: bicycle assembly and 3D printer assembly, which are specifically for an industrial environment. 3. Benchmarking and validation: apart from constructing a dataset, this paper also proposed strong baselines. 4. Use a high-quality annotation pipeline with human annotation
1. Limited demographic generalizability: for example, the majority of participants were male (24 of 36), right-handed (31 of 36), young (47% aged 22-24), and had an engineering background (72%). 2. Missing real-world setting: the experiments were conducted in a lab or "test-bench" setting, not an actual real-world environment. 3. Underperformance of the audio modality: there is an obvious performance gap between audio-only modality and other single modality. In addition, adding the audio into o
1. Dual-scenario design: Captures both open-ended (goal-oriented) and procedural workflows, reflecting real variations in industrial activity. 2. Comprehensive sensing: Rich multimodal setup with 282 synchronized channels across ego and exo viewpoints, combining wearable, visual, and acoustic streams. 3. Thoughtful labeling: LLM-assisted annotation pipeline with multi-action verb–object–tool labels enhances expressiveness and scalability. 4. Fusion effectiveness: Particularly interesting finding
1. Controlled environment: Although participants come from over 20 countries, all recordings were conducted in a lab-like setting. This limits the dataset’s realism compared to actual industrial floors with variable lighting, background noise, and worker fatigue. 2. Audio realism: The recorded audio lacks machinery noise and natural acoustic interference, making it less representative of real industrial conditions. A noisy-environment extension or augmentation study would strengthen robustness
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Context-Aware Activity Recognition Systems · Emotion and Mood Recognition
