TL;DR
The paper introduces IKEA ASM, a large multi-view dataset with depth, actions, objects, and pose annotations for furniture assembly videos, enabling advanced human activity analysis and benchmarking of computer vision methods.
Contribution
It provides a comprehensive, multi-modal dataset for human activity understanding in furniture assembly, including benchmarks for various vision tasks.
Findings
Benchmark results for action recognition, segmentation, and pose estimation.
Demonstration of the dataset's utility for developing holistic multi-modal methods.
Insights into challenges of multi-view, multi-modal activity analysis.
Abstract
The availability of a large labeled dataset is a key requirement for applying deep learning methods to solve various computer vision tasks. In the context of understanding human activities, existing public datasets, while large in size, are often limited to a single RGB camera and provide only per-frame or per-clip action annotations. To enable richer analysis and understanding of human activities, we introduce IKEA ASM -- a three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose. Additionally, we benchmark prominent methods for video action recognition, object segmentation and human pose estimation tasks on this challenging dataset. The dataset enables the development of holistic methods, which integrate multi-modal and multi-view data to better perform on these tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
