Towards Multimodal Multitask Scene Understanding Models for Indoor   Mobile Agents

Yao-Hung Hubert Tsai; Hanlin Goh; Ali Farhadi; Jian Zhang

arXiv:2209.13156·cs.CV·September 28, 2022

Towards Multimodal Multitask Scene Understanding Models for Indoor Mobile Agents

Yao-Hung Hubert Tsai, Hanlin Goh, Ali Farhadi, Jian Zhang

PDF

Open Access

TL;DR

This paper introduces MMISM, a multimodal multitask model for indoor scene understanding in mobile agents, capable of handling multiple outputs and modalities, addressing data scarcity and fusion challenges, and outperforming single-task models.

Contribution

The paper presents MMISM, a novel multi-modality, multi-task indoor scene understanding model that effectively fuses RGB and Lidar data for diverse perception tasks.

Findings

01

MMISM achieves comparable or better performance than single-task models.

02

Improves 3D object detection results by 11.7% on ARKitScenes dataset.

03

Addresses key challenges in indoor scene understanding for mobile agents.

Abstract

The perception system in personalized mobile agents requires developing indoor scene understanding models, which can understand 3D geometries, capture objectiveness, analyze human behaviors, etc. Nonetheless, this direction has not been well-explored in comparison with models for outdoor environments (e.g., the autonomous driving system that includes pedestrian prediction, car detection, traffic sign recognition, etc.). In this paper, we first discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments, and other challenges such as fusion between heterogeneous sources of information (e.g., RGB images and Lidar point clouds), modeling relationships between a diverse set of outputs (e.g., 3D object locations, depth estimation, and human poses), and computational efficiency. Then, we describe MMISM (Multi-modality input Multi-task output Indoor…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Advanced Neural Network Applications · Robotics and Sensor-Based Localization