MMIS: Multimodal Dataset for Interior Scene Visual Generation and Recognition
Hozaifa Kassab, Ahmed Mahmoud, Mohamed Bahaa, Ammar Mohamed, Ali, Hamdi

TL;DR
The MMIS dataset offers a large, multimodal collection of interior scene images with text and audio annotations, supporting advancements in multi-modal scene generation and recognition tasks.
Contribution
We introduce MMIS, a comprehensive multimodal dataset with images, text, and audio for interior scene understanding and generation, facilitating multi-modal learning research.
Findings
Supports diverse interior scene tasks like generation, retrieval, captioning, and classification.
Enables research on multi-modal representation learning.
Contains nearly 160,000 richly annotated images.
Abstract
We introduce MMIS, a novel dataset designed to advance MultiModal Interior Scene generation and recognition. MMIS consists of nearly 160,000 images. Each image within the dataset is accompanied by its corresponding textual description and an audio recording of that description, providing rich and diverse sources of information for scene generation and recognition. MMIS encompasses a wide range of interior spaces, capturing various styles, layouts, and furnishings. To construct this dataset, we employed careful processes involving the collection of images, the generation of textual descriptions, and corresponding speech annotations. The presented dataset contributes to research in multi-modal representation learning tasks such as image generation, retrieval, captioning, and classification.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCultural Heritage Management and Preservation · Remote Sensing and Land Use
