Semantic Scene Difference Detection in Daily Life Patroling by Mobile   Robots using Pre-Trained Large-Scale Vision-Language Model

Yoshiki Obinata; Kento Kawaharazuka; Naoaki Kanazawa; Naoya Yamaguchi,; Naoto Tsukamoto; Iori Yanokura; Shingo Kitagawa; Koki Shinjo; Kei Okada and; Masayuki Inaba

arXiv:2309.16552·cs.RO·September 29, 2023

Semantic Scene Difference Detection in Daily Life Patroling by Mobile Robots using Pre-Trained Large-Scale Vision-Language Model

Yoshiki Obinata, Kento Kawaharazuka, Naoaki Kanazawa, Naoya Yamaguchi,, Naoto Tsukamoto, Iori Yanokura, Shingo Kitagawa, Koki Shinjo, Kei Okada and, Masayuki Inaba

PDF

Open Access

TL;DR

This paper presents a novel method for detecting semantic environmental changes in daily life using large-scale vision-language models, enabling robots to identify meaningful scene differences without training.

Contribution

It introduces a training-free, noise-robust semantic change detection approach leveraging vision-language models' VQA capabilities for mobile robot patrols.

Findings

01

Effective semantic change detection demonstrated in real-world robot patrols

02

Method is robust to noise and does not require training or fine-tuning

03

Potential for adding explanatory language to environmental changes

Abstract

It is important for daily life support robots to detect changes in their environment and perform tasks. In the field of anomaly detection in computer vision, probabilistic and deep learning methods have been used to calculate the image distance. These methods calculate distances by focusing on image pixels. In contrast, this study aims to detect semantic changes in the daily life environment using the current development of large-scale vision-language models. Using its Visual Question Answering (VQA) model, we propose a method to detect semantic changes by applying multiple questions to a reference image and a current image and obtaining answers in the form of sentences. Unlike deep learning-based methods in anomaly detection, this method does not require any training or fine-tuning, is not affected by noise, and is sensitive to semantic state changes in the real world. In our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Anomaly Detection Techniques and Applications · Advanced Image and Video Retrieval Techniques