Open Scene Understanding: Grounded Situation Recognition Meets Segment   Anything for Helping People with Visual Impairments

Ruiping Liu; Jiaming Zhang; Kunyu Peng; Junwei Zheng; Ke Cao; Yufan; Chen; Kailun Yang; Rainer Stiefelhagen

arXiv:2307.07757·cs.CV·July 18, 2023·1 cites

Open Scene Understanding: Grounded Situation Recognition Meets Segment Anything for Helping People with Visual Impairments

Ruiping Liu, Jiaming Zhang, Kunyu Peng, Junwei Zheng, Ke Cao, Yufan, Chen, Kailun Yang, Rainer Stiefelhagen

PDF

Open Access 1 Repo

TL;DR

This paper introduces OpenSU, a novel system combining Grounded Situation Recognition and segmentation models to provide detailed scene understanding for aiding visually impaired individuals, achieving state-of-the-art results and practical assistive applications.

Contribution

We propose OpenSU, integrating GSR with SAM and transformer backbones to generate pixel-wise segmentation masks, enhancing scene understanding for assistive technology.

Findings

01

Achieves state-of-the-art performance on SWiG dataset.

02

Demonstrates practical utility in assistive technology for PVI.

03

Reduces training time with GELU activation functions.

Abstract

Grounded Situation Recognition (GSR) is capable of recognizing and interpreting visual scenes in a contextually intuitive way, yielding salient activities (verbs) and the involved entities (roles) depicted in images. In this work, we focus on the application of GSR in assisting people with visual impairments (PVI). However, precise localization information of detected objects is often required to navigate their surroundings confidently and make informed decisions. For the first time, we propose an Open Scene Understanding (OpenSU) system that aims to generate pixel-wise dense segmentation masks of involved entities instead of bounding boxes. Specifically, we build our OpenSU system on top of GSR by additionally adopting an efficient Segment Anything Model (SAM). Furthermore, to enhance the feature extraction and interaction between the encoder-decoder structure, we construct our OpenSU…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ruipingl/opensu
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Tactile and Sensory Interactions · Domain Adaptation and Few-Shot Learning

MethodsFocus