LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual   Preferences

Hongyan Zhi; Peihao Chen; Junyan Li; Shuailei Ma; Xinyu; Sun; Tianhang Xiang; Yinjie Lei; Mingkui Tan; Chuang Gan

arXiv:2412.01292·cs.CV·February 4, 2025

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

Hongyan Zhi, Peihao Chen, Junyan Li, Shuailei Ma, Xinyu, Sun, Tianhang Xiang, Yinjie Lei, Mingkui Tan, Chuang Gan

PDF

Open Access 1 Repo

TL;DR

LSceneLLM introduces an adaptive framework that leverages large language models' visual preferences to focus on task-relevant areas in large 3D scenes, improving understanding and performance in scene-related tasks.

Contribution

The paper proposes LSceneLLM, a novel adaptive approach that identifies and magnifies task-relevant visual details in large 3D scenes using LLMs, and introduces the XR-Scene benchmark for comprehensive evaluation.

Findings

01

Outperforms existing methods in large scene understanding tasks.

02

Enhances scene understanding by magnifying task-relevant areas.

03

Improves performance of existing 3D-VLMs with the scene magnifier module.

Abstract

Research on 3D Vision-Language Models (3D-VLMs) is gaining increasing attention, which is crucial for developing embodied AI within 3D scenes, such as visual navigation and embodied question answering. Due to the high density of visual features, especially in large 3D scenes, accurately locating task-relevant visual information is challenging. Existing works attempt to segment all objects and consider their features as scene representations. However, these task-agnostic object features include much redundant information and missing details for the task-relevant area. To tackle these problems, we propose LSceneLLM, an adaptive framework that automatically identifies task-relevant areas by leveraging LLM's visual preference for different tasks, followed by a plug-and-play scene magnifier module to capture fine-grained details in focused areas. Specifically, a dense token selector examines…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Hoyyyaard/LSceneLLM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Computer Graphics and Visualization Techniques

MethodsSoftmax · Attention Is All You Need