LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
Hongyan Zhi, Peihao Chen, Junyan Li, Shuailei Ma, Xinyu, Sun, Tianhang Xiang, Yinjie Lei, Mingkui Tan, Chuang Gan

TL;DR
LSceneLLM introduces an adaptive framework that leverages large language models' visual preferences to focus on task-relevant areas in large 3D scenes, improving understanding and performance in scene-related tasks.
Contribution
The paper proposes LSceneLLM, a novel adaptive approach that identifies and magnifies task-relevant visual details in large 3D scenes using LLMs, and introduces the XR-Scene benchmark for comprehensive evaluation.
Findings
Outperforms existing methods in large scene understanding tasks.
Enhances scene understanding by magnifying task-relevant areas.
Improves performance of existing 3D-VLMs with the scene magnifier module.
Abstract
Research on 3D Vision-Language Models (3D-VLMs) is gaining increasing attention, which is crucial for developing embodied AI within 3D scenes, such as visual navigation and embodied question answering. Due to the high density of visual features, especially in large 3D scenes, accurately locating task-relevant visual information is challenging. Existing works attempt to segment all objects and consider their features as scene representations. However, these task-agnostic object features include much redundant information and missing details for the task-relevant area. To tackle these problems, we propose LSceneLLM, an adaptive framework that automatically identifies task-relevant areas by leveraging LLM's visual preference for different tasks, followed by a plug-and-play scene magnifier module to capture fine-grained details in focused areas. Specifically, a dense token selector examines…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Computer Graphics and Visualization Techniques
MethodsSoftmax · Attention Is All You Need
