Decoupling the Image Perception and Multimodal Reasoning for Reasoning Segmentation with Digital Twin Representations
Yizhen Li, Dell Zhang, Xuelong Li, Yiqing Shen

TL;DR
This paper introduces DTwinSeger, a novel approach for reasoning segmentation that uses Digital Twin representations to decouple perception from reasoning, enabling state-of-the-art performance with large language models.
Contribution
The paper proposes a two-stage reasoning segmentation method using Digital Twin representations to separate perception and reasoning, improving multimodal understanding.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Demonstrates Digital Twin as an effective bridge for vision-text reasoning.
Enhances LLM reasoning with a new fine-tuning dataset Seg-DT.
Abstract
Reasoning Segmentation (RS) is a multimodal vision-text task that requires segmenting objects based on implicit text queries, demanding both precise visual perception and vision-text reasoning capabilities. Current RS approaches rely on fine-tuning vision-language models (VLMs) for both perception and reasoning, but their tokenization of images fundamentally disrupts continuous spatial relationships between objects. We introduce DTwinSeger, a novel RS approach that leverages Digital Twin (DT) representation as an intermediate layer to decouple perception from reasoning. Innovatively, DTwinSeger reformulates RS as a two-stage process, where the first transforms the image into a structured DT representation that preserves spatial relationships and semantic properties and then employs a Large Language Model (LLM) to perform explicit reasoning over this representation to identify target…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Graph Neural Networks
