Constructing and Interpreting Digital Twin Representations for Visual Reasoning via Reinforcement Learning
Yiqing Shen, Mathias Unberath

TL;DR
This paper introduces DT-R1, a reinforcement learning framework that constructs digital twin representations of visual inputs, enabling a unified approach to diverse visual reasoning tasks and outperforming specialized models.
Contribution
The paper presents a novel reinforcement learning method for building digital twin representations that unify various visual reasoning tasks, improving over task-specific models.
Findings
DT-R1 outperforms state-of-the-art models on six benchmarks.
It effectively handles multiple modalities and task types.
The approach demonstrates the potential of digital twins in visual reasoning.
Abstract
Visual reasoning may require models to interpret images and videos and respond to implicit text queries across diverse output formats, from pixel-level segmentation masks to natural language descriptions. Existing approaches rely on supervised fine-tuning with task-specific architectures. For example, reasoning segmentation, grounding, summarization, and visual question answering each demand distinct model designs and training, preventing unified solutions and limiting cross-task and cross-modality generalization. Hence, we propose DT-R1, a reinforcement learning framework that trains large language models to construct digital twin representations of complex multi-modal visual inputs and then reason over these high-level representations as a unified approach to visual reasoning. Specifically, we train DT-R1 using GRPO with a novel reward that validates both structural integrity and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
