Constructing and Interpreting Digital Twin Representations for Visual Reasoning via Reinforcement Learning

Yiqing Shen; Mathias Unberath

arXiv:2511.12365·cs.CV·November 18, 2025

Constructing and Interpreting Digital Twin Representations for Visual Reasoning via Reinforcement Learning

Yiqing Shen, Mathias Unberath

PDF

Open Access

TL;DR

This paper introduces DT-R1, a reinforcement learning framework that constructs digital twin representations of visual inputs, enabling a unified approach to diverse visual reasoning tasks and outperforming specialized models.

Contribution

The paper presents a novel reinforcement learning method for building digital twin representations that unify various visual reasoning tasks, improving over task-specific models.

Findings

01

DT-R1 outperforms state-of-the-art models on six benchmarks.

02

It effectively handles multiple modalities and task types.

03

The approach demonstrates the potential of digital twins in visual reasoning.

Abstract

Visual reasoning may require models to interpret images and videos and respond to implicit text queries across diverse output formats, from pixel-level segmentation masks to natural language descriptions. Existing approaches rely on supervised fine-tuning with task-specific architectures. For example, reasoning segmentation, grounding, summarization, and visual question answering each demand distinct model designs and training, preventing unified solutions and limiting cross-task and cross-modality generalization. Hence, we propose DT-R1, a reinforcement learning framework that trains large language models to construct digital twin representations of complex multi-modal visual inputs and then reason over these high-level representations as a unified approach to visual reasoning. Specifically, we train DT-R1 using GRPO with a novel reward that validates both structural integrity and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning