A 4D Representation for Training-Free Agentic Reasoning from Monocular Laparoscopic Video

Maximilian Fehrentz; Nicolas Stellwag; Robert Wiebe; Nicole Thorisch; Fabian Grob; Patrick Remerscheid; Ken-Joel Simmoteit; Benjamin D. Killeen; Christian Heiliger; Nassir Navab

arXiv:2604.00867·cs.CV·April 2, 2026

A 4D Representation for Training-Free Agentic Reasoning from Monocular Laparoscopic Video

Maximilian Fehrentz, Nicolas Stellwag, Robert Wiebe, Nicole Thorisch, Fabian Grob, Patrick Remerscheid, Ken-Joel Simmoteit, Benjamin D. Killeen, Christian Heiliger, Nassir Navab

PDF

1 Repo

TL;DR

This paper introduces a novel 4D spatiotemporal representation for surgical video analysis, enabling reasoning and grounding without additional training by combining 2D language models with 3D vision tools.

Contribution

It presents a framework that integrates explicit 4D representations with multimodal language models for training-free spatiotemporal reasoning in surgery.

Findings

01

Improved spatiotemporal understanding in surgical scenes.

02

Effective grounding of natural language in 4D representations.

03

No fine-tuning needed for reasoning with the proposed system.

Abstract

Spatiotemporal reasoning is a fundamental capability for artificial intelligence (AI) in soft tissue surgery, paving the way for intelligent assistive systems and autonomous robotics. While 2D vision-language models show increasing promise at understanding surgical video, the spatial complexity of surgical scenes suggests that reasoning systems may benefit from explicit 4D representations. Here, we propose a framework for equipping surgical agents with spatiotemporal tools based on an explicit 4D representation, enabling AI systems to ground their natural language reasoning in both time and 3D space. Leveraging models for point tracking, depth, and segmentation, we develop a coherent 4D model with spatiotemporally consistent tool and tissue semantics. A Multimodal Large Language Model (MLLM) then acts as an agent on tools derived from the explicit 4D representation (e.g., trajectories)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://tum-ai.github.io/surg4d
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.