TrajPred: Trajectory-Conditioned Joint Embedding Prediction for Surgical Instrument-Tissue Interaction Recognition in Vision-Language Models

Jiajun Cheng; Xiaofan Yu; Subarna Tripathi; Sainan Liu; Shan Lin

arXiv:2603.06999·cs.CV·March 17, 2026

TrajPred: Trajectory-Conditioned Joint Embedding Prediction for Surgical Instrument-Tissue Interaction Recognition in Vision-Language Models

Jiajun Cheng, Xiaofan Yu, Subarna Tripathi, Sainan Liu, Shan Lin

PDF

Open Access

TL;DR

TrajPred enhances vision-language models for surgical instrument-tissue interaction recognition by encoding instrument trajectories and fine-tuning visual-textual alignment, leading to improved accuracy and better semantic alignment.

Contribution

The paper introduces TrajPred, a novel framework that incorporates instrument trajectories and a predictor module for improved fine-grained interaction recognition in surgical vision-language models.

Findings

01

Improves Average Precision and Top-K accuracy on CholecT50 benchmark.

02

Enhances alignment between visual and textual embeddings.

03

Effective adaptation via prompt tuning and verb-rephrasing techniques.

Abstract

Recognizing instruments' interactions with tissues is essential for building context-aware AI assistants in robotic surgery. Vision-language models (VLMs) have opened a new avenue for surgical perception and achieved better generalization on a wide range of tasks compared to conventional task-specific deep learning approaches. However, their performance on instrument--tissue interaction recognition remains limited, largely due to two challenges: (1) many models do not effectively leverage temporal information, and (2) alignment between vision and text often misses fine-grained action details. To address these issues, we propose TrajPred, a framework that encodes instrument trajectories to incorporate temporal motion cues and, conditioned on these trajectories, introduces a predictor module to generate visual semantic embeddings that better capture fine-grained action details. We further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Surgical Simulation and Training · Soft Robotics and Applications