TrajPred: Trajectory-Conditioned Joint Embedding Prediction for Surgical Instrument-Tissue Interaction Recognition in Vision-Language Models
Jiajun Cheng, Xiaofan Yu, Subarna Tripathi, Sainan Liu, Shan Lin

TL;DR
TrajPred enhances vision-language models for surgical instrument-tissue interaction recognition by encoding instrument trajectories and fine-tuning visual-textual alignment, leading to improved accuracy and better semantic alignment.
Contribution
The paper introduces TrajPred, a novel framework that incorporates instrument trajectories and a predictor module for improved fine-grained interaction recognition in surgical vision-language models.
Findings
Improves Average Precision and Top-K accuracy on CholecT50 benchmark.
Enhances alignment between visual and textual embeddings.
Effective adaptation via prompt tuning and verb-rephrasing techniques.
Abstract
Recognizing instruments' interactions with tissues is essential for building context-aware AI assistants in robotic surgery. Vision-language models (VLMs) have opened a new avenue for surgical perception and achieved better generalization on a wide range of tasks compared to conventional task-specific deep learning approaches. However, their performance on instrument--tissue interaction recognition remains limited, largely due to two challenges: (1) many models do not effectively leverage temporal information, and (2) alignment between vision and text often misses fine-grained action details. To address these issues, we propose TrajPred, a framework that encodes instrument trajectories to incorporate temporal motion cues and, conditioned on these trajectories, introduces a predictor module to generate visual semantic embeddings that better capture fine-grained action details. We further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Surgical Simulation and Training · Soft Robotics and Applications
