Dual Cross-Attention Siamese Transformer for Rectal Tumor Regrowth Assessment in Watch-and-Wait Endoscopy

Jorge Tapias Gomez; Despoina Kanata; Aneesh Rangnekar; Christina Lee; Julio Garcia-Aguilar; Joshua Jesse Smith; Harini Veeraraghavan

arXiv:2512.03883·cs.CV·May 8, 2026

Dual Cross-Attention Siamese Transformer for Rectal Tumor Regrowth Assessment in Watch-and-Wait Endoscopy

Jorge Tapias Gomez, Despoina Kanata, Aneesh Rangnekar, Christina Lee, Julio Garcia-Aguilar, Joshua Jesse Smith, Harini Veeraraghavan

PDF

TL;DR

This paper introduces a Siamese Swin Transformer with Dual Cross-Attention (SSDCA) for accurately detecting rectal tumor regrowth during watch-and-wait endoscopy follow-up, leveraging pretrained transformers for robust feature extraction.

Contribution

The study presents a novel SSDCA model that combines longitudinal endoscopic images without spatial alignment, achieving high accuracy and robustness in detecting tumor regrowth.

Findings

01

SSDCA achieved 81.76% balanced accuracy, 90.07% sensitivity, and 72.86% specificity.

02

Model demonstrated stable performance across artifacts like blood, stool, and poor image quality.

03

UMAP analysis confirmed discriminative feature representations with maximal inter-cluster separation.

Abstract

Increasing evidence supports watch-and-wait (WW) surveillance for patients with rectal cancer who show clinical complete response (cCR) at restaging following total neoadjuvant treatment (TNT). However, objectively accurate methods to early detect local regrowth (LR) from follow-up endoscopy images during WW are essential to manage care and prevent distant metastases. Hence, we developed a Siamese Swin Transformer with Dual Cross-Attention (SSDCA) to combine longitudinal endoscopic images at restaging and follow-up and distinguish cCR from LR. SSDCA leverages pretrained Swin transformers to extract domain agnostic features and enhance robustness to imaging variations. Dual cross attention is implemented to emphasize features from the two scans without requiring any spatial alignment of images to predict response. SSDCA as well as Swin-based baselines were trained using image pairs from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.