BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation
Ye Yu, Jialin Yuan, Gaurav Mittal, Li Fuxin, and Mei Chen

TL;DR
BATMAN introduces a novel bilateral attention transformer that leverages motion and appearance cues, along with optical flow calibration, to significantly improve semi-supervised video object segmentation performance across multiple benchmarks.
Contribution
The paper proposes a new Bilateral Attention Transformer with optical flow calibration for better segmentation of similar objects in close proximity.
Findings
Outperforms all existing state-of-the-art methods on four VOS benchmarks.
Effectively captures motion and appearance for improved segmentation.
Reduces noise at object boundaries through optical flow calibration.
Abstract
Video Object Segmentation (VOS) is fundamental to video understanding. Transformer-based methods show significant performance improvement on semi-supervised VOS. However, existing work faces challenges segmenting visually similar objects in close proximity of each other. In this paper, we propose a novel Bilateral Attention Transformer in Motion-Appearance Neighboring space (BATMAN) for semi-supervised VOS. It captures object motion in the video via a novel optical flow calibration module that fuses the segmentation mask with optical flow estimation to improve within-object optical flow smoothness and reduce noise at object boundaries. This calibrated optical flow is then employed in our novel bilateral attention, which computes the correspondence between the query and reference frames in the neighboring bilateral space considering both motion and appearance. Extensive experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Video Surveillance and Tracking Methods
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Layer Normalization · Adam · Byte Pair Encoding · Label Smoothing · Residual Connection
