CCNet: Criss-Cross Attention for Semantic Segmentation
Zilong Huang, Xinggang Wang, Yunchao Wei, Lichao Huang, Humphrey Shi,, Wenyu Liu, Thomas S. Huang

TL;DR
CCNet introduces a novel criss-cross attention module that efficiently captures full-image contextual information for semantic segmentation, achieving state-of-the-art results with reduced memory and computational costs.
Contribution
The paper proposes a recurrent criss-cross attention module that is GPU memory friendly, computationally efficient, and effective for full-image context modeling in semantic segmentation.
Findings
Achieves state-of-the-art mIoU scores on Cityscapes, ADE20K, and LIP benchmarks.
Requires 11x less GPU memory than non-local blocks.
Reduces FLOPs by approximately 85% compared to non-local attention.
Abstract
Contextual information is vital in visual understanding problems, such as semantic segmentation and object detection. We propose a Criss-Cross Network (CCNet) for obtaining full-image contextual information in a very effective and efficient way. Concretely, for each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Besides, a category consistent loss is proposed to enforce the criss-cross attention module to produce more discriminative features. Overall, CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11x less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
MethodsCriss-Cross Network · Residual Connection · Non-Local Operation · 1x1 Convolution · Non-Local Block
