Perceiving Longer Sequences With Bi-Directional Cross-Attention   Transformers

Markus Hiller; Krista A. Ehinger; Tom Drummond

arXiv:2402.12138·cs.CV·November 1, 2024·1 cites

Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers

Markus Hiller, Krista A. Ehinger, Tom Drummond

PDF

Open Access 2 Repos 1 Video

TL;DR

The paper introduces BiXT, a bi-directional Transformer architecture that efficiently processes longer sequences across various modalities, outperforming larger models in speed and resource usage while maintaining competitive accuracy.

Contribution

BiXT replaces iterative attention with a bi-directional cross-attention module, enabling linear scaling with input size and simultaneous interpretation of semantics and location.

Findings

01

Outperforms larger models in vision tasks with 28% fewer FLOPs

02

Achieves up to 8.4x faster processing speed

03

Performs comparably to full Transformers on sequence tasks

Abstract

We present a novel bi-directional Transformer architecture (BiXT) which scales linearly with input size in terms of computational cost and memory consumption, but does not suffer the drop in performance or limitation to only one input modality seen with other efficient Transformer-based approaches. BiXT is inspired by the Perceiver architectures but replaces iterative attention with an efficient bi-directional cross-attention module in which input tokens and latent variables attend to each other simultaneously, leveraging a naturally emerging attention-symmetry between the two. This approach unlocks a key bottleneck experienced by Perceiver-like architectures and enables the processing and interpretation of both semantics ('what') and location ('where') to develop alongside each other over multiple layers -- allowing its direct application to dense and instance-based tasks alike. By…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers· slideslive

Taxonomy

TopicsMedical Image Segmentation Techniques · Industrial Vision Systems and Defect Detection

MethodsAttention Is All You Need · Linear Layer · Concatenated Skip Connection · Dense Connections · Label Smoothing · Adam · Softmax · Multi-Head Attention · Layer Normalization · Residual Connection