Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers
Markus Hiller, Krista A. Ehinger, Tom Drummond

TL;DR
The paper introduces BiXT, a bi-directional Transformer architecture that efficiently processes longer sequences across various modalities, outperforming larger models in speed and resource usage while maintaining competitive accuracy.
Contribution
BiXT replaces iterative attention with a bi-directional cross-attention module, enabling linear scaling with input size and simultaneous interpretation of semantics and location.
Findings
Outperforms larger models in vision tasks with 28% fewer FLOPs
Achieves up to 8.4x faster processing speed
Performs comparably to full Transformers on sequence tasks
Abstract
We present a novel bi-directional Transformer architecture (BiXT) which scales linearly with input size in terms of computational cost and memory consumption, but does not suffer the drop in performance or limitation to only one input modality seen with other efficient Transformer-based approaches. BiXT is inspired by the Perceiver architectures but replaces iterative attention with an efficient bi-directional cross-attention module in which input tokens and latent variables attend to each other simultaneously, leveraging a naturally emerging attention-symmetry between the two. This approach unlocks a key bottleneck experienced by Perceiver-like architectures and enables the processing and interpretation of both semantics ('what') and location ('where') to develop alongside each other over multiple layers -- allowing its direct application to dense and instance-based tasks alike. By…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMedical Image Segmentation Techniques · Industrial Vision Systems and Defect Detection
MethodsAttention Is All You Need · Linear Layer · Concatenated Skip Connection · Dense Connections · Label Smoothing · Adam · Softmax · Multi-Head Attention · Layer Normalization · Residual Connection
